Commit Graph

17344 Commits

Author SHA1 Message Date
Sage Weil
c1ea8823be ceph: fix osd request submission race
The osd request submission path registers the request, drops and retakes
the request_mutex, then sends it to the OSD.  A racing kick_requests could
sent it during that interval, causing the same msg to be sent twice and
BUGing in the msgr.

Fix by only sending the message if it hasn't been touched by other
threads.

Signed-off-by: Sage Weil <sage@newdream.net>
2009-10-09 11:58:03 -07:00
Chris Mason
ac6889cbb2 Btrfs: fix file clone ioctl for bookend extents
The file clone ioctl was incorrectly taking the offset into the
extent on disk into account when calculating the length of the
cloned extent.

The length never changes based on the offset into the physical extent.

Test case:

fallocate -l 1g image
mke2fs image
bcp image image2
e2fsck -f image2

(errors on image2)

The math bug ends up wrapping the length of the extent, and things
go wrong from there.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-10-09 11:29:53 -04:00
Chris Mason
e9061e2148 Btrfs: fix uninit compiler warning in cow_file_range_nocow
The extent_type variable was exposed uninit via a goto.  It should be
impossible to trigger because it is protected by a check on another
variable, but this makes sure.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-10-09 09:57:45 -04:00
Alexey Dobriyan
82d339d9b3 Btrfs: constify dentry_operations
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-10-09 09:54:36 -04:00
Yan, Zheng
94fcca9f89 Btrfs: optimize back reference update during btrfs_drop_snapshot
This patch reading level 0 tree blocks that already use full backrefs.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-10-09 09:25:16 -04:00
Yan, Zheng
efefb1438b Btrfs: remove negative dentry when deleting subvolumne
The use of btrfs_dentry_delete is removing dentries from the
dcache when deleting subvolumne. btrfs_dentry_delete ignores
negative dentries. This is incorrect since if we don't remove
the negative dentry, its parent dentry can't be removed.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-10-09 09:25:16 -04:00
Linus Torvalds
32b7a567c8 Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
  NFSv4: Kill nfs4_renewd_prepare_shutdown()
  NFSv4: Fix the referral mount code
  nfs: Avoid overrun when copying client IP address string
  NFS: Fix port initialisation in nfs_remount()
  NFS: Fix port and mountport display in /proc/self/mountinfo
  NFS: Fix a default mount regression...
2009-10-08 14:15:19 -07:00
Josef Bacik
ff782e0a13 Btrfs: optimize fsync for the single writer case
This patch optimizes the tree logging stuff so it doesn't always wait 1 jiffie
for new people to join the logging transaction if there is only ever 1 writer.
This helps a little bit with latency where we have something like RPM where it
will fdatasync every file it writes, and so waiting the 1 jiffie for every
fdatasync really starts to add up.

Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-10-08 15:30:04 -04:00
Josef Bacik
e3ccfa9897 Btrfs: async delalloc flushing under space pressure
This patch moves the delalloc flushing that occurs when we are under space
pressure off to a async thread pool.  This helps since we only free up
metadata space when we actually insert the extent item, which means it takes
quite a while for space to be free'ed up if we wait on all ordered extents.
However, if space is freed up due to inline extents being inserted, we can
wake people who are waiting up early, and they can finish their work.

Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-10-08 15:21:23 -04:00
Josef Bacik
32c00aff71 Btrfs: release delalloc reservations on extent item insertion
This patch fixes an issue with the delalloc metadata space reservation
code.  The problem is we used to free the reservation as soon as we
allocated the delalloc region.  The problem with this is if we are not
inserting an inline extent, we don't actually insert the extent item until
after the ordered extent is written out.  This patch does 3 things,

1) It moves the reservation clearing stuff into the ordered code, so when
we remove the ordered extent we remove the reservation.
2) It adds a EXTENT_DO_ACCOUNTING flag that gets passed when we clear
delalloc bits in the cases where we want to clear the metadata reservation
when we clear the delalloc extent, in the case that we do an inline extent
or we invalidate the page.
3) It adds another waitqueue to the space info so that when we start a fs
wide delalloc flush, anybody else who also hits that area will simply wait
for the flush to finish and then try to make their allocation.

This has been tested thoroughly to make sure we did not regress on
performance.

Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-10-08 15:21:10 -04:00
Chris Mason
a3429ab70b Btrfs: delay clearing EXTENT_DELALLOC for compressed extents
When compression is on, the cow_file_range code is farmed off to
worker threads.  This allows us to do significant CPU work in parallel
on SMP machines.

But it is a delicate balance around when we clear flags and how.  In
the past we cleared the delalloc flag immediately, which was safe
because the pages stayed locked.

But this is causing problems with the newest ENOSPC code, and with the
recent extent state cleanups we can now clear the delalloc bit at the
same time the uncompressed code does.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-10-08 15:11:50 -04:00
Chris Mason
a791e35e12 Btrfs: cleanup extent_clear_unlock_delalloc flags
extent_clear_unlock_delalloc has a growing set of ugly parameters
that is very difficult to read and maintain.

This switches to a flag field and well named flag defines.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-10-08 15:11:49 -04:00
Alex Elder
e09d39968b Merge branch 'master' into for-linus 2009-10-08 13:53:44 -05:00
Sage Weil
0656d11ba6 ceph: renew mon subscription before it expires
Be conservative: renew subscription once half the interval has expired.

Do not reuse sub expiration to control hunting.

Signed-off-by: Sage Weil <sage@newdream.net>
2009-10-08 10:39:08 -07:00
Christoph Hellwig
d0800703fe xfs: stop calling filemap_fdatawait inside ->fsync
Now that the VFS actually waits for the data I/O to complete before
calling into ->fsync we can stop doing it ourselves.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2009-10-08 12:02:48 -05:00
Eric Sandeen
8e69ce1471 fix readahead calculations in xfs_dir2_leaf_getdents()
This is for bug #850,
http://oss.sgi.com/bugzilla/show_bug.cgi?id=850
XFS file system segfaults , repeatedly and 100% reproducable in 2.6.30 , 2.6.31

The above only showed up on a CONFIG_XFS_DEBUG=y kernel, because
xfs_bmapi() ASSERTs that it has been asked for at least one map,

and it was getting 0.

The root cause is that our guesstimated "bufsize" from xfs_file_readdir
was fairly small, and the

		bufsize -= length;

in the loop was going negative - except bufsize is a size_t, so it
was wrapping to a very large number.

Then when we did
		ra_want = howmany(bufsize + mp->m_dirblksize,
				  mp->m_sb.sb_blocksize) - 1;

with that very large number, the (int) ra_want was coming out
negative, and a subsequent compare:

		if (1 + ra_want > map_blocks ...

was coming out -true- (negative int compare w/ uint) and we went
back to xfs_bmapi() for more, even though we did not need more,
and asked for 0 maps, and hit the ASSERT.

We have kind of a type mess here, but just keeping bufsize from
going negative is probably sufficient to avoid the problem.

Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2009-10-08 12:02:12 -05:00
Dave Chinner
dce5065a57 xfs: make sure xfs_sync_fsdata covers the log
We want to always cover the log after writing out the superblock, and
in case of a synchronous writeout make sure we actually wait for the
log to be covered.  That way a filesystem that has been sync()ed can
be considered clean by log recovery.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2009-10-08 12:01:49 -05:00
Dave Chinner
932640e8ad xfs: mark inodes dirty before issuing I/O
To make sure they get properly waited on in sync when I/O is in flight and
we latter need to update the inode size.  Requires a new helper to check if an
ioend structure is beyond the current EOF.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2009-10-08 12:01:26 -05:00
Christoph Hellwig
69961a26b8 xfs: cleanup ->sync_fs
Sort out ->sync_fs to not perform a superblock writeback for the wait = 0 case
as that is just an optional first pass and the superblock will be written back
properly in the next call with wait = 1.  Instead perform an opportunistic
quota writeback to have less work later.  Also remove the freeze special case
as we do a proper wait = 1 call in the freeze code anyway.

Also rename the function to xfs_fs_sync_fs to match the normal naming
convention, update comments and avoid calling into the laptop_mode logic on
an error.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2009-10-08 12:01:03 -05:00
Dave Chinner
c90b07e8dd xfs: fix xfs_quiesce_data
We need to do a synchronous xfs_sync_fsdata to make sure the superblock
actually is on disk when we return.

Also remove SYNC_BDFLUSH flag to xfs_sync_inodes because that particular
flag is never checked.

Move xfs_filestream_flush call later to only release inodes after they
have been written out.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2009-10-08 12:00:36 -05:00
Christoph Hellwig
f9581b1443 xfs: implement ->dirty_inode to fix timestamp handling
This is picking up on Felix's repost of Dave's patch to implement a
.dirty_inode method.  We really need this notification because
the VFS keeps writing directly into the inode structure instead
of going through methods to update this state.  In addition to
the long-known atime issue we now also have a caller in VM code
that updates c/mtime that way for shared writeable mmaps.  And
I found another one that no one has noticed in practice in the FIFO
code.

So implement ->dirty_inode to set i_update_core whenever the
inode gets externally dirtied, and switch the c/mtime handling to
the same scheme we already use for atime (always picking up
the value from the Linux inode).

Note that this patch also removes the xfs_synchronize_atime call
in xfs_reclaim it was superflous as we already synchronize the time
when writing the inode via the log (xfs_inode_item_format) or the
normal buffers (xfs_iflush_int).

In addition also remove the I_CLEAR check before copying the Linux
timestamps - now that we always have the Linux inode available
we can always use the timestamps in it.

Also switch to just using file_update_time for regular reads/writes -
that will get us all optimization done to it for free and make
sure we notice early when it breaks.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
2009-10-08 12:00:03 -05:00
Mimi Zohar
36520be8e3 ima: ecryptfs fix imbalance message
The unencrypted files are being measured.  Update the counters to get
rid of the ecryptfs imbalance message. (http://bugzilla.redhat.com/519737)

Reported-by: Sachin Garg
Cc: Eric Paris <eparis@redhat.com>
Cc: Dustin Kirkland <kirkland@canonical.com>
Cc: James Morris <jmorris@namei.org>
Cc: David Safford <safford@watson.ibm.com>
Cc: stable@kernel.org
Signed-off-by: Mimi Zohar <zohar@us.ibm.com>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2009-10-08 11:31:38 -05:00
Tyler Hicks
ed1f21857e eCryptfs: Remove Kconfig NET dependency and select MD5
eCryptfs no longer uses a netlink interface to communicate with
ecryptfsd, so NET is not a valid dependency anymore.

MD5 is required and must be built for eCryptfs to be of any use.

Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2009-10-08 11:31:36 -05:00
Randy Dunlap
664fc5a4e7 ecryptfs: depends on CRYPTO
ecryptfs uses crypto APIs so it should depend on CRYPTO.
Otherwise many build errors occur. [63 lines not pasted]

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: ecryptfs-devel@lists.launchpad.net
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
2009-10-08 11:21:12 -05:00
Trond Myklebust
3050141bae NFSv4: Kill nfs4_renewd_prepare_shutdown()
The NFSv4 renew daemon is shared between all active super blocks that refer
to a particular NFS server, so it is wrong to be shutting it down in
nfs4_kill_super every time a super block is destroyed.

This patch therefore kills nfs4_renewd_prepare_shutdown altogether, and
leaves it up to nfs4_shutdown_client() to also shut down the renew daemon
by means of the existing call to nfs4_kill_renewd().

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-10-08 11:50:55 -04:00
Wu Fengguang
253fb02d62 pagemap: export KPF_HWPOISON
This flag indicates a hardware detected memory corruption on the page.
Any future access of the page data may bring down the machine.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-10-08 07:36:39 -07:00
Jaswinder Singh Rajput
4055e97318 fs: includecheck fix: proc, kcore.c
fix the following 'make includecheck' warning:

  fs/proc/kcore.c: linux/mm.h is included more than once.

Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-10-08 07:36:38 -07:00
Sage Weil
e251e28808 ceph: fix mdsmap decoding when multiple mds's are present
A misplaced sizeof() around namelen was throwing things off.

Signed-off-by: Sage Weil <sage@newdream.net>
2009-10-07 16:38:55 -07:00
Sage Weil
b28813a61d ceph: gracefully avoid empty crush buckets
This avoids a divide by zero when the input and/or map are
malformed.

Signed-off-by: Sage Weil <sage@newdream.net>
2009-10-07 10:59:34 -07:00
Sage Weil
b195befd9a ceph: include preferred_osd in file layout virtual xattr
Signed-off-by: Sage Weil <sage@newdream.net>
2009-10-07 10:59:30 -07:00
Sage Weil
fa0b72e9e2 ceph: show meaningful version on module load
Kill the old git revision; print the ceph version and protocol
versions instead.

Signed-off-by: Sage Weil <sage@newdream.net>
2009-10-07 10:59:10 -07:00
Trond Myklebust
517be09def NFSv4: Fix the referral mount code
Fix a typo which causes try_location() to use the wrong length argument
when calling nfs_parse_server_name(). This again, causes the initialisation
of the mount's sockaddr structure to fail.

Also ensure that if nfs4_pathname_string() returns an error, then we pass
that error back up the stack instead of ENOENT.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-10-06 15:42:20 -04:00
Ben Hutchings
f4373bf9e6 nfs: Avoid overrun when copying client IP address string
As seen in <http://bugs.debian.org/549002>, nfs4_init_client() can
overrun the source string when copying the client IP address from
nfs_parsed_mount_data::client_address to nfs_client::cl_ipaddr.  Since
these are both treated as null-terminated strings elsewhere, the copy
should be done with strlcpy() not memcpy().

Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-10-06 15:42:18 -04:00
Trond Myklebust
bcd2ea17da NFS: Fix port initialisation in nfs_remount()
The recent changeset 53a0b9c4c9 (NFS: Replace
nfs_parse_ip_address() with rpc_pton()) broke nfs_remount, since the call
to rpc_pton() will zero out the port number in data->nfs_server.address.

This is actually due to a bug in nfs_remount: it should be looking at the
port number in nfs_server.port instead...

This fixes bug
   http://bugzilla.kernel.org/show_bug.cgi?id=14276

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-10-06 15:41:22 -04:00
Trond Myklebust
f5855fecda NFS: Fix port and mountport display in /proc/self/mountinfo
Currently, the port and mount port will both display as 65535 if you do not
specify a port number. That would be wrong...

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-10-06 15:40:37 -04:00
Trond Myklebust
c5811dbdd2 NFS: Fix a default mount regression...
With the recent spate of changes, the nfs protocol version will now default
to 2 instead of 3, while the mount protocol version defaults to 3.

The following patch should ensure the defaults are consistent with the
previous defaults of vers=3,proto=tcp,mountvers=3,mountproto=tcp.

This fixes the bug
   http://bugzilla.kernel.org/show_bug.cgi?id=14259

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-10-06 15:40:15 -04:00
Sage Weil
e324b8f991 ceph: document shared files in README
Document files shared between kernel and user code trees.

Signed-off-by: Sage Weil <sage@newdream.net>
2009-10-06 12:21:17 -07:00
Steve French
8347a5cdd1 [CIFS] Fixing to avoid invalid kfree() in cifs_get_tcp_session()
trivial bug in fs/cifs/connect.c .
The bug is caused by fail of extract_hostname()
when mounting cifs file system.

This is the situation when I noticed this bug.

% sudo mount -t cifs //192.168.10.208 mountpoint -o options...

Then my kernel says,

[ 1461.807776] ------------[ cut here ]------------
[ 1461.807781] kernel BUG at mm/slab.c:521!
[ 1461.807784] invalid opcode: 0000 [#2] PREEMPT SMP
[ 1461.807790] last sysfs file:
/sys/devices/pci0000:00/0000:00:1e.0/0000:09:02.0/resource
[ 1461.807793] CPU 0
[ 1461.807796] Modules linked in: nls_iso8859_1 usbhid sbp2 uhci_hcd
ehci_hcd i2c_i801 ohci1394 ieee1394 psmouse serio_raw pcspkr sky2 usbcore
evdev
[ 1461.807816] Pid: 3446, comm: mount Tainted: G      D 2.6.32-rc2-vanilla
[ 1461.807820] RIP: 0010:[<ffffffff810b888e>]  [<ffffffff810b888e>]
kfree+0x63/0x156
[ 1461.807829] RSP: 0018:ffff8800b4f7fbb8  EFLAGS: 00010046
[ 1461.807832] RAX: ffffea00033fff98 RBX: ffff8800afbae7e2 RCX:
0000000000000000
[ 1461.807836] RDX: ffffea0000000000 RSI: 000000000000005c RDI:
ffffffffffffffea
[ 1461.807839] RBP: ffff8800b4f7fbf8 R08: 0000000000000001 R09:
0000000000000000
[ 1461.807842] R10: 0000000000000000 R11: ffff8800b4f7fbf8 R12:
00000000ffffffea
[ 1461.807845] R13: ffff8800afb23000 R14: ffff8800b4f87bc0 R15:
ffffffffffffffea
[ 1461.807849] FS:  00007f52b6f187c0(0000) GS:ffff880007600000(0000)
knlGS:0000000000000000
[ 1461.807852] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 1461.807855] CR2: 0000000000613000 CR3: 00000000af8f9000 CR4:
00000000000006f0
[ 1461.807858] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 1461.807861] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[ 1461.807865] Process mount (pid: 3446, threadinfo ffff8800b4f7e000, task
ffff8800950e4380)
[ 1461.807867] Stack:
[ 1461.807869]  0000000000000202 0000000000000282 ffff8800b4f7fbf8
ffff8800afbae7e2
[ 1461.807876] <0> 00000000ffffffea ffff8800afb23000 ffff8800b4f87bc0
ffff8800b4f7fc28
[ 1461.807884] <0> ffff8800b4f7fcd8 ffffffff81159f6d ffffffff81147bc2
ffffffff816bfb48
[ 1461.807892] Call Trace:
[ 1461.807899]  [<ffffffff81159f6d>] cifs_get_tcp_session+0x440/0x44b
[ 1461.807904]  [<ffffffff81147bc2>] ? find_nls+0x1c/0xe9
[ 1461.807909]  [<ffffffff8115b889>] cifs_mount+0x16bc/0x2167
[ 1461.807917]  [<ffffffff814455bd>] ? _spin_unlock+0x30/0x4b
[ 1461.807923]  [<ffffffff81150da9>] cifs_get_sb+0xa5/0x1a8
[ 1461.807928]  [<ffffffff810c1b94>] vfs_kern_mount+0x56/0xc9
[ 1461.807933]  [<ffffffff810c1c64>] do_kern_mount+0x47/0xe7
[ 1461.807938]  [<ffffffff810d8632>] do_mount+0x712/0x775
[ 1461.807943]  [<ffffffff810d671f>] ? copy_mount_options+0xcf/0x132
[ 1461.807948]  [<ffffffff810d8714>] sys_mount+0x7f/0xbf
[ 1461.807953]  [<ffffffff8144509a>] ? lockdep_sys_exit_thunk+0x35/0x67
[ 1461.807960]  [<ffffffff81011cc2>] system_call_fastpath+0x16/0x1b
[ 1461.807963] Code: 00 00 00 00 ea ff ff 48 c1 e8 0c 48 6b c0 68 48 01 d0
66 83 38 00 79 04 48 8b 40 10 66 83 38 00 79 04 48 8b 40 10 80 38 00 78 04
<0f> 0b eb fe 4c 8b 70 58 4c 89 ff 41 8b 76 4c e8 b8 49 fb ff e8
[ 1461.808022] RIP  [<ffffffff810b888e>] kfree+0x63/0x156
[ 1461.808027]  RSP <ffff8800b4f7fbb8>
[ 1461.808031] ---[ end trace ffe26fcdc72c0ce4 ]---

The reason of this bug is that the error handling code of
cifs_get_tcp_session()
calls kfree() when corresponding kmalloc() failed.
(The kmalloc() is called by extract_hostname().)

Signed-off-by: Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp>
CC: Stable <stable@kernel.org>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2009-10-06 18:31:29 +00:00
Sage Weil
9030aaf9bf ceph: Kconfig, Makefile
Kconfig options and Makefile.

Signed-off-by: Sage Weil <sage@newdream.net>
2009-10-06 11:31:15 -07:00
Sage Weil
76aa844d5b ceph: debugfs
Basic state information is available via /sys/kernel/debug/ceph,
including instances of the client, fsids, current monitor, mds and osd
maps, outstanding server requests, and hooks to adjust debug levels.

Signed-off-by: Sage Weil <sage@newdream.net>
2009-10-06 11:31:14 -07:00
Sage Weil
8f4e91dee2 ceph: ioctls
A few Ceph ioctls for getting and setting file layout (striping)
parameters, and learning the identity and network address of the OSD a
given region of a file is stored on.

Signed-off-by: Sage Weil <sage@newdream.net>
2009-10-06 11:31:14 -07:00
Sage Weil
a8e63b7d51 ceph: nfs re-export support
Basic NFS re-export support is included.  This mostly works.  However,
Ceph's MDS design precludes the ability to generate a (small)
filehandle that will be valid forever, so this is of limited utility.

Signed-off-by: Sage Weil <sage@newdream.net>
2009-10-06 11:31:13 -07:00
Sage Weil
8fc91fd859 ceph: message pools
The msgpool is a basic mempool_t-like structure to preallocate
messages we expect to receive over the wire.  This ensures we have the
necessary memory preallocated to process replies to requests, or to
process unsolicited messages from various servers.

Signed-off-by: Sage Weil <sage@newdream.net>
2009-10-06 11:31:13 -07:00
Sage Weil
31b8006e1d ceph: messenger library
A generic message passing library is used to communicate with all
other components in the Ceph file system.  The messenger library
provides ordered, reliable delivery of messages between two nodes in
the system.

This implementation is based on TCP.

Signed-off-by: Sage Weil <sage@newdream.net>
2009-10-06 11:31:13 -07:00
Sage Weil
963b61eb04 ceph: snapshot management
Ceph snapshots rely on client cooperation in determining which
operations apply to which snapshots, and appropriately flushing
snapshotted data and metadata back to the OSD and MDS clusters.
Because snapshots apply to subtrees of the file hierarchy and can be
created at any time, there is a fair bit of bookkeeping required to
make this work.

Portions of the hierarchy that belong to the same set of snapshots
are described by a single 'snap realm.'  A 'snap context' describes
the set of snapshots that exist for a given file or directory.

Signed-off-by: Sage Weil <sage@newdream.net>
2009-10-06 11:31:12 -07:00
Sage Weil
a8599bd821 ceph: capability management
The Ceph metadata servers control client access to inode metadata and
file data by issuing capabilities, granting clients permission to read
and/or write both inode field and file data to OSDs (storage nodes).
Each capability consists of a set of bits indicating which operations
are allowed.

If the client holds a *_SHARED cap, the client has a coherent value
that can be safely read from the cached inode.

In the case of a *_EXCL (exclusive) or FILE_WR capabilities, the client
is allowed to change inode attributes (e.g., file size, mtime), note
its dirty state in the ceph_cap, and asynchronously flush that
metadata change to the MDS.

In the event of a conflicting operation (perhaps by another client),
the MDS will revoke the conflicting client capabilities.

In order for a client to cache an inode, it must hold a capability
with at least one MDS server.  When inodes are released, release
notifications are batched and periodically sent en masse to the MDS
cluster to release server state.

Signed-off-by: Sage Weil <sage@newdream.net>
2009-10-06 11:31:12 -07:00
Sage Weil
ba75bb98cf ceph: monitor client
The monitor cluster is responsible for managing cluster membership
and state.  The monitor client handles what minimal interaction
the Ceph client has with it: checking for updated versions of the
MDS and OSD maps, getting statfs() information, and unmounting.

Signed-off-by: Sage Weil <sage@newdream.net>
2009-10-06 11:31:11 -07:00
Sage Weil
5ecc0a0f81 ceph: CRUSH mapping algorithm
CRUSH is a pseudorandom data distribution function designed to map
inputs onto a dynamic hierarchy of devices, while minimizing the
extent to which inputs are remapped when the devices are added or
removed.  It includes some features that are specifically useful for
storage, most notably the ability to map each input onto a set of N
devices that are separated across administrator-defined failure
domains.  CRUSH is used to distribute data across the cluster of Ceph
storage nodes.

More information about CRUSH can be found in this paper:

    http://www.ssrc.ucsc.edu/Papers/weil-sc06.pdf

Signed-off-by: Sage Weil <sage@newdream.net>
2009-10-06 11:31:11 -07:00
Sage Weil
f24e9980eb ceph: OSD client
The OSD client is responsible for reading and writing data from/to the
object storage pool.  This includes determining where objects are
stored in the cluster, and ensuring that requests are retried or
redirected in the event of a node failure or data migration.

If an OSD does not respond before a timeout expires, keepalive
messages are sent across the lossless, ordered communications channel
to ensure that any break in the TCP is discovered.  If the session
does reset, a reconnection is attempted and affected requests are
resent (by the message transport layer).

Signed-off-by: Sage Weil <sage@newdream.net>
2009-10-06 11:31:10 -07:00
Sage Weil
2f2dc05340 ceph: MDS client
The MDS (metadata server) client is responsible for submitting
requests to the MDS cluster and parsing the response.  We decide which
MDS to submit each request to based on cached information about the
current partition of the directory hierarchy across the cluster.  A
stateful session is opened with each MDS before we submit requests to
it, and a mutex is used to control the ordering of messages within
each session.

An MDS request may generate two responses.  The first indicates the
operation was a success and returns any result.  A second reply is
sent when the operation commits to disk.  Note that locking on the MDS
ensures that the results of updates are visible only to the updating
client before the operation commits.  Requests are linked to the
containing directory so that an fsync will wait for them to commit.

If an MDS fails and/or recovers, we resubmit requests as needed.  We
also reconnect existing capabilities to a recovering MDS to
reestablish that shared session state.  Old dentry leases are
invalidated.

Signed-off-by: Sage Weil <sage@newdream.net>
2009-10-06 11:31:09 -07:00