In case of non-modular kernels the root filesystem is mounted by trying
several filesystems. If ocfs2 was tried before the actual filesystem
type, the mount would fail because ocfs2_sb_probe() returns -EAGAIN
instead of -EINVAL. ocfs2 will now return -EINVAL properly.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Reported-by: Laszlo Attila Toth <panther@balabit.hu>
We were incorrectly calculationing of object offset. If we have multiple
stripe units per object, we need to shift to the start of the current
su in addition to the offset within the su.
Also rename bno to ono (object number) to avoid some variable naming
confusion.
Signed-off-by: Sage Weil <sage@newdream.net>
The object extent offset is the file offset _modulo_ the stripe unit.
The code was correct, the comment was wrong.
Reported-by: Noah Watkins <jayhawk@soe.ucsc.edu>
Signed-off-by: Sage Weil <sage@newdream.net>
Using stripe unit size calculated and saved on the stack to avoid
a redundant call to le32_to_cpu.
Signed-off-by: Noah Watkins <noah@noahdesu.com>
Signed-off-by: Sage Weil <sage@newdream.net>
Usage of non-list.h list_entry function for container_of
functionality replaced with direct use of container_of.
Signed-off-by: Noah Watkins <noah@noahdesu.com>
Signed-off-by: Sage Weil <sage@newdream.net>
Hi,
Some workloads issue batches of small I/O, and the performance is poor
due to the call to blk_run_address_space for every single iocb. Nathan
Roberts pointed this out, and suggested that by deferring this call
until all I/Os in the iocb array are submitted to the block layer, we
can realize some impressive performance gains (up to 30% for sequential
4k reads in batches of 16).
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Hi,
The WRITE_ODIRECT flag is only used in one place, and that code path
happens to also call blk_run_address_space. The introduction of this
flag, then, could result in the device being unplugged twice for every
I/O.
Further, with the batching changes in the next patch, we don't want an
O_DIRECT write to imply a queue unplug.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
We have been doing some extensive testing of Linux support for ACLs on
NFDS v4. We have noticed that the server rejects ACLs where the groups
are out of order, for example, the following ACL is rejected:
A::OWNER@:rwaxtTcCy
A::user101@domain:rwaxtcy
A::GROUP@:rwaxtcy
A:g:group102@domain:rwaxtcy
A:g:group101@domain:rwaxtcy
A::EVERYONE@:rwaxtcy
Examining the server code, I found that after converting an NFS v4 ACL
to POSIX, sort_pacl is called to sort the user ACEs and group ACEs.
Unfortunately, a minor bug causes the group sort to be skipped.
Signed-off-by: Frank Filz <ffilzlnx@us.ibm.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
We do the same calculation in a couple places; use a helper function,
and add a little documentation, in the hopes of preventing bugs like
that fixed in the last patch.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Unbalanced calculations on creation and destruction of sessions could
cause our estimate of cache memory used to become negative, sometimes
resulting in spurious SERVERFAULT returns to client CREATE_SESSION
requests.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
We're adding enough nfs documentation that it may as well have its own
subdirectory.
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
This simplifies much of the error handling during mount. It also means
that we have the mount args before client creation, and we can initialize
based on those options.
Signed-off-by: Sage Weil <sage@newdream.net>
Since we've increased the max mon count, we shouldn't put the addr array
on the parse_mount_args stack. Put it on the heap instead.
Signed-off-by: Sage Weil <sage@newdream.net>
The hugetlb dependencies presently depend on SUPERH && MMU while the
hugetlb page size definitions depend on CPU_SH4 or CPU_SH5. This
unfortunately allows SH-3 + MMU configurations to enable hugetlbfs
without a corresponding HPAGE_SHIFT definition, resulting in the build
blowing up.
As SH-3 doesn't support variable page sizes, we tighten up the
dependenies a bit to prevent hugetlbfs from being enabled. These days
we also have a shiny new SYS_SUPPORTS_HUGETLBFS, so switch to using
that rather than adding to the list of corner cases in fs/Kconfig.
Reported-by: Kristoffer Ericson <kristoffer.ericson@gmail.com>
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
commit 0762b8bde9
(from 14 months ago) introduced a use-after-free bug which has just
recently started manifesting in my md testing.
I tried git bisect to find out what caused the bug to start
manifesting, and it could have been the recent change to
blk_unregister_queue (48c0d4d4c0) but the results were inconclusive.
This patch certainly fixes my symptoms and looks correct as the two
calls are now in the same order as elsewhere in that function.
Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Based on discussions on LKML and LSM, where there are consecutive
security_ and ima_ calls in the vfs layer, move the ima_ calls to
the existing security_ hooks.
Signed-off-by: Mimi Zohar <zohar@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
RFC 3530 states that when we recieve the error NFS4ERR_RESOURCE, we are not
supposed to bump the sequence number on OPEN, LOCK, LOCKU, CLOSE, etc
operations. The problem is that we map that error into EREMOTEIO in the XDR
layer, and so the NFSv4 middle-layer routines like seqid_mutating_err(),
and nfs_increment_seqid() don't recognise it.
The fix is to defer the mapping until after the middle layers have
processed the error.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Actually pass the NFS_FILE_SYNC option to the server to avoid a
Panic in nfs_direct_write_complete() when a commit fails.
At the end of an nfs write, if the nfs commit fails, all the writes
will be rescheduled. They are supposed to be rescheduled as NFS_FILE_SYNC
writes, but the rpc_task structure is not completely intialized and so
the option is not passed. When the rescheduled writes complete, the
return indicates that they are NFS_UNSTABLE and we try to do another
commit. This leads to a Panic because the commit data structure pointer
was set to null in the initial (failed) commit attempt.
Signed-off-by: Terry Loftin <terry.loftin@hp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Get rid of separate max mon limit; use the system limit instead. This
allows mounts when there are lots of mon addrs provided by mount.ceph (as
with a host with lots of A/AAAA records).
Signed-off-by: Sage Weil <sage@newdream.net>
* 'for-linus' of git://git.infradead.org/users/eparis/notify:
dnotify: ignore FS_EVENT_ON_CHILD
inotify: fix coalesce duplicate events into a single event in special case
inotify: deprecate the inotify kernel interface
fsnotify: do not set group for a mark before it is on the i_list
This patch fixes a null pointer exception in pipe_rdwr_open() which
generates the stack trace:
> Unable to handle kernel NULL pointer dereference at 0000000000000028 RIP:
> [<ffffffff802899a5>] pipe_rdwr_open+0x35/0x70
> [<ffffffff8028125c>] __dentry_open+0x13c/0x230
> [<ffffffff8028143d>] do_filp_open+0x2d/0x40
> [<ffffffff802814aa>] do_sys_open+0x5a/0x100
> [<ffffffff8021faf3>] sysenter_do_call+0x1b/0x67
The failure mode is triggered by an attempt to open an anonymous
pipe via /proc/pid/fd/* as exemplified by this script:
=============================================================
while : ; do
{ echo y ; sleep 1 ; } | { while read ; do echo z$REPLY; done ; } &
PID=$!
OUT=$(ps -efl | grep 'sleep 1' | grep -v grep |
{ read PID REST ; echo $PID; } )
OUT="${OUT%% *}"
DELAY=$((RANDOM * 1000 / 32768))
usleep $((DELAY * 1000 + RANDOM % 1000 ))
echo n > /proc/$OUT/fd/1 # Trigger defect
done
=============================================================
Note that the failure window is quite small and I could only
reliably reproduce the defect by inserting a small delay
in pipe_rdwr_open(). For example:
static int
pipe_rdwr_open(struct inode *inode, struct file *filp)
{
msleep(100);
mutex_lock(&inode->i_mutex);
Although the defect was observed in pipe_rdwr_open(), I think it
makes sense to replicate the change through all the pipe_*_open()
functions.
The core of the change is to verify that inode->i_pipe has not
been released before attempting to manipulate it. If inode->i_pipe
is no longer present, return ENOENT to indicate so.
The comment about potentially using atomic_t for i_pipe->readers
and i_pipe->writers has also been removed because it is no longer
relevant in this context. The inode->i_mutex lock must be used so
that inode->i_pipe can be dealt with correctly.
Signed-off-by: Earl Chew <earl_chew@agilent.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We can't fill i_size with rbytes at the fill_file_size stage without
adding additional checks for directories. Notably, we want st_blocks
to remain 0 on directories so that 'du' still works.
Fill in i_blocks, i_size specially in ceph_getattr instead.
Signed-off-by: Sage Weil <sage@newdream.net>
Mix the preferred osd (if any) into the placement seed that is fed into
the CRUSH object placement calculation. This prevents all the placement
pgs from peering with the same osds.
Rev the osd client protocol with this change.
Signed-off-by: Sage Weil <sage@newdream.net>
In order to have better cache layouts of struct sock (separate zones
for rx/tx paths), we need this preliminary patch.
Goal is to transfert fields used at lookup time in the first
read-mostly cache line (inside struct sock_common) and move sk_refcnt
to a separate cache line (only written by rx path)
This patch adds inet_ prefix to daddr, rcv_saddr, dport, num, saddr,
sport and id fields. This allows a future patch to define these
fields as macros, like sk_refcnt, without name clashes.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If we do rename a dir entry, like this:
rename("/tmp/ino7UrgoJ.rename1", "/tmp/ino7UrgoJ.rename2")
rename("/tmp/ino7UrgoJ.rename2", "/tmp/ino7UrgoJ")
The duplicate events should be coalesced into a single event. But those two
events do not be coalesced into a single event, due to some bad check in
event_compare(). It can not match the two NULL inodes as the same event.
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
fsnotify_add_mark is supposed to add a mark to the g_list and i_list and to
set the group and inode for the mark. fsnotify_destroy_mark_by_entry uses
the fact that ->group != NULL to know if this group should be destroyed or
if it's already been done.
But fsnotify_add_mark sets the group and inode before it actually adds the
mark to the i_list and g_list. This can result in a race in inotify, it
requires 3 threads.
sys_inotify_add_watch("file") sys_inotify_add_watch("file") sys_inotify_rm_watch([a])
inotify_update_watch()
inotify_new_watch()
inotify_add_to_idr()
^--- returns wd = [a]
inotfiy_update_watch()
inotify_new_watch()
inotify_add_to_idr()
fsnotify_add_mark()
^--- returns wd = [b]
returns to userspace;
inotify_idr_find([a])
^--- gives us the pointer from task 1
fsnotify_add_mark()
^--- this is going to set the mark->group and mark->inode fields, but will
return -EEXIST because of the race with [b].
fsnotify_destroy_mark()
^--- since ->group != NULL we call back
into inotify_freeing_mark() which calls
inotify_remove_from_idr([a])
since fsnotify_add_mark() failed we call:
inotify_remove_from_idr([a]) <------WHOOPS it's not in the idr, this could
have been any entry added later!
The fix is to make sure we don't set mark->group until we are sure the mark is
on the inode and fsnotify_add_mark will return success.
Signed-off-by: Eric Paris <eparis@redhat.com>
Pass the front_len we need when pulling a message off a msgpool,
and WARN if it is greater than the pool's size. Then try to
allocate a new message (to continue without failing).
Signed-off-by: Sage Weil <sage@newdream.net>
Previously we were flushing dirty caps by passing an extra flag
when traversing the delayed caps list. Besides being a bit ugly,
that can also miss caps that are dirty but didn't result in a
cap requeue: notably, mark_caps_dirty().
Separate the flushing into a separate helper, and traverse the
cap_dirty list.
This also brings i_dirty_item in line with i_dirty_caps: we are
on the list IFF caps != 0. We carry an inode ref IFF
dirty_caps|flushing_caps != 0.
Lose the unused return value from __ceph_mark_caps_dirty().
Signed-off-by: Sage Weil <sage@newdream.net>
* 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: always pin metadata in discard mode
Btrfs: enable discard support
Btrfs: add -o discard option
Btrfs: properly wait log writers during log sync
Btrfs: fix possible ENOSPC problems with truncate
Btrfs: fix btrfs acl #ifdef checks
Btrfs: streamline tree-log btree block writeout
Btrfs: avoid tree log commit when there are no changes
Btrfs: only write one super copy during fsync
sysfs_notify_dirent is a simple atomic operation that can be used to
alert user-space that new data can be read from a sysfs attribute.
Unfortunately it cannot currently be called from non-process context
because of its use of spin_lock which is sometimes taken with
interrupts enabled.
So change all lockers of sysfs_open_dirent_lock to disable interrupts,
thus making sysfs_notify_dirent safe to be called from non-process
context (as drivers/md does in md_safemode_timeout).
sysfs_get_open_dirent is (documented as being) only called from
process context, so it uses spin_lock_irq. Other places
use spin_lock_irqsave.
The usage for sysfs_notify_dirent in md_safemode_timeout was
introduced in 2.6.28, so this patch is suitable for that and more
recent kernels.
Reported-by: Joel Andres Granados <jgranado@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: stable <stable@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
As device_move() and kobject_move() both handle a NULL destination,
sysfs_move_dir() should do this as well (again) and fall back to
sysfs_root in that case.
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Cc: Phil Carmody <ext-phil.2.carmody@nokia.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
We had a watchdog in _get_block_create_0() that jumped to a fixup retry
path in case the bkl got relaxed while calling kmap().
This is not necessary anymore since we now have a reiserfs lock that is
not implicitly relaxed while sleeping.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Alexander Beregalov <a.beregalov@gmail.com>
Cc: Laurent Riffard <laurent.riffard@free.fr>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reiserfs uses the ioctl callback for its file operations, which means
that its ioctl path is still locked by the bkl, this was synchronizing
with the rest of the filsystem operations. We have changed that by
locking it with the new reiserfs lock but we do that only from the
compat_ioctl callback.
Fix that by locking reiserfs_ioctl() everytime.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Alexander Beregalov <a.beregalov@gmail.com>
Cc: Laurent Riffard <laurent.riffard@free.fr>
Cc: Thomas Gleixner <tglx@linutronix.de>