Commit Graph

49973 Commits

Author SHA1 Message Date
Jeff Mahoney
fef394f75b btrfs: drop unused extent_op arg from btrfs_add_delayed_data_ref
btrfs_add_delayed_data_ref is always called with a NULL extent_op,
so let's drop the argument.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:50 +01:00
Colin Ian King
694a0dee9c btrfs: remove redundant inode null check
The check for a null inode is redundant since the function
is a callback for exportfs, which will itself crash if
dentry->d_inode or parent->d_inode is NULL.  Removing the
null check makes this consistent with other file systems.

Also remove the redundant null dir check too.

Found with static analysis by CoverityScan, CID 1389472

Kudos to Jeff Mahoney for reviewing and explaining the error in
my original patch (most of this explanation went into the above
commit message) and David Sterba for pointing out that the dir
check is also redundant.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:50 +01:00
Seraphime Kirkovski
20c7bcec6f Btrfs: ACCESS_ONCE cleanup
This replaces ACCESS_ONCE macro with the corresponding
READ|WRITE macros

Signed-off-by: Seraphime Kirkovski <kirkseraph@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:50 +01:00
Seraphime Kirkovski
50d0446e68 Btrfs: code cleanup min/max -> min_t/max_t
This cleans up the cases where the min/max macros were used with a cast
rather than using directly min_t/max_t.

Signed-off-by: Seraphime Kirkovski <kirkseraph@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:50 +01:00
Geliang Tang
6b4df8b6c5 btrfs: use rb_entry() instead of container_of
To make the code clearer, use rb_entry() instead of container_of() to
deal with rbtree.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:49 +01:00
Anand Jain
f74670f713 btrfs: use BTRFS_COMPRESS_NONE to specify no compression
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:49 +01:00
Michal Hocko
1aceabf362 btrfs: drop gfp mask tweaking in try_release_extent_state
try_release_extent_state reduces the gfp mask to GFP_NOFS if it is
compatible. This is true for GFP_KERNEL as well. There is no real
reason to do that though. There is no new lock taken down the
the only consumer of the gfp mask which is
try_release_extent_state
  clear_extent_bit
    __clear_extent_bit
      alloc_extent_state

So this seems just unnecessary and confusing.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:49 +01:00
Michal Hocko
3ba7ab220e btrfs: fix up misleading GFP_NOFS usage in btrfs_releasepage
b335b0034e ("Btrfs: Avoid using __GFP_HIGHMEM with slab allocator")
has reduced the allocation mask in btrfs_releasepage to GFP_NOFS just
to prevent from giving an unappropriate gfp mask to the slab allocator
deeper down the callchain (in alloc_extent_state). This is wrong for
two reasons a) GFP_NOFS might be just too restrictive for the calling
context b) it is better to tweak the gfp mask down when it needs that.

So just remove the mask tweaking from btrfs_releasepage and move it
down to alloc_extent_state where it is needed.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:49 +01:00
Qu Wenruo
18dc22c19b btrfs: Add WARN_ON for qgroup reserved underflow
Goldwyn Rodrigues has exposed and fixed a bug which underflows btrfs
qgroup reserved space, and leads to non-writable fs.

This reminds us that we don't have enough underflow check for qgroup
reserved space.

For underflow case, we should not really underflow the numbers but warn
and keeps qgroup still work.

So add more check on qgroup reserved space and add WARN_ON() and
btrfs_warn() for any underflow case.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-14 15:50:49 +01:00
Vivek Goyal
fea6d2a610 vfs: Use upper filesystem inode in bprm_fill_uid()
Right now bprm_fill_uid() uses inode fetched from file_inode(bprm->file).
This in turn returns inode of lower filesystem (in a stacked filesystem
setup).

I was playing with modified patches of shiftfs posted by james bottomley
and realized that through shiftfs setuid bit does not take effect. And
reason being that we fetch uid/gid from inode of lower fs (and not from
shiftfs inode). And that results in following checks failing.

/* We ignore suid/sgid if there are no mappings for them in the ns */
if (!kuid_has_mapping(bprm->cred->user_ns, uid) ||
    !kgid_has_mapping(bprm->cred->user_ns, gid))
	return;

uid/gid fetched from lower fs inode might not be mapped inside the user
namespace of container. So we need to look at uid/gid fetched from
upper filesystem (shiftfs in this particular case) and these should be
mapped and setuid bit can take affect.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2017-02-14 20:51:12 +13:00
David S. Miller
c3d8103bc0 Merge tag 'rxrpc-rewrite-20170210' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
David Howells says:

====================
afs: Use system UUID generation

There is now a general function for generating a UUID and AFS should make
use of it.  It's also been recommended to me that I switch to using random
rather than time plus MAC address-based UUIDs which this function does.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-13 22:24:16 -05:00
Kees Cook
46418413ed pstore: Check for prz allocation in walker
Instead of needing additional checks in callers for unallocated przs,
perform the check in the walker, which gives us a more universal way to
handle the situation.

Signed-off-by: Kees Cook <keescook@chromium.org>
2017-02-13 10:25:52 -08:00
Kees Cook
76d5692a58 pstore: Correctly initialize spinlock and flags
The ram backend wasn't always initializing its spinlock correctly. Since
it was coming from kzalloc memory, though, it was harmless on
architectures that initialize unlocked spinlocks to 0 (at least x86 and
ARM). This also fixes a possibly ignored flag setting too.

When running under CONFIG_DEBUG_SPINLOCK, the following Oops was visible:

[    0.760836] persistent_ram: found existing buffer, size 29988, start 29988
[    0.765112] persistent_ram: found existing buffer, size 30105, start 30105
[    0.769435] persistent_ram: found existing buffer, size 118542, start 118542
[    0.785960] persistent_ram: found existing buffer, size 0, start 0
[    0.786098] persistent_ram: found existing buffer, size 0, start 0
[    0.786131] pstore: using zlib compression
[    0.790716] BUG: spinlock bad magic on CPU#0, swapper/0/1
[    0.790729]  lock: 0xffffffc0d1ca9bb0, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0
[    0.790742] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.10.0-rc2+ #913
[    0.790747] Hardware name: Google Kevin (DT)
[    0.790750] Call trace:
[    0.790768] [<ffffff900808ae88>] dump_backtrace+0x0/0x2bc
[    0.790780] [<ffffff900808b164>] show_stack+0x20/0x28
[    0.790794] [<ffffff9008460ee0>] dump_stack+0xa4/0xcc
[    0.790809] [<ffffff9008113cfc>] spin_dump+0xe0/0xf0
[    0.790821] [<ffffff9008113d3c>] spin_bug+0x30/0x3c
[    0.790834] [<ffffff9008113e28>] do_raw_spin_lock+0x50/0x1b8
[    0.790846] [<ffffff9008a2d2ec>] _raw_spin_lock_irqsave+0x54/0x6c
[    0.790862] [<ffffff90083ac3b4>] buffer_size_add+0x48/0xcc
[    0.790875] [<ffffff90083acb34>] persistent_ram_write+0x60/0x11c
[    0.790888] [<ffffff90083aab1c>] ramoops_pstore_write_buf+0xd4/0x2a4
[    0.790900] [<ffffff90083a9d3c>] pstore_console_write+0xf0/0x134
[    0.790912] [<ffffff900811c304>] console_unlock+0x48c/0x5e8
[    0.790923] [<ffffff900811da18>] register_console+0x3b0/0x4d4
[    0.790935] [<ffffff90083aa7d0>] pstore_register+0x1a8/0x234
[    0.790947] [<ffffff90083ac250>] ramoops_probe+0x6b8/0x7d4
[    0.790961] [<ffffff90085ca548>] platform_drv_probe+0x7c/0xd0
[    0.790972] [<ffffff90085c76ac>] driver_probe_device+0x1b4/0x3bc
[    0.790982] [<ffffff90085c7ac8>] __device_attach_driver+0xc8/0xf4
[    0.790996] [<ffffff90085c4bfc>] bus_for_each_drv+0xb4/0xe4
[    0.791006] [<ffffff90085c7414>] __device_attach+0xd0/0x158
[    0.791016] [<ffffff90085c7b18>] device_initial_probe+0x24/0x30
[    0.791026] [<ffffff90085c648c>] bus_probe_device+0x50/0xe4
[    0.791038] [<ffffff90085c35b8>] device_add+0x3a4/0x76c
[    0.791051] [<ffffff90087d0e84>] of_device_add+0x74/0x84
[    0.791062] [<ffffff90087d19b8>] of_platform_device_create_pdata+0xc0/0x100
[    0.791073] [<ffffff90087d1a2c>] of_platform_device_create+0x34/0x40
[    0.791086] [<ffffff900903c910>] of_platform_default_populate_init+0x58/0x78
[    0.791097] [<ffffff90080831fc>] do_one_initcall+0x88/0x160
[    0.791109] [<ffffff90090010ac>] kernel_init_freeable+0x264/0x31c
[    0.791123] [<ffffff9008a25bd0>] kernel_init+0x18/0x11c
[    0.791133] [<ffffff9008082ec0>] ret_from_fork+0x10/0x50
[    0.793717] console [pstore-1] enabled
[    0.797845] pstore: Registered ramoops as persistent store backend
[    0.804647] ramoops: attached 0x100000@0xf7edc000, ecc: 0/0

Fixes: 663deb4788 ("pstore: Allow prz to control need for locking")
Fixes: 109704492e ("pstore: Make spinlock per zone instead of global")
Reported-by: Brian Norris <briannorris@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
2017-02-13 10:25:52 -08:00
Konstantin Khlebnikov
d6cffbbe9a proc/sysctl: prune stale dentries during unregistering
Currently unregistering sysctl table does not prune its dentries.
Stale dentries could slowdown sysctl operations significantly.

For example, command:

 # for i in {1..100000} ; do unshare -n -- sysctl -a &> /dev/null ; done
 creates a millions of stale denties around sysctls of loopback interface:

 # sysctl fs.dentry-state
 fs.dentry-state = 25812579  24724135        45      0       0       0

 All of them have matching names thus lookup have to scan though whole
 hash chain and call d_compare (proc_sys_compare) which checks them
 under system-wide spinlock (sysctl_lock).

 # time sysctl -a > /dev/null
 real    1m12.806s
 user    0m0.016s
 sys     1m12.400s

Currently only memory reclaimer could remove this garbage.
But without significant memory pressure this never happens.

This patch collects sysctl inodes into list on sysctl table header and
prunes all their dentries once that table unregisters.

Konstantin Khlebnikov <khlebnikov@yandex-team.ru> writes:
> On 10.02.2017 10:47, Al Viro wrote:
>> how about >> the matching stats *after* that patch?
>
> dcache size doesn't grow endlessly, so stats are fine
>
> # sysctl fs.dentry-state
> fs.dentry-state = 92712	58376	45	0	0	0
>
> # time sysctl -a &>/dev/null
>
> real	0m0.013s
> user	0m0.004s
> sys	0m0.008s

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2017-02-13 17:00:06 +13:00
Linus Torvalds
2b95550a43 Merge branch 'for-linus-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "This has two last minute fixes. The highest priority here is a
  regression fix for the decompression code, but we also fixed up a
  problem with the 32-bit compat ioctls.

  The decompression bug could hand back the wrong data on big reads when
  zlib was used. I have a larger cleanup to make the math here less
  error prone, but at this stage in the release Omar's patch is the best
  choice"

* 'for-linus-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  btrfs: fix btrfs_decompress_buf2page()
  btrfs: fix btrfs_compat_ioctl failures on non-compat ioctls
2017-02-11 09:15:58 -08:00
David S. Miller
35eeacf182 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-02-11 02:31:11 -05:00
Omar Sandoval
6e78b3f7a1 Btrfs: fix btrfs_decompress_buf2page()
If btrfs_decompress_buf2page() is handed a bio with its page in the
middle of the working buffer, then we adjust the offset into the working
buffer. After we copy into the bio, we advance the iterator by the
number of bytes we copied. Then, we have some logic to handle the case
of discontiguous pages and adjust the offset into the working buffer
again. However, if we didn't advance the bio to a new page, we may enter
this case in error, essentially repeating the adjustment that we already
made when we entered the function. The end result is bogus data in the
bio.

Previously, we only checked for this case when we advanced to a new
page, but the conversion to bio iterators changed that. This restores
the old, correct behavior.

A case I saw when testing with zlib was:

    buf_start = 42769
    total_out = 46865
    working_bytes = total_out - buf_start = 4096
    start_byte = 45056

The condition (total_out > start_byte && buf_start < start_byte) is
true, so we adjust the offset:

    buf_offset = start_byte - buf_start = 2287
    working_bytes -= buf_offset = 1809
    current_buf_start = buf_start = 42769

Then, we copy

    bytes = min(bvec.bv_len, PAGE_SIZE - buf_offset, working_bytes) = 1809
    buf_offset += bytes = 4096
    working_bytes -= bytes = 0
    current_buf_start += bytes = 44578

After bio_advance(), we are still in the same page, so start_byte is the
same. Then, we check (total_out > start_byte && current_buf_start < start_byte),
which is true! So, we adjust the values again:

    buf_offset = start_byte - buf_start = 2287
    working_bytes = total_out - start_byte = 1809
    current_buf_start = buf_start + buf_offset = 45056

But note that working_bytes was already zero before this, so we should
have stopped copying.

Fixes: 974b1adc3b ("btrfs: use bio iterators for the decompression handlers")
Reported-by: Pat Erley <pat-lkml@erley.org>
Reviewed-by: Chris Mason <clm@fb.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Tested-by: Liu Bo <bo.li.liu@oracle.com>
2017-02-10 19:11:03 -08:00
Linus Torvalds
7fe654dca2 Merge tag 'nfsd-4.10-3' of git://linux-nfs.org/~bfields/linux
Pull nfsd revert from Bruce Fields:
 "This patch turned out to have a couple problems. The problems are
  fixable, but at least one of the fixes is a little ugly. The original
  bug has always been there, so we can wait another week or two to get
  this right"

* tag 'nfsd-4.10-3' of git://linux-nfs.org/~bfields/linux:
  nfsd: Revert "nfsd: special case truncates some more"
2017-02-10 14:23:45 -08:00
Arnd Bergmann
b4db2b35fc afs: Use core kernel UUID generation
AFS uses a time based UUID to identify the host itself.  This requires
getting a timestamp which is currently done through the getnstimeofday()
interface that we want to eventually get rid of.

Instead of replacing it with a ktime-based interface, simply remove the
entire function and use generate_random_uuid() instead, which has a v4
("completely random") UUID instead of the time-based one.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David Howells <dhowells@redhat.com>
2017-02-10 16:34:17 +00:00
David Howells
ff54877310 afs: Move UUID struct to linux/uuid.h
Move the afs_uuid struct to linux/uuid.h, rename it to uuid_v1 and change
the u16/u32 fields to __be16/__be32 instead so that the structure can be
cast to a 16-octet network-order buffer.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de
2017-02-10 16:34:17 +00:00
Konstantin Khlebnikov
17627157cd kernfs: handle null pointers while printing node name and path
Null kernfs nodes could be found at cgroups during construction.
It seems safer to handle these null pointers right in kernfs in
the same way as printf prints "(null)" for null pointer string.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-02-10 16:02:26 +01:00
Thomas Gleixner
1e38da300e timerfd: Protect the might cancel mechanism proper
The handling of the might_cancel queueing is not properly protected, so
parallel operations on the file descriptor can race with each other and
lead to list corruptions or use after free.

Protect the context for these operations with a seperate lock.

The wait queue lock cannot be reused for this because that would create a
lock inversion scenario vs. the cancel lock. Replacing might_cancel with an
atomic (atomic_t or atomic bit) does not help either because it still can
race vs. the actual list operation.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "linux-fsdevel@vger.kernel.org"
Cc: syzkaller <syzkaller@googlegroups.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1701311521430.3457@nanos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-02-10 11:15:09 +01:00
Jan Kara
5469d7c308 ext4: do not use stripe_width if it is not set
Avoid using stripe_width for sbi->s_stripe value if it is not actually
set. It prevents using the stride for sbi->s_stripe.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-02-10 00:56:09 -05:00
Jan Kara
d9b22cf9f5 ext4: fix stripe-unaligned allocations
When a filesystem is created using:

	mkfs.ext4 -b 4096 -E stride=512 <dev>

and we try to allocate 64MB extent, we will end up directly in
ext4_mb_complex_scan_group(). This is because the request is detected
as power-of-two allocation (so we start in ext4_mb_regular_allocator()
with ac_criteria == 0) however the check before
ext4_mb_simple_scan_group() refuses the direct buddy scan because the
allocation request is too large. Since cr == 0, the check whether we
should use ext4_mb_scan_aligned() fails as well and we fall back to
ext4_mb_complex_scan_group().

Fix the problem by checking for upper limit on power-of-two requests
directly when detecting them.

Reported-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-02-10 00:50:56 -05:00
J. Bruce Fields
0839ffb83e nfsd: Revert "nfsd: special case truncates some more"
This patch incorrectly attempted nested mnt_want_write, and incorrectly
disabled nfsd's owner override for truncate.  We'll fix those problems
and make another attempt soon, for the moment I think the safest is to
revert.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-02-09 20:49:20 -05:00
Brian Norris
8672aed7bd pstore: don't OOPS when there are no ftrace zones
We'll OOPS in ramoops_get_next_prz() if the platform didn't ask for any
ftrace zones (i.e., cxt->fprzs will be NULL). Let's just skip this
entire FTRACE section if there's no 'fprzs'.

Regression seen on a coreboot/depthcharge-based Chromebook.

Fixes: 2fbea82bbb ("pstore: Merge per-CPU ftrace records into one")
Cc: Joel Fernandes <joelaf@google.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Brian Norris <briannorris@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
2017-02-09 11:49:49 -08:00
Mike Marshall
eb68d0324d orangefs: fix buffer size mis-match between kernel space and user space.
The deamon through which the kernel module communicates with the userspace
part of Orangefs, the "client-core", sends initialization data to the
kernel module with ioctl. The initialization data was built by the
client-core in a 2k buffer and copy_from_user'd into a 1k buffer
in the kernel module. When more than 1k of initialization data needed
to be sent, some was lost, reducing the usability of the control by which
debug levels are set. This patch sets the kernel side buffer to 2K to
match the userspace side...

Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2017-02-09 14:41:06 -05:00
Mike Marshall
05973c2efb orangefs: Dan Carpenter influenced cleanups...
This patch is simlar to one Dan Carpenter sent me, cleans
up some return codes and whitespace errors. There was one
place where he thought inserting an error message into
the ring buffer might be too chatty, I hope I convinced him
othewise. As a consolation <g> I changed a truly chatty
error message in another location into a debug message,
system-admins had already yelled at me about that one...

Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2017-02-09 14:38:50 -05:00
Christoph Hellwig
4560e78f40 xfs: don't block the log commit handler for discards
Instead we submit the discard requests and use another workqueue to
release the extents from the extent busy list.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-02-09 11:36:40 -08:00
Christoph Hellwig
4669412981 xfs: improve busy extent sorting
Sort busy extents by the full block number instead of just the AGNO so
that we can issue consecutive discard requests that the block layer could
merge (although we'll need additional block layer fixes for fast devices).

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-02-09 11:36:39 -08:00
Trond Myklebust
26ae102f2c NFSv4: Set the connection timeout to match the lease period
Set the timeout for TCP connections to be 1 lease period to ensure
that we don't lose our lease due to a faulty TCP connection.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-09 14:15:16 -05:00
Christoph Hellwig
ebf5587261 xfs: improve handling of busy extents in the low-level allocator
Currently we force the log and simply try again if we hit a busy extent,
but especially with online discard enabled it might take a while after
the log force for the busy extents to disappear, and we might have
already completed our second pass.

So instead we add a new waitqueue and a generation counter to the pag
structure so that we can do wakeups once we've removed busy extents,
and we replace the single retry with an unconditional one - after
all we hold the AGF buffer lock, so no other allocations or frees
can be racing with us in this AG.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-02-09 10:50:25 -08:00
Christoph Hellwig
5e30c23d13 xfs: don't fail xfs_extent_busy allocation
We don't just need the structure to track busy extents which can be
avoided with a synchronous transaction, but also to keep track of
pending discard.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-02-09 10:50:25 -08:00
Bill O'Donnell
b20fe4730e xfs: correct null checks and error processing in xfs_initialize_perag
If pag cannot be allocated, the current error exit path will trip
a null pointer deference error when calling xfs_buf_hash_destroy
with a null pag.  Fix this by adding a new error exit labels and
jumping to those accordingly, avoiding the hash destroy and
unnecessary kmem_free on pag.

Up to three things need to be properly unwound:

1) pag memory allocation
2) xfs_buf_hash_init
3) radix_tree_insert

For any given iteration through the loop, any of the above which
succeed must be unwound for /this/ pag, and then all prior
initialized pags must be unwound.

Addresses-Coverity-Id: 1397628 ("Dereference after null check")

Reported-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-02-09 10:50:25 -08:00
Christoph Hellwig
c5ecb42342 xfs: update ctime and mtime on clone destinatation inodes
We're changing both metadata and data, so we need to update the
timestamps for clone operations.  Dedupe on the other hand does
not change file data, and only changes invisible metadata so the
timestamps should not be updated.

This follows existing btrfs behavior.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
[darrick: remove redundant is_dedupe test]
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-02-09 10:50:24 -08:00
Kinglong Mee
6c71100db5 fanotify: simplify the code of fanotify_merge
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-02-09 14:09:22 +01:00
Trond Myklebust
a974deee47 NFSv4: Fix memory and state leak in _nfs4_open_and_get_state
If we exit because the file access check failed, we currently
leak the struct nfs4_state. We need to attach it to the
open context before returning.

Fixes: 3efb972247 ("NFSv4: Refactor _nfs4_open_and_get_state..")
Cc: stable@vger.kernel.org # 3.10+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-08 17:02:47 -05:00
Kinglong Mee
863d7d9c2e sunrpc/nfs: cleanup procfs/pipefs entry in cache_detail
Record flush/channel/content entries is useless, remove them.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-08 17:02:45 -05:00
Nicholas Piggin
600424e3d9 nfs: no PG_private waiters remain, remove waker
Since commit 4f52b6bb ("NFS: Don't call COMMIT in ->releasepage()"),
no tasks wait on PagePrivate, so the wake introduced in commit 95905446
("NFS: avoid deadlocks with loop-back mounted NFS filesystems.") can
be removed.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-08 17:02:44 -05:00
Benjamin Coddington
920b4530fb NFS: nfs_rename() handle -ERESTARTSYS dentry left behind
An interrupted rename will leave the old dentry behind if the rename
succeeds.  Fix this by moving the final local work of the rename to
rpc_call_done so that the results of the RENAME can always be handled,
even if the original process has already returned with -ERESTARTSYS.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-02-08 17:02:44 -05:00
Christoph Hellwig
168316db35 dax: assert that i_rwsem is held exclusive for writes
Make sure all callers follow the same locking protocol, given that DAX
transparantly replaced the normal buffered I/O path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-02-08 14:43:13 -05:00
Christoph Hellwig
ff5462e39c ext4: fix DAX write locking
Unlike O_DIRECT DAX is not an optional opt-in feature selected by the
application, so we'll have to provide the traditional synchronіzation
of overlapping writes as we do for buffered writes.

This was broken historically for DAX, but got fixed for ext2 and XFS
as part of the iomap conversion.  Fix up ext4 as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2017-02-08 14:39:27 -05:00
Jeff Mahoney
2a36224918 btrfs: fix btrfs_compat_ioctl failures on non-compat ioctls
Commit 4c63c2454e incorrectly assumed that returning -ENOIOCTLCMD would
cause the native ioctl to be called.  The ->compat_ioctl callback is
expected to handle all ioctls, not just compat variants.  As a result,
when using 32-bit userspace on 64-bit kernels, everything except those
three ioctls would return -ENOTTY.

Fixes: 4c63c2454e ("btrfs: bugfix: handle FS_IOC32_{GETFLAGS,SETFLAGS,GETVERSION} in btrfs_ioctl")
Cc: stable@vger.kernel.org
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-08 17:47:30 +01:00
Eric Biggers
6f69f0ed61 fscrypt: constify struct fscrypt_operations
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Richard Weinberger <richard@nod.at>
2017-02-08 10:59:57 -05:00
David S. Miller
3efa70d78f Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
The conflict was an interaction between a bug fix in the
netvsc driver in 'net' and an optimization of the RX path
in 'net-next'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-07 16:29:30 -05:00
Hugh Dickins
b6789123bc mm: fix KPF_SWAPCACHE in /proc/kpageflags
Commit 6326fec112 ("mm: Use owner_priv bit for PageSwapCache, valid
when PageSwapBacked") aliased PG_swapcache to PG_owner_priv_1 (and
depending on PageSwapBacked being true).

As a result, the KPF_SWAPCACHE bit in '/proc/kpageflags' should now be
synthesized, instead of being shown on unrelated pages which just happen
to have PG_owner_priv_1 set.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-07 12:08:32 -08:00
Konstantin Khlebnikov
51f8f3c4e2 ovl: drop CAP_SYS_RESOURCE from saved mounter's credentials
If overlay was mounted by root then quota set for upper layer does not work
because overlay now always use mounter's credentials for operations.
Also overlay might deplete reserved space and inodes in ext4.

This patch drops capability SYS_RESOURCE from saved credentials.
This affects creation new files, whiteouts, and copy-up operations.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Fixes: 1175b6b8d9 ("ovl: do operations on underlying file system in mounter's context")
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-02-07 15:47:14 +01:00
Amir Goldstein
e593b2bf51 ovl: properly implement sync_filesystem()
overlayfs syncs all inode pages on sync_filesystem(), but it also
needs to call s_op->sync_fs() of upper fs for metadata sync.

This fixes correctness of syncfs(2) as demonstrated by following
xfs specific test:

xfs_sync_stats()
{
	echo $1
	echo -n "xfs_log_force = "
	grep log /proc/fs/xfs/stat  | awk '{ print $5 }'
}

xfs_sync_stats "before touch"
touch x
xfs_sync_stats "after touch"
xfs_io -c syncfs .
xfs_sync_stats "after syncfs"
xfs_io -c fsync x
xfs_sync_stats "after fsync"
xfs_io -c fsync x
xfs_sync_stats "after fsync #2"

When this test is run in overlay mount over xfs, log force
count does not increase with syncfs command.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-02-07 15:47:14 +01:00
Amir Goldstein
01ad3eb8a0 ovl: concurrent copy up of regular files
Now that copy up of regular file is done using O_TMPFILE,
we don't need to hold rename_lock throughout copy up.

Use the copy up waitqueue to synchronize concurrent copy up
of the same file. Different regular files can be copied up
concurrently.

The upper dir inode_lock is taken instead of rename_lock,
because it is needed for lookup and later for linking the
temp file, but it is released while copying up data.

Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-02-07 15:47:14 +01:00
Amir Goldstein
39d3d60a54 ovl: introduce copy up waitqueue
The overlay sb 'copyup_wq' and overlay inode 'copying' condition
variable are about to replace the upper sb rename_lock, as finer
grained synchronization objects for concurrent copy up.

Suggested-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-02-07 15:47:14 +01:00