Commit Graph

49973 Commits

Author SHA1 Message Date
Amir Goldstein
4c7d0c9cb7 ovl: fix possible use after free on redirect dir lookup
ovl_lookup_layer() iterates on path elements of d->name.name
but also frees and allocates a new pointer for d->name.name.

For the case of lookup in upper layer, the initial d->name.name
pointer is stable (dentry->d_name), but for lower layers, the
initial d->name.name can be d->redirect, which can be freed during
iteration.

[SzM]
Keep the count of remaining characters in the redirect path and calculate
the current position from that.  This works becuase only the prefix is
modified, the ending always stays the same.

Fixes: 02b69b284c ("ovl: lookup redirects")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-01-18 15:19:54 +01:00
David S. Miller
580bdf5650 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-01-17 15:19:37 -05:00
Eric Sandeen
657bdfb7f5 xfs: don't wrap ID in xfs_dq_get_next_id
The GETNEXTQOTA ioctl takes whatever ID is sent in,
and looks for the next active quota for an user
equal or higher to that ID.

But if we are at the maximum ID and then ask for the "next"
one, we may wrap back to zero.  In this case, userspace
may loop forever, because it will start querying again
at zero.

We'll fix this in userspace as well, but for the kernel,
return -ENOENT if we ask for the next quota ID
past UINT_MAX so the caller knows to stop.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-01-17 11:43:38 -08:00
Amir Goldstein
a324cbf10a xfs: sanity check inode di_mode
Check for invalid file type in xfs_dinode_verify()
and fail to load the inode structure from disk.

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-01-17 11:42:22 -08:00
Amir Goldstein
fab8eef86c xfs: sanity check inode mode when creating new dentry
The helper xfs_dentry_to_name() is used by 2 different
classes of callers: Callers that pass zero mode and don't care
about the returned name.type field and Callers that pass
non zero mode and do care about the name.type field.

Change xfs_dentry_to_name() to not take the mode argument and
change the call sites of the first class to not pass the mode
argument.

Create a new helper xfs_dentry_mode_to_name() which does pass
the mode argument and returns -EFSCORRUPTED if mode is invalid.
Callers that translate non zero mode to on-disk file type now
check the return value and will export the error to user instead
of staging an invalid file type to be written to directory entry.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-01-17 11:42:22 -08:00
Amir Goldstein
1fc4d33fed xfs: replace xfs_mode_to_ftype table with switch statement
The size of the xfs_mode_to_ftype[] conversion table
was too small to handle an invalid value of mode=S_IFMT.

Instead of fixing the table size, replace the conversion table
with a conversion helper that uses a switch statement.

Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-01-17 11:41:43 -08:00
Amir Goldstein
b597dd5373 xfs: add missing include dependencies to xfs_dir2.h
xfs_dir2.h dereferences some data types in inline functions
and fails to include those type definitions, e.g.:
xfs_dir2_data_aoff_t, struct xfs_da_geometry.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-01-17 11:41:42 -08:00
Amir Goldstein
3c6f46eacd xfs: sanity check directory inode di_size
This changes fixes an assertion hit when fuzzing on-disk
i_mode values.

The easy case to fix is when changing an empty file
i_mode to S_IFDIR. In this case, xfs_dinode_verify()
detects an illegal zero size for directory and fails
to load the inode structure from disk.

For the case of non empty file whose i_mode is changed
to S_IFDIR, the ASSERT() statement in xfs_dir2_isblock()
is replaced with return -EFSCORRUPTED, to avoid interacting
with corrupted jusk also when XFS_DEBUG is disabled.

Suggested-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-01-17 11:41:41 -08:00
Amir Goldstein
bf46ecc3d8 xfs: make the ASSERT() condition likely
The ASSERT() condition is the normal case, not the exception,
so testing the condition should be likely(), not unlikely().

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-01-17 11:41:41 -08:00
Richard Weinberger
1cb51a15b5 ubifs: Fix journal replay wrt. xattr nodes
When replaying the journal it can happen that a journal entry points to
a garbage collected node.
This is the case when a power-cut occurred between a garbage collect run
and a commit. In such a case nodes have to be read using the failable
read functions to detect whether the found node matches what we expect.

One corner case was forgotten, when the journal contains an entry to
remove an inode all xattrs have to be removed too. UBIFS models xattr
like directory entries, so the TNC code iterates over
all xattrs of the inode and removes them too. This code re-uses the
functions for walking directories and calls ubifs_tnc_next_ent().
ubifs_tnc_next_ent() expects to be used only after the journal and
aborts when a node does not match the expected result. This behavior can
render an UBIFS volume unmountable after a power-cut when xattrs are
used.

Fix this issue by using failable read functions in ubifs_tnc_next_ent()
too when replaying the journal.
Cc: stable@vger.kernel.org
Fixes: 1e51764a3c ("UBIFS: add new flash file system")
Reported-by: Rock Lee <rockdotlee@gmail.com>
Reviewed-by: David Gstir <david@sigma-star.at>
Signed-off-by: Richard Weinberger <richard@nod.at>
2017-01-17 14:35:58 +01:00
Eric Biggers
3d4b2fcbc9 ubifs: remove redundant checks for encryption key
In several places, ubifs checked for an encryption key before creating a
file in an encrypted directory.  This was redundant with
fscrypt_setup_filename() or ubifs_new_inode(), and in the case of
ubifs_link() it broke linking to special files.  So remove the extra
checks.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
2017-01-17 14:34:21 +01:00
Eric Biggers
a75467d910 ubifs: allow encryption ioctls in compat mode
The ubifs encryption ioctls did not work when called by a 32-bit program
on a 64-bit kernel.  Since 'struct fscrypt_policy' is not affected by
the word size, ubifs just needs to allow these ioctls through, like what
ext4 and f2fs do.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
2017-01-17 14:33:41 +01:00
Arnd Bergmann
404e0b6331 ubifs: add CONFIG_BLOCK dependency for encryption
This came up during the v4.10 merge window:

warning: (UBIFS_FS_ENCRYPTION) selects FS_ENCRYPTION which has unmet direct dependencies (BLOCK)
fs/crypto/crypto.c: In function 'fscrypt_zeroout_range':
fs/crypto/crypto.c:355:9: error: implicit declaration of function 'bio_alloc';did you mean 'd_alloc'? [-Werror=implicit-function-declaration]
   bio = bio_alloc(GFP_NOWAIT, 1);

The easiest way out is to limit UBIFS_FS_ENCRYPTION to configurations
that also enable BLOCK.

Fixes: d475a50745 ("ubifs: Add skeleton for fscrypto")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
2017-01-17 14:32:47 +01:00
Peter Rosin
507502adf0 ubifs: fix unencrypted journal write
Without this, I get the following on reboot:

UBIFS error (ubi1:0 pid 703): ubifs_load_znode: bad target node (type 1) length (8240)
UBIFS error (ubi1:0 pid 703): ubifs_load_znode: have to be in range of 48-4144
UBIFS error (ubi1:0 pid 703): ubifs_load_znode: bad indexing node at LEB 13:11080, error 5
 magic          0x6101831
 crc            0xb1cb246f
 node_type      9 (indexing node)
 group_type     0 (no node group)
 sqnum          546
 len            128
 child_cnt      5
 level          0
 Branches:
 0: LEB 14:72088 len 161 key (133, inode)
 1: LEB 14:81120 len 160 key (134, inode)
 2: LEB 20:26624 len 8240 key (134, data, 0)
 3: LEB 14:81280 len 160 key (135, inode)
 4: LEB 20:34864 len 8240 key (135, data, 0)
UBIFS warning (ubi1:0 pid 703): ubifs_ro_mode.part.0: switched to read-only mode, error -22
CPU: 0 PID: 703 Comm: mount Not tainted 4.9.0-next-20161213+ #1197
Hardware name: Atmel SAMA5
[<c010d2ac>] (unwind_backtrace) from [<c010b250>] (show_stack+0x10/0x14)
[<c010b250>] (show_stack) from [<c024df94>] (ubifs_jnl_update+0x2e8/0x614)
[<c024df94>] (ubifs_jnl_update) from [<c0254bf8>] (ubifs_mkdir+0x160/0x204)
[<c0254bf8>] (ubifs_mkdir) from [<c01a6030>] (vfs_mkdir+0xb0/0x104)
[<c01a6030>] (vfs_mkdir) from [<c0286070>] (ovl_create_real+0x118/0x248)
[<c0286070>] (ovl_create_real) from [<c0283ed4>] (ovl_fill_super+0x994/0xaf4)
[<c0283ed4>] (ovl_fill_super) from [<c019c394>] (mount_nodev+0x44/0x9c)
[<c019c394>] (mount_nodev) from [<c019c4ac>] (mount_fs+0x14/0xa4)
[<c019c4ac>] (mount_fs) from [<c01b5338>] (vfs_kern_mount+0x4c/0xd4)
[<c01b5338>] (vfs_kern_mount) from [<c01b6b80>] (do_mount+0x154/0xac8)
[<c01b6b80>] (do_mount) from [<c01b782c>] (SyS_mount+0x74/0x9c)
[<c01b782c>] (SyS_mount) from [<c0107f80>] (ret_fast_syscall+0x0/0x3c)
UBIFS error (ubi1:0 pid 703): ubifs_mkdir: cannot create directory, error -22
overlayfs: failed to create directory /mnt/ovl/work/work (errno: 22); mounting read-only

Fixes: 7799953b34 ("ubifs: Implement encrypt/decrypt for all IO")
Signed-off-by: Peter Rosin <peda@axentia.se>
Tested-by: Kevin Hilman <khilman@baylibre.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
2017-01-17 14:05:23 +01:00
Colin Ian King
e8f19746e4 ubifs: ensure zero err is returned on successful return
err is no longer being set on a successful return path, causing
a garbage value being returned. Fix this by setting err to zero
for the successful return path.

Found with static analysis by CoverityScan, CID 1389473

Fixes: 7799953b34 ("ubifs: Implement encrypt/decrypt for all IO")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
2017-01-17 13:57:32 +01:00
Linus Torvalds
5cf7a0f344 Merge tag 'nfs-for-4.10-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:

 - fix invalid fget()/fput() calls when doing file locking

 - fix multiple directory cache invalidation issues due to the client
   failing to recognise that the directory wasn't changed

 - fix client recovery when server reboots multiple times

* tag 'nfs-for-4.10-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  NFSv4: Fix client recovery when server reboots multiple times
  NFSv4: update_changeattr should update the attribute timestamp
  NFSv4: Don't call update_changeattr() unless the unlink is successful
  NFSv4: Don't apply change_info4 twice on rename within a directory
  NFSv4: Call update_changeattr() from _nfs4_proc_open only if a file was created
  nfs: Don't take a reference on fl->fl_file for LOCK operation
2017-01-16 12:15:59 -08:00
Arnd Bergmann
51c89e6a3b afs: Conditionalise a new unused variable
The bulk readpages support introduced a harmless warning:

fs/afs/file.c: In function 'afs_readpages_page_done':
fs/afs/file.c:270:20: error: unused variable 'vnode' [-Werror=unused-variable]

This adds an #ifdef to match the user of that variable.  The user of the
variable has to be conditional because it accesses a member of a struct
that is also conditional.

Fixes: 91b467e0a3 ("afs: Make afs_readpages() fetch data in bulk")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-16 13:30:52 -05:00
Linus Torvalds
2eabb8b8d6 Merge tag 'nfsd-4.10-1' of git://linux-nfs.org/~bfields/linux
Pull nfsd fixes from Bruce Fields:
 "Miscellaneous nfsd bugfixes, one for a 4.10 regression, three for
  older bugs"

* tag 'nfsd-4.10-1' of git://linux-nfs.org/~bfields/linux:
  svcrdma: avoid duplicate dma unmapping during error recovery
  sunrpc: don't call sleeping functions from the notifier block callbacks
  svcrpc: don't leak contexts on PROC_DESTROY
  nfsd: fix supported attributes for acl & labels
2017-01-16 09:34:37 -08:00
Linus Torvalds
99421c1cb2 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull namespace fixes from Eric Biederman:
 "This tree contains 4 fixes.

  The first is a fix for a race that can causes oopses under the right
  circumstances, and that someone just recently encountered.

  Past that are several small trivial correct fixes. A real issue that
  was blocking development of an out of tree driver, but does not appear
  to have caused any actual problems for in-tree code. A potential
  deadlock that was reported by lockdep. And a deadlock people have
  experienced and took the time to track down caused by a cleanup that
  removed the code to drop a reference count"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  sysctl: Drop reference added by grab_header in proc_sys_readdir
  pid: fix lockdep deadlock warning due to ucount_lock
  libfs: Modify mount_pseudo_xattr to be clear it is not a userspace mount
  mnt: Protect the mountpoint hashtable with mount_lock
2017-01-15 16:09:50 -08:00
Linus Torvalds
f4d3935e4f Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro.

The most notable fix here is probably the fix for a splice regression
("fix a fencepost error in pipe_advance()") noticed by Alan Wylie.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fix a fencepost error in pipe_advance()
  coredump: Ensure proper size of sparse core files
  aio: fix lock dep warning
  tmpfs: clear S_ISGID when setting posix ACLs
2017-01-14 17:13:28 -08:00
Linus Torvalds
34241af77b Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:

 - the virtio_blk stack DMA corruption fix from Christoph, fixing and
   issue with VMAP stacks.

 - O_DIRECT blkbits calculation fix from Chandan.

 - discard regression fix from Christoph.

 - queue init error handling fixes for nbd and virtio_blk, from Omar and
   Jeff.

 - two small nvme fixes, from Christoph and Guilherme.

 - rename of blk_queue_zone_size and bdev_zone_size to _sectors instead,
   to more closely follow what we do in other places in the block layer.
   This interface is new for this series, so let's get the naming right
   before releasing a kernel with this feature. From Damien.

* 'for-linus' of git://git.kernel.dk/linux-block:
  block: don't try to discard from __blkdev_issue_zeroout
  sd: remove __data_len hack for WRITE SAME
  nvme: use blk_rq_payload_bytes
  scsi: use blk_rq_payload_bytes
  block: add blk_rq_payload_bytes
  block: Rename blk_queue_zone_size and bdev_zone_size
  nvme: apply DELAY_BEFORE_CHK_RDY quirk at probe time too
  nvme-rdma: fix nvme_rdma_queue_is_ready
  virtio_blk: fix panic in initialization error path
  nbd: blk_mq_init_queue returns an error code on failure, not NULL
  virtio_blk: avoid DMA to stack for the sense buffer
  do_direct_IO: Use inode->i_blkbits to compute block count to be cleaned
2017-01-14 17:07:04 -08:00
Dave Kleikamp
4d22c75d4c coredump: Ensure proper size of sparse core files
If the last section of a core file ends with an unmapped or zero page,
the size of the file does not correspond with the last dump_skip() call.
gdb complains that the file is truncated and can be confusing to users.

After all of the vma sections are written, make sure that the file size
is no smaller than the current file position.

This problem can be demonstrated with gdb's bigcore testcase on the
sparc architecture.

Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-01-14 19:32:40 -05:00
Shaohua Li
a12f1ae61c aio: fix lock dep warning
lockdep reports a warnning. file_start_write/file_end_write only
acquire/release the lock for regular files. So checking the files in aio
side too.

[  453.532141] ------------[ cut here ]------------
[  453.533011] WARNING: CPU: 1 PID: 1298 at ../kernel/locking/lockdep.c:3514 lock_release+0x434/0x670
[  453.533011] DEBUG_LOCKS_WARN_ON(depth <= 0)
[  453.533011] Modules linked in:
[  453.533011] CPU: 1 PID: 1298 Comm: fio Not tainted 4.9.0+ #964
[  453.533011] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-1.fc24 04/01/2014
[  453.533011]  ffff8803a24b7a70 ffffffff8196cffb ffff8803a24b7ae8 0000000000000000
[  453.533011]  ffff8803a24b7ab8 ffffffff81091ee1 ffff8803a5dba700 00000dba00000008
[  453.533011]  ffffed0074496f59 ffff8803a5dbaf54 ffff8803ae0f8488 fffffffffffffdef
[  453.533011] Call Trace:
[  453.533011]  [<ffffffff8196cffb>] dump_stack+0x67/0x9c
[  453.533011]  [<ffffffff81091ee1>] __warn+0x111/0x130
[  453.533011]  [<ffffffff81091f97>] warn_slowpath_fmt+0x97/0xb0
[  453.533011]  [<ffffffff81091f00>] ? __warn+0x130/0x130
[  453.533011]  [<ffffffff8191b789>] ? blk_finish_plug+0x29/0x60
[  453.533011]  [<ffffffff811205d4>] lock_release+0x434/0x670
[  453.533011]  [<ffffffff8198af94>] ? import_single_range+0xd4/0x110
[  453.533011]  [<ffffffff81322195>] ? rw_verify_area+0x65/0x140
[  453.533011]  [<ffffffff813aa696>] ? aio_write+0x1f6/0x280
[  453.533011]  [<ffffffff813aa6c9>] aio_write+0x229/0x280
[  453.533011]  [<ffffffff813aa4a0>] ? aio_complete+0x640/0x640
[  453.533011]  [<ffffffff8111df20>] ? debug_check_no_locks_freed+0x1a0/0x1a0
[  453.533011]  [<ffffffff8114793a>] ? debug_lockdep_rcu_enabled.part.2+0x1a/0x30
[  453.533011]  [<ffffffff81147985>] ? debug_lockdep_rcu_enabled+0x35/0x40
[  453.533011]  [<ffffffff812a92be>] ? __might_fault+0x7e/0xf0
[  453.533011]  [<ffffffff813ac9bc>] do_io_submit+0x94c/0xb10
[  453.533011]  [<ffffffff813ac2ae>] ? do_io_submit+0x23e/0xb10
[  453.533011]  [<ffffffff813ac070>] ? SyS_io_destroy+0x270/0x270
[  453.533011]  [<ffffffff8111d7b3>] ? mark_held_locks+0x23/0xc0
[  453.533011]  [<ffffffff8100201a>] ? trace_hardirqs_on_thunk+0x1a/0x1c
[  453.533011]  [<ffffffff813acb90>] SyS_io_submit+0x10/0x20
[  453.533011]  [<ffffffff824f96aa>] entry_SYSCALL_64_fastpath+0x18/0xad
[  453.533011]  [<ffffffff81119190>] ? trace_hardirqs_off_caller+0xc0/0x110
[  453.533011] ---[ end trace b2fbe664d1cc0082 ]---

Cc: Dmitry Monakhov <dmonakhov@openvz.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-01-14 19:31:40 -05:00
Rabin Vincent
81ddd8c0c5 cifs: initialize file_info_lock
Reviewed-by: Jeff Layton <jlayton@redhat.com>
CC: Stable <stable@vger.kernel.org>

file_info_lock is not initalized in initiate_cifs_search(), leading to the
following splat after a simple "mount.cifs ... dir && ls dir/":

 BUG: spinlock bad magic on CPU#0, ls/486
  lock: 0xffff880009301110, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0
 CPU: 0 PID: 486 Comm: ls Not tainted 4.9.0 #27
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
  ffffc900042f3db0 ffffffff81327533 0000000000000000 ffff880009301110
  ffffc900042f3dd0 ffffffff810baf75 ffff880009301110 ffffffff817ae077
  ffffc900042f3df0 ffffffff810baff6 ffff880009301110 ffff880008d69900
 Call Trace:
  [<ffffffff81327533>] dump_stack+0x65/0x92
  [<ffffffff810baf75>] spin_dump+0x85/0xe0
  [<ffffffff810baff6>] spin_bug+0x26/0x30
  [<ffffffff810bb159>] do_raw_spin_lock+0xe9/0x130
  [<ffffffff8159ad2f>] _raw_spin_lock+0x1f/0x30
  [<ffffffff8127e50d>] cifs_closedir+0x4d/0x100
  [<ffffffff81181cfd>] __fput+0x5d/0x160
  [<ffffffff81181e3e>] ____fput+0xe/0x10
  [<ffffffff8109410e>] task_work_run+0x7e/0xa0
  [<ffffffff81002512>] exit_to_usermode_loop+0x92/0xa0
  [<ffffffff810026f9>] syscall_return_slowpath+0x49/0x50
  [<ffffffff8159b484>] entry_SYSCALL_64_fastpath+0xa7/0xa9

Fixes: 3afca265b5 ("Clarify locking of cifs file and tcon structures and make more granular")
Signed-off-by: Rabin Vincent <rabinv@axis.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2017-01-14 14:58:29 -06:00
Peter Zijlstra
2c935bc572 locking/atomic, kref: Add kref_read()
Since we need to change the implementation, stop exposing internals.

Provide kref_read() to read the current reference count; typically
used for debug messages.

Kills two anti-patterns:

	atomic_read(&kref->refcount)
	kref->refcount.counter

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:37:18 +01:00
Peter Zijlstra
1e24edca05 locking/atomic, kref: Add KREF_INIT()
Since we need to change the implementation, stop exposing internals.

Provide KREF_INIT() to allow static initialization of struct kref.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:37:18 +01:00
Tejun Heo
6fa7aa50b2 fs/jbd2, locking/mutex, sched/wait: Use mutex_lock_io() for journal->j_checkpoint_mutex
When an ext4 fs is bogged down by a lot of metadata IOs (in the
reported case, it was deletion of millions of files, but any massive
amount of journal writes would do), after the journal is filled up,
tasks which try to access the filesystem and aren't currently
performing the journal writes end up waiting in
__jbd2_log_wait_for_space() for journal->j_checkpoint_mutex.

Because those mutex sleeps aren't marked as iowait, this condition can
lead to misleadingly low iowait and /proc/stat:procs_blocked.  While
iowait propagation is far from strict, this condition can be triggered
fairly easily and annotating these sleeps correctly helps initial
diagnosis quite a bit.

Use the new mutex_lock_io() for journal->j_checkpoint_mutex so that
these sleeps are properly marked as iowait.

Reported-by: Mingbo Wan <mingbo@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jan Kara <jack@suse.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kernel-team@fb.com
Link: http://lkml.kernel.org/r/1477673892-28940-5-git-send-email-tj@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-14 11:30:06 +01:00
Linus Torvalds
e96f8f18c8 Merge branch 'for-linus-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "These are all over the place.

  The tracepoint part of the pull fixes a crash and adds a little more
  information to two tracepoints, while the rest are good old fashioned
  fixes"

* 'for-linus-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  btrfs: make tracepoint format strings more compact
  Btrfs: add truncated_len for ordered extent tracepoints
  Btrfs: add 'inode' for extent map tracepoint
  btrfs: fix crash when tracepoint arguments are freed by wq callbacks
  Btrfs: adjust outstanding_extents counter properly when dio write is split
  Btrfs: fix lockdep warning about log_mutex
  Btrfs: use down_read_nested to make lockdep silent
  btrfs: fix locking when we put back a delayed ref that's too new
  btrfs: fix error handling when run_delayed_extent_op fails
  btrfs: return the actual error value from  from btrfs_uuid_tree_iterate
2017-01-13 17:40:22 -08:00
Linus Torvalds
04e396277b Merge tag 'ceph-for-4.10-rc4' of git://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
 "Two small fixups for the filesystem changes that went into this merge
  window"

* tag 'ceph-for-4.10-rc4' of git://github.com/ceph/ceph-client:
  ceph: fix get_oldest_context()
  ceph: fix mds cluster availability check
2017-01-13 17:38:05 -08:00
Trond Myklebust
c6180a6237 NFSv4: Fix client recovery when server reboots multiple times
If the server reboots multiple times, the client should rely on the
server to tell it that it cannot reclaim state as per section 9.6.3.4
in RFC7530 and section 8.4.2.1 in RFC5661.
Currently, the client is being to conservative, and is assuming that
if the server reboots while state recovery is in progress, then it must
ignore state that was not recovered before the reboot.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-01-13 13:31:32 -05:00
David Sheets
210675270c fuse: fix time_to_jiffies nsec sanity check
Commit bcb6f6d2b9 ("fuse: use timespec64") introduced clamped nsec values
in time_to_jiffies but used the max of nsec and NSEC_PER_SEC - 1 instead of
the min. Because of this, dentries would stay in the cache longer than
requested and go stale in scenarios that relied on their timely eviction.

Fixes: bcb6f6d2b9 ("fuse: use timespec64")
Signed-off-by: David Sheets <dsheets@docker.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org> # 4.9
2017-01-13 17:20:47 +01:00
Tahsin Erdogan
a8a86d78d6 fuse: clear FR_PENDING flag when moving requests out of pending queue
fuse_abort_conn() moves requests from pending list to a temporary list
before canceling them. This operation races with request_wait_answer()
which also tries to remove the request after it gets a fatal signal. It
checks FR_PENDING flag to determine whether the request is still in the
pending list.

Make fuse_abort_conn() clear FR_PENDING flag so that request_wait_answer()
does not remove the request from temporary list.

This bug causes an Oops when trying to delete an already deleted list entry
in end_requests().

Fixes: ee314a870e ("fuse: abort: no fc->lock needed for request ending")
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org> # 4.2+
2017-01-13 12:03:47 +01:00
J. Bruce Fields
dcd2086977 nfsd: fix supported attributes for acl & labels
Oops--in 916d2d844a I moved some constants into an array for
convenience, but here I'm accidentally writing to that array.

The effect is that if you ever encounter a filesystem lacking support
for ACLs or security labels, then all queries of supported attributes
will report that attribute as unsupported from then on.

Fixes: 916d2d844a "nfsd: clean up supported attribute handling"
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-01-12 15:55:51 -05:00
Trond Myklebust
d3129ef672 NFSv4: update_changeattr should update the attribute timestamp
Otherwise, the attribute cache remains marked as being expired.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-01-12 15:51:19 -05:00
Trond Myklebust
c40d52fe1c NFSv4: Don't call update_changeattr() unless the unlink is successful
If the unlink wasn't successful, then the directory has presumably not
changed.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-01-12 15:51:18 -05:00
Trond Myklebust
c733c49c32 NFSv4: Don't apply change_info4 twice on rename within a directory
If a file is renamed, but stays in the same directory, we will still receive
2 change_info4 structures describing the change to that directory, but we
only want to apply it once.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-01-12 15:51:18 -05:00
Trond Myklebust
2dfc617364 NFSv4: Call update_changeattr() from _nfs4_proc_open only if a file was created
We don't want to invalidate the directory attribute and data cache unless we
know that a file was created, or the change attribute differs from the one
in our cache.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-01-12 15:51:17 -05:00
Linus Torvalds
e28ac1fc31 Merge tag 'xfs-for-linus-4.10-rc4-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
 "As promised last week, here's some stability fixes from Christoph and
  Jan Kara:

   - fix free space request handling when low on disk space

   - remove redundant log failure error messages

   - free truncated dirty pages instead of letting them build up
     forever"

* tag 'xfs-for-linus-4.10-rc4-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: Timely free truncated dirty pages
  xfs: don't print warnings when xfs_log_force fails
  xfs: don't rely on ->total in xfs_alloc_space_available
  xfs: adjust allocation length in xfs_alloc_space_available
  xfs: fix bogus minleft manipulations
  xfs: bump up reserved blocks in xfs_alloc_set_aside
2017-01-12 11:06:26 -08:00
Geng, Jichao
84fcc2d2bd ceph: fix get_oldest_context()
For no snapshot case, we should use ci->truncate_{seq,size}.

Fixes: 5f743e4566 ("ceph: record truncate size/seq for snap data writeback")
Signed-off-by: Geng, Jichao <geng.jichao@h3c.com>
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2017-01-12 19:31:01 +01:00
Yan, Zheng
cc8e834293 ceph: fix mds cluster availability check
We should apply the check after getting the initial mdsmap.

Fixes: e9e427f0a1 ("ceph: check availability of mds cluster on mount")
Link: http://tracker.ceph.com/issues/18161
Signed-off-by: Yan, Zheng <zyan@redhat.com>
2017-01-12 19:31:01 +01:00
Benjamin Coddington
4b09ec4b14 nfs: Don't take a reference on fl->fl_file for LOCK operation
I have reports of a crash that look like __fput() was called twice for
a NFSv4.0 file.  It seems possible that the state manager could try to
reclaim a lock and take a reference on the fl->fl_file at the same time the
file is being released if, during the close(), a signal interrupts the wait
for outstanding IO while removing locks which then skips the removal
of that lock.

Since 83bfff23e9 ("nfs4: have do_vfs_lock take an inode pointer") has
removed the need to traverse fl->fl_file->f_inode in nfs4_lock_done(),
taking that reference is no longer necessary.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-01-12 12:51:29 -05:00
Damien Le Moal
f99e86485c block: Rename blk_queue_zone_size and bdev_zone_size
All block device data fields and functions returning a number of 512B
sectors are by convention named xxx_sectors while names in the form
xxx_size are generally used for a number of bytes. The blk_queue_zone_size
and bdev_zone_size functions were not following this convention so rename
them.

No functional change is introduced by this patch.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>

Collapsed the two patches, they were nonsensically split and broke
bisection.

Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-12 07:58:32 -07:00
Al Viro
7880b43bdf 9p: constify ->d_name handling
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-01-12 04:01:17 -05:00
Theodore Ts'o
b907f2d519 ext4: avoid calling ext4_mark_inode_dirty() under unneeded semaphores
There is no need to call ext4_mark_inode_dirty while holding xattr_sem
or i_data_sem, so where it's easy to avoid it, move it out from the
critical region.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-01-11 22:14:49 -05:00
Theodore Ts'o
c755e25135 ext4: fix deadlock between inline_data and ext4_expand_extra_isize_ea()
The xattr_sem deadlock problems fixed in commit 2e81a4eeed: "ext4:
avoid deadlock when expanding inode size" didn't include the use of
xattr_sem in fs/ext4/inline.c.  With the addition of project quota
which added a new extra inode field, this exposed deadlocks in the
inline_data code similar to the ones fixed by 2e81a4eeed.

The deadlock can be reproduced via:

   dmesg -n 7
   mke2fs -t ext4 -O inline_data -Fq -I 256 /dev/vdc 32768
   mount -t ext4 -o debug_want_extra_isize=24 /dev/vdc /vdc
   mkdir /vdc/a
   umount /vdc
   mount -t ext4 /dev/vdc /vdc
   echo foo > /vdc/a/foo

and looks like this:

[   11.158815] 
[   11.160276] =============================================
[   11.161960] [ INFO: possible recursive locking detected ]
[   11.161960] 4.10.0-rc3-00015-g011b30a8a3cf #160 Tainted: G        W      
[   11.161960] ---------------------------------------------
[   11.161960] bash/2519 is trying to acquire lock:
[   11.161960]  (&ei->xattr_sem){++++..}, at: [<c1225a4b>] ext4_expand_extra_isize_ea+0x3d/0x4cd
[   11.161960] 
[   11.161960] but task is already holding lock:
[   11.161960]  (&ei->xattr_sem){++++..}, at: [<c1227941>] ext4_try_add_inline_entry+0x3a/0x152
[   11.161960] 
[   11.161960] other info that might help us debug this:
[   11.161960]  Possible unsafe locking scenario:
[   11.161960] 
[   11.161960]        CPU0
[   11.161960]        ----
[   11.161960]   lock(&ei->xattr_sem);
[   11.161960]   lock(&ei->xattr_sem);
[   11.161960] 
[   11.161960]  *** DEADLOCK ***
[   11.161960] 
[   11.161960]  May be due to missing lock nesting notation
[   11.161960] 
[   11.161960] 4 locks held by bash/2519:
[   11.161960]  #0:  (sb_writers#3){.+.+.+}, at: [<c11a2414>] mnt_want_write+0x1e/0x3e
[   11.161960]  #1:  (&type->i_mutex_dir_key){++++++}, at: [<c119508b>] path_openat+0x338/0x67a
[   11.161960]  #2:  (jbd2_handle){++++..}, at: [<c123314a>] start_this_handle+0x582/0x622
[   11.161960]  #3:  (&ei->xattr_sem){++++..}, at: [<c1227941>] ext4_try_add_inline_entry+0x3a/0x152
[   11.161960] 
[   11.161960] stack backtrace:
[   11.161960] CPU: 0 PID: 2519 Comm: bash Tainted: G        W       4.10.0-rc3-00015-g011b30a8a3cf #160
[   11.161960] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1 04/01/2014
[   11.161960] Call Trace:
[   11.161960]  dump_stack+0x72/0xa3
[   11.161960]  __lock_acquire+0xb7c/0xcb9
[   11.161960]  ? kvm_clock_read+0x1f/0x29
[   11.161960]  ? __lock_is_held+0x36/0x66
[   11.161960]  ? __lock_is_held+0x36/0x66
[   11.161960]  lock_acquire+0x106/0x18a
[   11.161960]  ? ext4_expand_extra_isize_ea+0x3d/0x4cd
[   11.161960]  down_write+0x39/0x72
[   11.161960]  ? ext4_expand_extra_isize_ea+0x3d/0x4cd
[   11.161960]  ext4_expand_extra_isize_ea+0x3d/0x4cd
[   11.161960]  ? _raw_read_unlock+0x22/0x2c
[   11.161960]  ? jbd2_journal_extend+0x1e2/0x262
[   11.161960]  ? __ext4_journal_get_write_access+0x3d/0x60
[   11.161960]  ext4_mark_inode_dirty+0x17d/0x26d
[   11.161960]  ? ext4_add_dirent_to_inline.isra.12+0xa5/0xb2
[   11.161960]  ext4_add_dirent_to_inline.isra.12+0xa5/0xb2
[   11.161960]  ext4_try_add_inline_entry+0x69/0x152
[   11.161960]  ext4_add_entry+0xa3/0x848
[   11.161960]  ? __brelse+0x14/0x2f
[   11.161960]  ? _raw_spin_unlock_irqrestore+0x44/0x4f
[   11.161960]  ext4_add_nondir+0x17/0x5b
[   11.161960]  ext4_create+0xcf/0x133
[   11.161960]  ? ext4_mknod+0x12f/0x12f
[   11.161960]  lookup_open+0x39e/0x3fb
[   11.161960]  ? __wake_up+0x1a/0x40
[   11.161960]  ? lock_acquire+0x11e/0x18a
[   11.161960]  path_openat+0x35c/0x67a
[   11.161960]  ? sched_clock_cpu+0xd7/0xf2
[   11.161960]  do_filp_open+0x36/0x7c
[   11.161960]  ? _raw_spin_unlock+0x22/0x2c
[   11.161960]  ? __alloc_fd+0x169/0x173
[   11.161960]  do_sys_open+0x59/0xcc
[   11.161960]  SyS_open+0x1d/0x1f
[   11.161960]  do_int80_syscall_32+0x4f/0x61
[   11.161960]  entry_INT80_32+0x2f/0x2f
[   11.161960] EIP: 0xb76ad469
[   11.161960] EFLAGS: 00000286 CPU: 0
[   11.161960] EAX: ffffffda EBX: 08168ac8 ECX: 00008241 EDX: 000001b6
[   11.161960] ESI: b75e46bc EDI: b7755000 EBP: bfbdb108 ESP: bfbdafc0
[   11.161960]  DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b

Cc: stable@vger.kernel.org # 3.10 (requires 2e81a4eeed as a prereq)
Reported-by: George Spelvin <linux@sciencehorizons.net>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-01-11 21:50:46 -05:00
Theodore Ts'o
670e9875eb ext4: add debug_want_extra_isize mount option
In order to test the inode extra isize expansion code, it is useful to
be able to easily create file systems that have inodes with extra
isize values smaller than the current desired value.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-01-11 15:32:22 -05:00
David S. Miller
02ac5d1487 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Two AF_* families adding entries to the lockdep tables
at the same time.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-11 14:43:39 -05:00
Jan Kara
0a417b8dc1 xfs: Timely free truncated dirty pages
Commit 99579ccec4 "xfs: skip dirty pages in ->releasepage()" started
to skip dirty pages in xfs_vm_releasepage() which also has the effect
that if a dirty page is truncated, it does not get freed by
block_invalidatepage() and is lingering in LRU list waiting for reclaim.
So a simple loop like:

while true; do
	dd if=/dev/zero of=file bs=1M count=100
	rm file
done

will keep using more and more memory until we hit low watermarks and
start pagecache reclaim which will eventually reclaim also the truncate
pages. Keeping these truncated (and thus never usable) pages in memory
is just a waste of memory, is unnecessarily stressing page cache
reclaim, and reportedly also leads to anonymous mmap(2) returning ENOMEM
prematurely.

So instead of just skipping dirty pages in xfs_vm_releasepage(), return
to old behavior of skipping them only if they have delalloc or unwritten
buffers and fix the spurious warnings by warning only if the page is
clean.

CC: stable@vger.kernel.org
CC: Brian Foster <bfoster@redhat.com>
CC: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Petr Tůma <petr.tuma@d3s.mff.cuni.cz>
Fixes: 99579ccec4
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-01-11 10:20:04 -08:00
Chris Mason
0bf70aebf1 Merge branch 'tracepoint-updates-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.10 2017-01-11 06:26:12 -08:00
Eric Ren
e7ee2c089e ocfs2: fix crash caused by stale lvb with fsdlm plugin
The crash happens rather often when we reset some cluster nodes while
nodes contend fiercely to do truncate and append.

The crash backtrace is below:

   dlm: C21CBDA5E0774F4BA5A9D4F317717495: dlm_recover_grant 1 locks on 971 resources
   dlm: C21CBDA5E0774F4BA5A9D4F317717495: dlm_recover 9 generation 5 done: 4 ms
   ocfs2: Begin replay journal (node 318952601, slot 2) on device (253,18)
   ocfs2: End replay journal (node 318952601, slot 2) on device (253,18)
   ocfs2: Beginning quota recovery on device (253,18) for slot 2
   ocfs2: Finishing quota recovery on device (253,18) for slot 2
   (truncate,30154,1):ocfs2_truncate_file:470 ERROR: bug expression: le64_to_cpu(fe->i_size) != i_size_read(inode)
   (truncate,30154,1):ocfs2_truncate_file:470 ERROR: Inode 290321, inode i_size = 732 != di i_size = 937, i_flags = 0x1
   ------------[ cut here ]------------
   kernel BUG at /usr/src/linux/fs/ocfs2/file.c:470!
   invalid opcode: 0000 [#1] SMP
   Modules linked in: ocfs2_stack_user(OEN) ocfs2(OEN) ocfs2_nodemanager ocfs2_stackglue(OEN) quota_tree dlm(OEN) configfs fuse sd_mod    iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi af_packet iscsi_ibft iscsi_boot_sysfs softdog xfs libcrc32c ppdev parport_pc pcspkr parport      joydev virtio_balloon virtio_net i2c_piix4 acpi_cpufreq button processor ext4 crc16 jbd2 mbcache ata_generic cirrus virtio_blk ata_piix               drm_kms_helper ahci syscopyarea libahci sysfillrect sysimgblt fb_sys_fops ttm floppy libata drm virtio_pci virtio_ring uhci_hcd virtio ehci_hcd       usbcore serio_raw usb_common sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod autofs4
   Supported: No, Unsupported modules are loaded
   CPU: 1 PID: 30154 Comm: truncate Tainted: G           OE   N  4.4.21-69-default #1
   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20151112_172657-sheep25 04/01/2014
   task: ffff88004ff6d240 ti: ffff880074e68000 task.ti: ffff880074e68000
   RIP: 0010:[<ffffffffa05c8c30>]  [<ffffffffa05c8c30>] ocfs2_truncate_file+0x640/0x6c0 [ocfs2]
   RSP: 0018:ffff880074e6bd50  EFLAGS: 00010282
   RAX: 0000000000000074 RBX: 000000000000029e RCX: 0000000000000000
   RDX: 0000000000000001 RSI: 0000000000000246 RDI: 0000000000000246
   RBP: ffff880074e6bda8 R08: 000000003675dc7a R09: ffffffff82013414
   R10: 0000000000034c50 R11: 0000000000000000 R12: ffff88003aab3448
   R13: 00000000000002dc R14: 0000000000046e11 R15: 0000000000000020
   FS:  00007f839f965700(0000) GS:ffff88007fc80000(0000) knlGS:0000000000000000
   CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
   CR2: 00007f839f97e000 CR3: 0000000036723000 CR4: 00000000000006e0
   Call Trace:
     ocfs2_setattr+0x698/0xa90 [ocfs2]
     notify_change+0x1ae/0x380
     do_truncate+0x5e/0x90
     do_sys_ftruncate.constprop.11+0x108/0x160
     entry_SYSCALL_64_fastpath+0x12/0x6d
   Code: 24 28 ba d6 01 00 00 48 c7 c6 30 43 62 a0 8b 41 2c 89 44 24 08 48 8b 41 20 48 c7 c1 78 a3 62 a0 48 89 04 24 31 c0 e8 a0 97 f9 ff <0f> 0b 3d 00 fe ff ff 0f 84 ab fd ff ff 83 f8 fc 0f 84 a2 fd ff
   RIP  [<ffffffffa05c8c30>] ocfs2_truncate_file+0x640/0x6c0 [ocfs2]

It's because ocfs2_inode_lock() get us stale LVB in which the i_size is
not equal to the disk i_size.  We mistakenly trust the LVB because the
underlaying fsdlm dlm_lock() doesn't set lkb_sbflags with
DLM_SBF_VALNOTVALID properly for us.  But, why?

The current code tries to downconvert lock without DLM_LKF_VALBLK flag
to tell o2cb don't update RSB's LVB if it's a PR->NULL conversion, even
if the lock resource type needs LVB.  This is not the right way for
fsdlm.

The fsdlm plugin behaves different on DLM_LKF_VALBLK, it depends on
DLM_LKF_VALBLK to decide if we care about the LVB in the LKB.  If
DLM_LKF_VALBLK is not set, fsdlm will skip recovering RSB's LVB from
this lkb and set the right DLM_SBF_VALNOTVALID appropriately when node
failure happens.

The following diagram briefly illustrates how this crash happens:

RSB1 is inode metadata lock resource with LOCK_TYPE_USES_LVB;

The 1st round:

             Node1                                    Node2
RSB1: PR
                                                  RSB1(master): NULL->EX
ocfs2_downconvert_lock(PR->NULL, set_lvb==0)
  ocfs2_dlm_lock(no DLM_LKF_VALBLK)

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

dlm_lock(no DLM_LKF_VALBLK)
  convert_lock(overwrite lkb->lkb_exflags
               with no DLM_LKF_VALBLK)

RSB1: NULL                                        RSB1: EX
                                                  reset Node2
dlm_recover_rsbs()
  recover_lvb()

/* The LVB is not trustable if the node with EX fails and
 * no lock >= PR is left. We should set RSB_VALNOTVALID for RSB1.
 */

 if(!(kb_exflags & DLM_LKF_VALBLK)) /* This means we miss the chance to
           return;                   * to invalid the LVB here.
                                     */

The 2nd round:

         Node 1                                Node2
RSB1(become master from recovery)

ocfs2_setattr()
  ocfs2_inode_lock(NULL->EX)
    /* dlm_lock() return the stale lvb without setting DLM_SBF_VALNOTVALID */
    ocfs2_meta_lvb_is_trustable() return 1 /* so we don't refresh inode from disk */
  ocfs2_truncate_file()
      mlog_bug_on_msg(disk isize != i_size_read(inode))  /* crash! */

The fix is quite straightforward.  We keep to set DLM_LKF_VALBLK flag
for dlm_lock() if the lock resource type needs LVB and the fsdlm plugin
is uesed.

Link: http://lkml.kernel.org/r/1481275846-6604-1-git-send-email-zren@suse.com
Signed-off-by: Eric Ren <zren@suse.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-01-10 18:31:54 -08:00