linux

Author	SHA1	Message	Date
Darrick J. Wong	65092ca140	xfs: simplify returns in xchk_bmap Remove the pointless goto and return code in xchk_bmap, since it only serves to obscure what's going on in the function. Instead, return whichever error code is appropriate there. For nonexistent forks, this should have been ENOENT. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2023-08-10 07:48:13 -07:00
Darrick J. Wong	369c001b7a	xfs: rewrite xchk_inode_is_allocated to work properly Back in the mists of time[1], I proposed this function to assist the inode btree scrubbers in checking the inode btree contents against the allocation state of the inode records. The original version performed a direct lookup in the inode cache and returned the allocation status if the cached inode hadn't been reused and wasn't in an intermediate state. Brian thought it would be better to use the usual iget/irele mechanisms, so that was changed for the final version. Unfortunately, this hasn't aged well -- the IGET_INCORE flag only has one user and clutters up the regular iget path, which makes it hard to reason about how it actually works. Worse yet, the inode inactivation series silently broke it because iget won't return inodes that are anywhere in the inactivation machinery, even though the caller is already required to prevent inode allocation and freeing. Inodes in the inactivation machinery are still allocated, but the current code's interactions with the iget code prevent us from being able to say that. Now that I understand the inode lifecycle better than I did in early 2017, I now realize that as long as the cached inode hasn't been reused and isn't actively being reclaimed, it's safe to access the i_mode field (with the AGI, rcu, and i_flags locks held), and we don't need to worry about the inode being freed out from under us. Therefore, port the original version to modern code structure, which fixes the brokennes w.r.t. inactivation. In the next patch we'll remove IGET_INCORE since it's no longer necessary. [1] https://lore.kernel.org/linux-xfs/149643868294.23065.8094890990886436794.stgit@birch.djwong.org/ Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2023-08-10 07:48:12 -07:00
Darrick J. Wong	0d29663453	xfs: hide xfs_inode_is_allocated in scrub common code This function is only used by online fsck, so let's move it there. In the next patch, we'll fix it to work properly and to require that the caller hold the AGI buffer locked. No major changes aside from adjusting the signature a bit. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2023-08-10 07:48:12 -07:00
Darrick J. Wong	a634c0a60b	xfs: fix agf_fllast when repairing an empty AGFL xfs/139 with parent pointers enabled occasionally pops up a corruption message when online fsck force-rebuild repairs an AGFL: XFS (sde): Metadata corruption detected at xfs_agf_verify+0x11e/0x220 [xfs], xfs_agf block 0x9e0001 XFS (sde): Unmount and run xfs_repair XFS (sde): First 128 bytes of corrupted metadata buffer: 00000000: 58 41 47 46 00 00 00 01 00 00 00 4f 00 00 40 00 XAGF.......O..@. 00000010: 00 00 00 01 00 00 00 02 00 00 00 05 00 00 00 01 ................ 00000020: 00 00 00 01 00 00 00 01 00 00 00 00 ff ff ff ff ................ 00000030: 00 00 00 00 00 00 00 05 00 00 00 05 00 00 00 00 ................ 00000040: 91 2e 6f b1 ed 61 4b 4d 8c 9b 6e 87 08 bb f6 36 ..o..aKM..n....6 00000050: 00 00 00 01 00 00 00 01 00 00 00 06 00 00 00 01 ................ 00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ The root cause of this failure is that prior to the repair, there were zero blocks in the AGFL. This scenario is set up by the test case, since it formats with 64MB AGs and tries to ENOSPC the whole filesystem. In this case of flcount==0, we reset fllast to -1U, which then trips the write verifier's check that fllast is less than xfs_agfl_size(). Correct this code to set fllast to the last possible slot in the AGFL when flcount is zero, which mirrors the behavior of xfs_repair phase5 when it has to create a totally empty AGFL. Fixes: `0e93d3f43e` ("xfs: repair the AGFL") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2023-08-10 07:48:11 -07:00
Darrick J. Wong	9ce7f9b225	xfs: clear pagf_agflreset when repairing the AGFL Clear the pagf_agflreset flag when we're repairing the AGFL because we fix all the same padding problems that xfs_agfl_reset does. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2023-08-10 07:48:11 -07:00
Darrick J. Wong	5c83df2e54	xfs: allow userspace to rebuild metadata structures Add a new (superuser-only) flag to the online metadata repair ioctl to force it to rebuild structures, even if they're not broken. We will use this to move metadata structures out of the way during a free space defragmentation operation. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2023-08-10 07:48:11 -07:00
Darrick J. Wong	8336a64eb7	xfs: don't complain about unfixed metadata when repairs were injected While debugging other parts of online repair, I noticed that if someone injects FORCE_SCRUB_REPAIR, starts an IFLAG_REPAIR scrub on a piece of metadata, and the metadata repair fails, we'll log a message about uncorrected errors in the filesystem. This isn't strictly true if the scrub function didn't set OFLAG_CORRUPT and we're only doing the repair because the error injection knob is set. Repair functions are allowed to abort the entire operation at any point before committing new metadata, in which case the piece of metadata is in the same state as it was before. Therefore, the log message should be gated on the results of the scrub. Refactor the predicate and rearrange the code flow to make this happen. Note: If the repair function errors out after it commits the new metadata, the transaction cancellation will shut down the filesystem, which is an obvious sign of corrupt metadata. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2023-08-10 07:48:10 -07:00
Darrick J. Wong	d728f4e3b2	xfs: allow the user to cancel repairs before we start writing All online repair functions have the same structure: walk filesystem metadata structures gathering enough data to rebuild the structure, stage a new copy, and then commit the new copy. The gathering steps do not write anything to disk, so they are peppered with xchk_should_terminate calls to avoid softlockup warnings and to provide an opportunity to abort the repair (by killing xfs_scrub). However, it's not clear in the code base when is the last chance to abort cleanly without having to undo a bunch of structure. Therefore, add one more call to xchk_should_terminate (along with a comment) providing the sysadmin with the ability to abort before it's too late and to make it clear in the source code when it's no longer convenient or safe to abort a repair. As there are only four repair functions right now, this patch exists more to establish a precedent for subsequent additions than to deliver practical functionality. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2023-08-10 07:48:10 -07:00
Darrick J. Wong	d65eb8a633	xfs: always rescan allegedly healthy per-ag metadata after repair After an online repair function runs for a per-AG metadata structure, sc->sick_mask is supposed to reflect the per-AG metadata that the repair function fixed. Our next move is to re-check the metadata to assess the completeness of our repair, so we don't want the rebuilt structure to be excluded from the rescan just because the health system previously logged a problem with the data structure. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2023-08-10 07:48:09 -07:00
Darrick J. Wong	526aab5f57	xfs: implement online scrubbing of rtsummary info Finish the realtime summary scrubber by adding the functions we need to compute a fresh copy of the rtsummary info and comparing it to the copy on disk. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2023-08-10 07:48:09 -07:00
Darrick J. Wong	b7d47a77b9	xfs: move the realtime summary file scrubber to a separate source file Move the realtime summary file checking code to a separate file in preparation to actually implement it. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2023-08-10 07:48:09 -07:00
Darrick J. Wong	294012fb07	xfs: wrap ilock/iunlock operations on sc->ip Scrub tracks the resources that it's holding onto in the xfs_scrub structure. This includes the inode being checked (if applicable) and the inode lock state of that inode. Replace the open-coded structure manipulation with a trivial helper to eliminate sources of error. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2023-08-10 07:48:08 -07:00
Darrick J. Wong	1730853950	xfs: get our own reference to inodes that we want to scrub When we want to scrub a file, get our own reference to the inode unconditionally. This will make disposal rules simpler in the long run. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2023-08-10 07:48:08 -07:00
Darrick J. Wong	d7a74cad8f	xfs: track usage statistics of online fsck Track the usage, outcomes, and run times of the online fsck code, and report these values via debugfs. The columns in the file are: * scrubber name * number of scrub invocations * clean objects found * corruptions found * optimizations found * cross referencing failures * inconsistencies found during cross referencing * incomplete scrubs * warnings * number of time scrub had to retry * cumulative amount of time spent scrubbing (microseconds) * number of repair inovcations * successfully repaired objects * cumuluative amount of time spent repairing (microseconds) Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2023-08-10 07:48:07 -07:00
Darrick J. Wong	a76dba3b24	xfs: create scaffolding for creating debugfs entries Set up debugfs directories for xfs as a whole, and a subdirectory for each mounted filesystem. This will enable the creation of debugfs files in the next patch. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2023-08-10 07:48:07 -07:00
Darrick J. Wong	764018caa9	xfs: improve xfarray quicksort pivot Now that we have the means to do insertion sorts of small in-memory subsets of an xfarray, use it to improve the quicksort pivot algorithm by reading 7 records into memory and finding the median of that. This should prevent bad partitioning when a[lo] and a[hi] end up next to each other in the final sort, which can happen when sorting for cntbt repair when the free space is extremely fragmented (e.g. generic/176). This doesn't speed up the average quicksort run by much, but it will (hopefully) avoid the quadratic time collapse for which quicksort is famous. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2023-08-10 07:48:07 -07:00
Darrick J. Wong	cf36f4f64c	xfs: cache pages used for xfarray quicksort convergence After quicksort picks a pivot item for a particular subsort, it walks the records in that subset from the outside in, rearranging them so that every record less than the pivot comes before it, and every record greater than the pivot comes after it. This scan has a lot of locality, so we can speed it up quite a bit by grabbing the xfile backing page and holding onto it as long as we possibly can. Doing so reduces the runtime by another 5% on the author's computer. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2023-08-10 07:48:06 -07:00
Darrick J. Wong	e5b46c7589	xfs: speed up xfarray sort by sorting xfile page contents directly If all the records in an xfarray subset live within the same memory page, we can short-circuit even more quicksort recursion by mapping that page into the local CPU and using the kernel's heapsort function to sort the subset. On the author's computer, this reduces the runtime by another 15% on a 500,000 element array. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2023-08-10 07:48:06 -07:00
Darrick J. Wong	137db333b2	xfs: teach xfile to pass back direct-map pages to caller Certain xfile array operations (such as sorting) can be sped up quite a bit by allowing xfile users to grab a page to bulk-read the records contained within it. Create helper methods to facilitate this. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2023-08-10 07:48:05 -07:00
Darrick J. Wong	c390c64503	xfs: convert xfarray insertion sort to heapsort using scratchpad memory In the previous patch, we created a very basic quicksort implementation for xfile arrays. While the use of an alternate sorting algorithm to avoid quicksort recursion on very small subsets reduces the runtime modestly, we could do better than a load and store-heavy insertion sort, particularly since each load and store requires a page mapping lookup in the xfile. For a small increase in kernel memory requirements, we could instead bulk load the xfarray records into memory, use the kernel's existing heapsort implementation to sort the records, and bulk store the memory buffer back into the xfile. On the author's computer, this reduces the runtime by about 5% on a 500,000 element array. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2023-08-10 07:48:05 -07:00
Darrick J. Wong	232ea05277	xfs: enable sorting of xfile-backed arrays The btree bulk loading code requires that records be provided in the correct record sort order for the given btree type. In general, repair code cannot be required to collect records in order, and it is not feasible to insert new records in the middle of an array to maintain sort order. Implement a sorting algorithm so that we can sort the records just prior to bulk loading. In principle, an xfarray could consume many gigabytes of memory and its backing pages can be sent out to disk at any time. This means that we cannot map the entire array into memory at once, so we must find a way to divide the work into smaller portions (e.g. a page) that /can/ be mapped into memory. Quicksort seems like a reasonable fit for this purpose, since it uses a divide and conquer strategy to keep its average runtime logarithmic. The solution presented here is a port of the glibc implementation, which itself is derived from the median-of-three and tail call recursion strategies outlined by Sedgwick. Subsequent patches will optimize the implementation further by utilizing the kernel's heapsort on directly-mapped memory whenever possible, and improving the quicksort pivot selection algorithm to try to avoid O(n^2) collapses. Note: The sorting functionality gets its own patch because the basic big array mechanisms were plenty for a single code patch. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2023-08-10 07:48:05 -07:00
Darrick J. Wong	3934e8ebb7	xfs: create a big array data structure Create a simple 'big array' data structure for storage of fixed-size metadata records that will be used to reconstruct a btree index. For repair operations, the most important operations are append, iterate, and sort. Earlier implementations of the big array used linked lists and suffered from severe problems -- pinning all records in kernel memory was not a good idea and frequently lead to OOM situations; random access was very inefficient; and record overhead for the lists was unacceptably high at 40-60%. Therefore, the big memory array relies on the 'xfile' abstraction, which creates a memfd file and stores the records in page cache pages. Since the memfd is created in tmpfs, the memory pages can be pushed out to disk if necessary and we have a built-in usage limit of 50% of physical memory. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2023-08-10 07:48:04 -07:00
Darrick J. Wong	014ad53732	xfs: use per-AG bitmaps to reap unused AG metadata blocks during repair The AGFL repair code uses a series of bitmaps to figure out where there are OWN_AG blocks that are not claimed by the free space and rmap btrees. These blocks become the new AGFL, and any overflow is reaped. The bitmaps current track xfs_fsblock_t even though we already know the AG number. In the last patch, we introduced a new bitmap "type" for tracking xfs_agblock_t extents. Port the reaping code and the AGFL repair to use this new type, which makes it very obvious what we're tracking. This also eliminates a bunch of unnecessary agblock <-> fsblock conversions. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2023-08-10 07:48:04 -07:00
Darrick J. Wong	1c7ce115e5	xfs: reap large AG metadata extents when possible When we're freeing extents that have been set in a bitmap, break the bitmap extent into multiple sub-extents organized by fate, and reap the extents. This enables us to dispose of old resources more efficiently than doing them block by block. While we're at it, rename the reaping functions to make it clear that they're reaping per-AG extents. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2023-08-10 07:48:04 -07:00
Darrick J. Wong	9ed851f695	xfs: allow scanning ranges of the buffer cache for live buffers After an online repair, we need to invalidate buffers representing the blocks from the old metadata that we're replacing. It's possible that parts of a tree that were previously cached in memory are no longer accessible due to media failure or other corruption on interior nodes, so repair figures out the old blocks from the reverse mapping data and scans the buffer cache directly. In other words, online fsck needs to find all the live (i.e. non-stale) buffers for a range of fsblocks so that it can invalidate them. Unfortunately, the current buffer cache code triggers asserts if the rhashtable lookup finds a non-stale buffer of a different length than the key we searched for. For regular operation this is desirable, but for this repair procedure, we don't care since we're going to forcibly stale the buffer anyway. Add an internal lookup flag to avoid the assert. Skip buffers that are already XBF_STALE. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2023-08-10 07:48:03 -07:00
Darrick J. Wong	77a1396f9f	xfs: rearrange xrep_reap_block to make future code flow easier Rearrange the logic inside xrep_reap_block to make it more obvious that crosslinked metadata blocks are handled differently. Add a couple of tracepoints so that we can tell what's going on at the end of a btree rebuild operation. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2023-08-10 07:48:03 -07:00
Darrick J. Wong	5fee784ed0	xfs: use deferred frees to reap old btree blocks Use deferred frees (EFIs) to reap the blocks of a btree that we just replaced. This helps us to shrink the window in which those old blocks could be lost due to a system crash, though we try to flush the EFIs every few hundred blocks so that we don't also overflow the transaction reservations during and after we commit the new btree. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2023-08-10 07:48:02 -07:00
Darrick J. Wong	a55e073088	xfs: only allow reaping of per-AG blocks in xrep_reap_extents Now that we've refactored btree cursors to require the caller to pass in a perag structure, there are numerous problems in xrep_reap_extents if it's being called to reap extents for an inode metadata repair. We don't have any repair functions that can do that, so drop the support for now. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2023-08-10 07:48:02 -07:00
Darrick J. Wong	8e54e06b5c	xfs: only invalidate blocks if we're going to free them When we're discarding old btree blocks after a repair, only invalidate the buffers for the ones that we're freeing -- if the metadata was crosslinked with another data structure, we don't want to touch it. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2023-08-10 07:48:02 -07:00
Darrick J. Wong	e06ef14b9f	xfs: move the post-repair block reaping code to a separate file Reaping blocks after a repair is a complicated affair involving a lot of rmap btree lookups and figuring out if we're going to unmap or free old metadata blocks that might be crosslinked. Eventually, we will need to be able to reap per-AG metadata blocks, bmbt blocks from inode forks, garbage CoW staging extents, and (even later) blocks from btrees rooted in inodes. This results in a lot of reaping code, so we might as well split that off while it's easy. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2023-08-10 07:48:01 -07:00
Darrick J. Wong	86a464179c	xfs: cull repair code that will never get used These two functions date from the era when I thought that we could rebuild btrees by creating an alternate root and adding records one by one. In other words, they predate the btree bulk loader. They're not necessary now, so remove them. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2023-08-10 07:48:01 -07:00
Linus Torvalds	0108963f14	Merge tag 'v6.5-rc5.vfs.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: - Fix a wrong check for O_TMPFILE during RESOLVE_CACHED lookup - Clean up directory iterators and clarify file_needs_f_pos_lock() * tag 'v6.5-rc5.vfs.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: rely on ->iterate_shared to determine f_pos locking vfs: get rid of old '->iterate' directory operation proc: fix missing conversion to 'iterate_shared' open: make RESOLVE_CACHED correctly test for O_TMPFILE	2023-08-06 10:43:52 -07:00
Christian Brauner	7d84d1b9af	fs: rely on ->iterate_shared to determine f_pos locking Now that we removed ->iterate we don't need to check for either ->iterate or ->iterate_shared in file_needs_f_pos_lock(). Simply check for ->iterate_shared instead. This will tell us whether we need to unconditionally take the lock. Not just does it allow us to avoid checking f_inode's mode it also actually clearly shows that we're locking because of readdir. Signed-off-by: Christian Brauner <brauner@kernel.org>	2023-08-06 15:08:36 +02:00
Linus Torvalds	3e32715496	vfs: get rid of old '->iterate' directory operation All users now just use '->iterate_shared()', which only takes the directory inode lock for reading. Filesystems that never got convered to shared mode now instead use a wrapper that drops the lock, re-takes it in write mode, calls the old function, and then downgrades the lock back to read mode. This way the VFS layer and other callers no longer need to care about filesystems that never got converted to the modern era. The filesystems that use the new wrapper are ceph, coda, exfat, jfs, ntfs, ocfs2, overlayfs, and vboxsf. Honestly, several of them look like they really could just iterate their directories in shared mode and skip the wrapper entirely, but the point of this change is to not change semantics or fix filesystems that haven't been fixed in the last 7+ years, but to finally get rid of the dual iterators. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2023-08-06 15:08:35 +02:00
Linus Torvalds	0a2c2baafa	proc: fix missing conversion to 'iterate_shared' I'm looking at the directory handling due to the discussion about f_pos locking (see commit `797964253d`: "file: reinstate f_pos locking optimization for regular files"), and wanting to clean that up. And one source of ugliness is how we were supposed to move filesystems over to the '->iterate_shared()' function that only takes the inode lock for reading many many years ago, but several filesystems still use the bad old '->iterate()' that takes the inode lock for exclusive access. See commit `6192269444` ("introduce a parallel variant of ->iterate()") that also added some documentation stating Old method is only used if the new one is absent; eventually it will be removed. Switch while you still can; the old one won't stay. and that was back in April 2016. Here we are, many years later, and the old version is still clearly sadly alive and well. Now, some of those old style iterators are probably just because the filesystem may end up having per-inode mutable data that it uses for iterating a directory, but at least one case is just a mistake. Al switched over most filesystems to use '->iterate_shared()' back when it was introduced. In particular, the /proc filesystem was converted as one of the first ones in commit `f50752eaa0` ("switch all procfs directories ->iterate_shared()"). But then later one new user of '->iterate()' was then re-introduced by commit `6d9c939dbe` ("procfs: add smack subdir to attrs"). And that's clearly not what we wanted, since that new case just uses the same 'proc_pident_readdir()' and 'proc_pident_lookup()' helper functions that other /proc pident directories use, and they are most definitely safe to use with the inode lock held shared. So just fix it. This still leaves a fair number of oddball filesystems using the old-style directory iterator (ceph, coda, exfat, jfs, ntfs, ocfs2, overlayfs, and vboxsf), but at least we don't have any remaining in the core filesystems. I'm going to add a wrapper function that just drops the read-lock and takes it as a write lock, so that we can clean up the core vfs layer and make all the ugly 'this filesystem needs exclusive inode locking' be just filesystem-internal warts. I just didn't want to make that conversion when we still had a core user left. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2023-08-06 15:08:35 +02:00
Aleksa Sarai	a0fc452a5d	open: make RESOLVE_CACHED correctly test for O_TMPFILE O_TMPFILE is actually __O_TMPFILE\|O_DIRECTORY. This means that the old fast-path check for RESOLVE_CACHED would reject all users passing O_DIRECTORY with -EAGAIN, when in fact the intended test was to check for __O_TMPFILE. Cc: stable@vger.kernel.org # v5.12+ Fixes: `99668f6180` ("fs: expose LOOKUP_CACHED through openat2() RESOLVE_CACHED") Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> Message-Id: <20230806-resolve_cached-o_tmpfile-v1-1-7ba16308465e@cyphar.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2023-08-06 15:08:35 +02:00
Linus Torvalds	f6a6916859	Merge tag '6.5-rc4-smb3-client-fix' of git://git.samba.org/sfrench/cifs-2.6 Pull smb client fix from Steve French: - Fix DFS interlink problem (different namespace) * tag '6.5-rc4-smb3-client-fix' of git://git.samba.org/sfrench/cifs-2.6: smb: client: fix dfs link mount against w2k8	2023-08-05 13:44:06 -07:00
Linus Torvalds	4593f3c2c6	Merge tag 'ceph-for-6.5-rc5' of https://github.com/ceph/ceph-client Pull ceph fixes from Ilya Dryomov: "Two patches to improve RBD exclusive lock interaction with osd_request_timeout option and another fix to reduce the potential for erroneous blocklisting -- this time in CephFS. All going to stable" * tag 'ceph-for-6.5-rc5' of https://github.com/ceph/ceph-client: libceph: fix potential hang in ceph_osdc_notify() rbd: prevent busy loop when requesting exclusive lock ceph: defer stopping mdsc delayed_work	2023-08-04 11:29:38 -07:00
Linus Torvalds	797964253d	file: reinstate f_pos locking optimization for regular files In commit `20ea1e7d13` ("file: always lock position for FMODE_ATOMIC_POS") we ended up always taking the file pos lock, because pidfd_getfd() could get a reference to the file even when it didn't have an elevated file count due to threading of other sharing cases. But Mateusz Guzik reports that the extra locking is actually measurable, so let's re-introduce the optimization, and only force the locking for directory traversal. Directories need the lock for correctness reasons, while regular files only need it for "POSIX semantics". Since pidfd_getfd() is about debuggers etc special things that are _way_ outside of POSIX, we can relax the rules for that case. Reported-by: Mateusz Guzik <mjguzik@gmail.com> Cc: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/linux-fsdevel/20230803095311.ijpvhx3fyrbkasul@f/ Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2023-08-04 11:22:14 -07:00
Linus Torvalds	7bafbd4027	Merge tag 'nfsd-6.5-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd fix from Chuck Lever: - Fix tmpfs splice read support * tag 'nfsd-6.5-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: nfsd: Fix reading via splice	2023-08-03 09:26:34 -07:00
Linus Torvalds	556c9424e2	Merge tag 'erofs-for-6.5-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs Pull erofs fixes from Gao Xiang: - Fix data corruption caused by insufficient decompression on deduplicated compressed extents - Drop a useless s_magic checking in erofs_kill_sb() * tag 'erofs-for-6.5-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: erofs: drop unnecessary WARN_ON() in erofs_kill_sb() erofs: fix wrong primary bvec selection on deduplicated extents	2023-08-03 09:20:50 -07:00
Linus Torvalds	4b954598a4	Merge tag 'exfat-for-6.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat Pull exfat fixes from Namjae Jeon: - Fix page allocation failure from allocation bitmap by using kvmalloc_array/kvfree - Add the check to validate if filename entries exceeds max filename length - Fix potential deadlock condition from dir_emit() tag 'exfat-for-6.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat: exfat: release s_lock before calling dir_emit() exfat: check if filename entries exceeds max filename length exfat: use kvmalloc_array/kvfree instead of kmalloc_array/kfree	2023-08-02 11:43:06 -07:00
Paulo Alcantara	11260c3d60	smb: client: fix dfs link mount against w2k8 Customer reported that they couldn't mount their DFS link that was seen by the client as a DFS interlink -- special form of DFS link where its single target may point to a different DFS namespace -- and it turned out that it was just a regular DFS link where its referral header flags missed the StorageServers bit thus making the client think it couldn't tree connect to target directly without requiring further referrals. When the DFS link referral header flags misses the StoraServers bit and its target doesn't respond to any referrals, then tree connect to it. Fixes: `a1c0d00572` ("cifs: share dfs connections and supers") Cc: stable@vger.kernel.org Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2023-08-02 13:36:12 -05:00
Xiubo Li	e7e607bd00	ceph: defer stopping mdsc delayed_work Flushing the dirty buffer may take a long time if the cluster is overloaded or if there is network issue. So we should ping the MDSs periodically to keep alive, else the MDS will blocklist the kclient. Cc: stable@vger.kernel.org Link: https://tracker.ceph.com/issues/61843 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2023-08-02 00:13:02 +02:00
Gao Xiang	4da3c7183e	erofs: drop unnecessary WARN_ON() in erofs_kill_sb() Previously, .kill_sb() will be called only after fill_super fails. It will be changed [1]. Besides, checking for s_magic in erofs_kill_sb() is unnecessary from any point of view. Let's get rid of it now. [1] https://lore.kernel.org/r/20230731-flugbereit-wohnlage-78acdf95ab7e@brauner Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20230801014737.28614-1-hsiangkao@linux.alibaba.com	2023-08-01 16:12:24 +08:00
Gao Xiang	94c43de735	erofs: fix wrong primary bvec selection on deduplicated extents When handling deduplicated compressed data, there can be multiple decompressed extents pointing to the same compressed data in one shot. In such cases, the bvecs which belong to the longest extent will be selected as the primary bvecs for real decompressors to decode and the other duplicated bvecs will be directly copied from the primary bvecs. Previously, only relative offsets of the longest extent were checked to decompress the primary bvecs. On rare occasions, it can be incorrect if there are several extents with the same start relative offset. As a result, some short bvecs could be selected for decompression and then cause data corruption. For example, as Shijie Sun reported off-list, considering the following extents of a file: 117: 903345.. 915250 \| 11905 : 385024.. 389120 \| 4096 ... 119: 919729.. 930323 \| 10594 : 385024.. 389120 \| 4096 ... 124: 968881.. 980786 \| 11905 : 385024.. 389120 \| 4096 The start relative offset is the same: 2225, but extent 119 (919729.. 930323) is shorter than the others. Let's restrict the bvec length in addition to the start offset if bvecs are not full. Reported-by: Shijie Sun <sunshijie@xiaomi.com> Fixes: `5c2a64252c` ("erofs: introduce partial-referenced pclusters") Tested-by Shijie Sun <sunshijie@xiaomi.com> Reviewed-by: Yue Hu <huyue2@coolpad.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20230719065459.60083-1-hsiangkao@linux.alibaba.com	2023-08-01 16:12:17 +08:00
David Howells	101df45e7e	nfsd: Fix reading via splice nfsd_splice_actor() has a clause in its loop that chops up a compound page into individual pages such that if the same page is seen twice in a row, it is discarded the second time. This is a problem with the advent of shmem_splice_read() as that inserts zero_pages into the pipe in lieu of pages that aren't present in the pagecache. Fix this by assuming that the last page is being extended only if the currently stored length + starting offset is not currently on a page boundary. This can be tested by NFS-exporting a tmpfs filesystem on the test machine and truncating it to more than a page in size (eg. truncate -s 8192) and then reading it by NFS. The first page will be all zeros, but thereafter garbage will be read. Note: I wonder if we can ever get a situation now where we get a splice that gives us contiguous parts of a page in separate actor calls. As NFSD can only be splicing from a file (I think), there are only three sources of the page: copy_splice_read(), shmem_splice_read() and file_splice_read(). The first allocates pages for the data it reads, so the problem cannot occur; the second should never see a partial page; and the third waits for each page to become available before we're allowed to read from it. Fixes: `bd194b1871` ("shmem: Implement splice-read") Reported-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: NeilBrown <neilb@suse.de> cc: Hugh Dickins <hughd@google.com> cc: Jens Axboe <axboe@kernel.dk> cc: Matthew Wilcox <willy@infradead.org> cc: linux-nfs@vger.kernel.org cc: linux-fsdevel@vger.kernel.org cc: linux-mm@kvack.org Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2023-07-30 18:07:12 -04:00
Linus Torvalds	d31e379291	Merge tag '6.5-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6 Pull smb client fixes from Steve French: "Four small SMB3 client fixes: - two reconnect fixes (to address the case where non-default iocharset gets incorrectly overridden at reconnect with the default charset) - fix for NTLMSSP_AUTH request setting a flag incorrectly) - Add missing check for invalid tlink (tree connection) in ioctl" * tag '6.5-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: add missing return value check for cifs_sb_tlink smb3: do not set NTLMSSP_VERSION flag for negotiate not auth request cifs: fix charset issue in reconnection fs/nls: make load_nls() take a const parameter	2023-07-29 20:49:13 -07:00
Sven Joachim	1f2190d6b7	arch//configs/defconfig: Replace AUTOFS4_FS by AUTOFS_FS Commit `a2225d931f` ("autofs: remove left-over autofs4 stubs") promised the removal of the fs/autofs/Kconfig fragment for AUTOFS4_FS within a couple of releases, but five years later this still has not happened yet, and AUTOFS4_FS is still enabled in 63 defconfigs. Get rid of it mechanically: git grep -l CONFIG_AUTOFS4_FS -- '*defconfig' \| xargs sed -i 's/AUTOFS4_FS/AUTOFS_FS/' Also just remove the AUTOFS4_FS config option stub. Anybody who hasn't regenerated their config file in the last five years will need to just get the new name right when they do. Signed-off-by: Sven Joachim <svenjoac@gmx.de> Acked-by: Ian Kent <raven@themaw.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2023-07-29 14:08:22 -07:00
Linus Torvalds	122e7943b2	Merge tag 'mm-hotfixes-stable-2023-07-28-15-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull hotfixes from Andrew Morton: "11 hotfixes. Five are cc:stable and the remainder address post-6.4 issues or aren't considered serious enough to justify backporting" * tag 'mm-hotfixes-stable-2023-07-28-15-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm/memory-failure: fix hardware poison check in unpoison_memory() proc/vmcore: fix signedness bug in read_from_oldmem() mailmap: update remaining active codeaurora.org email addresses mm: lock VMA in dup_anon_vma() before setting ->anon_vma mm: fix memory ordering for mm_lock_seq and vm_lock_seq scripts/spelling.txt: remove 'thead' as a typo mm/pagewalk: fix EFI_PGT_DUMP of espfix area shmem: minor fixes to splice-read implementation tmpfs: fix Documentation of noswap and huge mount options Revert "um: Use swap() to make code cleaner" mm/damon/core-test: initialise context before test in damon_test_set_attrs()	2023-07-28 17:19:52 -07:00

1 2 3 4 5 ...

82873 Commits