Yonatan Nachum
809c9c3bd6
RDMA/efa: Limit EQs to available MSI-X vectors
...
When creating EQs we take into consideration the max number of EQs the
device reported it can support and the number of available CPUs. There
are situations where the number of EQs the device reported it can
support and the PCI configuration of MSI-X is different, take it in
account as well when creating EQs.
Also request at least 1 MSI-X vector for the management queue and allow
the kernel to return a number of vectors in a range between 1 and the
max supported MSI-X vectors according to the PCI config.
Reviewed-by: Michael Margolin <mrgolin@amazon.com >
Reviewed-by: Yonatan Goldhirsh <ygold@amazon.com >
Signed-off-by: Yonatan Nachum <ynachum@amazon.com >
Link: https://lore.kernel.org/r/20240131093403.18624-1-ynachum@amazon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org >
2024-01-31 14:32:26 +02:00
Alexey Dobriyan
a400073ce3
RDMA/mlx5: Delete unused mlx5_ib_copy_pas prototype
...
mlx5_ib_copy_pas() doesn't exist anymore.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com >
Link: https://lore.kernel.org/r/a2cb861e-d11e-4567-8a73-73763d1dc199@p183
Signed-off-by: Leon Romanovsky <leon@kernel.org >
2024-01-25 12:06:57 +02:00
Alexey Dobriyan
9bd5653f76
RDMA/cxgb4: Delete unused c4iw_ep_redirect prototype
...
c4iw_ep_redirect() doesn't exist.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com >
Link: https://lore.kernel.org/r/a67881e7-469b-43d9-9973-ad7579eb2c4b@p183
Signed-off-by: Leon Romanovsky <leon@kernel.org >
2024-01-25 12:05:30 +02:00
Konstantin Taranov
2a31c5a7e0
RDMA/mana_ib: Introduce mana_ib_install_cq_cb helper function
...
Use a helper function to install callbacks to CQs.
This patch removes code repetition.
Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com >
Link: https://lore.kernel.org/r/1705965781-3235-4-git-send-email-kotaranov@linux.microsoft.com
Signed-off-by: Leon Romanovsky <leon@kernel.org >
2024-01-25 12:03:14 +02:00
Konstantin Taranov
3b73eb3a4a
RDMA/mana_ib: Introduce mana_ib_get_netdev helper function
...
Use a helper function to access netdevs using a port number.
This patch removes code repetitions as well as removes the need
to explicitly use gdma_dev, which was error-prone.
Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com >
Link: https://lore.kernel.org/r/1705965781-3235-3-git-send-email-kotaranov@linux.microsoft.com
Signed-off-by: Leon Romanovsky <leon@kernel.org >
2024-01-25 12:03:07 +02:00
Konstantin Taranov
71c8cbfcdc
RDMA/mana_ib: Introduce mdev_to_gc helper function
...
Use a helper function to access gdma_context from mana_ib_dev.
This patch removes code repetitions as well as removes the need
to explicitly use gdma_dev, which was error-prone.
Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com >
Link: https://lore.kernel.org/r/1705965781-3235-2-git-send-email-kotaranov@linux.microsoft.com
Signed-off-by: Leon Romanovsky <leon@kernel.org >
2024-01-25 12:03:00 +02:00
Yunsheng Lin
c00743cbf2
RDMA/hns: Simplify 'struct hns_roce_hem' allocation
...
'struct hns_roce_hem' is used to refer to the last level of
dma buffer managed by the hw, pointed by a single BA(base
address) in the previous level of BT(base table), so the dma
buffer in 'struct hns_roce_hem' must be contiguous.
Right now the size of dma buffer in 'struct hns_roce_hem' is
decided by mhop->buf_chunk_size in get_hem_table_config(),
which ensure the mhop->buf_chunk_size is power of two of
PAGE_SIZE, so there will be only one contiguous dma buffer
allocated in hns_roce_alloc_hem(), which means hem->chunk_list
and chunk->mem for linking multi dma buffers is unnecessary.
This patch removes the hem->chunk_list and chunk->mem and other
related macro and function accordingly.
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com >
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com >
Link: https://lore.kernel.org/r/20240113085935.2838701-7-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org >
2024-01-25 11:54:38 +02:00
Chengchang Tang
2eb999b3d4
RDMA/hns: Support adaptive PBL hopnum
...
In the current implementation, a fixed addressing level is used for
PBL. But in fact, the necessary addressing level is related to page
size and the size of MR.
This patch calculates the addressing level according to page size
and the size of MR, and uses the addressing level to configure the
PBL.
Signed-off-by: Chengchang Tang <tangchengchang@huawei.com >
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com >
Link: https://lore.kernel.org/r/20240113085935.2838701-6-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org >
2024-01-25 11:54:38 +02:00
Chengchang Tang
0ff6c9779a
RDMA/hns: Support flexible umem page size
...
In the current implementation, a fixed page size is used to
configure the umem PBL, which is not flexible enough and is
not conducive to the performance of the HW. Find a best page
size to get better performance.
Signed-off-by: Chengchang Tang <tangchengchang@huawei.com >
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com >
Link: https://lore.kernel.org/r/20240113085935.2838701-5-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org >
2024-01-25 11:54:38 +02:00
Chengchang Tang
6afc859518
RDMA/hns: Alloc MTR memory before alloc_mtt()
...
MTR memory allocation do not depend on allocation of mtt.
This patch moves the allocation of mtr before mtt in preparation for
the following optimization.
Signed-off-by: Chengchang Tang <tangchengchang@huawei.com >
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com >
Link: https://lore.kernel.org/r/20240113085935.2838701-4-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org >
2024-01-25 11:54:38 +02:00
Chengchang Tang
4f5731b1fb
RDMA/hns: Refactor mtr_init_buf_cfg()
...
page_shift and page_cnt is only used in mtr_map_bufs(). And these
parameter could be calculated indepedently.
Strip the computation of page_shift and page_cnt from mtr_init_buf_cfg(),
reducing the number of parameters of it. This helps reducing coupling
between mtr_init_buf_cfg() and mtr_map_bufs().
Signed-off-by: Chengchang Tang <tangchengchang@huawei.com >
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com >
Link: https://lore.kernel.org/r/20240113085935.2838701-3-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org >
2024-01-25 11:54:38 +02:00
Chengchang Tang
a4ca341080
RDMA/hns: Refactor mtr find
...
hns_roce_mtr_find() is a collection of multiple functions, and the
return value is also difficult to understand, which is not conducive
to modification and maintenance.
Separate the function of obtaining MTR root BA from this function.
And some adjustments has been made to improve readability.
Signed-off-by: Chengchang Tang <tangchengchang@huawei.com >
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com >
Link: https://lore.kernel.org/r/20240113085935.2838701-2-huangjunxian6@hisilicon.com
Signed-off-by: Leon Romanovsky <leon@kernel.org >
2024-01-25 11:54:38 +02:00
Christian Heusel
64854534ff
RDMA/ipoib: Print symbolic error name instead of error code
...
Utilize the %pe print specifier to get the symbolic error name as a
string (i.e "-ENOMEM") in the log message instead of the error code to
increase its readability.
This change was suggested in
https://lore.kernel.org/all/92972476-0b1f-4d0a-9951-af3fc8bc6e65@suswa.mountain/
Signed-off-by: Christian Heusel <christian@heusel.eu >
Link: https://lore.kernel.org/r/20240111141311.987098-1-christian@heusel.eu
Reviewed-by: Dan Carpenter <dan.carpenter@linaro.org >
Signed-off-by: Leon Romanovsky <leon@kernel.org >
2024-01-25 11:50:58 +02:00
Li Zhijian
190a7affee
RDMA/rxe: Remove rxe_info from rxe_set_mtu
...
commit 9ac01f434a ("RDMA/rxe: Extend dbg log messages to err and info")
newly added this info. But it did only show null device when
the rdma_rxe is being loaded because dev_name(rxe->ib_dev->dev)
has not yet been assigned at the moment:
"(null): rxe_set_mtu: Set mtu to 1024"
Remove it to silent this message, check the mtu from it backend link
instead if needed.
CC: Bob Pearson <rpearsonhpe@gmail.com >
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com >
Link: https://lore.kernel.org/r/20240109083253.3629967-2-lizhijian@fujitsu.com
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev >
Signed-off-by: Leon Romanovsky <leon@kernel.org >
2024-01-25 11:49:50 +02:00
Li Zhijian
6482718086
RDMA/rxe: Improve newline in printing messages
...
Previously rxe_{dbg,info,err}() macros are appended built-in newline,
but some users will add redundant newline sometimes. So remove the
built-in newline for these macros.
In terms of rxe_{dbg,info,err}_xxx() macros, because they don't have
built-in newline, append newline when using them.
CC: Daisuke Matsuda <matsuda-daisuke@fujitsu.com >
Reviewed-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com >
Acked-by: Zhu Yanjun <zyjzyj2000@gmail.com >
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com >
Link: https://lore.kernel.org/r/20240109083253.3629967-1-lizhijian@fujitsu.com
Signed-off-by: Leon Romanovsky <leon@kernel.org >
2024-01-25 11:49:50 +02:00
Randy Dunlap
4c09f7405e
IB/hfi1: fix spellos and kernel-doc
...
Fix spelling mistakes as reported by codespell.
Fix kernel-doc warnings.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org >
Cc: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com >
Cc: Jason Gunthorpe <jgg@nvidia.com >
Cc: Leon Romanovsky <leonro@nvidia.com >
Cc: linux-rdma@vger.kernel.org
Link: https://lore.kernel.org/r/20240111042754.17530-1-rdunlap@infradead.org
Signed-off-by: Leon Romanovsky <leon@kernel.org >
2024-01-25 11:48:50 +02:00
Linus Torvalds
6613476e22
Linux 6.8-rc1
v6.8-rc1
2024-01-21 14:11:32 -08:00
Linus Torvalds
35a4474b5c
Merge tag 'bcachefs-2024-01-21' of https://evilpiepirate.org/git/bcachefs
...
Pull more bcachefs updates from Kent Overstreet:
"Some fixes, Some refactoring, some minor features:
- Assorted prep work for disk space accounting rewrite
- BTREE_TRIGGER_ATOMIC: after combining our trigger callbacks, this
makes our trigger context more explicit
- A few fixes to avoid excessive transaction restarts on
multithreaded workloads: fstests (in addition to ktest tests) are
now checking slowpath counters, and that's shaking out a few bugs
- Assorted tracepoint improvements
- Starting to break up bcachefs_format.h and move on disk types so
they're with the code they belong to; this will make room to start
documenting the on disk format better.
- A few minor fixes"
* tag 'bcachefs-2024-01-21' of https://evilpiepirate.org/git/bcachefs: (46 commits)
bcachefs: Improve inode_to_text()
bcachefs: logged_ops_format.h
bcachefs: reflink_format.h
bcachefs; extents_format.h
bcachefs: ec_format.h
bcachefs: subvolume_format.h
bcachefs: snapshot_format.h
bcachefs: alloc_background_format.h
bcachefs: xattr_format.h
bcachefs: dirent_format.h
bcachefs: inode_format.h
bcachefs; quota_format.h
bcachefs: sb-counters_format.h
bcachefs: counters.c -> sb-counters.c
bcachefs: comment bch_subvolume
bcachefs: bch_snapshot::btime
bcachefs: add missing __GFP_NOWARN
bcachefs: opts->compression can now also be applied in the background
bcachefs: Prep work for variable size btree node buffers
bcachefs: grab s_umount only if snapshotting
...
2024-01-21 14:01:12 -08:00
Linus Torvalds
4fbbed7872
Merge tag 'timers-core-2024-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
...
Pull timer updates from Thomas Gleixner:
"Updates for time and clocksources:
- A fix for the idle and iowait time accounting vs CPU hotplug.
The time is reset on CPU hotplug which makes the accumulated
systemwide time jump backwards.
- Assorted fixes and improvements for clocksource/event drivers"
* tag 'timers-core-2024-01-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tick-sched: Fix idle and iowait sleeptime accounting vs CPU hotplug
clocksource/drivers/ep93xx: Fix error handling during probe
clocksource/drivers/cadence-ttc: Fix some kernel-doc warnings
clocksource/drivers/timer-ti-dm: Fix make W=n kerneldoc warnings
clocksource/timer-riscv: Add riscv_clock_shutdown callback
dt-bindings: timer: Add StarFive JH8100 clint
dt-bindings: timer: thead,c900-aclint-mtimer: separate mtime and mtimecmp regs
2024-01-21 11:14:40 -08:00
Linus Torvalds
7b297a5cc9
Merge tag 'powerpc-6.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
...
Pull powerpc fixes from Aneesh Kumar:
- Increase default stack size to 32KB for Book3S
Thanks to Michael Ellerman.
* tag 'powerpc-6.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/64s: Increase default stack size to 32KB
2024-01-21 11:04:29 -08:00
Kent Overstreet
249f441f83
bcachefs: Improve inode_to_text()
...
Add line breaks - inode_to_text() is now much easier to read.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev >
2024-01-21 13:27:11 -05:00
Kent Overstreet
d826cc57c5
bcachefs: logged_ops_format.h
...
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev >
2024-01-21 13:27:11 -05:00
Kent Overstreet
8d52ba60c4
bcachefs: reflink_format.h
...
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev >
2024-01-21 13:27:11 -05:00
Kent Overstreet
b2fa1b633b
bcachefs; extents_format.h
...
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev >
2024-01-21 13:27:11 -05:00
Kent Overstreet
0560eb9abf
bcachefs: ec_format.h
...
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev >
2024-01-21 13:27:11 -05:00
Kent Overstreet
c6c4ff6507
bcachefs: subvolume_format.h
...
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev >
2024-01-21 13:27:11 -05:00
Kent Overstreet
8fed323b14
bcachefs: snapshot_format.h
...
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev >
2024-01-21 13:27:10 -05:00
Kent Overstreet
d455179fce
bcachefs: alloc_background_format.h
...
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev >
2024-01-21 13:27:10 -05:00
Kent Overstreet
72e0801049
bcachefs: xattr_format.h
...
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev >
2024-01-21 13:27:10 -05:00
Kent Overstreet
7ffc4daa5f
bcachefs: dirent_format.h
...
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev >
2024-01-21 13:27:10 -05:00
Kent Overstreet
b36425da71
bcachefs: inode_format.h
...
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev >
2024-01-21 13:27:10 -05:00
Kent Overstreet
82de6207fb
bcachefs; quota_format.h
...
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev >
2024-01-21 13:27:10 -05:00
Kent Overstreet
43314801a4
bcachefs: sb-counters_format.h
...
bcachefs_format.h has gotten too big; let's do some organizing.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev >
2024-01-21 13:27:10 -05:00
Kent Overstreet
3a58dfbc46
bcachefs: counters.c -> sb-counters.c
...
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev >
2024-01-21 13:27:10 -05:00
Kent Overstreet
12207f49ef
bcachefs: comment bch_subvolume
...
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev >
2024-01-21 13:27:10 -05:00
Kent Overstreet
d32088f2f2
bcachefs: bch_snapshot::btime
...
Add a field to bch_snapshot for creation time; this will be important
when we start exposing the snapshot tree to userspace.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev >
2024-01-21 13:27:10 -05:00
Kent Overstreet
7be0208fc9
bcachefs: add missing __GFP_NOWARN
...
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev >
2024-01-21 13:27:10 -05:00
Kent Overstreet
d7e77f53e9
bcachefs: opts->compression can now also be applied in the background
...
The "apply this compression method in the background" paths now use the
compression option if background_compression is not set; this means that
setting or changing the compression option will cause existing data to
be compressed accordingly in the background.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev >
2024-01-21 13:27:10 -05:00
Kent Overstreet
ec4edd7b9d
bcachefs: Prep work for variable size btree node buffers
...
bcachefs btree nodes are big - typically 256k - and btree roots are
pinned in memory. As we're now up to 18 btrees, we now have significant
memory overhead in mostly empty btree roots.
And in the future we're going to start enforcing that certain btree node
boundaries exist, to solve lock contention issues - analagous to XFS's
AGIs.
Thus, we need to start allocating smaller btree node buffers when we
can. This patch changes code that refers to the filesystem constant
c->opts.btree_node_size to refer to the btree node buffer size -
btree_buf_bytes() - where appropriate.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev >
2024-01-21 13:27:10 -05:00
Su Yue
2acc59dd88
bcachefs: grab s_umount only if snapshotting
...
When I was testing mongodb over bcachefs with compression,
there is a lockdep warning when snapshotting mongodb data volume.
$ cat test.sh
prog=bcachefs
$prog subvolume create /mnt/data
$prog subvolume create /mnt/data/snapshots
while true;do
$prog subvolume snapshot /mnt/data /mnt/data/snapshots/$(date +%s)
sleep 1s
done
$ cat /etc/mongodb.conf
systemLog:
destination: file
logAppend: true
path: /mnt/data/mongod.log
storage:
dbPath: /mnt/data/
lockdep reports:
[ 3437.452330] ======================================================
[ 3437.452750] WARNING: possible circular locking dependency detected
[ 3437.453168] 6.7.0-rc7-custom+ #85 Tainted: G E
[ 3437.453562] ------------------------------------------------------
[ 3437.453981] bcachefs/35533 is trying to acquire lock:
[ 3437.454325] ffffa0a02b2b1418 (sb_writers#10){.+.+}-{0:0}, at: filename_create+0x62/0x190
[ 3437.454875]
but task is already holding lock:
[ 3437.455268] ffffa0a02b2b10e0 (&type->s_umount_key#48){.+.+}-{3:3}, at: bch2_fs_file_ioctl+0x232/0xc90 [bcachefs]
[ 3437.456009]
which lock already depends on the new lock.
[ 3437.456553]
the existing dependency chain (in reverse order) is:
[ 3437.457054]
-> #3 (&type->s_umount_key#48){.+.+}-{3:3}:
[ 3437.457507] down_read+0x3e/0x170
[ 3437.457772] bch2_fs_file_ioctl+0x232/0xc90 [bcachefs]
[ 3437.458206] __x64_sys_ioctl+0x93/0xd0
[ 3437.458498] do_syscall_64+0x42/0xf0
[ 3437.458779] entry_SYSCALL_64_after_hwframe+0x6e/0x76
[ 3437.459155]
-> #2 (&c->snapshot_create_lock){++++}-{3:3}:
[ 3437.459615] down_read+0x3e/0x170
[ 3437.459878] bch2_truncate+0x82/0x110 [bcachefs]
[ 3437.460276] bchfs_truncate+0x254/0x3c0 [bcachefs]
[ 3437.460686] notify_change+0x1f1/0x4a0
[ 3437.461283] do_truncate+0x7f/0xd0
[ 3437.461555] path_openat+0xa57/0xce0
[ 3437.461836] do_filp_open+0xb4/0x160
[ 3437.462116] do_sys_openat2+0x91/0xc0
[ 3437.462402] __x64_sys_openat+0x53/0xa0
[ 3437.462701] do_syscall_64+0x42/0xf0
[ 3437.462982] entry_SYSCALL_64_after_hwframe+0x6e/0x76
[ 3437.463359]
-> #1 (&sb->s_type->i_mutex_key#15){+.+.}-{3:3}:
[ 3437.463843] down_write+0x3b/0xc0
[ 3437.464223] bch2_write_iter+0x5b/0xcc0 [bcachefs]
[ 3437.464493] vfs_write+0x21b/0x4c0
[ 3437.464653] ksys_write+0x69/0xf0
[ 3437.464839] do_syscall_64+0x42/0xf0
[ 3437.465009] entry_SYSCALL_64_after_hwframe+0x6e/0x76
[ 3437.465231]
-> #0 (sb_writers#10){.+.+}-{0:0}:
[ 3437.465471] __lock_acquire+0x1455/0x21b0
[ 3437.465656] lock_acquire+0xc6/0x2b0
[ 3437.465822] mnt_want_write+0x46/0x1a0
[ 3437.465996] filename_create+0x62/0x190
[ 3437.466175] user_path_create+0x2d/0x50
[ 3437.466352] bch2_fs_file_ioctl+0x2ec/0xc90 [bcachefs]
[ 3437.466617] __x64_sys_ioctl+0x93/0xd0
[ 3437.466791] do_syscall_64+0x42/0xf0
[ 3437.466957] entry_SYSCALL_64_after_hwframe+0x6e/0x76
[ 3437.467180]
other info that might help us debug this:
[ 3437.469670] 2 locks held by bcachefs/35533:
other info that might help us debug this:
[ 3437.467507] Chain exists of:
sb_writers#10 --> &c->snapshot_create_lock --> &type->s_umount_key#48
[ 3437.467979] Possible unsafe locking scenario:
[ 3437.468223] CPU0 CPU1
[ 3437.468405] ---- ----
[ 3437.468585] rlock(&type->s_umount_key#48);
[ 3437.468758] lock(&c->snapshot_create_lock);
[ 3437.469030] lock(&type->s_umount_key#48);
[ 3437.469291] rlock(sb_writers#10);
[ 3437.469434]
*** DEADLOCK ***
[ 3437.469670] 2 locks held by bcachefs/35533:
[ 3437.469838] #0 : ffffa0a02ce00a88 (&c->snapshot_create_lock){++++}-{3:3}, at: bch2_fs_file_ioctl+0x1e3/0xc90 [bcachefs]
[ 3437.470294] #1 : ffffa0a02b2b10e0 (&type->s_umount_key#48){.+.+}-{3:3}, at: bch2_fs_file_ioctl+0x232/0xc90 [bcachefs]
[ 3437.470744]
stack backtrace:
[ 3437.470922] CPU: 7 PID: 35533 Comm: bcachefs Kdump: loaded Tainted: G E 6.7.0-rc7-custom+ #85
[ 3437.471313] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[ 3437.471694] Call Trace:
[ 3437.471795] <TASK>
[ 3437.471884] dump_stack_lvl+0x57/0x90
[ 3437.472035] check_noncircular+0x132/0x150
[ 3437.472202] __lock_acquire+0x1455/0x21b0
[ 3437.472369] lock_acquire+0xc6/0x2b0
[ 3437.472518] ? filename_create+0x62/0x190
[ 3437.472683] ? lock_is_held_type+0x97/0x110
[ 3437.472856] mnt_want_write+0x46/0x1a0
[ 3437.473025] ? filename_create+0x62/0x190
[ 3437.473204] filename_create+0x62/0x190
[ 3437.473380] user_path_create+0x2d/0x50
[ 3437.473555] bch2_fs_file_ioctl+0x2ec/0xc90 [bcachefs]
[ 3437.473819] ? lock_acquire+0xc6/0x2b0
[ 3437.474002] ? __fget_files+0x2a/0x190
[ 3437.474195] ? __fget_files+0xbc/0x190
[ 3437.474380] ? lock_release+0xc5/0x270
[ 3437.474567] ? __x64_sys_ioctl+0x93/0xd0
[ 3437.474764] ? __pfx_bch2_fs_file_ioctl+0x10/0x10 [bcachefs]
[ 3437.475090] __x64_sys_ioctl+0x93/0xd0
[ 3437.475277] do_syscall_64+0x42/0xf0
[ 3437.475454] entry_SYSCALL_64_after_hwframe+0x6e/0x76
[ 3437.475691] RIP: 0033:0x7f2743c313af
======================================================
In __bch2_ioctl_subvolume_create(), we grab s_umount unconditionally
and unlock it at the end of the function. There is a comment
"why do we need this lock?" about the lock coming from
commit 42d237320e ("bcachefs: Snapshot creation, deletion")
The reason is that __bch2_ioctl_subvolume_create() calls
sync_inodes_sb() which enforce locked s_umount to writeback all dirty
nodes before doing snapshot works.
Fix it by read locking s_umount for snapshotting only and unlocking
s_umount after sync_inodes_sb().
Signed-off-by: Su Yue <glass.su@suse.com >
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev >
2024-01-21 13:27:10 -05:00
Su Yue
369acf97d6
bcachefs: kvfree bch_fs::snapshots in bch2_fs_snapshots_exit
...
bch_fs::snapshots is allocated by kvzalloc in __snapshot_t_mut.
It should be freed by kvfree not kfree.
Or umount will triger:
[ 406.829178 ] BUG: unable to handle page fault for address: ffffe7b487148008
[ 406.830676 ] #PF: supervisor read access in kernel mode
[ 406.831643 ] #PF: error_code(0x0000) - not-present page
[ 406.832487 ] PGD 0 P4D 0
[ 406.832898 ] Oops: 0000 [#1 ] PREEMPT SMP PTI
[ 406.833512 ] CPU: 2 PID: 1754 Comm: umount Kdump: loaded Tainted: G OE 6.7.0-rc7-custom+ #90
[ 406.834746 ] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[ 406.835796 ] RIP: 0010:kfree+0x62/0x140
[ 406.836197 ] Code: 80 48 01 d8 0f 82 e9 00 00 00 48 c7 c2 00 00 00 80 48 2b 15 78 9f 1f 01 48 01 d0 48 c1 e8 0c 48 c1 e0 06 48 03 05 56 9f 1f 01 <48> 8b 50 08 48 89 c7 f6 c2 01 0f 85 b0 00 00 00 66 90 48 8b 07 f6
[ 406.837810 ] RSP: 0018:ffffb9d641607e48 EFLAGS: 00010286
[ 406.838213 ] RAX: ffffe7b487148000 RBX: ffffb9d645200000 RCX: ffffb9d641607dc4
[ 406.838738 ] RDX: 000065bb00000000 RSI: ffffffffc0d88b84 RDI: ffffb9d645200000
[ 406.839217 ] RBP: ffff9a4625d00068 R08: 0000000000000001 R09: 0000000000000001
[ 406.839650 ] R10: 0000000000000001 R11: 000000000000001f R12: ffff9a4625d4da80
[ 406.840055 ] R13: ffff9a4625d00000 R14: ffffffffc0e2eb20 R15: 0000000000000000
[ 406.840451 ] FS: 00007f0a264ffb80(0000) GS:ffff9a4e2d500000(0000) knlGS:0000000000000000
[ 406.840851 ] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 406.841125 ] CR2: ffffe7b487148008 CR3: 000000018c4d2000 CR4: 00000000000006f0
[ 406.841464 ] Call Trace:
[ 406.841583 ] <TASK>
[ 406.841682 ] ? __die+0x1f/0x70
[ 406.841828 ] ? page_fault_oops+0x159/0x470
[ 406.842014 ] ? fixup_exception+0x22/0x310
[ 406.842198 ] ? exc_page_fault+0x1ed/0x200
[ 406.842382 ] ? asm_exc_page_fault+0x22/0x30
[ 406.842574 ] ? bch2_fs_release+0x54/0x280 [bcachefs]
[ 406.842842 ] ? kfree+0x62/0x140
[ 406.842988 ] ? kfree+0x104/0x140
[ 406.843138 ] bch2_fs_release+0x54/0x280 [bcachefs]
[ 406.843390 ] kobject_put+0xb7/0x170
[ 406.843552 ] deactivate_locked_super+0x2f/0xa0
[ 406.843756 ] cleanup_mnt+0xba/0x150
[ 406.843917 ] task_work_run+0x59/0xa0
[ 406.844083 ] exit_to_user_mode_prepare+0x197/0x1a0
[ 406.844302 ] syscall_exit_to_user_mode+0x16/0x40
[ 406.844510 ] do_syscall_64+0x4e/0xf0
[ 406.844675 ] entry_SYSCALL_64_after_hwframe+0x6e/0x76
[ 406.844907 ] RIP: 0033:0x7f0a2664e4fb
Signed-off-by: Su Yue <glass.su@suse.com >
Reviewed-by: Brian Foster <bfoster@redhat.com >
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev >
2024-01-21 13:27:10 -05:00
Kent Overstreet
00fff4dd58
bcachefs: bios must be 512 byte algined
...
Fixes: 023f9ac9f7 bcachefs: Delete dio read alignment check
Reported-by: Brian Foster <bfoster@redhat.com >
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev >
2024-01-21 13:27:10 -05:00
Colin Ian King
aead3428e8
bcachefs: remove redundant variable tmp
...
The variable tmp is being assigned a value but it isn't being
read afterwards. The assignment is redundant and so tmp can be
removed.
Cleans up clang scan build warning:
warning: Although the value stored to 'ret' is used in the enclosing
expression, the value is never actually read from 'ret'
[deadcode.DeadStores]
Signed-off-by: Colin Ian King <colin.i.king@gmail.com >
Reviewed-by: Brian Foster <bfoster@redhat.com >
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev >
2024-01-21 13:27:10 -05:00
Kent Overstreet
b97de45365
bcachefs: Improve trace_trans_restart_relock
...
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev >
2024-01-21 13:27:10 -05:00
Kent Overstreet
46bf2e9cc7
bcachefs: Fix excess transaction restarts in __bchfs_fallocate()
...
drop_locks_do() should not be used in a fastpath without first trying
the do in nonblocking mode - the unlock and relock will cause excessive
transaction restarts and potentially livelocking with other threads that
are contending for the same locks.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev >
2024-01-21 13:27:10 -05:00
Kent Overstreet
1a5039041b
bcachefs: extents_to_bp_state
...
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev >
2024-01-21 13:27:10 -05:00
Kent Overstreet
ba96d36ca5
bcachefs: bkey_and_val_eq()
...
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev >
2024-01-21 13:27:09 -05:00
Kent Overstreet
e6a2566f7a
bcachefs: Better journal tracepoints
...
Factor out bch2_journal_bufs_to_text(), and use it in the
journal_entry_full() tracepoint; when we can't get a journal reservation
we need to know the outstanding journal entry sizes to know if the
problem is due to excessive flushing.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev >
2024-01-21 13:27:09 -05:00
Kent Overstreet
4ae016607b
bcachefs: Print size of superblock with space allocated
...
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev >
2024-01-21 13:27:09 -05:00
Kent Overstreet
a6548c8b5e
bcachefs: Avoid flushing the journal in the discard path
...
When issuing discards, we may need to flush the journal if there's too
many buckets that can't be discarded until a journal flush.
But the heuristic was bad; we should be comparing the number of buckets
that need to flushes against the number of free buckets, not the number
of buckets we saw.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev >
2024-01-21 13:27:09 -05:00