linux

Author	SHA1	Message	Date
Thomas Gleixner	9d1c58c800	genirq/msi: Optionally use dev->fwnode for device domain To support wire to MSI domains via the MSI infrastructure it is required to use the firmware node of the device which implements this for creating the MSI domain. Otherwise the existing firmware match mechanisms to find the correct irqdomain for a wired interrupt which is connected to a wire to MSI bridge would fail. This cannot be used for the general case because not all devices provide firmware nodes and all regular per device MSI domains are directly associated to the device and have not be searched for. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Anup Patel <apatel@ventanamicro.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240127161753.114685-11-apatel@ventanamicro.com	2024-02-15 17:55:41 +01:00
Thomas Gleixner	3095cc0d5b	genirq/msi: Split msi_domain_alloc_irq_at() In preparation for providing a special allocation function for wired interrupts which are connected to a wire to MSI bridge, split the inner workings of msi_domain_alloc_irq_at() out into a helper function so the code can be shared. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Anup Patel <apatel@ventanamicro.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240127161753.114685-9-apatel@ventanamicro.com	2024-02-15 17:55:40 +01:00
Thomas Gleixner	9c78c1a85c	genirq/msi: Provide optional translation op irq_create_fwspec_mapping() requires translation of the firmware spec to a hardware interrupt number and the trigger type information. Wired interrupts which are connected to a wire to MSI bridge, like MBIGEN are allocated that way. So far MBIGEN provides a regular irqdomain which then hooks backwards into the MSI infrastructure. That's an unholy mess and will be replaced with per device MSI domains which are regular MSI domains. Interrupts on MSI domains are not supported by irq_create_fwspec_mapping(), but for making the wire to MSI bridges sane it makes sense to provide a special allocation/free interface in the MSI infrastructure. That avoids the backdoors into the core MSI allocation code and just shares all the regular MSI infrastructure. Provide an optional translation callback in msi_domain_ops which can be utilized by these wire to MSI bridges. No other MSI domain should provide a translation callback. The default translation callback of the MSI irqdomains will warn when it is invoked on a non-prepared MSI domain. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Anup Patel <apatel@ventanamicro.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240127161753.114685-8-apatel@ventanamicro.com	2024-02-15 17:55:40 +01:00
Thomas Gleixner	de1ff306dc	genirq/irqdomain: Remove the param count restriction from select() Now that the GIC-v3 callback can handle invocation with a fwspec parameter count of 0 lift the restriction in the core code and invoke select() unconditionally when the domain provides it. Preparatory change for per device MSI domains. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Anup Patel <apatel@ventanamicro.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240127161753.114685-3-apatel@ventanamicro.com	2024-02-15 17:55:39 +01:00
Thorsten Blum	9b6326354c	tracing/synthetic: Fix trace_string() return value Fix trace_string() by assigning the string length to the return variable which got lost in commit `ddeea494a1` ("tracing/synthetic: Use union instead of casts") and caused trace_string() to always return 0. Link: https://lore.kernel.org/linux-trace-kernel/20240214220555.711598-1-thorsten.blum@toblux.com Cc: stable@vger.kernel.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Fixes: `ddeea494a1` ("tracing/synthetic: Use union instead of casts") Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2024-02-15 11:40:01 -05:00
Andrea Parri	cd9b29014d	membarrier: riscv: Provide core serializing command RISC-V uses xRET instructions on return from interrupt and to go back to user-space; the xRET instruction is not core serializing. Use FENCE.I for providing core serialization as follows: - by calling sync_core_before_usermode() on return from interrupt (cf. ipi_sync_core()), - via switch_mm() and sync_core_before_usermode() (respectively, for uthread->uthread and kthread->uthread transitions) before returning to user-space. On RISC-V, the serialization in switch_mm() is activated by resetting the icache_stale_mask of the mm at prepare_sync_core_cmd(). Suggested-by: Palmer Dabbelt <palmer@dabbelt.com> Signed-off-by: Andrea Parri <parri.andrea@gmail.com> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/r/20240131144936.29190-5-parri.andrea@gmail.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>	2024-02-15 08:04:14 -08:00
Andrea Parri	4ff4c745a1	locking: Introduce prepare_sync_core_cmd() Introduce an architecture function that architectures can use to set up ("prepare") SYNC_CORE commands. The function will be used by RISC-V to update its "deferred icache- flush" data structures (icache_stale_mask). Architectures defining prepare_sync_core_cmd() static inline need to select ARCH_HAS_PREPARE_SYNC_CORE_CMD. Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Andrea Parri <parri.andrea@gmail.com> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/r/20240131144936.29190-4-parri.andrea@gmail.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>	2024-02-15 08:04:13 -08:00
Andrea Parri	a14d11a0f5	membarrier: Create Documentation/scheduler/membarrier.rst To gather the architecture requirements of the "private/global expedited" membarrier commands. The file will be expanded to integrate further information about the membarrier syscall (as needed/desired in the future). While at it, amend some related inline comments in the membarrier codebase. Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Andrea Parri <parri.andrea@gmail.com> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/r/20240131144936.29190-3-parri.andrea@gmail.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>	2024-02-15 08:04:12 -08:00
Andrea Parri	d6cfd1770f	membarrier: riscv: Add full memory barrier in switch_mm() The membarrier system call requires a full memory barrier after storing to rq->curr, before going back to user-space. The barrier is only needed when switching between processes: the barrier is implied by mmdrop() when switching from kernel to userspace, and it's not needed when switching from userspace to kernel. Rely on the feature/mechanism ARCH_HAS_MEMBARRIER_CALLBACKS and on the primitive membarrier_arch_switch_mm(), already adopted by the PowerPC architecture, to insert the required barrier. Fixes: `fab957c11e` ("RISC-V: Atomic and Locking Code") Signed-off-by: Andrea Parri <parri.andrea@gmail.com> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/r/20240131144936.29190-2-parri.andrea@gmail.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>	2024-02-15 08:04:11 -08:00
Andrii Nakryiko	a4561f5afe	bpf: Use O(log(N)) binary search to find line info record Real-world BPF applications keep growing in size. Medium-sized production application can easily have 50K+ verified instructions, and its line info section in .BTF.ext has more than 3K entries. When verifier emits log with log_level>=1, it annotates assembly code with matched original C source code. Currently it uses linear search over line info records to find a match. As complexity of BPF applications grows, this O(K * N) approach scales poorly. So, let's instead of linear O(N) search for line info record use faster equivalent O(log(N)) binary search algorithm. It's not a plain binary search, as we don't look for exact match. It's an upper bound search variant, looking for rightmost line info record that starts at or before given insn_off. Some unscientific measurements were done before and after this change. They were done in VM and fluctuate a bit, but overall the speed up is undeniable. BASELINE ======== File Program Duration (us) Insns -------------------------------- ---------------- ------------- ------ katran.bpf.o balancer_ingress 2497130 343552 pyperf600.bpf.linked3.o on_event 12389611 627288 strobelight_pyperf_libbpf.o on_py_event 387399 52445 -------------------------------- ---------------- ------------- ------ BINARY SEARCH ============= File Program Duration (us) Insns -------------------------------- ---------------- ------------- ------ katran.bpf.o balancer_ingress 2339312 343552 pyperf600.bpf.linked3.o on_event 5602203 627288 strobelight_pyperf_libbpf.o on_py_event 294761 52445 -------------------------------- ---------------- ------------- ------ While Katran's speed up is pretty modest (about 105ms, or 6%), for production pyperf BPF program (on_py_event) it's much greater already, going from 387ms down to 295ms (23% improvement). Looking at BPF selftests's biggest pyperf example, we can see even more dramatic improvement, shaving more than 50% of time, going from 12.3s down to 5.6s. Different amount of improvement is the function of overall amount of BPF assembly instructions in .bpf.o files (which contributes to how much line info records there will be and thus, on average, how much time linear search will take), among other things: $ llvm-objdump -d katran.bpf.o \| wc -l 3863 $ llvm-objdump -d strobelight_pyperf_libbpf.o \| wc -l 6997 $ llvm-objdump -d pyperf600.bpf.linked3.o \| wc -l 87854 Granted, this only applies to debugging cases (e.g., using veristat, or failing verification in production), but seems worth doing to improve overall developer experience anyways. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20240214002311.2197116-1-andrii@kernel.org	2024-02-14 23:53:42 +01:00
Tejun Heo	2f34d7337d	workqueue: Fix queue_work_on() with BH workqueues When queue_work_on() is used to queue a BH work item on a remote CPU, the work item is queued on that CPU but kick_pool() raises softirq on the local CPU. This leads to stalls as the work item won't be executed until something else on the remote CPU schedules a BH work item or tasklet locally. Fix it by bouncing raising softirq to the target CPU using per-cpu irq_work. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: `4cb1ef6460` ("workqueue: Implement BH workqueues to eventually replace tasklets")	2024-02-14 08:33:55 -10:00
Steven Rostedt (Google)	2394ac4145	tracing: Inform kmemleak of saved_cmdlines allocation The allocation of the struct saved_cmdlines_buffer structure changed from: s = kmalloc(sizeof(s), GFP_KERNEL); s->saved_cmdlines = kmalloc_array(TASK_COMM_LEN, val, GFP_KERNEL); to: orig_size = sizeof(s) + val * TASK_COMM_LEN; order = get_order(orig_size); size = 1 << (order + PAGE_SHIFT); page = alloc_pages(GFP_KERNEL, order); if (!page) return NULL; s = page_address(page); memset(s, 0, sizeof(*s)); s->saved_cmdlines = kmalloc_array(TASK_COMM_LEN, val, GFP_KERNEL); Where that s->saved_cmdlines allocation looks to be a dangling allocation to kmemleak. That's because kmemleak only keeps track of kmalloc() allocations. For allocations that use page_alloc() directly, the kmemleak needs to be explicitly informed about it. Add kmemleak_alloc() and kmemleak_free() around the page allocation so that it doesn't give the following false positive: unreferenced object 0xffff8881010c8000 (size 32760): comm "swapper", pid 0, jiffies 4294667296 hex dump (first 32 bytes): ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ................ backtrace (crc ae6ec1b9): [<ffffffff86722405>] kmemleak_alloc+0x45/0x80 [<ffffffff8414028d>] __kmalloc_large_node+0x10d/0x190 [<ffffffff84146ab1>] __kmalloc+0x3b1/0x4c0 [<ffffffff83ed7103>] allocate_cmdlines_buffer+0x113/0x230 [<ffffffff88649c34>] tracer_alloc_buffers.isra.0+0x124/0x460 [<ffffffff8864a174>] early_trace_init+0x14/0xa0 [<ffffffff885dd5ae>] start_kernel+0x12e/0x3c0 [<ffffffff885f5758>] x86_64_start_reservations+0x18/0x30 [<ffffffff885f582b>] x86_64_start_kernel+0x7b/0x80 [<ffffffff83a001c3>] secondary_startup_64_no_verify+0x15e/0x16b Link: https://lore.kernel.org/linux-trace-kernel/87r0hfnr9r.fsf@kernel.org/ Link: https://lore.kernel.org/linux-trace-kernel/20240214112046.09a322d6@gandalf.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Fixes: `44dc5c41b5` ("tracing: Fix wasted memory in saved_cmdlines logic") Reported-by: Kalle Valo <kvalo@kernel.org> Tested-by: Kalle Valo <kvalo@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2024-02-14 12:36:34 -05:00
Onkarnath	c90e3ecc91	rcu/sync: remove un-used rcu_sync_enter_start function With commit '6a010a49b63a ("cgroup: Make !percpu threadgroup_rwsem operations optional")' usage of rcu_sync_enter_start is removed. So this function can also be removed. In the words of Oleg Nesterov: __rcu_sync_enter(wait => false) is a better alternative if someone needs rcu_sync_enter_start() again. Link: https://lore.kernel.org/all/20220725121208.GB28662@redhat.com/ Signed-off-by: Onkarnath <onkarnath.1@samsung.com> Signed-off-by: Maninder Singh <maninder1.s@samsung.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2024-02-14 08:00:57 -08:00
Paul E. McKenney	fd2a749d3f	rcutorture: Suppress rtort_pipe_count warnings until after stalls Currently, if rcu_torture_writer() sees fewer than ten grace periods having elapsed during a call to stutter_wait() that actually waited, the rtort_pipe_count warning is emitted. This has worked well for a long time. Except that the rcutorture TREE07 scenario now does a short-term 14-second RCU CPU stall, which can most definitely case false-positive rtort_pipe_count warnings. This commit therefore changes rcu_torture_writer() to compute the full expected holdoff and stall duration, and to refuse to report any rtort_pipe_count warnings until after all stalls have completed. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2024-02-14 08:00:57 -08:00
Joel Fernandes (Google)	67050837ec	srcu: Improve comments about acceleration leak The comments added in commit 1ef990c4b36b ("srcu: No need to advance/accelerate if no callback enqueued") are a bit confusing. The comments are describing a scenario for code that was moved and is no longer the way it was (snapshot after advancing). Improve the code comments to reflect this and also document why acceleration can never fail. Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Neeraj Upadhyay <neeraj.iitr10@gmail.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2024-02-14 08:00:57 -08:00
Qais Yousef	7f66f099de	rcu: Provide a boot time parameter to control lazy RCU To allow more flexible arrangements while still provide a single kernel for distros, provide a boot time parameter to enable/disable lazy RCU. Specify: rcutree.enable_rcu_lazy=[y\|1\|n\|0] Which also requires rcu_nocbs=all at boot time to enable/disable lazy RCU. To disable it by default at build time when CONFIG_RCU_LAZY=y, the new CONFIG_RCU_LAZY_DEFAULT_OFF can be used. Signed-off-by: Qais Yousef (Google) <qyousef@layalina.io> Tested-by: Andrea Righi <andrea.righi@canonical.com> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2024-02-14 08:00:57 -08:00
Frederic Weisbecker	499d7e7e83	rcu: Rename jiffies_till_flush to jiffies_lazy_flush The variable name jiffies_till_flush is too generic and therefore: * It may shadow a global variable * It doesn't tell on what it operates Make the name more precise, along with the related APIs. Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2024-02-14 08:00:57 -08:00
Paul E. McKenney	3b239b308e	context_tracking: Fix kerneldoc headers for __ct_user_{enter,exit}() Document the "state" parameter of both of these functions. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202312041922.YZCcEPYD-lkp@intel.com/ Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: Randy Dunlap <rdunlap@infradead.org> Cc: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2024-02-14 07:53:50 -08:00
Frederic Weisbecker	23da2ad64d	rcu/exp: Remove rcu_par_gp_wq TREE04 running on short iterations can produce writer stalls of the following kind: ??? Writer stall state RTWS_EXP_SYNC(4) g3968 f0x0 ->state 0x2 cpu 0 task:rcu_torture_wri state:D stack:14568 pid:83 ppid:2 flags:0x00004000 Call Trace: <TASK> __schedule+0x2de/0x850 ? trace_event_raw_event_rcu_exp_funnel_lock+0x6d/0xb0 schedule+0x4f/0x90 synchronize_rcu_expedited+0x430/0x670 ? __pfx_autoremove_wake_function+0x10/0x10 ? __pfx_synchronize_rcu_expedited+0x10/0x10 do_rtws_sync.constprop.0+0xde/0x230 rcu_torture_writer+0x4b4/0xcd0 ? __pfx_rcu_torture_writer+0x10/0x10 kthread+0xc7/0xf0 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x2f/0x50 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1b/0x30 </TASK> Waiting for an expedited grace period and polling for an expedited grace period both are operations that internally rely on the same workqueue performing necessary asynchronous work. However, a dependency chain is involved between those two operations, as depicted below: ====== CPU 0 ======= ====== CPU 1 ======= synchronize_rcu_expedited() exp_funnel_lock() mutex_lock(&rcu_state.exp_mutex); start_poll_synchronize_rcu_expedited queue_work(rcu_gp_wq, &rnp->exp_poll_wq); synchronize_rcu_expedited_queue_work() queue_work(rcu_gp_wq, &rew->rew_work); wait_event() // A, wait for &rew->rew_work completion mutex_unlock() // B //======> switch to kworker sync_rcu_do_polled_gp() { synchronize_rcu_expedited() exp_funnel_lock() mutex_lock(&rcu_state.exp_mutex); // C, wait B .... } // D Since workqueues are usually implemented on top of several kworkers handling the queue concurrently, the above situation wouldn't deadlock most of the time because A then doesn't depend on D. But in case of memory stress, a single kworker may end up handling alone all the works in a serialized way. In that case the above layout becomes a problem because A then waits for D, closing a circular dependency: A -> D -> C -> B -> A This however only happens when CONFIG_RCU_EXP_KTHREAD=n. Indeed synchronize_rcu_expedited() is otherwise implemented on top of a kthread worker while polling still relies on rcu_gp_wq workqueue, breaking the above circular dependency chain. Fix this with making expedited grace period to always rely on kthread worker. The workqueue based implementation is essentially a duplicate anyway now that the per-node initialization is performed by per-node kthread workers. Meanwhile the CONFIG_RCU_EXP_KTHREAD switch is still kept around to manage the scheduler policy of these kthread workers. Reported-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Reported-by: Thomas Gleixner <tglx@linutronix.de> Suggested-by: Joel Fernandes <joel@joelfernandes.org> Suggested-by: Paul E. McKenney <paulmck@kernel.org> Suggested-by: Neeraj upadhyay <Neeraj.Upadhyay@amd.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2024-02-14 07:51:36 -08:00
Frederic Weisbecker	b67cffcbbf	rcu/exp: Handle parallel exp gp kworkers affinity Affine the parallel expedited gp kworkers to their respective RCU node in order to make them close to the cache their are playing with. This reuses the boost kthreads machinery that probe into CPU hotplug operations such that the kthreads become/stay affine to their respective node as soon/long as they contain online CPUs. Otherwise and if the current CPU going down was the last online on the leaf node, the related kthread is affine to the housekeeping CPUs. In the long run, this affinity VS CPU hotplug operation game should probably be implemented at the generic kthread level. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> [boqun: s/* rcu_boost_task/*rcu_boost_task as reported by checkpatch] Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2024-02-14 07:51:36 -08:00
Frederic Weisbecker	8e5e621566	rcu/exp: Make parallel exp gp kworker per rcu node When CONFIG_RCU_EXP_KTHREAD=n, the expedited grace period per node initialization is performed in parallel via workqueues (one work per node). However in CONFIG_RCU_EXP_KTHREAD=y, this per node initialization is performed by a single kworker serializing each node initialization (one work for all nodes). The second part is certainly less scalable and efficient beyond a single leaf node. To improve this, expand this single kworker into per-node kworkers. This new layout is eventually intended to remove the workqueues based implementation since it will essentially now become duplicate code. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2024-02-14 07:51:36 -08:00
Frederic Weisbecker	c19e5d3b49	rcu/exp: Move expedited kthread worker creation functions above rcutree_prepare_cpu() The expedited kthread worker performing the per node initialization is going to be split into per node kthreads. As such, the future per node kthread creation will need to be called from CPU hotplug callbacks instead of an initcall, right beside the per node boost kthread creation. To prepare for that, move the kthread worker creation above rcutree_prepare_cpu() as a first step to make the review smoother for the upcoming modifications. No intended functional change. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2024-02-14 07:51:36 -08:00
Frederic Weisbecker	7836b27060	rcu: s/boost_kthread_mutex/kthread_mutex This mutex is currently protecting per node boost kthreads creation and affinity setting across CPU hotplug operations. Since the expedited kworkers will soon be split per node as well, they will be subject to the same concurrency constraints against hotplug. Therefore their creation and affinity tuning operations will be grouped with those of boost kthreads and then rely on the same mutex. To prepare for that, generalize its name. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2024-02-14 07:51:36 -08:00
Frederic Weisbecker	e7539ffc9a	rcu/exp: Handle RCU expedited grace period kworker allocation failure Just like is done for the kworker performing nodes initialization, gracefully handle the possible allocation failure of the RCU expedited grace period main kworker. While at it perform a rename of the related checking functions to better reflect the expedited specifics. Reviewed-by: Kalesh Singh <kaleshsingh@google.com> Fixes: `9621fbee44` ("rcu: Move expedited grace period (GP) work to RT kthread_worker") Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2024-02-14 07:51:36 -08:00
Frederic Weisbecker	a636c5e6f8	rcu/exp: Fix RCU expedited parallel grace period kworker allocation failure recovery Under CONFIG_RCU_EXP_KTHREAD=y, the nodes initialization for expedited grace periods is queued to a kworker. However if the allocation of that kworker failed, the nodes initialization is performed synchronously by the caller instead. Now the check for kworker initialization failure relies on the kworker pointer to be NULL while its value might actually encapsulate an allocation failure error. Make sure to handle this case. Reviewed-by: Kalesh Singh <kaleshsingh@google.com> Fixes: `9621fbee44` ("rcu: Move expedited grace period (GP) work to RT kthread_worker") Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2024-02-14 07:51:36 -08:00
Frederic Weisbecker	a7e4074dcc	rcu/exp: Remove full barrier upon main thread wakeup When an expedited grace period is ending, care must be taken so that all the quiescent states propagated up to the root are correctly ordered against the wake up of the main expedited grace period workqueue. This ordering is already carried through the root rnp locking augmented by an smp_mb__after_unlock_lock() barrier. Therefore the explicit smp_mb() placed before the wake up is not needed and can be removed. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2024-02-14 07:51:35 -08:00
Zqiang	f3c4c00784	rcu/nocb: Check rdp_gp->nocb_timer in __call_rcu_nocb_wake() Currently, only rdp_gp->nocb_timer is used, for nocb_timer of no-rdp_gp structure, the timer_pending() is always return false, this commit therefore need to check rdp_gp->nocb_timer in __call_rcu_nocb_wake(). Signed-off-by: Zqiang <qiang.zhang1211@gmail.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2024-02-14 07:50:45 -08:00
Zqiang	dda98810b5	rcu/nocb: Fix WARN_ON_ONCE() in the rcu_nocb_bypass_lock() For the kernels built with CONFIG_RCU_NOCB_CPU_DEFAULT_ALL=y and CONFIG_RCU_LAZY=y, the following scenarios will trigger WARN_ON_ONCE() in the rcu_nocb_bypass_lock() and rcu_nocb_wait_contended() functions: CPU2 CPU11 kthread rcu_nocb_cb_kthread ksys_write rcu_do_batch vfs_write rcu_torture_timer_cb proc_sys_write __kmem_cache_free proc_sys_call_handler kmemleak_free drop_caches_sysctl_handler delete_object_full drop_slab __delete_object shrink_slab put_object lazy_rcu_shrink_scan call_rcu rcu_nocb_flush_bypass __call_rcu_commn rcu_nocb_bypass_lock raw_spin_trylock(&rdp->nocb_bypass_lock) fail atomic_inc(&rdp->nocb_lock_contended); rcu_nocb_wait_contended WARN_ON_ONCE(smp_processor_id() != rdp->cpu); WARN_ON_ONCE(atomic_read(&rdp->nocb_lock_contended)) \| \|_ _ _ _ _ _ _ _ _ _same rdp and rdp->cpu != 11_ _ _ _ _ _ _ _ _ __\| Reproduce this bug with "echo 3 > /proc/sys/vm/drop_caches". This commit therefore uses rcu_nocb_try_flush_bypass() instead of rcu_nocb_flush_bypass() in lazy_rcu_shrink_scan(). If the nocb_bypass queue is being flushed, then rcu_nocb_try_flush_bypass will return directly. Signed-off-by: Zqiang <qiang.zhang1211@gmail.com> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2024-02-14 07:50:45 -08:00
Frederic Weisbecker	afd4e69647	rcu/nocb: Re-arrange call_rcu() NOCB specific code Currently the call_rcu() function interleaves NOCB and !NOCB enqueue code in a complicated way such that: * The bypass enqueue code may or may not have enqueued and may or may not have locked the ->nocb_lock. Everything that follows is in a Schrödinger locking state for the unwary reviewer's eyes. * The was_alldone is always set but only used in NOCB related code. * The NOCB wake up is distantly related to the locking hopefully performed by the bypass enqueue code that did not enqueue on the bypass list. Unconfuse the whole and gather NOCB and !NOCB specific enqueue code to their own functions. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2024-02-14 07:50:45 -08:00
Frederic Weisbecker	b913c3fe68	rcu/nocb: Make IRQs disablement symmetric Currently IRQs are disabled on call_rcu() and then depending on the context: * If the CPU is in nocb mode: - If the callback is enqueued in the bypass list, IRQs are re-enabled implictly by rcu_nocb_try_bypass() - If the callback is enqueued in the normal list, IRQs are re-enabled implicitly by __call_rcu_nocb_wake() * If the CPU is NOT in nocb mode, IRQs are reenabled explicitly from call_rcu() This makes the code a bit hard to follow, especially as it interleaves with nocb locking. To make the IRQ flags coverage clearer and also in order to prepare for moving all the nocb enqueue code to its own function, always re-enable the IRQ flags explicitly from call_rcu(). Reviewed-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2024-02-14 07:50:45 -08:00
Frederic Weisbecker	1e8e6951a5	rcu/nocb: Remove needless full barrier after callback advancing A full barrier is issued from nocb_gp_wait() upon callbacks advancing to order grace period completion with callbacks execution. However these two events are already ordered by the smp_mb__after_unlock_lock() barrier within the call to raw_spin_lock_rcu_node() that is necessary for callbacks advancing to happen. The following litmus test shows the kind of guarantee that this barrier provides: C smp_mb__after_unlock_lock {} // rcu_gp_cleanup() P0(spinlock_t rnp_lock, int gpnum) { // Grace period cleanup increase gp sequence number spin_lock(rnp_lock); WRITE_ONCE(gpnum, 1); spin_unlock(rnp_lock); } // nocb_gp_wait() P1(spinlock_t rnp_lock, spinlock_t nocb_lock, int gpnum, int cb_ready) { int r1; // Call rcu_advance_cbs() from nocb_gp_wait() spin_lock(nocb_lock); spin_lock(rnp_lock); smp_mb__after_unlock_lock(); r1 = READ_ONCE(gpnum); WRITE_ONCE(cb_ready, 1); spin_unlock(rnp_lock); spin_unlock(nocb_lock); } // nocb_cb_wait() P2(spinlock_t nocb_lock, int cb_ready, int cb_executed) { int r2; // rcu_do_batch() -> rcu_segcblist_extract_done_cbs() spin_lock(nocb_lock); r2 = READ_ONCE(cb_ready); spin_unlock(nocb_lock); // Actual callback execution WRITE_ONCE(cb_executed, 1); } P3(int cb_executed, int gpnum) { int r3; WRITE_ONCE(cb_executed, 2); smp_mb(); r3 = READ_ONCE(gpnum); } exists (1:r1=1 /\ 2:r2=1 /\ cb_executed=2 /\ 3:r3=0) (* Bad outcome. *) Here the bad outcome only occurs if the smp_mb__after_unlock_lock() is removed. This barrier orders the grace period completion against callbacks advancing and even later callbacks invocation, thanks to the opportunistic propagation via the ->nocb_lock to nocb_cb_wait(). Therefore the smp_mb() placed after callbacks advancing can be safely removed. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2024-02-14 07:50:45 -08:00
Frederic Weisbecker	ca16265aaf	rcu/nocb: Remove needless LOAD-ACQUIRE The LOAD-ACQUIRE access performed on rdp->nocb_cb_sleep advertizes ordering callback execution against grace period completion. However this is contradicted by the following: * This LOAD-ACQUIRE doesn't pair with anything. The only counterpart barrier that can be found is the smp_mb() placed after callbacks advancing in nocb_gp_wait(). However the barrier is placed _after_ ->nocb_cb_sleep write. * Callbacks can be concurrently advanced between the LOAD-ACQUIRE on ->nocb_cb_sleep and the call to rcu_segcblist_extract_done_cbs() in rcu_do_batch(), making any ordering based on ->nocb_cb_sleep broken. * Both rcu_segcblist_extract_done_cbs() and rcu_advance_cbs() are called under the nocb_lock, the latter hereby providing already the desired ACQUIRE semantics. Therefore it is safe to access ->nocb_cb_sleep with a simple compiler barrier. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>	2024-02-14 07:50:44 -08:00
Ingo Molnar	4589f199eb	Merge branch 'x86/bugs' into x86/core, to pick up pending changes before dependent patches Merge in pending alternatives patching infrastructure changes, before applying more patches. Signed-off-by: Ingo Molnar <mingo@kernel.org>	2024-02-14 10:49:37 +01:00
Andrii Nakryiko	7cc13adbd0	bpf: emit source code file name and line number in verifier log As BPF applications grow in size and complexity and are separated into multiple .bpf.c files that are statically linked together, it becomes harder and harder to match verifier's BPF assembly level output to original C code. While often annotated C source code is unique enough to be able to identify the file it belongs to, quite often this is actually problematic as parts of source code can be quite generic. Long story short, it is very useful to see source code file name and line number information along with the original C code. Verifier already knows this information, we just need to output it. This patch extends verifier log with file name and line number information, emitted next to original (presumably C) source code, annotating BPF assembly output, like so: ; <original C code> @ <filename>.bpf.c:<line> If file name has directory names in it, they are stripped away. This should be fine in practice as file names tend to be pretty unique with C code anyways, and keeping log size smaller is always good. In practice this might look something like below, where some code is coming from application files, while others are from libbpf's usdt.bpf.h header file: ; if (STROBEMETA_READ( @ strobemeta_probe.bpf.c:534 5592: (79) r1 = (u64 )(r10 -56) ; R1_w=mem_or_null(id=1589,sz=7680) R10=fp0 5593: (7b) (u64 )(r10 -56) = r1 ; R1_w=mem_or_null(id=1589,sz=7680) R10=fp0 5594: (79) r3 = (u64 )(r10 -8) ; R3_w=scalar() R10=fp0 fp-8=mmmmmmmm ... 170: (71) r1 = (u8 )(r8 +15) ; frame1: R1_w=scalar(...) R8_w=map_value(map=__bpf_usdt_spec,ks=4,vs=208) 171: (67) r1 <<= 56 ; frame1: R1_w=scalar(...) 172: (c7) r1 s>>= 56 ; frame1: R1_w=scalar(smin=smin32=-128,smax=smax32=127) ; val <<= arg_spec->arg_bitshift; @ usdt.bpf.h:183 173: (67) r1 <<= 32 ; frame1: R1_w=scalar(...) 174: (77) r1 >>= 32 ; frame1: R1_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) 175: (79) r2 = (u64 )(r10 -8) ; frame1: R2_w=scalar() R10=fp0 fp-8=mmmmmmmm 176: (6f) r2 <<= r1 ; frame1: R1_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) R2_w=scalar() 177: (7b) (u64 )(r10 -8) = r2 ; frame1: R2_w=scalar(id=61) R10=fp0 fp-8_w=scalar(id=61) ; if (arg_spec->arg_signed) @ usdt.bpf.h:184 178: (bf) r3 = r2 ; frame1: R2_w=scalar(id=61) R3_w=scalar(id=61) 179: (7f) r3 >>= r1 ; frame1: R1_w=scalar(smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) R3_w=scalar() ; if (arg_spec->arg_signed) @ usdt.bpf.h:184 180: (71) r4 = (u8 )(r8 +14) 181: safe log_fixup tests needed a minor adjustment as verifier log output increased a bit and that test is quite sensitive to such changes. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240212235944.2816107-1-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-02-13 18:51:32 -08:00
Andrii Nakryiko	879bbe7aa4	bpf: don't infer PTR_TO_CTX for programs with unnamed context type For program types that don't have named context type name (e.g., BPF iterator programs or tracepoint programs), ctx_tname will be a non-NULL empty string. For such programs it shouldn't be possible to have PTR_TO_CTX argument for global subprogs based on type name alone. arg:ctx tag is the only way to have PTR_TO_CTX passed into global subprog for such program types. Fix this loophole, which currently would assume PTR_TO_CTX whenever user uses a pointer to anonymous struct as an argument to their global subprogs. This happens in practice with the following (quite common, in practice) approach: typedef struct { /* anonymous / int x; } my_type_t; int my_subprog(my_type_t arg) { ... } User's intent is to have PTR_TO_MEM argument for `arg`, but verifier will complain about expecting PTR_TO_CTX. This fix also closes unintended s390x-specific KPROBE handling of PTR_TO_CTX case. Selftest change is necessary to accommodate this. Fixes: `91cc1a9974` ("bpf: Annotate context types") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240212233221.2575350-4-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-02-13 18:46:47 -08:00
Andrii Nakryiko	824c58fb10	bpf: handle bpf_user_pt_regs_t typedef explicitly for PTR_TO_CTX global arg Expected canonical argument type for global function arguments representing PTR_TO_CTX is `bpf_user_pt_regs_t *ctx`. This currently works on s390x by accident because kernel resolves such typedef to underlying struct (which is anonymous on s390x), and erroneously accepting it as expected context type. We are fixing this problem next, which would break s390x arch, so we need to handle `bpf_user_pt_regs_t` case explicitly for KPROBE programs. Fixes: `91cc1a9974` ("bpf: Annotate context types") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240212233221.2575350-3-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-02-13 18:46:47 -08:00
Andrii Nakryiko	fb5b86cfd4	bpf: simplify btf_get_prog_ctx_type() into btf_is_prog_ctx_type() Return result of btf_get_prog_ctx_type() is never used and callers only check NULL vs non-NULL case to determine if given type matches expected PTR_TO_CTX type. So rename function to `btf_is_prog_ctx_type()` and return a simple true/false. We'll use this simpler interface to handle kprobe program type's special typedef case in the next patch. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240212233221.2575350-2-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-02-13 18:46:46 -08:00
Oliver Crumrine	32e18e7688	bpf: remove check in __cgroup_bpf_run_filter_skb Originally, this patch removed a redundant check in BPF_CGROUP_RUN_PROG_INET_EGRESS, as the check was already being done in the function it called, __cgroup_bpf_run_filter_skb. For v2, it was reccomended that I remove the check from __cgroup_bpf_run_filter_skb, and add the checks to the other macro that calls that function, BPF_CGROUP_RUN_PROG_INET_INGRESS. To sum it up, checking that the socket exists and that it is a full socket is now part of both macros BPF_CGROUP_RUN_PROG_INET_EGRESS and BPF_CGROUP_RUN_PROG_INET_INGRESS, and it is no longer part of the function they call, __cgroup_bpf_run_filter_skb. v3->v4: Fixed weird merge conflict. v2->v3: Sent to bpf-next instead of generic patch v1->v2: Addressed feedback about where check should be removed. Signed-off-by: Oliver Crumrine <ozlinuxc@gmail.com> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/7lv62yiyvmj5a7eozv2iznglpkydkdfancgmbhiptrgvgan5sy@3fl3onchgdz3 Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>	2024-02-13 15:41:17 -08:00
Kui-Feng Lee	1611603537	bpf: Create argument information for nullable arguments. Collect argument information from the type information of stub functions to mark arguments of BPF struct_ops programs with PTR_MAYBE_NULL if they are nullable. A nullable argument is annotated by suffixing "__nullable" at the argument name of stub function. For nullable arguments, this patch sets a struct bpf_ctx_arg_aux to label their reg_type with PTR_TO_BTF_ID \| PTR_TRUSTED \| PTR_MAYBE_NULL. This makes the verifier to check programs and ensure that they properly check the pointer. The programs should check if the pointer is null before accessing the pointed memory. The implementer of a struct_ops type should annotate the arguments that can be null. The implementer should define a stub function (empty) as a placeholder for each defined operator. The name of a stub function should be in the pattern "<st_op_type>__<operator name>". For example, for test_maybe_null of struct bpf_testmod_ops, it's stub function name should be "bpf_testmod_ops__test_maybe_null". You mark an argument nullable by suffixing the argument name with "__nullable" at the stub function. Since we already has stub functions for kCFI, we just reuse these stub functions with the naming convention mentioned earlier. These stub functions with the naming convention is only required if there are nullable arguments to annotate. For functions having not nullable arguments, stub functions are not necessary for the purpose of this patch. This patch will prepare a list of struct bpf_ctx_arg_aux, aka arg_info, for each member field of a struct_ops type. "arg_info" will be assigned to "prog->aux->ctx_arg_info" of BPF struct_ops programs in check_struct_ops_btf_id() so that it can be used by btf_ctx_access() later to set reg_type properly for the verifier. Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Link: https://lore.kernel.org/r/20240209023750.1153905-4-thinker.li@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>	2024-02-13 15:16:44 -08:00
Kui-Feng Lee	6115a0aeef	bpf: Move __kfunc_param_match_suffix() to btf.c. Move __kfunc_param_match_suffix() to btf.c and rename it as btf_param_match_suffix(). It can be reused by bpf_struct_ops later. Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Link: https://lore.kernel.org/r/20240209023750.1153905-3-thinker.li@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>	2024-02-13 15:16:44 -08:00
Kui-Feng Lee	77c0208e19	bpf: add btf pointer to struct bpf_ctx_arg_aux. Enable the providers to use types defined in a module instead of in the kernel (btf_vmlinux). Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Link: https://lore.kernel.org/r/20240209023750.1153905-2-thinker.li@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>	2024-02-13 15:16:44 -08:00
Hari Bathini	11f522256e	bpf: Fix warning for bpf_cpumask in verifier Compiling with CONFIG_BPF_SYSCALL & !CONFIG_BPF_JIT throws the below warning: "WARN: resolve_btfids: unresolved symbol bpf_cpumask" Fix it by adding the appropriate #ifdef. Signed-off-by: Hari Bathini <hbathini@linux.ibm.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Stanislav Fomichev <sdf@google.com> Acked-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/bpf/20240208100115.602172-1-hbathini@linux.ibm.com	2024-02-13 11:13:39 -08:00
Yonghong Song	178c54666f	bpf: Mark bpf_spin_{lock,unlock}() helpers with notrace correctly Currently tracing is supposed not to allow for bpf_spin_{lock,unlock}() helper calls. This is to prevent deadlock for the following cases: - there is a prog (prog-A) calling bpf_spin_{lock,unlock}(). - there is a tracing program (prog-B), e.g., fentry, attached to bpf_spin_lock() and/or bpf_spin_unlock(). - prog-B calls bpf_spin_{lock,unlock}(). For such a case, when prog-A calls bpf_spin_{lock,unlock}(), a deadlock will happen. The related source codes are below in kernel/bpf/helpers.c: notrace BPF_CALL_1(bpf_spin_lock, struct bpf_spin_lock , lock) notrace BPF_CALL_1(bpf_spin_unlock, struct bpf_spin_lock , lock) notrace is supposed to prevent fentry prog from attaching to bpf_spin_{lock,unlock}(). But actually this is not the case and fentry prog can successfully attached to bpf_spin_lock(). Siddharth Chintamaneni reported the issue in [1]. The following is the macro definition for above BPF_CALL_1: #define BPF_CALL_x(x, name, ...) \ static __always_inline \ u64 ____##name(__BPF_MAP(x, __BPF_DECL_ARGS, __BPF_V, __VA_ARGS__)); \ typedef u64 (*btf_##name)(__BPF_MAP(x, __BPF_DECL_ARGS, __BPF_V, __VA_ARGS__)); \ u64 name(__BPF_REG(x, __BPF_DECL_REGS, __BPF_N, __VA_ARGS__)); \ u64 name(__BPF_REG(x, __BPF_DECL_REGS, __BPF_N, __VA_ARGS__)) \ { \ return ((btf_##name)____##name)(__BPF_MAP(x,__BPF_CAST,__BPF_N,__VA_ARGS__));\ } \ static __always_inline \ u64 ____##name(__BPF_MAP(x, __BPF_DECL_ARGS, __BPF_V, __VA_ARGS__)) #define BPF_CALL_1(name, ...) BPF_CALL_x(1, name, __VA_ARGS__) The notrace attribute is actually applied to the static always_inline function ____bpf_spin_{lock,unlock}(). The actual callback function bpf_spin_{lock,unlock}() is not marked with notrace, hence allowing fentry prog to attach to two helpers, and this may cause the above mentioned deadlock. Siddharth Chintamaneni actually has a reproducer in [2]. To fix the issue, a new macro NOTRACE_BPF_CALL_1 is introduced which will add notrace attribute to the original function instead of the hidden always_inline function and this fixed the problem. [1] https://lore.kernel.org/bpf/CAE5sdEigPnoGrzN8WU7Tx-h-iFuMZgW06qp0KHWtpvoXxf1OAQ@mail.gmail.com/ [2] https://lore.kernel.org/bpf/CAE5sdEg6yUc_Jz50AnUXEEUh6O73yQ1Z6NV2srJnef0ZrQkZew@mail.gmail.com/ Fixes: `d83525ca62` ("bpf: introduce bpf_spin_lock") Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20240207070102.335167-1-yonghong.song@linux.dev	2024-02-13 11:11:25 -08:00
Daniel Xu	5b268d1ebc	bpf: Have bpf_rdonly_cast() take a const pointer Since `20d59ee551` ("libbpf: add bpf_core_cast() macro"), libbpf is now exporting a const arg version of bpf_rdonly_cast(). This causes the following conflicting type error when generating kfunc prototypes from BTF: In file included from skeleton/pid_iter.bpf.c:5: /home/dxu/dev/linux/tools/bpf/bpftool/bootstrap/libbpf/include/bpf/bpf_core_read.h:297:14: error: conflicting types for 'bpf_rdonly_cast' extern void bpf_rdonly_cast(const void obj__ign, __u32 btf_id__k) __ksym __weak; ^ ./vmlinux.h:135625:14: note: previous declaration is here extern void bpf_rdonly_cast(void obj__ign, u32 btf_id__k) __weak __ksym; This is b/c the kernel defines bpf_rdonly_cast() with non-const arg. Since const arg is more permissive and thus backwards compatible, we change the kernel definition as well to avoid conflicting type errors. Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/bpf/dfd3823f11ffd2d4c838e961d61ec9ae8a646773.1707080349.git.dxu@dxuuu.xyz	2024-02-13 11:05:26 -08:00
Sven Schnelle	a6eaa24f1c	tracing: Use ring_buffer_record_is_set_on() in tracer_tracing_is_on() tracer_tracing_is_on() checks whether record_disabled is not zero. This checks both the record_disabled counter and the RB_BUFFER_OFF flag. Reading the source it looks like this function should only check for the RB_BUFFER_OFF flag. Therefore use ring_buffer_record_is_set_on(). This fixes spurious fails in the 'test for function traceon/off triggers' test from the ftrace testsuite when the system is under load. Link: https://lore.kernel.org/linux-trace-kernel/20240205065340.2848065-1-svens@linux.ibm.com Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Tested-By: Mete Durlu <meted@linux.ibm.com> Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2024-02-13 12:04:17 -05:00
Petr Pavlu	bdbddb109c	tracing: Fix HAVE_DYNAMIC_FTRACE_WITH_REGS ifdef Commit `a8b9cf62ad` ("ftrace: Fix DIRECT_CALLS to use SAVE_REGS by default") attempted to fix an issue with direct trampolines on x86, see its description for details. However, it wrongly referenced the HAVE_DYNAMIC_FTRACE_WITH_REGS config option and the problem is still present. Add the missing "CONFIG_" prefix for the logic to work as intended. Link: https://lore.kernel.org/linux-trace-kernel/20240213132434.22537-1-petr.pavlu@suse.com Fixes: `a8b9cf62ad` ("ftrace: Fix DIRECT_CALLS to use SAVE_REGS by default") Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2024-02-13 11:56:49 -05:00
Marco Elver	68bc61c26c	bpf: Allow compiler to inline most of bpf_local_storage_lookup() In various performance profiles of kernels with BPF programs attached, bpf_local_storage_lookup() appears as a significant portion of CPU cycles spent. To enable the compiler generate more optimal code, turn bpf_local_storage_lookup() into a static inline function, where only the cache insertion code path is outlined Notably, outlining cache insertion helps avoid bloating callers by duplicating setting up calls to raw_spin_{lock,unlock}_irqsave() (on architectures which do not inline spin_lock/unlock, such as x86), which would cause the compiler produce worse code by deciding to outline otherwise inlinable functions. The call overhead is neutral, because we make 2 calls either way: either calling raw_spin_lock_irqsave() and raw_spin_unlock_irqsave(); or call __bpf_local_storage_insert_cache(), which calls raw_spin_lock_irqsave(), followed by a tail-call to raw_spin_unlock_irqsave() where the compiler can perform TCO and (in optimized uninstrumented builds) turns it into a plain jump. The call to __bpf_local_storage_insert_cache() can be elided entirely if cacheit_lockit is a false constant expression. Based on results from './benchs/run_bench_local_storage.sh' (21 trials, reboot between each trial; x86 defconfig + BPF, clang 16) this produces improvements in throughput and latency in the majority of cases, with an average (geomean) improvement of 8%: +---- Hashmap Control -------------------- \| \| + num keys: 10 \| : <before> \| <after> \| +-+ hashmap (control) sequential get +----------------------+---------------------- \| +- hits throughput \| 14.789 M ops/s \| 14.745 M ops/s ( ~ ) \| +- hits latency \| 67.679 ns/op \| 67.879 ns/op ( ~ ) \| +- important_hits throughput \| 14.789 M ops/s \| 14.745 M ops/s ( ~ ) \| \| + num keys: 1000 \| : <before> \| <after> \| +-+ hashmap (control) sequential get +----------------------+---------------------- \| +- hits throughput \| 12.233 M ops/s \| 12.170 M ops/s ( ~ ) \| +- hits latency \| 81.754 ns/op \| 82.185 ns/op ( ~ ) \| +- important_hits throughput \| 12.233 M ops/s \| 12.170 M ops/s ( ~ ) \| \| + num keys: 10000 \| : <before> \| <after> \| +-+ hashmap (control) sequential get +----------------------+---------------------- \| +- hits throughput \| 7.220 M ops/s \| 7.204 M ops/s ( ~ ) \| +- hits latency \| 138.522 ns/op \| 138.842 ns/op ( ~ ) \| +- important_hits throughput \| 7.220 M ops/s \| 7.204 M ops/s ( ~ ) \| \| + num keys: 100000 \| : <before> \| <after> \| +-+ hashmap (control) sequential get +----------------------+---------------------- \| +- hits throughput \| 5.061 M ops/s \| 5.165 M ops/s (+2.1%) \| +- hits latency \| 198.483 ns/op \| 194.270 ns/op (-2.1%) \| +- important_hits throughput \| 5.061 M ops/s \| 5.165 M ops/s (+2.1%) \| \| + num keys: 4194304 \| : <before> \| <after> \| +-+ hashmap (control) sequential get +----------------------+---------------------- \| +- hits throughput \| 2.864 M ops/s \| 2.882 M ops/s ( ~ ) \| +- hits latency \| 365.220 ns/op \| 361.418 ns/op (-1.0%) \| +- important_hits throughput \| 2.864 M ops/s \| 2.882 M ops/s ( ~ ) \| +---- Local Storage ---------------------- \| \| + num_maps: 1 \| : <before> \| <after> \| +-+ local_storage cache sequential get +----------------------+---------------------- \| +- hits throughput \| 33.005 M ops/s \| 39.068 M ops/s (+18.4%) \| +- hits latency \| 30.300 ns/op \| 25.598 ns/op (-15.5%) \| +- important_hits throughput \| 33.005 M ops/s \| 39.068 M ops/s (+18.4%) \| : \| : <before> \| <after> \| +-+ local_storage cache interleaved get +----------------------+---------------------- \| +- hits throughput \| 37.151 M ops/s \| 44.926 M ops/s (+20.9%) \| +- hits latency \| 26.919 ns/op \| 22.259 ns/op (-17.3%) \| +- important_hits throughput \| 37.151 M ops/s \| 44.926 M ops/s (+20.9%) \| \| + num_maps: 10 \| : <before> \| <after> \| +-+ local_storage cache sequential get +----------------------+---------------------- \| +- hits throughput \| 32.288 M ops/s \| 38.099 M ops/s (+18.0%) \| +- hits latency \| 30.972 ns/op \| 26.248 ns/op (-15.3%) \| +- important_hits throughput \| 3.229 M ops/s \| 3.810 M ops/s (+18.0%) \| : \| : <before> \| <after> \| +-+ local_storage cache interleaved get +----------------------+---------------------- \| +- hits throughput \| 34.473 M ops/s \| 41.145 M ops/s (+19.4%) \| +- hits latency \| 29.010 ns/op \| 24.307 ns/op (-16.2%) \| +- important_hits throughput \| 12.312 M ops/s \| 14.695 M ops/s (+19.4%) \| \| + num_maps: 16 \| : <before> \| <after> \| +-+ local_storage cache sequential get +----------------------+---------------------- \| +- hits throughput \| 32.524 M ops/s \| 38.341 M ops/s (+17.9%) \| +- hits latency \| 30.748 ns/op \| 26.083 ns/op (-15.2%) \| +- important_hits throughput \| 2.033 M ops/s \| 2.396 M ops/s (+17.9%) \| : \| : <before> \| <after> \| +-+ local_storage cache interleaved get +----------------------+---------------------- \| +- hits throughput \| 34.575 M ops/s \| 41.338 M ops/s (+19.6%) \| +- hits latency \| 28.925 ns/op \| 24.193 ns/op (-16.4%) \| +- important_hits throughput \| 11.001 M ops/s \| 13.153 M ops/s (+19.6%) \| \| + num_maps: 17 \| : <before> \| <after> \| +-+ local_storage cache sequential get +----------------------+---------------------- \| +- hits throughput \| 28.861 M ops/s \| 32.756 M ops/s (+13.5%) \| +- hits latency \| 34.649 ns/op \| 30.530 ns/op (-11.9%) \| +- important_hits throughput \| 1.700 M ops/s \| 1.929 M ops/s (+13.5%) \| : \| : <before> \| <after> \| +-+ local_storage cache interleaved get +----------------------+---------------------- \| +- hits throughput \| 31.529 M ops/s \| 36.110 M ops/s (+14.5%) \| +- hits latency \| 31.719 ns/op \| 27.697 ns/op (-12.7%) \| +- important_hits throughput \| 9.598 M ops/s \| 10.993 M ops/s (+14.5%) \| \| + num_maps: 24 \| : <before> \| <after> \| +-+ local_storage cache sequential get +----------------------+---------------------- \| +- hits throughput \| 18.602 M ops/s \| 19.937 M ops/s (+7.2%) \| +- hits latency \| 53.767 ns/op \| 50.166 ns/op (-6.7%) \| +- important_hits throughput \| 0.776 M ops/s \| 0.831 M ops/s (+7.2%) \| : \| : <before> \| <after> \| +-+ local_storage cache interleaved get +----------------------+---------------------- \| +- hits throughput \| 21.718 M ops/s \| 23.332 M ops/s (+7.4%) \| +- hits latency \| 46.047 ns/op \| 42.865 ns/op (-6.9%) \| +- important_hits throughput \| 6.110 M ops/s \| 6.564 M ops/s (+7.4%) \| \| + num_maps: 32 \| : <before> \| <after> \| +-+ local_storage cache sequential get +----------------------+---------------------- \| +- hits throughput \| 14.118 M ops/s \| 14.626 M ops/s (+3.6%) \| +- hits latency \| 70.856 ns/op \| 68.381 ns/op (-3.5%) \| +- important_hits throughput \| 0.442 M ops/s \| 0.458 M ops/s (+3.6%) \| : \| : <before> \| <after> \| +-+ local_storage cache interleaved get +----------------------+---------------------- \| +- hits throughput \| 17.111 M ops/s \| 17.906 M ops/s (+4.6%) \| +- hits latency \| 58.451 ns/op \| 55.865 ns/op (-4.4%) \| +- important_hits throughput \| 4.776 M ops/s \| 4.998 M ops/s (+4.6%) \| \| + num_maps: 100 \| : <before> \| <after> \| +-+ local_storage cache sequential get +----------------------+---------------------- \| +- hits throughput \| 5.281 M ops/s \| 5.528 M ops/s (+4.7%) \| +- hits latency \| 192.398 ns/op \| 183.059 ns/op (-4.9%) \| +- important_hits throughput \| 0.053 M ops/s \| 0.055 M ops/s (+4.9%) \| : \| : <before> \| <after> \| +-+ local_storage cache interleaved get +----------------------+---------------------- \| +- hits throughput \| 6.265 M ops/s \| 6.498 M ops/s (+3.7%) \| +- hits latency \| 161.436 ns/op \| 152.877 ns/op (-5.3%) \| +- important_hits throughput \| 1.636 M ops/s \| 1.697 M ops/s (+3.7%) \| \| + num_maps: 1000 \| : <before> \| <after> \| +-+ local_storage cache sequential get +----------------------+---------------------- \| +- hits throughput \| 0.355 M ops/s \| 0.354 M ops/s ( ~ ) \| +- hits latency \| 2826.538 ns/op \| 2827.139 ns/op ( ~ ) \| +- important_hits throughput \| 0.000 M ops/s \| 0.000 M ops/s ( ~ ) \| : \| : <before> \| <after> \| +-+ local_storage cache interleaved get +----------------------+---------------------- \| +- hits throughput \| 0.404 M ops/s \| 0.403 M ops/s ( ~ ) \| +- hits latency \| 2481.190 ns/op \| 2487.555 ns/op ( ~ ) \| +- important_hits throughput \| 0.102 M ops/s \| 0.101 M ops/s ( ~ ) The on_lookup test in {cgrp,task}_ls_recursion.c is removed because the bpf_local_storage_lookup is no longer traceable and adding tracepoint will make the compiler generate worse code: https://lore.kernel.org/bpf/ZcJmok64Xqv6l4ZS@elver.google.com/ Signed-off-by: Marco Elver <elver@google.com> Cc: Martin KaFai Lau <martin.lau@linux.dev> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20240207122626.3508658-1-elver@google.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>	2024-02-11 14:06:24 -08:00
Linus Torvalds	2766f59ca4	Merge tag 'timers_urgent_for_v6.8_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Borislav Petkov: - Make sure a warning is issued when a hrtimer gets queued after the timers have been migrated on the CPU down path and thus said timer will get ignored * tag 'timers_urgent_for_v6.8_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: hrtimer: Report offline hrtimer enqueue	2024-02-11 11:44:14 -08:00
Linus Torvalds	7521f258ea	Merge tag 'mm-hotfixes-stable-2024-02-10-11-16' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "21 hotfixes. 12 are cc:stable and the remainder pertain to post-6.7 issues or aren't considered to be needed in earlier kernel versions" * tag 'mm-hotfixes-stable-2024-02-10-11-16' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (21 commits) nilfs2: fix potential bug in end_buffer_async_write mm/damon/sysfs-schemes: fix wrong DAMOS tried regions update timeout setup nilfs2: fix hang in nilfs_lookup_dirty_data_buffers() MAINTAINERS: Leo Yan has moved mm/zswap: don't return LRU_SKIP if we have dropped lru lock fs,hugetlb: fix NULL pointer dereference in hugetlbs_fill_super mailmap: switch email address for John Moon mm: zswap: fix objcg use-after-free in entry destruction mm/madvise: don't forget to leave lazy MMU mode in madvise_cold_or_pageout_pte_range() arch/arm/mm: fix major fault accounting when retrying under per-VMA lock selftests: core: include linux/close_range.h for CLOSE_RANGE_* macros mm/memory-failure: fix crash in split_huge_page_to_list from soft_offline_page mm: memcg: optimize parent iteration in memcg_rstat_updated() nilfs2: fix data corruption in dsync block recovery for small block sizes mm/userfaultfd: UFFDIO_MOVE implementation should use ptep_get() exit: wait_task_zombie: kill the no longer necessary spin_lock_irq(siglock) fs/proc: do_task_stat: use sig->stats_lock to gather the threads/children stats fs/proc: do_task_stat: move thread_group_cputime_adjusted() outside of lock_task_sighand() getrusage: use sig->stats_lock rather than lock_task_sighand() getrusage: move thread_group_cputime_adjusted() outside of lock_task_sighand() ...	2024-02-10 15:28:07 -08:00
Oleg Nesterov	81b9d8ac06	pidfd: change pidfd_send_signal() to respect PIDFD_THREAD Turn kill_pid_info() into kill_pid_info_type(), this allows to pass any pid_type to group_send_sig_info(), despite its name it should work fine even if type = PIDTYPE_PID. Change pidfd_send_signal() to use PIDTYPE_PID or PIDTYPE_TGID depending on PIDFD_THREAD. While at it kill another TODO comment in pidfd_show_fdinfo(). As Christian expains fdinfo reports f_flags, userspace can already detect PIDFD_THREAD. Reviewed-by: Tycho Andersen <tandersen@netflix.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20240209130650.GA8048@redhat.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-02-10 22:37:23 +01:00

... 6 7 8 9 10 ...

44306 Commits