linux

Author	SHA1	Message	Date
Steven Rostedt (Google)	b049525855	ring-buffer: Have saved event hold the entire event For the ring buffer iterator (non-consuming read), the event needs to be copied into the iterator buffer to make sure that a writer does not overwrite it while the user is reading it. If a write happens during the copy, the buffer is simply discarded. But the temp buffer itself was not big enough. The allocation of the buffer was only BUF_MAX_DATA_SIZE, which is the maximum data size that can be passed into the ring buffer and saved. But the temp buffer needs to hold the meta data as well. That would be BUF_PAGE_SIZE and not BUF_MAX_DATA_SIZE. Link: https://lore.kernel.org/linux-trace-kernel/20231212072558.61f76493@gandalf.local.home Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Fixes: `785888c544` ("ring-buffer: Have rb_iter_head_event() handle concurrent writer") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2023-12-12 20:56:10 -05:00
Steven Rostedt (Google)	9e45e39dc2	ring-buffer: Do not update before stamp when switching sub-buffers The ring buffer timestamps are synchronized by two timestamp placeholders. One is the "before_stamp" and the other is the "write_stamp" (sometimes referred to as the "after stamp" but only in the comments. These two stamps are key to knowing how to handle nested events coming in with a lockless system. When moving across sub-buffers, the before stamp is updated but the write stamp is not. There's an effort to put back the before stamp to something that seems logical in case there's nested events. But as the current event is about to cross sub-buffers, and so will any new nested event that happens, updating the before stamp is useless, and could even introduce new race conditions. The first event on a sub-buffer simply uses the sub-buffer's timestamp and keeps a "delta" of zero. The "before_stamp" and "write_stamp" are not used in the algorithm in this case. There's no reason to try to fix the before_stamp when this happens. As a bonus, it removes a cmpxchg() when crossing sub-buffers! Link: https://lore.kernel.org/linux-trace-kernel/20231211114420.36dde01b@gandalf.local.home Cc: stable@vger.kernel.org Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Fixes: `a389d86f7f` ("ring-buffer: Have nested events still record running time stamp") Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2023-12-12 20:55:45 -05:00
Andrew Morton	fc0fbad122	merge mm-hotfixes-stable into mm-nonmm-stable to pick up depended-upon changes	2023-12-12 17:39:11 -08:00
Yuntao Wang	1dd11e9773	crash_core: fix the check for whether crashkernel is from high memory If crash_base is equal to CRASH_ADDR_LOW_MAX, it also indicates that the crashkernel memory is allocated from high memory. However, the current check only considers the case where crash_base is greater than CRASH_ADDR_LOW_MAX. Fix it. The runtime effects is that crashkernel high memory is successfully reserved, whereas the crashkernel low memory is bypassed in this case, then kdump kernel bootup will fail because of no low memory under 4G. This patch also includes some minor cleanups. Link: https://lkml.kernel.org/r/20231209141438.77233-1-ytcoode@gmail.com Fixes: `0ab97169aa` ("crash_core: add generic function to do reservation") Signed-off-by: Yuntao Wang <ytcoode@gmail.com> Cc: Baoquan He <bhe@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Zhen Lei <thunder.leizhen@huawei.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-12-12 17:20:18 -08:00
Ignat Korchagin	c41bd25141	kexec: drop dependency on ARCH_SUPPORTS_KEXEC from CRASH_DUMP In commit f8ff23429c62 ("kernel/Kconfig.kexec: drop select of KEXEC for CRASH_DUMP") we tried to fix a config regression, where CONFIG_CRASH_DUMP required CONFIG_KEXEC. However, it was not enough at least for arm64 platforms. While further testing the patch with our arm64 config I noticed that CONFIG_CRASH_DUMP is unavailable in menuconfig. This is because CONFIG_CRASH_DUMP still depends on the new CONFIG_ARCH_SUPPORTS_KEXEC introduced in commit `91506f7e5d` ("arm64/kexec: refactor for kernel/Kconfig.kexec") and on arm64 CONFIG_ARCH_SUPPORTS_KEXEC requires CONFIG_PM_SLEEP_SMP=y, which in turn requires either CONFIG_SUSPEND=y or CONFIG_HIBERNATION=y neither of which are set in our config. Given that we already established that CONFIG_KEXEC (which is a switch for kexec system call itself) is not required for CONFIG_CRASH_DUMP drop CONFIG_ARCH_SUPPORTS_KEXEC dependency as well. The arm64 kernel builds just fine with CONFIG_CRASH_DUMP=y and with both CONFIG_KEXEC=n and CONFIG_KEXEC_FILE=n after f8ff23429c62 ("kernel/Kconfig.kexec: drop select of KEXEC for CRASH_DUMP") and this patch are applied given that the necessary shared bits are included via CONFIG_KEXEC_CORE dependency. [bhe@redhat.com: don't export some symbols when CONFIG_MMU=n] Link: https://lkml.kernel.org/r/ZW03ODUKGGhP1ZGU@MiWiFi-R3L-srv [bhe@redhat.com: riscv, kexec: fix dependency of two items] Link: https://lkml.kernel.org/r/ZW04G/SKnhbE5mnX@MiWiFi-R3L-srv Link: https://lkml.kernel.org/r/20231129220409.55006-1-ignat@cloudflare.com Fixes: `91506f7e5d` ("arm64/kexec: refactor for kernel/Kconfig.kexec") Signed-off-by: Ignat Korchagin <ignat@cloudflare.com> Signed-off-by: Baoquan He <bhe@redhat.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: <stable@vger.kernel.org> # 6.6+: f8ff234: kernel/Kconfig.kexec: drop select of KEXEC for CRASH_DUMP Cc: <stable@vger.kernel.org> # 6.6+ Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-12-12 17:20:16 -08:00
Steven Rostedt (Google)	d06aff1cb1	tracing: Update snapshot buffer on resize if it is allocated The snapshot buffer is to mimic the main buffer so that when a snapshot is needed, the snapshot and main buffer are swapped. When the snapshot buffer is allocated, it is set to the minimal size that the ring buffer may be at and still functional. When it is allocated it becomes the same size as the main ring buffer, and when the main ring buffer changes in size, it should do. Currently, the resize only updates the snapshot buffer if it's used by the current tracer (ie. the preemptirqsoff tracer). But it needs to be updated anytime it is allocated. When changing the size of the main buffer, instead of looking to see if the current tracer is utilizing the snapshot buffer, just check if it is allocated to know if it should be updated or not. Also fix typo in comment just above the code change. Link: https://lore.kernel.org/linux-trace-kernel/20231210225447.48476a6a@rorschach.local.home Cc: stable@vger.kernel.org Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Fixes: `ad909e21bb` ("tracing: Add internal tracing_snapshot() functions") Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2023-12-12 19:00:38 -05:00
Steven Rostedt (Google)	17d8017581	ring-buffer: Fix memory leak of free page Reading the ring buffer does a swap of a sub-buffer within the ring buffer with a empty sub-buffer. This allows the reader to have full access to the content of the sub-buffer that was swapped out without having to worry about contention with the writer. The readers call ring_buffer_alloc_read_page() to allocate a page that will be used to swap with the ring buffer. When the code is finished with the reader page, it calls ring_buffer_free_read_page(). Instead of freeing the page, it stores it as a spare. Then next call to ring_buffer_alloc_read_page() will return this spare instead of calling into the memory management system to allocate a new page. Unfortunately, on freeing of the ring buffer, this spare page is not freed, and causes a memory leak. Link: https://lore.kernel.org/linux-trace-kernel/20231210221250.7b9cc83c@rorschach.local.home Cc: stable@vger.kernel.org Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Fixes: `73a757e631` ("ring-buffer: Return reader page back into existing ring buffer") Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2023-12-12 19:00:37 -05:00
Steven Rostedt (Google)	b55b0a0d7c	tracing: Have large events show up as '[LINE TOO BIG]' instead of nothing If a large event was added to the ring buffer that is larger than what the trace_seq can handle, it just drops the output: ~# cat /sys/kernel/tracing/trace # tracer: nop # # entries-in-buffer/entries-written: 2/2 #P:8 # # _-----=> irqs-off/BH-disabled # / _----=> need-resched # \| / _---=> hardirq/softirq # \|\| / _--=> preempt-depth # \|\|\| / _-=> migrate-disable # \|\|\|\| / delay # TASK-PID CPU# \|\|\|\|\| TIMESTAMP FUNCTION # \| \| \| \|\|\|\|\| \| \| <...>-859 [001] ..... 141.118951: tracing_mark_write <...>-859 [001] ..... 141.148201: tracing_mark_write: 78901234 Instead, catch this case and add some context: ~# cat /sys/kernel/tracing/trace # tracer: nop # # entries-in-buffer/entries-written: 2/2 #P:8 # # _-----=> irqs-off/BH-disabled # / _----=> need-resched # \| / _---=> hardirq/softirq # \|\| / _--=> preempt-depth # \|\|\| / _-=> migrate-disable # \|\|\|\| / delay # TASK-PID CPU# \|\|\|\|\| TIMESTAMP FUNCTION # \| \| \| \|\|\|\|\| \| \| <...>-852 [001] ..... 121.550551: tracing_mark_write[LINE TOO BIG] <...>-852 [001] ..... 121.550581: tracing_mark_write: 78901234 This now emulates the same output as trace_pipe. Link: https://lore.kernel.org/linux-trace-kernel/20231209171058.78c1a026@gandalf.local.home Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2023-12-12 19:00:36 -05:00
Steven Rostedt (Google)	b3ae7b67b8	ring-buffer: Fix writing to the buffer with max_data_size The maximum ring buffer data size is the maximum size of data that can be recorded on the ring buffer. Events must be smaller than the sub buffer data size minus any meta data. This size is checked before trying to allocate from the ring buffer because the allocation assumes that the size will fit on the sub buffer. The maximum size was calculated as the size of a sub buffer page (which is currently PAGE_SIZE minus the sub buffer header) minus the size of the meta data of an individual event. But it missed the possible adding of a time stamp for events that are added long enough apart that the event meta data can't hold the time delta. When an event is added that is greater than the current BUF_MAX_DATA_SIZE minus the size of a time stamp, but still less than or equal to BUF_MAX_DATA_SIZE, the ring buffer would go into an infinite loop, looking for a page that can hold the event. Luckily, there's a check for this loop and after 1000 iterations and a warning is emitted and the ring buffer is disabled. But this should never happen. This can happen when a large event is added first, or after a long period where an absolute timestamp is prefixed to the event, increasing its size by 8 bytes. This passes the check and then goes into the algorithm that causes the infinite loop. For events that are the first event on the sub-buffer, it does not need to add a timestamp, because the sub-buffer itself contains an absolute timestamp, and adding one is redundant. The fix is to check if the event is to be the first event on the sub-buffer, and if it is, then do not add a timestamp. This also fixes 32 bit adding a timestamp when a read of before_stamp or write_stamp is interrupted. There's still no need to add that timestamp if the event is going to be the first event on the sub buffer. Also, if the buffer has "time_stamp_abs" set, then also check if the length plus the timestamp is greater than the BUF_MAX_DATA_SIZE. Link: https://lore.kernel.org/all/20231212104549.58863438@gandalf.local.home/ Link: https://lore.kernel.org/linux-trace-kernel/20231212071837.5fdd6c13@gandalf.local.home Link: https://lore.kernel.org/linux-trace-kernel/20231212111617.39e02849@gandalf.local.home Cc: stable@vger.kernel.org Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Fixes: `a4543a2fa9` ("ring-buffer: Get timestamp after event is allocated") Fixes: `58fbc3c632` ("ring-buffer: Consolidate add_timestamp to remove some branches") Reported-by: Kent Overstreet <kent.overstreet@linux.dev> # (on IRC) Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2023-12-12 19:00:19 -05:00
Andrei Matei	745e031130	bpf: Comment on check_mem_size_reg This patch adds a comment to check_mem_size_reg -- a function whose meaning is not very transparent. The function implicitly deals with two registers connected by convention, which is not obvious. Signed-off-by: Andrei Matei <andreimatei1@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20231210225149.67639-1-andreimatei1@gmail.com	2023-12-12 15:35:32 -08:00
Yang Li	56c26d5ad8	bpf: Remove unused backtrack_state helper functions The function are defined in the verifier.c file, but not called elsewhere, so delete the unused function. kernel/bpf/verifier.c:3448:20: warning: unused function 'bt_set_slot' kernel/bpf/verifier.c:3453:20: warning: unused function 'bt_clear_slot' kernel/bpf/verifier.c:3488:20: warning: unused function 'bt_is_slot_set' Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20231212005436.103829-1-yang.lee@linux.alibaba.com Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=7714	2023-12-12 09:52:07 -08:00
Christian Brauner	4e94ddfe2a	file: remove __receive_fd() Honestly, there's little value in having a helper with and without that int __user *ufd argument. It's just messy and doesn't really give us anything. Just expose receive_fd() with that argument and get rid of that helper. Link: https://lore.kernel.org/r/20231130-vfs-files-fixes-v1-5-e73ca6f4ea83@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>	2023-12-12 14:24:14 +01:00
Jason Gunthorpe	8f23f5dba6	iommu: Change kconfig around IOMMU_SVA Linus suggested that the kconfig here is confusing: https://lore.kernel.org/all/CAHk-=wgUiAtiszwseM1p2fCJ+sC4XWQ+YN4TanFhUgvUqjr9Xw@mail.gmail.com/ Let's break it into three kconfigs controlling distinct things: - CONFIG_IOMMU_MM_DATA controls if the mm_struct has the additional fields for the IOMMU. Currently only PASID, but later patches store a struct iommu_mm_data * - CONFIG_ARCH_HAS_CPU_PASID controls if the arch needs the scheduling bit for keeping track of the ENQCMD instruction. x86 will select this if IOMMU_SVA is enabled - IOMMU_SVA controls if the IOMMU core compiles in the SVA support code for iommu driver use and the IOMMU exported API This way ARM will not enable CONFIG_ARCH_HAS_CPU_PASID Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20231027000525.1278806-2-tina.zhang@intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>	2023-12-12 10:11:27 +01:00
Andrii Nakryiko	1a1ad782dc	bpf: tidy up exception callback management a bit Use the fact that we are passing subprog index around and have a corresponding struct bpf_subprog_info in bpf_verifier_env for each subprogram. We don't need to separately pass around a flag whether subprog is exception callback or not, each relevant verifier function can determine this using provided subprog index if we maintain bpf_subprog_info properly. Also move out exception callback-specific logic from btf_prepare_func_args(), keeping it generic. We can enforce all these restriction right before exception callback verification pass. We add out parameter, arg_cnt, for now, but this will be unnecessary with subsequent refactoring and will be removed. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20231204233931.49758-4-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-11 19:23:32 -08:00
Andrii Nakryiko	22b769bb4f	bpf: emit more dynptr information in verifier log Emit dynptr type for CONST_PTR_TO_DYNPTR register. Also emit id, ref_obj_id, and dynptr_id fields for STACK_DYNPTR stack slots. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20231204233931.49758-3-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-11 19:21:22 -08:00
Andrii Nakryiko	1e68485d82	bpf: log PTR_TO_MEM memory size in verifier log Emit valid memory size addressable through PTR_TO_MEM register. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20231204233931.49758-2-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-11 19:21:22 -08:00
Frederic Weisbecker	c21357e446	srcu: Explain why callbacks invocations can't run concurrently If an SRCU barrier is queued while callbacks are running and a new callbacks invocator for the same sdp were to run concurrently, the RCU barrier might execute too early. As this requirement is non-obvious, make sure to keep a record. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com>	2023-12-12 02:41:17 +05:30
Frederic Weisbecker	94c55b9e21	srcu: No need to advance/accelerate if no callback enqueued While in grace period start, there is nothing to accelerate and therefore no need to advance the callbacks either if no callback is to be enqueued. Spare these needless operations in this case. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com>	2023-12-12 02:41:16 +05:30
Frederic Weisbecker	20eb414239	srcu: Remove superfluous callbacks advancing from srcu_gp_start() Callbacks advancing on SRCU must be performed on two specific places: 1) On enqueue time in order to make room for the acceleration of the new callback. 2) On invocation time in order to move the callbacks ready to invoke. Any other callback advancing callsite is needless. Remove the remaining one in srcu_gp_start(). Co-developed-by: Yong He <zhuangel570@gmail.com> Signed-off-by: Yong He <zhuangel570@gmail.com> Co-developed-by: Joel Fernandes <joel@joelfernandes.org> Signed-off-by: Joel Fernandes <joel@joelfernandes.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Co-developed-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com>	2023-12-12 02:40:45 +05:30
Li zeming	4ac934b1aa	PM: hibernate: Do not initialize error in snapshot_write_next() The error variable in snapshot_write_next() gets a value before it is used, so don't initialize it to 0 upfront. Signed-off-by: Li zeming <zeming@nfschina.com> [ rjw: Subject and changelog rewrite ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2023-12-11 22:09:52 +01:00
Li zeming	bbeaa4691f	PM: hibernate: Do not initialize error in swap_write_page() 'error' first receives the function result before it is used, and it does not need to be assigned a value during definition. Signed-off-by: Li zeming <zeming@nfschina.com> [ rjw: Subject rewrite ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2023-12-11 22:06:15 +01:00
Wang chaodong	a1ca8295ee	PM: hibernate: Drop unnecessary local variable initialization It is not necessary to intialize the error variable in create_basic_memory_bitmaps(), because it is only read after being assigned a value. Signed-off-by: Wang chaodong <chaodong@nfschina.com> [ rjw: Subject and changelog rewrite ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2023-12-11 22:02:48 +01:00
Paul E. McKenney	4e58aaeebb	rcu: Restrict access to RCU CPU stall notifiers Although the RCU CPU stall notifiers can be useful for dumping state when tracking down delicate forward-progress bugs where NUMA effects cause cache lines to be delivered to a given CPU regularly, but always in a state that prevents that CPU from making forward progress. These bugs can be detected by the RCU CPU stall-warning mechanism, but in some cases, the stall-warnings printk()s disrupt the forward-progress bug before any useful state can be obtained. Unfortunately, the notifier mechanism added by commit `5b404fdaba` ("rcu: Add RCU CPU stall notifier") can make matters worse if used at all carelessly. For example, if the stall warning was caused by a lock not being released, then any attempt to acquire that lock in the notifier will hang. This will prevent not only the notifier from producing any useful output, but it will also prevent the stall-warning message from ever appearing. This commit therefore hides this new RCU CPU stall notifier mechanism under a new RCU_CPU_STALL_NOTIFIER Kconfig option that depends on both DEBUG_KERNEL and RCU_EXPERT. In addition, the rcupdate.rcu_cpu_stall_notifiers=1 kernel boot parameter must also be specified. The RCU_CPU_STALL_NOTIFIER Kconfig option's help text contains a warning and explains the dangers of careless use, recommending lockless notifier code. In addition, a WARN() is triggered each time that an attempt is made to register a stall-warning notifier in kernels built with CONFIG_RCU_CPU_STALL_NOTIFIER=y. This combination of measures will keep use of this mechanism confined to debug kernels and away from routine deployments. [ paulmck: Apply Dan Carpenter feedback. ] Fixes: `5b404fdaba` ("rcu: Add RCU CPU stall notifier") Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com>	2023-12-12 02:31:22 +05:30
Paul E. McKenney	18966f7b94	rcu-tasks: Mark RCU Tasks accesses to current->rcu_tasks_idle_cpu The task_struct structure's ->rcu_tasks_idle_cpu can be concurrently read and written from the RCU Tasks grace-period kthread and from the CPU on which the task_struct structure's task is running. This commit therefore marks the accesses appropriately. Reported-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Neeraj Upadhyay (AMD) <neeraj.iitr10@gmail.com>	2023-12-12 02:22:47 +05:30
Li Zhe	9d02330abd	softlockup: serialized softlockup's log If multiple CPUs trigger softlockup at the same time with 'softlockup_all_cpu_backtrace=0', the softlockup's logs will appear staggeredly in dmesg, which will affect the viewing of the logs for developer. Since the code path for outputting softlockup logs is not a kernel hotspot and the performance requirements for the code are not strict, locks are used to serialize the softlockup log output to improve the readability of the logs. Link: https://lkml.kernel.org/r/20231123084022.10302-1-lizhe.67@bytedance.com Signed-off-by: Li Zhe <lizhe.67@bytedance.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Douglas Anderson <dianders@chromium.org> Cc: Lecopzer Chen <lecopzer.chen@mediatek.com> Cc: Pingfan Liu <kernelfans@gmail.com> Cc: Zefan Li <lizefan.x@bytedance.com> Cc: John Ogness <john.ogness@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-12-10 17:21:44 -08:00
Baoquan He	b3ba234171	kexec_file: load kernel at top of system RAM if required Patch series "kexec_file: Load kernel at top of system RAM if required". Justification: ============== Kexec_load interface has been doing top down searching and loading kernel/initrd/purgtory etc to prepare for kexec reboot. In that way, the benefits are that it avoids to consume and fragment limited low memory which satisfy DMA buffer allocation and big chunk of continuous memory during system init; and avoids to stir with BIOS/FW reserved or occupied areas, or corner case handling/work around/quirk occupied areas when doing system init. By the way, the top-down searching and loading of kexec-ed kernel is done in user space utility code. For kexec_file loading, even if kexec_buf.top_down is 'true', it's simply ignored. It calls walk_system_ram_res() directly to go through all resources of System RAM bottom up, to find an available memory region, then call locate_mem_hole_callback() to allocate memory in that found memory region from top to down. This is not expected and inconsistent with kexec_load. Implementation =============== In patch 1, introduce a new function walk_system_ram_res_rev() which is a variant of walk_system_ram_res(), it walks through a list of all the resources of System RAM in reversed order, i.e., from higher to lower. In patch 2, check if kexec_buf.top_down is 'true' in kexec_walk_resources(), if yes, call walk_system_ram_res_rev() to find memory region of system RAM from top to down to load kernel/initrd etc. Background information: ======================= And I ever tried this in the past in a different way, please see below link. In the post, I tried to adjust struct sibling linking code, replace the the singly linked list with list_head so that walk_system_ram_res_rev() can be implemented in a much easier way. Finally I failed. https://lore.kernel.org/all/20180718024944.577-4-bhe@redhat.com/ This time, I picked up the patch from AKASHI Takahiro's old post and made some change to take as the current patch 1: https://lists.infradead.org/pipermail/linux-arm-kernel/2017-September/531456.html This patch (of 2): Kexec_load interface has been doing top down searching and loading kernel/initrd/purgtory etc to prepare for kexec reboot. In that way, the benefits are that it avoids to consume and fragment limited low memory which satisfy DMA buffer allocation and big chunk of continuous memory during system init; and avoids to stir with BIOS/FW reserved or occupied areas, or corner case handling/work around/quirk occupied areas when doing system init. By the way, the top-down searching and loading of kexec-ed kernel is done in user space utility code. For kexec_file loading, even if kexec_buf.top_down is 'true', it's simply ignored. It calls walk_system_ram_res() directly to go through all resources of System RAM bottom up, to find an available memory region, then call locate_mem_hole_callback() to allocate memory in that found memory region from top to down. This is not expected and inconsistent with kexec_load. Here check if kexec_buf.top_down is 'true' in kexec_walk_resources(), if yes, call the newly added walk_system_ram_res_rev() to find memory region of system RAM from top to down to load kernel/initrd etc. Link: https://lkml.kernel.org/r/20231114091658.228030-1-bhe@redhat.com Link: https://lkml.kernel.org/r/20231114091658.228030-3-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Cc: AKASHI Takahiro <takahiro.akashi@linaro.org> Cc: Baoquan He <bhe@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-12-10 17:21:44 -08:00
Baoquan He	7acf164b25	resource: add walk_system_ram_res_rev() This function, being a variant of walk_system_ram_res() introduced in commit `8c86e70ace` ("resource: provide new functions to walk through resources"), walks through a list of all the resources of System RAM in reversed order, i.e., from higher to lower. It will be used in kexec_file code to load kernel, initrd etc when preparing kexec reboot. Link: https://lkml.kernel.org/r/ZVTA6z/06cLnWKUz@MiWiFi-R3L-srv Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org> Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-12-10 17:21:44 -08:00
Arnd Bergmann	b1c3efe079	sched: fair: move unused stub functions to header These four functions have a normal definition for CONFIG_FAIR_GROUP_SCHED, and empty one that is only referenced when FAIR_GROUP_SCHED is disabled but CGROUP_SCHED is still enabled. If both are turned off, the functions are still defined but the misisng prototype causes a W=1 warning: kernel/sched/fair.c:12544:6: error: no previous prototype for 'free_fair_sched_group' kernel/sched/fair.c:12546:5: error: no previous prototype for 'alloc_fair_sched_group' kernel/sched/fair.c:12553:6: error: no previous prototype for 'online_fair_sched_group' kernel/sched/fair.c:12555:6: error: no previous prototype for 'unregister_fair_sched_group' Move the alternatives into the header as static inline functions with the correct combination of #ifdef checks to avoid the warning without adding even more complexity. [A different patch with the same description got applied by accident and was later reverted, but the original patch is still missing] Link: https://lkml.kernel.org/r/20231123110506.707903-4-arnd@kernel.org Fixes: `7aa55f2a59` ("sched/fair: Move unused stub functions to header") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Kees Cook <keescook@chromium.org> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Matt Turner <mattst88@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nicolas Schier <nicolas@fjasle.eu> Cc: Palmer Dabbelt <palmer@rivosinc.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Richard Henderson <richard.henderson@linaro.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rich Felker <dalias@libc.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Tudor Ambarus <tudor.ambarus@linaro.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-12-10 17:21:43 -08:00
Uros Bizjak	0311d82724	kexec: use atomic_try_cmpxchg in crash_kexec Use atomic_try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in crash_kexec(). x86 CMPXCHG instruction returns success in ZF flag, so this change saves a compare after cmpxchg. No functional change intended. Link: https://lkml.kernel.org/r/20231114161228.108516-1-ubizjak@gmail.com Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-12-10 17:21:33 -08:00
Oleg Nesterov	27bbb2a0fd	__ptrace_unlink: kill the obsolete "FIXME" code The corner case described by the comment is no longer possible after the commit `7b3c36fc4c` ("ptrace: fix task_join_group_stop() for the case when current is traced"), task_join_group_stop() ensures that the new thread has the correct signr in JOBCTL_STOP_SIGMASK regardless of ptrace. Link: https://lkml.kernel.org/r/20231121162650.GA6635@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-12-10 17:21:32 -08:00
Oleg Nesterov	b454ec2922	kernel/signal.c: simplify force_sig_info_to_task(), kill recalc_sigpending_and_wake() The purpose of recalc_sigpending_and_wake() is not clear, it looks "obviously unneeded" because we are going to send the signal which can't be blocked or ignored. Add the comment to explain why we can't rely on send_signal_locked() and make this logic more simple/explicit. recalc_sigpending_and_wake() has no other users, it can die. In fact I think we don't even need signal_wake_up(), the target task must be either current or a TASK_TRACED child, otherwise the usage of siglock is not safe. But this needs another change. Link: https://lkml.kernel.org/r/20231120151649.GA15995@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-12-10 17:21:32 -08:00
Heiko Carstens	3888750e21	arch: remove ARCH_TASK_STRUCT_ALLOCATOR IA-64 was the only architecture which selected ARCH_TASK_STRUCT_ALLOCATOR. IA-64 was removed with commit `cf8e865810` ("arch: Remove Itanium (IA-64) architecture"). Therefore remove support for ARCH_THREAD_STACK_ALLOCATOR as well. Link: https://lkml.kernel.org/r/20231116133638.1636277-3-hca@linux.ibm.com Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-12-10 17:21:31 -08:00
Heiko Carstens	f72709ab69	arch: remove ARCH_THREAD_STACK_ALLOCATOR Patch series "Remove unused code after IA-64 removal". While looking into something different I noticed that there are a couple of Kconfig options which were only selected by IA-64 and which are now unused. So remove them and simplify the code a bit. This patch (of 3): IA-64 was the only architecture which selected ARCH_THREAD_STACK_ALLOCATOR. IA-64 was removed with commit `cf8e865810` ("arch: Remove Itanium (IA-64) architecture"). Therefore remove support for ARCH_THREAD_STACK_ALLOCATOR as well. Link: https://lkml.kernel.org/r/20231116133638.1636277-1-hca@linux.ibm.com Link: https://lkml.kernel.org/r/20231116133638.1636277-2-hca@linux.ibm.com Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-12-10 17:21:30 -08:00
Oleg Nesterov	61a7a5e25f	introduce for_other_threads(p, t) Cosmetic, but imho it makes the usage look more clear and simple, the new helper doesn't require to initialize "t". After this change while_each_thread() has only 3 users, and it is only used in the do/while loops. Link: https://lkml.kernel.org/r/20231030155710.GA9095@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-12-10 17:21:25 -08:00
Dongmin Lee	a9a1d6ad66	kernel/reboot: explicitly notify if halt occurred instead of power off When kernel_can_power_off() returns false, and reboot has called with LINUX_REBOOT_CMD_POWER_OFF, kernel_halt() will be initiated instead of actual power off function. However, in this situation, Kernel never explicitly notifies user that system halted instead of requested power off. Since halt and power off perform different behavior, and user initiated reboot call with power off command, not halt, This could be unintended behavior to user, like this: ~ # poweroff -f [ 3.581482] reboot: System halted Therefore, this explicitly notifies user that poweroff is not available, and halting has been occured as an alternative behavior instead: ~ # poweroff -f [ 4.123668] reboot: Power off not available: System halted instead [akpm@linux-foundation.org: tweak comment text] Link: https://lkml.kernel.org/r/20231104113320.72440-1-ldmldm05@gmail.com Signed-off-by: Dongmin Lee <ldmldm05@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-12-10 17:21:25 -08:00
Peng Zhang	d240629148	fork: use __mt_dup() to duplicate maple tree in dup_mmap() In dup_mmap(), using __mt_dup() to duplicate the old maple tree and then directly replacing the entries of VMAs in the new maple tree can result in better performance. __mt_dup() uses DFS pre-order to duplicate the maple tree, so it is efficient. The average time complexity of __mt_dup() is O(n), where n is the number of VMAs. The proof of the time complexity is provided in the commit log that introduces __mt_dup(). After duplicating the maple tree, each element is traversed and replaced (ignoring the cases of deletion, which are rare). Since it is only a replacement operation for each element, this process is also O(n). Analyzing the exact time complexity of the previous algorithm is challenging because each insertion can involve appending to a node, pushing data to adjacent nodes, or even splitting nodes. The frequency of each action is difficult to calculate. The worst-case scenario for a single insertion is when the tree undergoes splitting at every level. If we consider each insertion as the worst-case scenario, we can determine that the upper bound of the time complexity is O(n*log(n)), although this is a loose upper bound. However, based on the test data, it appears that the actual time complexity is likely to be O(n). As the entire maple tree is duplicated using __mt_dup(), if dup_mmap() fails, there will be a portion of VMAs that have not been duplicated in the maple tree. To handle this, we mark the failure point with XA_ZERO_ENTRY. In exit_mmap(), if this marker is encountered, stop releasing VMAs that have not been duplicated after this point. There is a "spawn" in byte-unixbench[1], which can be used to test the performance of fork(). I modified it slightly to make it work with different number of VMAs. Below are the test results. The first row shows the number of VMAs. The second and third rows show the number of fork() calls per ten seconds, corresponding to next-20231006 and the this patchset, respectively. The test results were obtained with CPU binding to avoid scheduler load balancing that could cause unstable results. There are still some fluctuations in the test results, but at least they are better than the original performance. 21 121 221 421 821 1621 3221 6421 12821 25621 51221 112100 76261 54227 34035 20195 11112 6017 3161 1606 802 393 114558 83067 65008 45824 28751 16072 8922 4747 2436 1233 599 2.19% 8.92% 19.88% 34.64% 42.37% 44.64% 48.28% 50.17% 51.68% 53.74% 52.42% [1] https://github.com/kdlucas/byte-unixbench/tree/master Link: https://lkml.kernel.org/r/20231027033845.90608-11-zhangpeng.00@bytedance.com Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com> Suggested-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Mike Christie <michael.christie@oracle.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-12-10 16:51:34 -08:00
Linus Torvalds	3a87498869	Merge tag 'sched_urgent_for_v6.7_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Borislav Petkov: - Make sure tasks are thawed exactly and only once to avoid their state getting corrupted * tag 'sched_urgent_for_v6.7_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: freezer,sched: Do not restore saved_state of a thawed task	2023-12-10 11:09:16 -08:00
Linus Torvalds	537ccb5d28	Merge tag 'perf_urgent_for_v6.7_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf event fix from Borislav Petkov: - Make sure perf event size validation is done on every event in the group * tag 'perf_urgent_for_v6.7_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf: Fix perf_event_validate_size()	2023-12-10 11:03:15 -08:00
David Vernet	a6de18f310	bpf: Add bpf_cpumask_weight() kfunc It can be useful to query how many bits are set in a cpumask. For example, if you want to perform special logic for the last remaining core that's set in a mask. Let's therefore add a new bpf_cpumask_weight() kfunc which checks how many bits are set in a mask. Signed-off-by: David Vernet <void@manifault.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20231207210843.168466-2-void@manifault.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-09 21:37:33 -08:00
Andrii Nakryiko	482d548d40	bpf: handle fake register spill to stack with BPF_ST_MEM instruction When verifier validates BPF_ST_MEM instruction that stores known constant to stack (e.g., (u64 )(r10 - 8) = 123), it effectively spills a fake register with a constant (but initially imprecise) value to a stack slot. Because read-side logic treats it as a proper register fill from stack slot, we need to mark such stack slot initialization as INSN_F_STACK_ACCESS instruction to stop precision backtracking from missing it. Fixes: `41f6f64e69` ("bpf: support non-r10 register spill/fill to/from stack in precision tracking") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20231209010958.66758-1-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-09 18:43:20 -08:00
Hou Tao	06e5c999f1	bpf: Set uattr->batch.count as zero before batched update or deletion generic_map_{delete,update}_batch() doesn't set uattr->batch.count as zero before it tries to allocate memory for key. If the memory allocation fails, the value of uattr->batch.count will be incorrect. Fix it by setting uattr->batch.count as zero beore batched update or deletion. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231208102355.2628918-6-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-09 18:12:34 -08:00
Hou Tao	67ad2c73ff	bpf: Only call maybe_wait_bpf_programs() when map operation succeeds There is no need to call maybe_wait_bpf_programs() if update or deletion operation fails. So only call maybe_wait_bpf_programs() if update or deletion operation succeeds. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231208102355.2628918-5-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-09 18:12:34 -08:00
Hou Tao	012772581d	bpf: Add missed maybe_wait_bpf_programs() for htab of maps When doing batched lookup and deletion operations on htab of maps, maybe_wait_bpf_programs() is needed to ensure all programs don't use the inner map after the bpf syscall returns. Instead of adding the wait in __htab_map_lookup_and_delete_batch(), adding the wait in bpf_map_do_batch() and also removing the calling of maybe_wait_bpf_programs() from generic_map_{delete,update}_batch(). Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231208102355.2628918-4-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-09 18:12:34 -08:00
Hou Tao	37ba5b59d6	bpf: Call maybe_wait_bpf_programs() only once for generic_map_update_batch() Just like commit `9087c6ff8d` ("bpf: Call maybe_wait_bpf_programs() only once from generic_map_delete_batch()"), there is also no need to call maybe_wait_bpf_programs() for each update in batched update, so only call it once in generic_map_update_batch(). Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231208102355.2628918-3-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-09 18:12:33 -08:00
Hou Tao	c26f2a8901	bpf: Remove unnecessary wait from bpf_map_copy_value() Both map_lookup_elem() and generic_map_lookup_batch() use bpf_map_copy_value() to lookup and copy the value, and there is no update operation in bpf_map_copy_value(), so just remove the invocation of maybe_wait_bpf_programs() from it. Fixes: `15c14a3dca` ("bpf: Add bpf_map_{value_size, update_value, map_copy_value} functions") Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231208102355.2628918-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-12-09 18:12:33 -08:00
Yafang Shao	73d9eb340d	bpf: Enable bpf_cgrp_storage for cgroup1 non-attach case In the current cgroup1 environment, associating operations between cgroups and applications in a BPF program requires storing a mapping of cgroup_id to application either in a hash map or maintaining it in userspace. However, by enabling bpf_cgrp_storage for cgroup1, it becomes possible to conveniently store application-specific information in cgroup-local storage and utilize it within BPF programs. Furthermore, enabling this feature for cgroup1 involves minor modifications for the non-attach case, streamlining the process. However, when it comes to enabling this functionality for the cgroup1 attach case, it presents challenges. Therefore, the decision is to focus on enabling it solely for the cgroup1 non-attach case at present. If attempting to attach to a cgroup1 fd, the operation will simply fail with the error code -EBADF. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20231206115326.4295-2-laoar.shao@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>	2023-12-08 17:08:18 -08:00
Andrei Matei	2929bfac00	bpf: Minor cleanup around stack bounds Push the rounding up of stack offsets into the function responsible for growing the stack, rather than relying on all the callers to do it. Uncertainty about whether the callers did it or not tripped up people in a previous review. Signed-off-by: Andrei Matei <andreimatei1@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20231208032519.260451-4-andreimatei1@gmail.com	2023-12-08 14:19:00 -08:00
Andrei Matei	6b4a64bafd	bpf: Fix accesses to uninit stack slots Privileged programs are supposed to be able to read uninitialized stack memory (ever since `6715df8d5`) but, before this patch, these accesses were permitted inconsistently. In particular, accesses were permitted above state->allocated_stack, but not below it. In other words, if the stack was already "large enough", the access was permitted, but otherwise the access was rejected instead of being allowed to "grow the stack". This undesired rejection was happening in two places: - in check_stack_slot_within_bounds() - in check_stack_range_initialized() This patch arranges for these accesses to be permitted. A bunch of tests that were relying on the old rejection had to change; all of them were changed to add also run unprivileged, in which case the old behavior persists. One tests couldn't be updated - global_func16 - because it can't run unprivileged for other reasons. This patch also fixes the tracking of the stack size for variable-offset reads. This second fix is bundled in the same commit as the first one because they're inter-related. Before this patch, writes to the stack using registers containing a variable offset (as opposed to registers with fixed, known values) were not properly contributing to the function's needed stack size. As a result, it was possible for a program to verify, but then to attempt to read out-of-bounds data at runtime because a too small stack had been allocated for it. Each function tracks the size of the stack it needs in bpf_subprog_info.stack_depth, which is maintained by update_stack_depth(). For regular memory accesses, check_mem_access() was calling update_state_depth() but it was passing in only the fixed part of the offset register, ignoring the variable offset. This was incorrect; the minimum possible value of that register should be used instead. This tracking is now fixed by centralizing the tracking of stack size in grow_stack_state(), and by lifting the calls to grow_stack_state() to check_stack_access_within_bounds() as suggested by Andrii. The code is now simpler and more convincingly tracks the correct maximum stack size. check_stack_range_initialized() can now rely on enough stack having been allocated for the access; this helps with the fix for the first issue. A few tests were changed to also check the stack depth computation. The one that fails without this patch is verifier_var_off:stack_write_priv_vs_unpriv. Fixes: `01f810ace9` ("bpf: Allow variable-offset stack access") Reported-by: Hao Sun <sunhao.th@gmail.com> Signed-off-by: Andrei Matei <andreimatei1@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20231208032519.260451-3-andreimatei1@gmail.com Closes: https://lore.kernel.org/bpf/CABWLsev9g8UP_c3a=1qbuZUi20tGoUXoU07FPf-5FLvhOKOY+Q@mail.gmail.com/	2023-12-08 14:19:00 -08:00
Linus Torvalds	17894c2a7a	Merge tag 'trace-v6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Snapshot buffer issues: 1. When instances started allowing latency tracers, it uses a snapshot buffer (another buffer that is not written to but swapped with the main buffer that is). The snapshot buffer needs to be the same size as the main buffer. But when the snapshot buffers were added to instances, the code to make the snapshot equal to the main buffer still was only doing it for the main buffer and not the instances. 2. Need to stop the current tracer when resizing the buffers. Otherwise there can be a race if the tracer decides to make a snapshot between resizing the main buffer and the snapshot buffer. 3. When a tracer is "stopped" in disables both the main buffer and the snapshot buffer. This needs to be done for instances and not only the main buffer, now that instances also have a snapshot buffer. - Buffered event for filtering issues: When filtering is enabled, because events can be dropped often, it is quicker to copy the event into a temp buffer and write that into the main buffer if it is not filtered or just drop the event if it is, than to write the event into the ring buffer and then try to discard it. This temp buffer is allocated and needs special synchronization to do so. But there were some issues with that: 1. When disabling the filter and freeing the buffer, a call to all CPUs is required to stop each per_cpu usage. But the code called smp_call_function_many() which does not include the current CPU. If the task is migrated to another CPU when it enables the CPUs via smp_call_function_many(), it will not enable the one it is currently on and this causes issues later on. Use on_each_cpu_mask() instead, which includes the current CPU. 2.When the allocation of the buffered event fails, it can give a warning. But the buffered event is just an optimization (it's still OK to write to the ring buffer and free it). Do not WARN in this case. 3.The freeing of the buffer event requires synchronization. First a counter is decremented to zero so that no new uses of it will happen. Then it sets the buffered event to NULL, and finally it frees the buffered event. There's a synchronize_rcu() between the counter decrement and the setting the variable to NULL, but only a smp_wmb() between that and the freeing of the buffer. It is theoretically possible that a user missed seeing the decrement, but will use the buffer after it is free. Another synchronize_rcu() is needed in place of that smp_wmb(). - ring buffer timestamps on 32 bit machines The ring buffer timestamp on 32 bit machines has to break the 64 bit number into multiple values as cmpxchg is required on it, and a 64 bit cmpxchg on 32 bit architectures is very slow. The code use to just use two 32 bit values and make it a 60 bit timestamp where the other 4 bits were used as counters for synchronization. It later came known that the timestamp on 32 bit still need all 64 bits in some cases. So 3 words were created to handle the 64 bits. But issues arised with this: 1. The synchronization logic still only compared the counter with the first two, but not with the third number, so the synchronization could fail unknowingly. 2. A check on discard of an event could race if an event happened between the discard and updating one of the counters. The counter needs to be updated (forcing an absolute timestamp and not to use a delta) before the actual discard happens. * tag 'trace-v6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: ring-buffer: Test last update in 32bit version of __rb_time_read() ring-buffer: Force absolute timestamp on discard of event tracing: Fix a possible race when disabling buffered events tracing: Fix a warning when allocating buffered events fails tracing: Fix incomplete locking when disabling buffered events tracing: Disable snapshot buffer when stopping instance tracers tracing: Stop current tracer when resizing buffer tracing: Always update snapshot buffer size	2023-12-08 08:44:43 -08:00
Linus Torvalds	8e819a7623	Merge tag 'mm-hotfixes-stable-2023-12-07-18-47' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "31 hotfixes. Ten of these address pre-6.6 issues and are marked cc:stable. The remainder address post-6.6 issues or aren't considered serious enough to justify backporting" * tag 'mm-hotfixes-stable-2023-12-07-18-47' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (31 commits) mm/madvise: add cond_resched() in madvise_cold_or_pageout_pte_range() nilfs2: prevent WARNING in nilfs_sufile_set_segment_usage() mm/hugetlb: have CONFIG_HUGETLB_PAGE select CONFIG_XARRAY_MULTI scripts/gdb: fix lx-device-list-bus and lx-device-list-class MAINTAINERS: drop Antti Palosaari highmem: fix a memory copy problem in memcpy_from_folio nilfs2: fix missing error check for sb_set_blocksize call kernel/Kconfig.kexec: drop select of KEXEC for CRASH_DUMP units: add missing header drivers/base/cpu: crash data showing should depends on KEXEC_CORE mm/damon/sysfs-schemes: add timeout for update_schemes_tried_regions scripts/gdb/tasks: fix lx-ps command error mm/Kconfig: make userfaultfd a menuconfig selftests/mm: prevent duplicate runs caused by TEST_GEN_PROGS mm/damon/core: copy nr_accesses when splitting region lib/group_cpus.c: avoid acquiring cpu hotplug lock in group_cpus_evenly checkstack: fix printed address mm/memory_hotplug: fix error handling in add_memory_resource() mm/memory_hotplug: add missing mem_hotplug_lock .mailmap: add a new address mapping for Chester Lin ...	2023-12-08 08:36:23 -08:00

... 14 15 16 17 18 ...

44242 Commits