linux

Author	SHA1	Message	Date
Sebastian Andrzej Siewior	418ec1961c	printk: Adjust mapping for 32bit seq macros Note: This change only applies to 32bit architectures. On 64bit architectures the macros are NOPs. __ulseq_to_u64seq() computes the upper 32 bits of the passed argument value (@ulseq). The upper bits are derived from a base value (@rb_next_seq) in a way that assumes @ulseq represents a 64bit number that is less than or equal to @rb_next_seq. Until now this mapping has been correct for all call sites. However, in a follow-up commit, values of @ulseq will be passed in that are higher than the base value. This requires a change to how the 32bit value is mapped to a 64bit sequence number. Rather than mapping @ulseq such that the base value is the end of a 32bit block, map @ulseq such that the base value is in the middle of a 32bit block. This allows supporting 31 bits before and after the base value, which is deemed acceptable for the console sequence number during runtime. Here is an example to illustrate the previous and new mappings. For a base value (@rb_next_seq) of 2 2000 0000... Before this change the range of possible return values was: 1 2000 0001 to 2 2000 0000 __ulseq_to_u64seq(1fff ffff) => 2 1fff ffff __ulseq_to_u64seq(2000 0000) => 2 2000 0000 __ulseq_to_u64seq(2000 0001) => 1 2000 0001 __ulseq_to_u64seq(9fff ffff) => 1 9fff ffff __ulseq_to_u64seq(a000 0000) => 1 a000 0000 __ulseq_to_u64seq(a000 0001) => 1 a000 0001 After this change the range of possible return values are: 1 a000 0001 to 2 a000 0000 __ulseq_to_u64seq(1fff ffff) => 2 1fff ffff __ulseq_to_u64seq(2000 0000) => 2 2000 0000 __ulseq_to_u64seq(2000 0001) => 2 2000 0001 __ulseq_to_u64seq(9fff ffff) => 2 9fff ffff __ulseq_to_u64seq(a000 0000) => 2 a000 0000 __ulseq_to_u64seq(a000 0001) => 1 a000 0001 [ john.ogness: Rewrite commit message. ] Reported-by: Francesco Dolcini <francesco@dolcini.it> Reported-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240207134103.1357162-3-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>	2024-02-07 17:23:17 +01:00
John Ogness	5b73e706f0	printk: nbcon: Relocate 32bit seq macros The macros __seq_to_nbcon_seq() and __nbcon_seq_to_seq() are used to provide support for atomic handling of sequence numbers on 32bit systems. Until now this was only used by nbcon.c, which is why they were located in nbcon.c and include nbcon in the name. In a follow-up commit this functionality is also needed by printk_ringbuffer. Rather than duplicating the functionality, relocate the macros to printk_ringbuffer.h. Also, since the macros will be no longer nbcon-specific, rename them to __u64seq_to_ulseq() and __ulseq_to_u64seq(). This does not result in any functional change. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20240207134103.1357162-2-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>	2024-02-07 17:23:17 +01:00
Peter Hilber	4b7f521229	timekeeping: Evaluate system_counterval_t.cs_id instead of .cs Clocksource pointers can be problematic to obtain for drivers which are not clocksource drivers themselves. In particular, the RFC virtio_rtc driver [1] would require a new helper function to obtain a pointer to the ARM Generic Timer clocksource. The ptp_kvm driver also required a similar workaround. Address this by evaluating the clocksource ID, rather than the clocksource pointer, of struct system_counterval_t. By this, setting the clocksource pointer becomes unneeded, and get_device_system_crosststamp() callers will no longer need to supply clocksource pointers. All relevant clocksource drivers provide the ID, so this change is not changing the behaviour. [1] https://lore.kernel.org/lkml/20231218073849.35294-1-peter.hilber@opensynergy.com/ Signed-off-by: Peter Hilber <peter.hilber@opensynergy.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240201010453.2212371-7-peter.hilber@opensynergy.com	2024-02-07 17:05:21 +01:00
Ricardo B. Marliere	49f1ff50d4	clockevents: Make clockevents_subsys const Now that the driver core can properly handle constant struct bus_type, move the clockevents_subsys variable to be a constant structure as well, placing it into read-only memory which can not be modified at runtime. Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://lore.kernel.org/r/20240204-bus_cleanup-time-v1-2-207ec18e24b8@marliere.net	2024-02-07 15:11:24 +01:00
Ricardo B. Marliere	2bc7fc24f9	clocksource: Make clocksource_subsys const Now that the driver core can properly handle constant struct bus_type, move the clocksource_subsys variable to be a constant structure as well, placing it into read-only memory which can not be modified at runtime. Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/r/20240204-bus_cleanup-time-v1-1-207ec18e24b8@marliere.net	2024-02-07 15:11:24 +01:00
Tycho Andersen	0c9bd6bc4b	pidfd: getfd should always report ESRCH if a task is exiting We can get EBADF from pidfd_getfd() if a task is currently exiting, which might be confusing. Let's check PF_EXITING, and just report ESRCH if so. I chose PF_EXITING, because it is set in exit_signals(), which is called before exit_files(). Since ->exit_status is mostly set after exit_files() in exit_notify(), using that still leaves a window open for the race. Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Tycho Andersen <tandersen@netflix.com> Link: https://lore.kernel.org/r/20240206192357.81942-1-tycho@tycho.pizza Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-02-07 12:09:44 +01:00
Oleg Nesterov	83b290c9e3	pidfd: clone: allow CLONE_THREAD \| CLONE_PIDFD together copy_process() just needs to pass PIDFD_THREAD to __pidfd_prepare() if clone_flags & CLONE_THREAD. We can also add another CLONE_ flag (or perhaps reuse CLONE_DETACHED) to enforce PIDFD_THREAD without CLONE_THREAD. Originally-from: Tycho Andersen <tycho@tycho.pizza> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20240205145532.GA28823@redhat.com Reviewed-by: Tycho Andersen <tandersen@netflix.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-02-06 14:39:32 +01:00
Oleg Nesterov	e2e8a142fb	pidfd: exit: kill the no longer used thread_group_exited() It was used by pidfd_poll() but now it has no callers. If it finally finds a modular user we can revert this change, but note that the comment above this helper and the changelog in `38fd525a4c` ("exit: Factor thread_group_exited out of pidfd_poll") are not accurate, thread_group_exited() won't return true if all other threads have passed exit_notify() and are zombies, it returns true only when all other threads are completely gone. Not to mention that it can only work if the task identified by @pid is a thread-group leader. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20240205174347.GA31461@redhat.com Reviewed-by: Tycho Andersen <tandersen@netflix.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-02-06 14:02:51 +01:00
Oleg Nesterov	9ed52108f6	pidfd: change do_notify_pidfd() to use __wake_up(poll_to_key(EPOLLIN)) rather than wake_up_all(). This way do_notify_pidfd() won't wakeup the POLLHUP-only waiters which wait for pid_task() == NULL. TODO: - as Christian pointed out, this asks for the new wake_up_all_poll() helper, it can already have other users. - we can probably discriminate the PIDFD_THREAD and non-PIDFD_THREAD waiters, but this needs more work. See https://lore.kernel.org/all/20240205140848.GA15853@redhat.com/ Signed-off-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20240205141348.GA16539@redhat.com Reviewed-by: Tycho Andersen <tandersen@netflix.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-02-06 13:52:51 +01:00
Frederic Weisbecker	dad6a09f31	hrtimer: Report offline hrtimer enqueue The hrtimers migration on CPU-down hotplug process has been moved earlier, before the CPU actually goes to die. This leaves a small window of opportunity to queue an hrtimer in a blind spot, leaving it ignored. For example a practical case has been reported with RCU waking up a SCHED_FIFO task right before the CPUHP_AP_IDLE_DEAD stage, queuing that way a sched/rt timer to the local offline CPU. Make sure such situations never go unnoticed and warn when that happens. Fixes: `5c0930ccaa` ("hrtimers: Push pending hrtimers away from outgoing CPU earlier") Reported-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240129235646.3171983-4-boqun.feng@gmail.com	2024-02-06 10:56:35 +01:00
Kumar Kartikeya Dwivedi	6fceea0fa5	bpf: Transfer RCU lock state between subprog calls Allow transferring an imbalanced RCU lock state between subprog calls during verification. This allows patterns where a subprog call returns with an RCU lock held, or a subprog call releases an RCU lock held by the caller. Currently, the verifier would end up complaining if the RCU lock is not released when processing an exit from a subprog, which is non-ideal if its execution is supposed to be enclosed in an RCU read section of the caller. Instead, simply only check whether we are processing exit for frame#0 and do not complain on an active RCU lock otherwise. We only need to update the check when processing BPF_EXIT insn, as copy_verifier_state is already set up to do the right thing. Suggested-by: David Vernet <void@manifault.com> Tested-by: Yafang Shao <laoar.shao@gmail.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/r/20240205055646.1112186-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-02-05 20:00:14 -08:00
Kumar Kartikeya Dwivedi	a44b1334aa	bpf: Allow calling static subprogs while holding a bpf_spin_lock Currently, calling any helpers, kfuncs, or subprogs except the graph data structure (lists, rbtrees) API kfuncs while holding a bpf_spin_lock is not allowed. One of the original motivations of this decision was to force the BPF programmer's hand into keeping the bpf_spin_lock critical section small, and to ensure the execution time of the program does not increase due to lock waiting times. In addition to this, some of the helpers and kfuncs may be unsafe to call while holding a bpf_spin_lock. However, when it comes to subprog calls, atleast for static subprogs, the verifier is able to explore their instructions during verification. Therefore, it is similar in effect to having the same code inlined into the critical section. Hence, not allowing static subprog calls in the bpf_spin_lock critical section is mostly an annoyance that needs to be worked around, without providing any tangible benefit. Unlike static subprog calls, global subprog calls are not safe to permit within the critical section, as the verifier does not explore them during verification, therefore whether the same lock will be taken again, or unlocked, cannot be ascertained. Therefore, allow calling static subprogs within a bpf_spin_lock critical section, and only reject it in case the subprog linkage is global. Acked-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: David Vernet <void@manifault.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20240204222349.938118-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-02-05 19:58:47 -08:00
Tejun Heo	40911d4457	Merge branch 'for-6.8-fixes' into for-6.9 The for-6.8-fixes commit ae9cc8956944 ("Revert "workqueue: Override implicit ordered attribute in workqueue_apply_unbound_cpumask()") also fixes build for Signed-off-by: Tejun Heo <tj@kernel.org>	2024-02-05 15:49:47 -10:00
Tejun Heo	aac8a59537	Revert "workqueue: Override implicit ordered attribute in workqueue_apply_unbound_cpumask()" This reverts commit `ca10d851b9`. The commit allowed workqueue_apply_unbound_cpumask() to clear __WQ_ORDERED on now removed implicitly ordered workqueues. This was incorrect in that system-wide config change shouldn't break ordering properties of all workqueues. The reason why apply_workqueue_attrs() path was allowed to do so was because it was targeting the specific workqueue - either the workqueue had WQ_SYSFS set or the workqueue user specifically tried to change max_active, both of which indicate that the workqueue doesn't need to be ordered. The implicitly ordered workqueue promotion was removed by the previous commit `3bc1e711c2` ("workqueue: Don't implicitly make UNBOUND workqueues w/ @max_active==1 ordered"). However, it didn't update this path and broke build. Let's revert the commit which was incorrect in the first place which also fixes build. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: `3bc1e711c2` ("workqueue: Don't implicitly make UNBOUND workqueues w/ @max_active==1 ordered") Fixes: `ca10d851b9` ("workqueue: Override implicit ordered attribute in workqueue_apply_unbound_cpumask()") Cc: stable@vger.kernel.org # v6.6+ Signed-off-by: Tejun Heo <tj@kernel.org>	2024-02-05 15:49:06 -10:00
Tejun Heo	3bc1e711c2	workqueue: Don't implicitly make UNBOUND workqueues w/ @max_active==1 ordered `5c0338c687` ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered") automoatically promoted UNBOUND workqueues w/ @max_active==1 to ordered workqueues because UNBOUND workqueues w/ @max_active==1 used to be the way to create ordered workqueues and the new NUMA support broke it. These problems can be subtle and the fact that they can only trigger on NUMA machines made them even more difficult to debug. However, overloading the UNBOUND allocation interface this way creates other issues. It's difficult to tell whether a given workqueue actually needs to be ordered and users that legitimately want a min concurrency level wq unexpectedly gets an ordered one instead. With planned UNBOUND workqueue udpates to improve execution locality and more prevalence of chiplet designs which can benefit from such improvements, this isn't a state we wanna be in forever. There aren't that many UNBOUND w/ @max_active==1 users in the tree and the preceding patches audited all and converted them to alloc_ordered_workqueue() as appropriate. This patch removes the implicit promotion of UNBOUND w/ @max_active==1 workqueues to ordered ones. v2: v1 patch incorrectly dropped !list_empty(&wq->pwqs) condition in apply_workqueue_attrs_locked() which spuriously triggers WARNING and fails workqueue creation. Fix it. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: kernel test robot <oliver.sang@intel.com> Link: https://lore.kernel.org/oe-lkp/202304251050.45a5df1f-oliver.sang@intel.com	2024-02-05 14:19:10 -10:00
Tejun Heo	7245d24f87	backtracetest: Convert from tasklet to BH workqueue The only generic interface to execute asynchronously in the BH context is tasklet; however, it's marked deprecated and has some design flaws. To replace tasklets, BH workqueue support was recently added. A BH workqueue behaves similarly to regular workqueues except that the queued work items are executed in the BH context. This patch converts backtracetest from tasklet to BH workqueue. - Replace "irq" with "bh" in names and message to better reflect what's happening. - Replace completion usage with a flush_work() call. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Arjan van de Ven <arjan@linux.intel.com>	2024-02-05 13:22:34 -10:00
Kui-Feng Lee	df9705eaa0	bpf: Remove an unnecessary check. The "i" here is always equal to "btf_type_vlen(t)" since the "for_each_member()" loop never breaks. Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20240203055119.2235598-1-thinker.li@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>	2024-02-05 10:25:08 -08:00
Waiman Long	8eb17dc1a6	workqueue: Skip __WQ_DESTROYING workqueues when updating global unbound cpumask Skip updating workqueues with __WQ_DESTROYING bit set when updating global unbound cpumask to avoid unnecessary work and other complications. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-02-05 07:52:22 -10:00
Wang Jinchao	96068b6030	workqueue: fix a typo in comment There should be three, fix it. Signed-off-by: Wang Jinchao <wangjinchao@xfusion.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-02-05 07:48:25 -10:00
Tejun Heo	4f19b8e01e	Revert "workqueue: make wq_subsys const" This reverts commit `d412ace111`. This leads to build failures as it depends on a driver-core commit `32f78abe59` ("driver core: bus: constantify subsys_register() calls"). Let's drop it from wq tree and route it through driver-core tree. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202402051505.kM9Rr3CJ-lkp@intel.com/	2024-02-05 07:19:54 -10:00
Jens Axboe	06b23f92af	block: update cached timestamp post schedule/preemption Mark the task as having a cached timestamp when set assign it, so we can efficiently check if it needs updating post being scheduled back in. This covers both the actual schedule out case, which would've flushed the plug, and the preemption case which doesn't touch the plugged requests (for many reasons, one of them being then we'd need to have preemption disabled around plug state manipulation). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-02-05 10:07:34 -07:00
Nikhil V	8bc2973635	PM: hibernate: Add support for LZ4 compression for hibernation Extend the support for LZ4 compression to be used with hibernation. The main idea is that different compression algorithms have different characteristics and hibernation may benefit when it uses any of these algorithms: a default algorithm, having higher compression rate but is slower(compression/decompression) and a secondary algorithm, that is faster(compression/decompression) but has lower compression rate. LZ4 algorithm has better decompression speeds over LZO. This reduces the hibernation image restore time. As per test results: LZO LZ4 Size before Compression(bytes) 682696704 682393600 Size after Compression(bytes) 146502402 155993547 Decompression Rate 335.02 MB/s 501.05 MB/s Restore time 4.4s 3.8s LZO is the default compression algorithm used for hibernation. Enable CONFIG_HIBERNATION_COMP_LZ4 to set the default compressor as LZ4. Signed-off-by: Nikhil V <quic_nprakash@quicinc.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2024-02-05 14:30:35 +01:00
Nikhil V	a06c6f5d3c	PM: hibernate: Move to crypto APIs for LZO compression Currently for hibernation, LZO is the only compression algorithm available and uses the existing LZO library calls. However, there is no flexibility to switch to other algorithms which provides better results. The main idea is that different compression algorithms have different characteristics and hibernation may benefit when it uses alternate algorithms. By moving to crypto based APIs, it lays a foundation to use other compression algorithms for hibernation. There are no functional changes introduced by this approach. Signed-off-by: Nikhil V <quic_nprakash@quicinc.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2024-02-05 14:28:54 +01:00
Nikhil V	89a807625f	PM: hibernate: Rename lzo* to make it generic Renaming lzo* to generic names, except for lzo_xxx() APIs. This is used in the next patch where we move to crypto based APIs for compression. There are no functional changes introduced by this approach. Signed-off-by: Nikhil V <quic_nprakash@quicinc.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2024-02-05 14:28:54 +01:00
Rafael J. Wysocki	a6d38e991d	PM: sleep: stats: Use locking in dpm_save_failed_dev() Because dpm_save_failed_dev() may be called simultaneously by multiple failing device PM functions, the state of the suspend_stats fields updated by it may become inconsistent. Prevent that from happening by using a lock in dpm_save_failed_dev(). Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>	2024-02-05 14:28:54 +01:00
Rafael J. Wysocki	9ff544fa5f	PM: sleep: stats: Define suspend_stats next to the code using it It is not necessary to define struct suspend_stats in a header file and the suspend_stats variable in the core device system-wide PM code. They both can be defined in kernel/power/main.c, next to the sysfs and debugfs code accessing suspend_stats, which can be static. Modify the code in question in accordance with the above observation and replace the static inline functions manipulating suspend_stats with regular ones defined in kernel/power/main.c. While at it, move the enum suspend_stat_step to the end of suspend.h which is a more suitable place for it. No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>	2024-02-05 14:28:19 +01:00
Rafael J. Wysocki	2231f78d3e	PM: sleep: stats: Use unsigned int for success and failure counters Change the type of the "success" and "fail" fields in struct suspend_stats to unsigned int, because they cannot be negative. No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>	2024-02-05 14:26:27 +01:00
Rafael J. Wysocki	b730bab0b9	PM: sleep: stats: Use an array of step failure counters Instead of using a set of individual struct suspend_stats fields representing suspend step failure counters, use an array of counters indexed by enum suspend_stat_step for this purpose, which allows dpm_save_failed_step() to increment the appropriate counter automatically, so that its callers don't need to do that directly. It also allows suspend_stats_show() to carry out a loop over the counters array to print their values. Because the counters cannot become negative, use unsigned int for representing them. The only user-observable impact of this change is a different ordering of entries in the suspend_stats debugfs file which is not expected to matter. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>	2024-02-05 14:25:56 +01:00
Rafael J. Wysocki	bc88528cda	PM: sleep: stats: Use array of suspend step names Replace suspend_step_name() in the suspend statistics code with an array of suspend step names which has fewer lines of code and less overhead. While at it, remove two unnecessary line breaks in suspend_stats_show() and adjust some white space in there to the kernel coding style for a more consistent code layout. No intentional functional impact. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Stanislaw Gruszka <stanislaw.gruszka@linux.intel.com> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>	2024-02-05 14:21:24 +01:00
Tejun Heo	4cb1ef6460	workqueue: Implement BH workqueues to eventually replace tasklets The only generic interface to execute asynchronously in the BH context is tasklet; however, it's marked deprecated and has some design flaws such as the execution code accessing the tasklet item after the execution is complete which can lead to subtle use-after-free in certain usage scenarios and less-developed flush and cancel mechanisms. This patch implements BH workqueues which share the same semantics and features of regular workqueues but execute their work items in the softirq context. As there is always only one BH execution context per CPU, none of the concurrency management mechanisms applies and a BH workqueue can be thought of as a convenience wrapper around softirq. Except for the inability to sleep while executing and lack of max_active adjustments, BH workqueues and work items should behave the same as regular workqueues and work items. Currently, the execution is hooked to tasklet[_hi]. However, the goal is to convert all tasklet users over to BH workqueues. Once the conversion is complete, tasklet can be removed and BH workqueues can directly take over the tasklet softirqs. system_bh[_highpri]_wq are added. As queue-wide flushing doesn't exist in tasklet, all existing tasklet users should be able to use the system BH workqueues without creating their own workqueues. v3: - Add missing interrupt.h include. v2: - Instead of using tasklets, hook directly into its softirq action functions - tasklet[_hi]_action(). This is slightly cheaper and closer to the eventual code structure we want to arrive at. Suggested by Lai. - Lai also pointed out several places which need NULL worker->task handling or can use clarification. Updated. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/CAHk-=wjDW53w4-YcSmgKC5RruiRLHmJ1sXeYdp_ZgVoBw=5byA@mail.gmail.com Tested-by: Allen Pais <allen.lkml@gmail.com> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>	2024-02-04 11:28:06 -10:00
Tejun Heo	2fcdb1b444	workqueue: Factor out init_cpu_worker_pool() Factor out init_cpu_worker_pool() from workqueue_init_early(). This is pure reorganization in preparation of BH workqueue support. Signed-off-by: Tejun Heo <tj@kernel.org> Tested-by: Allen Pais <allen.lkml@gmail.com>	2024-02-04 11:28:06 -10:00
Tejun Heo	c35aea39d1	workqueue: Update lock debugging code These changes are in preparation of BH workqueue which will execute work items from BH context. - Update lock and RCU depth checks in process_one_work() so that it remembers and checks against the starting depths and prints out the depth changes. - Factor out lockdep annotations in the flush paths into touch_{wq\|work}_lockdep_map(). The work->lockdep_map touching is moved from __flush_work() to its callee - start_flush_work(). This brings it closer to the wq counterpart and will allow testing the associated wq's flags which will be needed to support BH workqueues. This is not expected to cause any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Tested-by: Allen Pais <allen.lkml@gmail.com>	2024-02-04 11:28:06 -10:00
Ricardo B. Marliere	d412ace111	workqueue: make wq_subsys const Now that the driver core can properly handle constant struct bus_type, move the wq_subsys variable to be a constant structure as well, placing it into read-only memory which can not be modified at runtime. Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Suggested-and-reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-02-04 11:23:25 -10:00
Tejun Heo	c70e1779b7	workqueue: Fix pwq->nr_in_flight corruption in try_to_grab_pending() `dd6c3c5441` ("workqueue: Move pwq_dec_nr_in_flight() to the end of work item handling") relocated pwq_dec_nr_in_flight() after set_work_pool_and_keep_pending(). However, the latter destroys information contained in work->data that's needed by pwq_dec_nr_in_flight() including the flush color. With flush color destroyed, flush_workqueue() can stall easily when mixed with cancel_work*() usages. This is easily triggered by running xfstests generic/001 test on xfs: INFO: task umount:6305 blocked for more than 122 seconds. ... task:umount state:D stack:13008 pid:6305 tgid:6305 ppid:6301 flags:0x00004000 Call Trace: <TASK> __schedule+0x2f6/0xa20 schedule+0x36/0xb0 schedule_timeout+0x20b/0x280 wait_for_completion+0x8a/0x140 __flush_workqueue+0x11a/0x3b0 xfs_inodegc_flush+0x24/0xf0 xfs_unmountfs+0x14/0x180 xfs_fs_put_super+0x3d/0x90 generic_shutdown_super+0x7c/0x160 kill_block_super+0x1b/0x40 xfs_kill_sb+0x12/0x30 deactivate_locked_super+0x35/0x90 deactivate_super+0x42/0x50 cleanup_mnt+0x109/0x170 __cleanup_mnt+0x12/0x20 task_work_run+0x60/0x90 syscall_exit_to_user_mode+0x146/0x150 do_syscall_64+0x5d/0x110 entry_SYSCALL_64_after_hwframe+0x6c/0x74 Fix it by stashing work_data before calling set_work_pool_and_keep_pending() and using the stashed value for pwq_dec_nr_in_flight(). Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Chandan Babu R <chandanbabu@kernel.org> Link: http://lkml.kernel.org/r/87o7cxeehy.fsf@debian-BULLSEYE-live-builder-AMD64 Fixes: `dd6c3c5441` ("workqueue: Move pwq_dec_nr_in_flight() to the end of work item handling")	2024-02-04 11:14:49 -10:00
Andrii Nakryiko	1eb986746a	bpf: don't emit warnings intended for global subprogs for static subprogs When btf_prepare_func_args() was generalized to handle both static and global subprogs, a few warnings/errors that are meant only for global subprog cases started to be emitted for static subprogs, where they are sort of expected and irrelavant. Stop polutting verifier logs with irrelevant scary-looking messages. Fixes: `e26080d0da` ("bpf: prepare btf_prepare_func_args() for handling static subprogs") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240202190529.2374377-4-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-02-02 18:08:59 -08:00
Andrii Nakryiko	8f13c34087	bpf: handle trusted PTR_TO_BTF_ID_OR_NULL in argument check logic Add PTR_TRUSTED \| PTR_MAYBE_NULL modifiers for PTR_TO_BTF_ID to check_reg_type() to support passing trusted nullable PTR_TO_BTF_ID registers into global functions accepting `__arg_trusted __arg_nullable` arguments. This hasn't been caught earlier because tests were either passing known non-NULL PTR_TO_BTF_ID registers or known NULL (SCALAR) registers. When utilizing this functionality in complicated real-world BPF application that passes around PTR_TO_BTF_ID_OR_NULL, it became apparent that verifier rejects valid case because check_reg_type() doesn't handle this case explicitly. Existing check_reg_type() logic is already anticipating this combination, so we just need to explicitly list this combo in the switch statement. Fixes: `e2b3c4ff5d` ("bpf: add __arg_trusted global func arg tag") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240202190529.2374377-2-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2024-02-02 18:08:58 -08:00
Linus Torvalds	56897d5188	Merge tag 'trace-v6.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing and eventfs fixes from Steven Rostedt: - Fix the return code for ring_buffer_poll_wait() It was returing a -EINVAL instead of EPOLLERR. - Zero out the tracefs_inode so that all fields are initialized. The ti->private could have had stale data, but instead of just initializing it to NULL, clear out the entire structure when it is allocated. - Fix a crash in timerlat The hrtimer was initialized at read and not open, but is canceled at close. If the file was opened and never read the close will pass a NULL pointer to hrtime_cancel(). - Rewrite of eventfs. Linus wrote a patch series to remove the dentry references in the eventfs_inode and to use ref counting and more of proper VFS interfaces to make it work. - Add warning to put_ei() if ei is not set to free. That means something is about to free it when it shouldn't. - Restructure the eventfs_inode to make it more compact, and remove the unused llist field. - Remove the fsnotify() funtions for when the inodes were being created in the lookup code. It doesn't make sense to notify about creation just because something is being looked up. - The inode hard link count was not accurate. It was being updated when a file was looked up. The inodes of directories were updating their parent inode hard link count every time the inode was created. That means if memory reclaim cleaned a stale directory inode and the inode was lookup up again, it would increment the parent inode again as well. Al Viro said to just have all eventfs directories have a hard link count of 1. That tells user space not to trust it. tag 'trace-v6.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: eventfs: Keep all directory links at 1 eventfs: Remove fsnotify*() functions from lookup() eventfs: Restructure eventfs_inode structure to be more condensed eventfs: Warn if an eventfs_inode is freed without is_freed being set tracing/timerlat: Move hrtimer_init to timerlat_fd open() eventfs: Get rid of dentry pointers without refcounts eventfs: Clean up dentry ops and add revalidate function eventfs: Remove unused d_parent pointer field tracefs: dentry lookup crapectomy tracefs: Avoid using the ei->dentry pointer unnecessarily eventfs: Initialize the tracefs inode properly tracefs: Zero out the tracefs_inode when allocating it ring-buffer: Clean ring_buffer_poll_wait() error return	2024-02-02 15:32:58 -08:00
Eduard Zingerman	6efbde200b	bpf: Handle scalar spill vs all MISC in stacksafe() When check_stack_read_fixed_off() reads value from an spi all stack slots of which are set to STACK_{MISC,INVALID}, the destination register is set to unbound SCALAR_VALUE. Exploit this fact by allowing stacksafe() to use a fake unbound scalar register to compare 'mmmm mmmm' stack value in old state vs spilled 64-bit scalar in current state and vice versa. Veristat results after this patch show some gains: ./veristat -C -e file,prog,states -f 'states_pct>10' not-opt after File Program States (DIFF) ----------------------- --------------------- --------------- bpf_overlay.o tail_rev_nodeport_lb4 -45 (-15.85%) bpf_xdp.o tail_lb_ipv4 -541 (-19.57%) pyperf100.bpf.o on_event -680 (-10.42%) pyperf180.bpf.o on_event -2164 (-19.62%) pyperf600.bpf.o on_event -9799 (-24.84%) strobemeta.bpf.o on_event -9157 (-65.28%) xdp_synproxy_kern.bpf.o syncookie_tc -54 (-19.29%) xdp_synproxy_kern.bpf.o syncookie_xdp -74 (-24.50%) Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20240127175237.526726-6-maxtram95@gmail.com	2024-02-02 13:22:14 -08:00
Maxim Mikityanskiy	c1e6148cb4	bpf: Preserve boundaries and track scalars on narrowing fill When the width of a fill is smaller than the width of the preceding spill, the information about scalar boundaries can still be preserved, as long as it's coerced to the right width (done by coerce_reg_to_size). Even further, if the actual value fits into the fill width, the ID can be preserved as well for further tracking of equal scalars. Implement the above improvements, which makes narrowing fills behave the same as narrowing spills and MOVs between registers. Two tests are adjusted to accommodate for endianness differences and to take into account that it's now allowed to do a narrowing fill from the least significant bits. reg_bounds_sync is added to coerce_reg_to_size to correctly adjust umin/umax boundaries after the var_off truncation, for example, a 64-bit value 0xXXXXXXXX00000000, when read as a 32-bit, gets umin = 0, umax = 0xFFFFFFFF, var_off = (0x0; 0xffffffff00000000), which needs to be synced down to umax = 0, otherwise reg_bounds_sanity_check doesn't pass. Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20240127175237.526726-4-maxtram95@gmail.com	2024-02-02 13:22:14 -08:00
Maxim Mikityanskiy	e67ddd9b1c	bpf: Track spilled unbounded scalars Support the pattern where an unbounded scalar is spilled to the stack, then boundary checks are performed on the src register, after which the stack frame slot is refilled into a register. Before this commit, the verifier didn't treat the src register and the stack slot as related if the src register was an unbounded scalar. The register state wasn't copied, the id wasn't preserved, and the stack slot was marked as STACK_MISC. Subsequent boundary checks on the src register wouldn't result in updating the boundaries of the spilled variable on the stack. After this commit, the verifier will preserve the bond between src and dst even if src is unbounded, which permits to do boundary checks on src and refill dst later, still remembering its boundaries. Such a pattern is sometimes generated by clang when compiling complex long functions. One test is adjusted to reflect that now unbounded scalars are tracked. Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20240127175237.526726-2-maxtram95@gmail.com	2024-02-02 13:22:14 -08:00
Christophe Leroy	315df9c476	modules: Remove #ifdef CONFIG_STRICT_MODULE_RWX around rodata_enabled Now that rodata_enabled is declared at all time, the #ifdef CONFIG_STRICT_MODULE_RWX can be removed. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>	2024-02-02 10:21:25 -08:00
Oleg Nesterov	a1c6d5439f	pid: kill the obsolete PIDTYPE_PID code in transfer_pid() transfer_pid() must be never called with pid == PIDTYPE_PID, new_leader->thread_pid should be changed by exchange_tids(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20240202131255.GA26025@redhat.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-02-02 14:57:53 +01:00
Oleg Nesterov	43f0df54c9	pidfd_poll: report POLLHUP when pid_task() == NULL Add another wake_up_all(wait_pidfd) into __change_pid() and change pidfd_poll() to include EPOLLHUP if task == NULL. This allows to wait until the target process/thread is reaped. TODO: change do_notify_pidfd() to use the keyed wakeups. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20240202131226.GA26018@redhat.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-02-02 14:57:53 +01:00
Oleg Nesterov	64bef697d3	pidfd: implement PIDFD_THREAD flag for pidfd_open() With this flag: - pidfd_open() doesn't require that the target task must be a thread-group leader - pidfd_poll() succeeds when the task exits and becomes a zombie (iow, passes exit_notify()), even if it is a leader and thread-group is not empty. This means that the behaviour of pidfd_poll(PIDFD_THREAD, pid-of-group-leader) is not well defined if it races with exec() from its sub-thread; pidfd_poll() can succeed or not depending on whether pidfd_task_exited() is called before or after exchange_tids(). Perhaps we can improve this behaviour later, pidfd_poll() can probably take sig->group_exec_task into account. But this doesn't really differ from the case when the leader exits before other threads (so pidfd_poll() succeeds) and then another thread execs and pidfd_poll() will block again. thread_group_exited() is no longer used, perhaps it can die. Co-developed-by: Tycho Andersen <tycho@tycho.pizza> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20240131132602.GA23641@redhat.com Tested-by: Tycho Andersen <tandersen@netflix.com> Reviewed-by: Tycho Andersen <tandersen@netflix.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-02-02 13:12:28 +01:00
Oleg Nesterov	21e25205d7	pidfd: don't do_notify_pidfd() if !thread_group_empty() do_notify_pidfd() makes no sense until the whole thread group exits, change do_notify_parent() to check thread_group_empty(). This avoids the unnecessary do_notify_pidfd() when tsk is not a leader, or it exits before other threads, or it has a ptraced EXIT_ZOMBIE sub-thread. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20240127132407.GA29136@redhat.com Reviewed-by: Tycho Andersen <tandersen@netflix.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-02-02 13:12:28 +01:00
Oleg Nesterov	cdefbf2324	pidfd: cleanup the usage of __pidfd_prepare's flags - make pidfd_create() static. - Don't pass O_RDWR \| O_CLOEXEC to __pidfd_prepare() in copy_process(), __pidfd_prepare() adds these flags unconditionally. - Kill the flags check in __pidfd_prepare(). sys_pidfd_open() checks the flags itself, all other users of pidfd_prepare() pass flags = 0. If we need a sanity check for those other in kernel users then WARN_ON_ONCE(flags & ~PIDFD_NONBLOCK) makes more sense. - Don't pass O_RDWR to get_unused_fd_flags(), it ignores everything except O_CLOEXEC. - Don't pass O_CLOEXEC to anon_inode_getfile(), it ignores everything except O_ACCMODE \| O_NONBLOCK. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20240125161734.GA778@redhat.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-02-02 13:12:28 +01:00
Wang Jinchao	b639585e71	fork: Using clone_flags for legacy clone check In the current implementation of clone(), there is a line that initializes `u64 clone_flags = args->flags` at the top. This means that there is no longer a need to use args->flags for the legacy clone check. Signed-off-by: Wang Jinchao <wangjinchao@xfusion.com> Link: https://lore.kernel.org/r/202401311054+0800-wangjinchao@xfusion.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-02-02 13:12:28 +01:00
Jakub Kicinski	cf244463a2	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Cross-merge networking fixes after downstream PR. No conflicts or adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2024-02-01 15:12:37 -08:00
Jingzi Meng	09ce61e27d	cap_syslog: remove CAP_SYS_ADMIN when dmesg_restrict CAP_SYSLOG was separated from CAP_SYS_ADMIN and introduced in Linux 2.6.37 (2010-11). For a long time, certain syslog actions required CAP_SYS_ADMIN or CAP_SYSLOG. Maybe it’s time to officially remove CAP_SYS_ADMIN for more fine-grained control. CAP_SYS_ADMIN was once removed but added back for backwards compatibility reasons. In commit `38ef4c2e43` ("syslog: check cap_syslog when dmesg_restrict") (2010-12), CAP_SYS_ADMIN was no longer needed. And in commit `ee24aebffb` ("cap_syslog: accept CAP_SYS_ADMIN for now") (2011-02), it was accepted again. Since then, CAP_SYS_ADMIN has been preserved. Now that almost 13 years have passed, the legacy application may have had enough time to be updated. Signed-off-by: Jingzi Meng <mengjingzi@iie.ac.cn> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20240105062007.26965-1-mengjingzi@iie.ac.cn Signed-off-by: Kees Cook <keescook@chromium.org>	2024-02-01 10:04:58 -08:00
Matt Bobrowski	1581e5118e	bpf: Minor clean-up to sleepable_lsm_hooks BTF set There's already one main CONFIG_SECURITY_NETWORK ifdef block within the sleepable_lsm_hooks BTF set. Consolidate this duplicated ifdef block as there's no need for it and all things guarded by it should remain in one place in this specific context. Signed-off-by: Matt Bobrowski <mattbobrowski@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/Zbt1smz43GDMbVU3@google.com	2024-02-01 18:37:45 +01:00

... 8 9 10 11 12 ...

44306 Commits