linux

Author	SHA1	Message	Date
Paul E. McKenney	660349ac79	srcu: Move ->sda_is_static from srcu_struct to srcu_usage This commit moves the ->sda_is_static field from the srcu_struct structure to the srcu_usage structure to reduce the size of the former in order to improve cache locality. Suggested-by: Christoph Hellwig <hch@lst.de> Tested-by: Sachin Sant <sachinp@linux.ibm.com> Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com> Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2023-04-04 08:36:28 -07:00
Paul E. McKenney	3b46679c62	srcu: Move heuristics fields from srcu_struct to srcu_usage This commit moves the ->srcu_size_jiffies, ->srcu_n_lock_retries, and ->srcu_n_exp_nodelay fields from the srcu_struct structure to the srcu_usage structure to reduce the size of the former in order to improve cache locality. Suggested-by: Christoph Hellwig <hch@lst.de> Tested-by: Sachin Sant <sachinp@linux.ibm.com> Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com> Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2023-04-04 08:36:23 -07:00
Paul E. McKenney	03200b5ca3	srcu: Move grace-period fields from srcu_struct to srcu_usage This commit moves the ->srcu_gp_seq, ->srcu_gp_seq_needed, ->srcu_gp_seq_needed_exp, ->srcu_gp_start, and ->srcu_last_gp_end fields from the srcu_struct structure to the srcu_usage structure to reduce the size of the former in order to improve cache locality. Suggested-by: Christoph Hellwig <hch@lst.de> Tested-by: Sachin Sant <sachinp@linux.ibm.com> Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com> Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2023-04-04 08:36:19 -07:00
Paul E. McKenney	e3a6ab25cf	srcu: Move ->srcu_gp_mutex from srcu_struct to srcu_usage This commit moves the ->srcu_gp_mutex field from the srcu_struct structure to the srcu_usage structure to reduce the size of the former in order to improve cache locality. Suggested-by: Christoph Hellwig <hch@lst.de> Tested-by: Sachin Sant <sachinp@linux.ibm.com> Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com> Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2023-04-04 08:36:14 -07:00
Paul E. McKenney	b3fb11f7e9	srcu: Move ->lock from srcu_struct to srcu_usage This commit moves the ->lock field from the srcu_struct structure to the srcu_usage structure to reduce the size of the former in order to improve cache locality. Suggested-by: Christoph Hellwig <hch@lst.de> Tested-by: Sachin Sant <sachinp@linux.ibm.com> Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com> Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2023-04-04 08:36:07 -07:00
Paul E. McKenney	0839ade94b	srcu: Move ->lock initialization after srcu_usage allocation Currently, both __init_srcu_struct() in CONFIG_DEBUG_LOCK_ALLOC=y kernels and init_srcu_struct() in CONFIG_DEBUG_LOCK_ALLOC=n kernel initialize the srcu_struct structure's ->lock before the srcu_usage structure has been allocated. This of course prevents the ->lock from being moved to the srcu_usage structure, so this commit moves the initialization into the init_srcu_struct_fields() after the srcu_usage structure has been allocated. Cc: Christoph Hellwig <hch@lst.de> Tested-by: Sachin Sant <sachinp@linux.ibm.com> Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com> Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2023-04-04 08:36:01 -07:00
Paul E. McKenney	574dc1a7ef	srcu: Move ->srcu_cb_mutex from srcu_struct to srcu_usage This commit moves the ->srcu_cb_mutex field from the srcu_struct structure to the srcu_usage structure to reduce the size of the former in order to improve cache locality. Suggested-by: Christoph Hellwig <hch@lst.de> Tested-by: Sachin Sant <sachinp@linux.ibm.com> Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com> Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2023-04-04 08:35:56 -07:00
Paul E. McKenney	a0d8cbd382	srcu: Move ->srcu_size_state from srcu_struct to srcu_usage This commit moves the ->srcu_size_state field from the srcu_struct structure to the srcu_usage structure to reduce the size of the former in order to improve cache locality. Suggested-by: Christoph Hellwig <hch@lst.de> Tested-by: Sachin Sant <sachinp@linux.ibm.com> Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com> Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2023-04-04 08:35:50 -07:00
Paul E. McKenney	208f41b131	srcu: Move ->level from srcu_struct to srcu_usage This commit moves the ->level[] array from the srcu_struct structure to the srcu_usage structure to reduce the size of the former in order to improve cache locality. Suggested-by: Christoph Hellwig <hch@lst.de> Tested-by: Sachin Sant <sachinp@linux.ibm.com> Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com> Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2023-04-04 08:35:44 -07:00
Paul E. McKenney	95433f7263	srcu: Begin offloading srcu_struct fields to srcu_update The current srcu_struct structure is on the order of 200 bytes in size (depending on architecture and .config), which is much better than the old-style 26K bytes, but still all too inconvenient when one is trying to achieve good cache locality on a fastpath involving SRCU readers. However, only a few fields in srcu_struct are used by SRCU readers. The remaining fields could be offloaded to a new srcu_update structure, thus shrinking the srcu_struct structure down to a few tens of bytes. This commit begins this noble quest, a quest that is complicated by open-coded initialization of the srcu_struct within the srcu_notifier_head structure. This complication is addressed by updating the srcu_notifier_head structure's open coding, given that there does not appear to be a straightforward way of abstracting that initialization. This commit moves only the ->node pointer to srcu_update. Later commits will move additional fields. [ paulmck: Fold in qiang1.zhang@intel.com's memory-leak fix. ] Link: https://lore.kernel.org/all/20230320055751.4120251-1-qiang1.zhang@intel.com/ Suggested-by: Christoph Hellwig <hch@lst.de> Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com> Cc: "Michał Mirosław" <mirq-linux@rere.qmqm.pl> Cc: Dmitry Osipenko <dmitry.osipenko@collabora.com> Tested-by: Sachin Sant <sachinp@linux.ibm.com> Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2023-04-04 08:35:34 -07:00
Paul E. McKenney	f4d01a2593	srcu: Use static init for statically allocated in-module srcu_struct Further shrinking the srcu_struct structure is eased by requiring that in-module srcu_struct structures rely more heavily on static initialization. In particular, this preserves the property that a module-load-time srcu_struct initialization can fail only due to memory-allocation failure of the per-CPU srcu_data structures. It might also slightly improve robustness by keeping the number of memory allocations that must succeed down percpu_alloc() call. This is in preparation for splitting an srcu_usage structure out of the srcu_struct structure. [ paulmck: Fold in qiang1.zhang@intel.com feedback. ] Cc: Christoph Hellwig <hch@lst.de> Tested-by: Sachin Sant <sachinp@linux.ibm.com> Tested-by: "Zhang, Qiang1" <qiang1.zhang@intel.com> Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2023-04-04 08:35:28 -07:00
Paul E. McKenney	2b4be54830	rcu-tasks: Fix warning for unused tasks_rcu_exit_srcu The tasks_rcu_exit_srcu variable is used only by kernels built with CONFIG_TASKS_RCU=y, but is defined for all kernesl with CONFIG_TASKS_RCU_GENERIC=y. Therefore, in kernels built with CONFIG_TASKS_RCU_GENERIC=y but CONFIG_TASKS_RCU=n, this gives a "defined but not used" warning. This commit therefore moves this variable under CONFIG_TASKS_RCU. Link: https://lore.kernel.org/oe-kbuild-all/202303191536.XzMSyzTl-lkp@intel.com/ Reported-by: kernel test robot <lkp@intel.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2023-04-04 08:34:47 -07:00
Dave Marchevsky	f6a6a5a976	bpf: Fix struct_meta lookup for bpf_obj_free_fields kfunc call bpf_obj_drop_impl has a void return type. In check_kfunc_call, the "else if" which sets insn_aux->kptr_struct_meta for bpf_obj_drop_impl is surrounded by a larger if statement which checks btf_type_is_ptr. As a result: * The bpf_obj_drop_impl-specific code will never execute * The btf_struct_meta input to bpf_obj_drop is always NULL * __bpf_obj_drop_impl will always see a NULL btf_record when called from BPF program, and won't call bpf_obj_free_fields * program-allocated kptrs which have fields that should be cleaned up by bpf_obj_free_fields may instead leak resources This patch adds a btf_type_is_void branch to the larger if and moves special handling for bpf_obj_drop_impl there, fixing the issue. Fixes: `ac9f06050a` ("bpf: Introduce bpf_obj_drop") Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230403200027.2271029-1-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-04-03 14:54:21 -07:00
Greg Kroah-Hartman	2243acd50a	driver core: class: remove struct class_interface * from callbacks The add_dev and remove_dev callbacks in struct class_interface currently pass in a pointer back to the class_interface structure that is calling them, but none of the callback implementations actually use this pointer as it is pointless (the structure is known, the driver passed it in in the first place if it is really needed again.) So clean this up and just remove the pointer from the callbacks and fix up all callback functions. Cc: Jean Delvare <jdelvare@suse.com> Cc: Guenter Roeck <linux@roeck-us.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Kurt Schwemmer <kurt.schwemmer@microsemi.com> Cc: Jon Mason <jdmason@kudzu.us> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Allen Hubbe <allenbh@gmail.com> Cc: Dominik Brodowski <linux@dominikbrodowski.net> Cc: Matt Porter <mporter@kernel.crashing.org> Cc: Alexandre Bounine <alex.bou9@gmail.com> Cc: "James E.J. Bottomley" <jejb@linux.ibm.com> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Cc: Doug Gilbert <dgilbert@interlog.com> Cc: John Stultz <jstultz@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Stephen Boyd <sboyd@kernel.org> Cc: Hans de Goede <hdegoede@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Wang Weiyang <wangweiyang2@huawei.com> Cc: Yang Yingliang <yangyingliang@huawei.com> Cc: Jakob Koschel <jakobkoschel@gmail.com> Cc: Cai Xinchen <caixinchen1@huawei.com> Acked-by: Rafael J. Wysocki <rafael@kernel.org> Acked-by: Logan Gunthorpe <logang@deltatee.com> Link: https://lore.kernel.org/r/2023040250-pushover-platter-509c@gregkh Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-04-03 21:42:52 +02:00
Daniel Bristot de Oliveira	d3cba7f02c	tracing/osnoise: Fix notify new tracing_max_latency osnoise/timerlat tracers are reporting new max latency on instances where the tracing is off, creating inconsistencies between the max reported values in the trace and in the tracing_max_latency. Thus only report new tracing_max_latency on active tracing instances. Link: https://lkml.kernel.org/r/ecd109fde4a0c24ab0f00ba1e9a144ac19a91322.1680104184.git.bristot@kernel.org Cc: stable@vger.kernel.org Fixes: `dae181349f` ("tracing/osnoise: Support a list of trace_array *tr") Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2023-04-03 11:52:46 -04:00
Daniel Bristot de Oliveira	b9f451a902	tracing/timerlat: Notify new max thread latency timerlat is not reporting a new tracing_max_latency for the thread latency. The reason is that it is not calling notify_new_max_latency() function after the new thread latency is sampled. Call notify_new_max_latency() after computing the thread latency. Link: https://lkml.kernel.org/r/16e18d61d69073d0192ace07bf61e405cca96e9c.1680104184.git.bristot@kernel.org Cc: stable@vger.kernel.org Fixes: `dae181349f` ("tracing/osnoise: Support a list of trace_array *tr") Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2023-04-03 11:52:32 -04:00
Zheng Yejian	6455b6163d	ring-buffer: Fix race while reader and writer are on the same page When user reads file 'trace_pipe', kernel keeps printing following logs that warn at "cpu_buffer->reader_page->read > rb_page_size(reader)" in rb_get_reader_page(). It just looks like there's an infinite loop in tracing_read_pipe(). This problem occurs several times on arm64 platform when testing v5.10 and below. Call trace: rb_get_reader_page+0x248/0x1300 rb_buffer_peek+0x34/0x160 ring_buffer_peek+0xbc/0x224 peek_next_entry+0x98/0xbc __find_next_entry+0xc4/0x1c0 trace_find_next_entry_inc+0x30/0x94 tracing_read_pipe+0x198/0x304 vfs_read+0xb4/0x1e0 ksys_read+0x74/0x100 __arm64_sys_read+0x24/0x30 el0_svc_common.constprop.0+0x7c/0x1bc do_el0_svc+0x2c/0x94 el0_svc+0x20/0x30 el0_sync_handler+0xb0/0xb4 el0_sync+0x160/0x180 Then I dump the vmcore and look into the problematic per_cpu ring_buffer, I found that tail_page/commit_page/reader_page are on the same page while reader_page->read is obviously abnormal: tail_page == commit_page == reader_page == { .write = 0x100d20, .read = 0x8f9f4805, // Far greater than 0xd20, obviously abnormal!!! .entries = 0x10004c, .real_end = 0x0, .page = { .time_stamp = 0x857257416af0, .commit = 0xd20, // This page hasn't been full filled. // .data[0...0xd20] seems normal. } } The root cause is most likely the race that reader and writer are on the same page while reader saw an event that not fully committed by writer. To fix this, add memory barriers to make sure the reader can see the content of what is committed. Since commit `a0fcaaed0c` ("ring-buffer: Fix race between reset page and reading page") has added the read barrier in rb_get_reader_page(), here we just need to add the write barrier. Link: https://lore.kernel.org/linux-trace-kernel/20230325021247.2923907-1-zhengyejian1@huawei.com Cc: stable@vger.kernel.org Fixes: `77ae365eca` ("ring-buffer: make lockless") Suggested-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2023-04-03 11:51:38 -04:00
Tze-nan Wu	4ccf11c4e8	tracing/synthetic: Fix races on freeing last_cmd Currently, the "last_cmd" variable can be accessed by multiple processes asynchronously when multiple users manipulate synthetic_events node at the same time, it could lead to use-after-free or double-free. This patch add "lastcmd_mutex" to prevent "last_cmd" from being accessed asynchronously. ================================================================ It's easy to reproduce in the KASAN environment by running the two scripts below in different shells. script 1: while : do echo -n -e '\x88' > /sys/kernel/tracing/synthetic_events done script 2: while : do echo -n -e '\xb0' > /sys/kernel/tracing/synthetic_events done ================================================================ double-free scenario: process A process B ------------------- --------------- 1.kstrdup last_cmd 2.free last_cmd 3.free last_cmd(double-free) ================================================================ use-after-free scenario: process A process B ------------------- --------------- 1.kstrdup last_cmd 2.free last_cmd 3.tracing_log_err(use-after-free) ================================================================ Appendix 1. KASAN report double-free: BUG: KASAN: double-free in kfree+0xdc/0x1d4 Free of addr *** by task sh/4879 Call trace: ... kfree+0xdc/0x1d4 create_or_delete_synth_event+0x60/0x1e8 trace_parse_run_command+0x2bc/0x4b8 synth_events_write+0x20/0x30 vfs_write+0x200/0x830 ... Allocated by task 4879: ... kstrdup+0x5c/0x98 create_or_delete_synth_event+0x6c/0x1e8 trace_parse_run_command+0x2bc/0x4b8 synth_events_write+0x20/0x30 vfs_write+0x200/0x830 ... Freed by task 5464: ... kfree+0xdc/0x1d4 create_or_delete_synth_event+0x60/0x1e8 trace_parse_run_command+0x2bc/0x4b8 synth_events_write+0x20/0x30 vfs_write+0x200/0x830 ... ================================================================ Appendix 2. KASAN report use-after-free: BUG: KASAN: use-after-free in strlen+0x5c/0x7c Read of size 1 at addr *** by task sh/5483 sh: CPU: 7 PID: 5483 Comm: sh ... __asan_report_load1_noabort+0x34/0x44 strlen+0x5c/0x7c tracing_log_err+0x60/0x444 create_or_delete_synth_event+0xc4/0x204 trace_parse_run_command+0x2bc/0x4b8 synth_events_write+0x20/0x30 vfs_write+0x200/0x830 ... Allocated by task 5483: ... kstrdup+0x5c/0x98 create_or_delete_synth_event+0x80/0x204 trace_parse_run_command+0x2bc/0x4b8 synth_events_write+0x20/0x30 vfs_write+0x200/0x830 ... Freed by task 5480: ... kfree+0xdc/0x1d4 create_or_delete_synth_event+0x74/0x204 trace_parse_run_command+0x2bc/0x4b8 synth_events_write+0x20/0x30 vfs_write+0x200/0x830 ... Link: https://lore.kernel.org/linux-trace-kernel/20230321110444.1587-1-Tze-nan.Wu@mediatek.com Fixes: `27c888da98` ("tracing: Remove size restriction on synthetic event cmd error logging") Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Matthias Brugger <matthias.bgg@gmail.com> Cc: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> Cc: "Tom Zanussi" <zanussi@kernel.org> Signed-off-by: Tze-nan Wu <Tze-nan.Wu@mediatek.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2023-04-03 11:51:12 -04:00
Stanislav Kinsburskii	bee43904dc	printk: Remove obsoleted check for non-existent "user" object The original check for non-null "user" object was introduced by commit `e11fea92e1` ("kmsg: export printk records to the /dev/kmsg interface") when "user" could be NULL if /dev/ksmg was opened for writing. Subsequent change `750afe7bab` ("printk: add kernel parameter to control writes to /dev/kmsg") made "user" context required for files opened for write, but didn't remove now redundant checks for it to be non-NULL. This patch removes the dead code while preserving the current logic. Signed-off-by: Stanislav Kinsburskii <stanislav.kinsburski@gmail.com> CC: Petr Mladek <pmladek@suse.com> CC: Sergey Senozhatsky <senozhatsky@chromium.org> CC: Steven Rostedt <rostedt@goodmis.org> CC: John Ogness <john.ogness@linutronix.de> CC: linux-kernel@vger.kernel.org Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/167929571877.2810.9926967619100618792.stgit@skinsburskii.localdomain	2023-04-03 12:05:17 +02:00
Christian Brauner	ca7707f543	fork: use pidfd_prepare() Stop open-coding get_unused_fd_flags() and anon_inode_getfile(). That's brittle just for keeping the flags between both calls in sync. Use the dedicated helper. Message-Id: <20230327-pidfd-file-api-v1-2-5c0e9a3158e4@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2023-04-03 11:16:57 +02:00
Christian Brauner	6ae930d9db	pid: add pidfd_prepare() Add a new helper that allows to reserve a pidfd and allocates a new pidfd file that stashes the provided struct pid. This will allow us to remove places that either open code this function or that call pidfd_create() but then have to call close_fd() because there are still failure points after pidfd_create() has been called. Reviewed-by: Jan Kara <jack@suse.cz> Message-Id: <20230327-pidfd-file-api-v1-1-5c0e9a3158e4@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2023-04-03 11:16:56 +02:00
Greg Kroah-Hartman	cd8fe5b6db	Merge 6.3-rc5 into driver-core-next We need the fixes in here for testing, as well as the driver core changes for documentation updates to build on. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2023-04-03 09:33:30 +02:00
Anton Protopopov	92b2e810f0	bpf: compute hashes in bloom filter similar to hashmap If the value size in a bloom filter is a multiple of 4, then the jhash2() function is used to compute hashes. The length parameter of this function equals to the number of 32-bit words in input. Compute it in the hot path instead of pre-computing it, as this is translated to one extra shift to divide the length by four vs. one extra memory load of a pre-computed length. Signed-off-by: Anton Protopopov <aspsk@isovalent.com> Link: https://lore.kernel.org/r/20230402114340.3441-1-aspsk@isovalent.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-04-02 08:44:49 -07:00
Anton Protopopov	5b85575ad4	bpf: optimize hashmap lookups when key_size is divisible by 4 The BPF hashmap uses the jhash() hash function. There is an optimized version of this hash function which may be used if hash size is a multiple of 4. Apply this optimization to the hashmap in a similar way as it is done in the bloom filter map. On practice the optimization is only noticeable for smaller key sizes, which, however, is sufficient for many applications. An example is listed in the following table of measurements (a hashmap of 65536 elements was used): -------------------------------------------------------------------- \| key_size \| fullness \| lookups /sec \| lookups (opt) /sec \| gain \| -------------------------------------------------------------------- \| 4 \| 25% \| 42.990M \| 46.000M \| 7.0% \| \| 4 \| 50% \| 37.910M \| 39.094M \| 3.1% \| \| 4 \| 75% \| 34.486M \| 36.124M \| 4.7% \| \| 4 \| 100% \| 31.760M \| 32.719M \| 3.0% \| -------------------------------------------------------------------- \| 8 \| 25% \| 43.855M \| 49.626M \| 13.2% \| \| 8 \| 50% \| 38.328M \| 42.152M \| 10.0% \| \| 8 \| 75% \| 34.483M \| 38.088M \| 10.5% \| \| 8 \| 100% \| 31.306M \| 34.686M \| 10.8% \| -------------------------------------------------------------------- \| 12 \| 25% \| 38.398M \| 43.770M \| 14.0% \| \| 12 \| 50% \| 33.336M \| 37.712M \| 13.1% \| \| 12 \| 75% \| 29.917M \| 34.440M \| 15.1% \| \| 12 \| 100% \| 27.322M \| 30.480M \| 11.6% \| -------------------------------------------------------------------- \| 16 \| 25% \| 41.491M \| 41.921M \| 1.0% \| \| 16 \| 50% \| 36.206M \| 36.474M \| 0.7% \| \| 16 \| 75% \| 32.529M \| 33.027M \| 1.5% \| \| 16 \| 100% \| 29.581M \| 30.325M \| 2.5% \| -------------------------------------------------------------------- \| 20 \| 25% \| 34.240M \| 36.787M \| 7.4% \| \| 20 \| 50% \| 30.328M \| 32.663M \| 7.7% \| \| 20 \| 75% \| 27.536M \| 29.354M \| 6.6% \| \| 20 \| 100% \| 24.847M \| 26.505M \| 6.7% \| -------------------------------------------------------------------- \| 24 \| 25% \| 36.329M \| 40.608M \| 11.8% \| \| 24 \| 50% \| 31.444M \| 35.059M \| 11.5% \| \| 24 \| 75% \| 28.426M \| 31.452M \| 10.6% \| \| 24 \| 100% \| 26.278M \| 28.741M \| 9.4% \| -------------------------------------------------------------------- \| 28 \| 25% \| 31.540M \| 31.944M \| 1.3% \| \| 28 \| 50% \| 27.739M \| 28.063M \| 1.2% \| \| 28 \| 75% \| 24.993M \| 25.814M \| 3.3% \| \| 28 \| 100% \| 23.513M \| 23.500M \| -0.1% \| -------------------------------------------------------------------- \| 32 \| 25% \| 32.116M \| 33.953M \| 5.7% \| \| 32 \| 50% \| 28.879M \| 29.859M \| 3.4% \| \| 32 \| 75% \| 26.227M \| 26.948M \| 2.7% \| \| 32 \| 100% \| 23.829M \| 24.613M \| 3.3% \| -------------------------------------------------------------------- \| 64 \| 25% \| 22.535M \| 22.554M \| 0.1% \| \| 64 \| 50% \| 20.471M \| 20.675M \| 1.0% \| \| 64 \| 75% \| 19.077M \| 19.146M \| 0.4% \| \| 64 \| 100% \| 17.710M \| 18.131M \| 2.4% \| -------------------------------------------------------------------- The following script was used to gather the results (SMT & frequency off): cd tools/testing/selftests/bpf for key_size in 4 8 12 16 20 24 28 32 64; do for nr_entries in `seq 16384 16384 65536`; do fullness=$(printf '%3s' $((nr_entries*100/65536))) echo -n "key_size=$key_size: $fullness% full: " sudo ./bench -d2 -a bpf-hashmap-lookup --key_size=$key_size --nr_entries=$nr_entries --max_entries=65536 --nr_loops=2000000 --map_flags=0x40 \| grep cpu done echo done Signed-off-by: Anton Protopopov <aspsk@isovalent.com> Link: https://lore.kernel.org/r/20230401200602.3275-1-aspsk@isovalent.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-04-01 15:08:19 -07:00
David Vernet	f85671c6ef	bpf: Remove now-defunct task kfuncs In commit `22df776a9a` ("tasks: Extract rcu_users out of union"), the 'refcount_t rcu_users' field was extracted out of a union with the 'struct rcu_head rcu' field. This allows us to safely perform a refcount_inc_not_zero() on task->rcu_users when acquiring a reference on a task struct. A prior patch leveraged this by making struct task_struct an RCU-protected object in the verifier, and by bpf_task_acquire() to use the task->rcu_users field for synchronization. Now that we can use RCU to protect tasks, we no longer need bpf_task_kptr_get(), or bpf_task_acquire_not_zero(). bpf_task_kptr_get() is truly completely unnecessary, as we can just use RCU to get the object. bpf_task_acquire_not_zero() is now equivalent to bpf_task_acquire(). In addition to these changes, this patch also updates the associated selftests to no longer use these kfuncs. Signed-off-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/r/20230331195733.699708-3-void@manifault.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-04-01 09:07:20 -07:00
David Vernet	d02c48fa11	bpf: Make struct task_struct an RCU-safe type struct task_struct objects are a bit interesting in terms of how their lifetime is protected by refcounts. task structs have two refcount fields: 1. refcount_t usage: Protects the memory backing the task struct. When this refcount drops to 0, the task is immediately freed, without waiting for an RCU grace period to elapse. This is the field that most callers in the kernel currently use to ensure that a task remains valid while it's being referenced, and is what's currently tracked with bpf_task_acquire() and bpf_task_release(). 2. refcount_t rcu_users: A refcount field which, when it drops to 0, schedules an RCU callback that drops a reference held on the 'usage' field above (which is acquired when the task is first created). This field therefore provides a form of RCU protection on the task by ensuring that at least one 'usage' refcount will be held until an RCU grace period has elapsed. The qualifier "a form of" is important here, as a task can remain valid after task->rcu_users has dropped to 0 and the subsequent RCU gp has elapsed. In terms of BPF, we want to use task->rcu_users to protect tasks that function as referenced kptrs, and to allow tasks stored as referenced kptrs in maps to be accessed with RCU protection. Let's first determine whether we can safely use task->rcu_users to protect tasks stored in maps. All of the bpf_task* kfuncs can only be called from tracepoint, struct_ops, or BPF_PROG_TYPE_SCHED_CLS, program types. For tracepoint and struct_ops programs, the struct task_struct passed to a program handler will always be trusted, so it will always be safe to call bpf_task_acquire() with any task passed to a program. Note, however, that we must update bpf_task_acquire() to be KF_RET_NULL, as it is possible that the task has exited by the time the program is invoked, even if the pointer is still currently valid because the main kernel holds a task->usage refcount. For BPF_PROG_TYPE_SCHED_CLS, tasks should never be passed as an argument to the any program handlers, so it should not be relevant. The second question is whether it's safe to use RCU to access a task that was acquired with bpf_task_acquire(), and stored in a map. Because bpf_task_acquire() now uses task->rcu_users, it follows that if the task is present in the map, that it must have had at least one task->rcu_users refcount by the time the current RCU cs was started. Therefore, it's safe to access that task until the end of the current RCU cs. With all that said, this patch makes struct task_struct is an RCU-protected object. In doing so, we also change bpf_task_acquire() to be KF_ACQUIRE \| KF_RCU \| KF_RET_NULL, and adjust any selftests as necessary. A subsequent patch will remove bpf_task_kptr_get(), and bpf_task_acquire_not_zero() respectively. Signed-off-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/r/20230331195733.699708-2-void@manifault.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-04-01 09:07:20 -07:00
Jacob Pan	cd3891158a	iommu/sva: Move PASID helpers to sva code Preparing to remove IOASID infrastructure, PASID management will be under SVA code. Decouple mm code from IOASID. Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/20230322200803.869130-3-jacob.jun.pan@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>	2023-03-31 10:03:23 +02:00
Linus Torvalds	62bad54b26	Merge tag 'dma-mapping-6.3-2023-03-31' of git://git.infradead.org/users/hch/dma-mapping Pull dma-mapping fixes from Christoph Hellwig: - fix for swiotlb deadlock due to wrong alignment checks (GuoRui.Yu, Petr Tesarik) * tag 'dma-mapping-6.3-2023-03-31' of git://git.infradead.org/users/hch/dma-mapping: swiotlb: fix slot alignment checks swiotlb: use wrap_area_index() instead of open-coding it swiotlb: fix the deadlock in swiotlb_do_find_slots	2023-03-30 16:09:37 -07:00
Jakub Kicinski	79548b7984	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Conflicts: drivers/net/ethernet/mediatek/mtk_ppe.c `3fbe4d8c0e` ("net: ethernet: mtk_eth_soc: ppe: add support for flow accounting") `924531326e` ("net: ethernet: mtk_eth_soc: add missing ppe cache flush when deleting a flow") Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-03-30 14:43:03 -07:00
David Vernet	e4c2acab95	bpf: Handle PTR_MAYBE_NULL case in PTR_TO_BTF_ID helper call arg When validating a helper function argument, we use check_reg_type() to ensure that the register containing the argument is of the correct type. When the register's base type is PTR_TO_BTF_ID, there is some supplemental logic where we do extra checks for various combinations of PTR_TO_BTF_ID type modifiers. For example, for PTR_TO_BTF_ID, PTR_TO_BTF_ID \| PTR_TRUSTED, and PTR_TO_BTF_ID \| MEM_RCU, we call map_kptr_match_type() for bpf_kptr_xchg() calls, and btf_struct_ids_match() for other helper calls. When an unhandled PTR_TO_BTF_ID type modifier combination is passed to check_reg_type(), the verifier fails with an internal verifier error message. This can currently be triggered by passing a PTR_MAYBE_NULL pointer to helper functions (currently just bpf_kptr_xchg()) with an ARG_PTR_TO_BTF_ID_OR_NULL arg type. For example, by callin bpf_kptr_xchg(&v->kptr, bpf_cpumask_create()). Whether or not passing a PTR_MAYBE_NULL arg to an ARG_PTR_TO_BTF_ID_OR_NULL argument is valid is an interesting question. In a vacuum, it seems fine. A helper function with an ARG_PTR_TO_BTF_ID_OR_NULL arg would seem to be implying that it can handle either a NULL or non-NULL arg, and has logic in place to detect and gracefully handle each. This is the case for bpf_kptr_xchg(), which of course simply does an xchg(). On the other hand, bpf_kptr_xchg() also specifies OBJ_RELEASE, and refcounting semantics for a PTR_MAYBE_NULL pointer is different than handling it for a NULL _OR_ non-NULL pointer. For example, with a non-NULL arg, we should always fail if there was not a nonzero refcount for the value in the register being passed to the helper. For PTR_MAYBE_NULL on the other hand, it's unclear. If the pointer is NULL it would be fine, but if it's not NULL, it would be incorrect to load the program. The current solution to this is to just fail if PTR_MAYBE_NULL is passed, and to instead require programs to have a NULL check to explicitly handle the NULL and non-NULL cases. This seems reasonable. Not only would it possibly be quite complicated to correctly handle PTR_MAYBE_NULL refcounting in the verifier, but it's also an arguably odd programming pattern in general to not explicitly handle the NULL case anyways. For example, it seems odd to not care about whether a pointer you're passing to bpf_kptr_xchg() was successfully allocated in a program such as the following: private(MASK) static struct bpf_cpumask __kptr * global_mask; SEC("tp_btf/task_newtask") int BPF_PROG(example, struct task_struct task, u64 clone_flags) { struct bpf_cpumask prev; /* bpf_cpumask_create() returns PTR_MAYBE_NULL */ prev = bpf_kptr_xchg(&global_mask, bpf_cpumask_create()); if (prev) bpf_cpumask_release(prev); return 0; } This patch therefore updates the verifier to explicitly check for PTR_MAYBE_NULL in check_reg_type(), and fail gracefully if it's observed. This isn't really "fixing" anything unsafe or incorrect. We're just updating the verifier to fail gracefully, and explicitly handle this pattern rather than unintentionally falling back to an internal verifier error path. A subsequent patch will update selftests. Signed-off-by: David Vernet <void@manifault.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20230330145203.80506-1-void@manifault.com	2023-03-30 14:12:22 -07:00
Geert Uytterhoeven	b31507dcaf	dma-debug: Use %pa to format phys_addr_t On 32-bit without LPAE: kernel/dma/debug.c: In function ‘debug_dma_dump_mappings’: kernel/dma/debug.c:537:7: warning: format ‘%llx’ expects argument of type ‘long long unsigned int’, but argument 9 has type ‘phys_addr_t’ {aka ‘unsigned int’} [-Wformat=] kernel/dma/debug.c: In function ‘dump_show’: kernel/dma/debug.c:568:59: warning: format ‘%llx’ expects argument of type ‘long long unsigned int’, but argument 11 has type ‘phys_addr_t’ {aka ‘unsigned int’} [-Wformat=] Fixes: `bd89d69a52` ("dma-debug: add cacheline to user/kernel space dump messages") Reported-by: kernel test robot <lkp@intel.com> Link: https://lore.kernel.org/r/202303160548.ReyuTsGD-lkp@intel.com Reported-by: noreply@ellerman.id.au Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: Christoph Hellwig <hch@lst.de>	2023-03-30 08:51:52 +09:00
Waiman Long	6667439f51	cgroup/cpuset: Include offline CPUs when tasks' cpumasks in top_cpuset are updated Similar to commit `3fb906e7fa` ("group/cpuset: Don't filter offline CPUs in cpuset_cpus_allowed() for top cpuset tasks"), the whole set of possible CPUs including offline ones should be used for setting cpumasks for tasks in the top cpuset when a cpuset partition is modified as the hotplug code won't update cpumasks for tasks in the top cpuset when CPUs become online or offline. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2023-03-29 08:42:09 -10:00
Waiman Long	df59b72cd8	cgroup/cpuset: Skip task update if hotplug doesn't affect current cpuset If a hotplug event doesn't affect the current cpuset, there is no point to call hotplug_update_tasks() or hotplug_update_tasks_legacy(). So just skip it. Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2023-03-29 08:41:28 -10:00
Tejun Heo	5dcad50636	Merge branch 'for-6.3-fixes' into for-6.4 To receive `292fd843de` ("cgroup/cpuset: Fix partition root's cpuset.cpus update bug") in preparation for further cpuset updates.	2023-03-29 08:39:46 -10:00
Waiman Long	292fd843de	cgroup/cpuset: Fix partition root's cpuset.cpus update bug It was found that commit `7a2127e66a` ("cpuset: Call set_cpus_allowed_ptr() with appropriate mask for task") introduced a bug that corrupted "cpuset.cpus" of a partition root when it was updated. It is because the tmp->new_cpus field of the passed tmp parameter of update_parent_subparts_cpumask() should not be used at all as it contains important cpumask data that should not be overwritten. Fix it by using tmp->addmask instead. Also update update_cpumask() to make sure that trialcs->cpu_allowed will not be corrupted until it is no longer needed. Fixes: `7a2127e66a` ("cpuset: Call set_cpus_allowed_ptr() with appropriate mask for task") Signed-off-by: Waiman Long <longman@redhat.com> Cc: stable@vger.kernel.org # v6.2+ Signed-off-by: Tejun Heo <tj@kernel.org>	2023-03-29 08:37:07 -10:00
Steven Rostedt (Google)	88fe1ec75f	tracing: Unbreak user events The user events was added a bit prematurely, and there were a few kernel developers that had issues with it. The API also needed a bit of work to make sure it would be stable. It was decided to make user events "broken" until this was settled. Now it has a new API that appears to be as stable as it will be without the use of a crystal ball. It's being used within Microsoft as is, which means the API has had some testing in real world use cases. It went through many discussions in the bi-weekly tracing meetings, and there's been no more comments about updates. I feel this is good to go. Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2023-03-29 06:52:09 -04:00
Steven Rostedt (Google)	4bec284cc0	tracing/user_events: Use print_format_fields() for trace output Currently, user events are shown using the "hex" output for "safety" reasons as one cannot trust user events behaving nicely. But the hex output is not the only utility for safe outputting of trace events. The print_event_fields() is just as safe and gives user readable output. Before: example-839 [001] ..... 43.222244: 00000000: b1 06 00 00 47 03 00 00 00 00 00 00 ....G....... example-839 [001] ..... 43.564433: 00000000: b1 06 00 00 47 03 00 00 01 00 00 00 ....G....... example-839 [001] ..... 43.763917: 00000000: b1 06 00 00 47 03 00 00 02 00 00 00 ....G....... example-839 [001] ..... 43.967929: 00000000: b1 06 00 00 47 03 00 00 03 00 00 00 ....G....... After: example-837 [006] ..... 55.739249: test: count=0x0 (0) example-837 [006] ..... 111.104784: test: count=0x1 (1) example-837 [006] ..... 111.268444: test: count=0x2 (2) example-837 [006] ..... 111.416533: test: count=0x3 (3) example-837 [006] ..... 111.542859: test: count=0x4 (4) Link: https://lore.kernel.org/linux-trace-kernel/20230328151413.4770b8d7@gandalf.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2023-03-29 06:52:09 -04:00
Beau Belgrave	a4c40c1349	tracing/user_events: Align structs with tabs for readability Add tabs to make struct members easier to read and unify the style of the code. Link: https://lkml.kernel.org/r/20230328235219.203-13-beaub@linux.microsoft.com Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2023-03-29 06:52:09 -04:00
Beau Belgrave	ce58e96e9f	tracing/user_events: Limit global user_event count Operators want to be able to ensure enough tracepoints exist on the system for kernel components as well as for user components. Since there are only up to 64K events, by default allow up to half to be used by user events. Add a kernel sysctl parameter (kernel.user_events_max) to set a global limit that is honored among all groups on the system. This ensures hard limits can be setup to prevent user processes from consuming all event IDs on the system. Link: https://lkml.kernel.org/r/20230328235219.203-12-beaub@linux.microsoft.com Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2023-03-29 06:52:09 -04:00
Beau Belgrave	f9cce238ee	tracing/user_events: Charge event allocs to cgroups Operators need a way to limit how much memory cgroups use. User events need to be included into that accounting. Fix this by using GFP_KERNEL_ACCOUNT for allocations generated by user programs for user_event tracing. Link: https://lkml.kernel.org/r/20230328235219.203-11-beaub@linux.microsoft.com Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2023-03-29 06:52:09 -04:00
Beau Belgrave	dcb8177c13	tracing/user_events: Add ioctl for disabling addresses Enablements are now tracked by the lifetime of the task/mm. User processes need to be able to disable their addresses if tracing is requested to be turned off. Before unmapping the page would suffice. However, we now need a stronger contract. Add an ioctl to enable this. A new flag bit is added, freeing, to user_event_enabler to ensure that if the event is attempted to be removed while a fault is being handled that the remove is delayed until after the fault is reattempted. Link: https://lkml.kernel.org/r/20230328235219.203-6-beaub@linux.microsoft.com Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2023-03-29 06:52:08 -04:00
Beau Belgrave	81f8fb6549	tracing/user_events: Fixup enable faults asyncly When events are enabled within the various tracing facilities, such as ftrace/perf, the event_mutex is held. As events are enabled pages are accessed. We do not want page faults to occur under this lock. Instead queue the fault to a workqueue to be handled in a process context safe way without the lock. The enable address is marked faulting while the async fault-in occurs. This ensures that we don't attempt to fault-in more than is necessary. Once the page has been faulted in, an address write is re-attempted. If the page couldn't fault-in, then we wait until the next time the event is enabled to prevent any potential infinite loops. Link: https://lkml.kernel.org/r/20230328235219.203-5-beaub@linux.microsoft.com Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2023-03-29 06:52:08 -04:00
Beau Belgrave	7235759084	tracing/user_events: Use remote writes for event enablement As part of the discussions for user_events aligned with user space tracers, it was determined that user programs should register a aligned value to set or clear a bit when an event becomes enabled. Currently a shared page is being used that requires mmap(). Remove the shared page implementation and move to a user registered address implementation. In this new model during the event registration from user programs 3 new values are specified. The first is the address to update when the event is either enabled or disabled. The second is the bit to set/clear to reflect the event being enabled. The third is the size of the value at the specified address. This allows for a local 32/64-bit value in user programs to support both kernel and user tracers. As an example, setting bit 31 for kernel tracers when the event becomes enabled allows for user tracers to use the other bits for ref counts or other flags. The kernel side updates the bit atomically, user programs need to also update these values atomically. User provided addresses must be aligned on a natural boundary, this allows for single page checking and prevents odd behaviors such as a enable value straddling 2 pages instead of a single page. Currently page faults are only logged, future patches will handle these. Link: https://lkml.kernel.org/r/20230328235219.203-4-beaub@linux.microsoft.com Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2023-03-29 06:52:08 -04:00
Beau Belgrave	fd593511cd	tracing/user_events: Track fork/exec/exit for mm lifetime During tracefs discussions it was decided instead of requiring a mapping within a user-process to track the lifetime of memory descriptors we should hook the appropriate calls. Do this by adding the minimal stubs required for task fork, exec, and exit. Currently this is just a NOP. Future patches will implement these calls fully. Link: https://lkml.kernel.org/r/20230328235219.203-3-beaub@linux.microsoft.com Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2023-03-29 06:52:08 -04:00
Beau Belgrave	e5a26a4048	tracing/user_events: Split header into uapi and kernel The UAPI parts need to be split out from the kernel parts of user_events now that other parts of the kernel will reference it. Do so by moving the existing include/linux/user_events.h into include/uapi/linux/user_events.h. Link: https://lkml.kernel.org/r/20230328235219.203-2-beaub@linux.microsoft.com Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2023-03-29 06:52:08 -04:00
Steven Rostedt (Google)	80a76994b2	tracing: Add "fields" option to show raw trace event fields The hex, raw and bin formats come from the old PREEMPT_RT patch set latency tracer. That actually gave real alternatives to reading the ascii buffer. But they have started to bit rot and they do not give a good representation of the tracing data. Add "fields" option that will read the trace event fields and parse the data from how the fields are defined: With "fields" = 0 (default) echo 1 > events/sched/sched_switch/enable cat trace <idle>-0 [003] d..2. 540.078653: sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=kworker/3:1 next_pid=83 next_prio=120 kworker/3:1-83 [003] d..2. 540.078860: sched_switch: prev_comm=kworker/3:1 prev_pid=83 prev_prio=120 prev_state=I ==> next_comm=swapper/3 next_pid=0 next_prio=120 <idle>-0 [003] d..2. 540.206423: sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=sshd next_pid=807 next_prio=120 sshd-807 [003] d..2. 540.206531: sched_switch: prev_comm=sshd prev_pid=807 prev_prio=120 prev_state=S ==> next_comm=swapper/3 next_pid=0 next_prio=120 <idle>-0 [001] d..2. 540.206597: sched_switch: prev_comm=swapper/1 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=kworker/u16:4 next_pid=58 next_prio=120 kworker/u16:4-58 [001] d..2. 540.206617: sched_switch: prev_comm=kworker/u16:4 prev_pid=58 prev_prio=120 prev_state=I ==> next_comm=bash next_pid=830 next_prio=120 bash-830 [001] d..2. 540.206678: sched_switch: prev_comm=bash prev_pid=830 prev_prio=120 prev_state=R ==> next_comm=kworker/u16:4 next_pid=58 next_prio=120 kworker/u16:4-58 [001] d..2. 540.206696: sched_switch: prev_comm=kworker/u16:4 prev_pid=58 prev_prio=120 prev_state=I ==> next_comm=bash next_pid=830 next_prio=120 bash-830 [001] d..2. 540.206713: sched_switch: prev_comm=bash prev_pid=830 prev_prio=120 prev_state=R ==> next_comm=kworker/u16:4 next_pid=58 next_prio=120 echo 1 > options/fields <...>-998 [002] d..2. 538.643732: sched_switch: next_prio=0x78 (120) next_pid=0x0 (0) next_comm=swapper/2 prev_state=0x20 (32) prev_prio=0x78 (120) prev_pid=0x3e6 (998) prev_comm=trace-cmd <idle>-0 [001] d..2. 538.643806: sched_switch: next_prio=0x78 (120) next_pid=0x33e (830) next_comm=bash prev_state=0x0 (0) prev_prio=0x78 (120) prev_pid=0x0 (0) prev_comm=swapper/1 bash-830 [001] d..2. 538.644106: sched_switch: next_prio=0x78 (120) next_pid=0x3a (58) next_comm=kworker/u16:4 prev_state=0x0 (0) prev_prio=0x78 (120) prev_pid=0x33e (830) prev_comm=bash kworker/u16:4-58 [001] d..2. 538.644130: sched_switch: next_prio=0x78 (120) next_pid=0x33e (830) next_comm=bash prev_state=0x80 (128) prev_prio=0x78 (120) prev_pid=0x3a (58) prev_comm=kworker/u16:4 bash-830 [001] d..2. 538.644180: sched_switch: next_prio=0x78 (120) next_pid=0x3a (58) next_comm=kworker/u16:4 prev_state=0x0 (0) prev_prio=0x78 (120) prev_pid=0x33e (830) prev_comm=bash kworker/u16:4-58 [001] d..2. 538.644185: sched_switch: next_prio=0x78 (120) next_pid=0x33e (830) next_comm=bash prev_state=0x80 (128) prev_prio=0x78 (120) prev_pid=0x3a (58) prev_comm=kworker/u16:4 bash-830 [001] d..2. 538.644204: sched_switch: next_prio=0x78 (120) next_pid=0x0 (0) next_comm=swapper/1 prev_state=0x1 (1) prev_prio=0x78 (120) prev_pid=0x33e (830) prev_comm=bash <idle>-0 [003] d..2. 538.644211: sched_switch: next_prio=0x78 (120) next_pid=0x327 (807) next_comm=sshd prev_state=0x0 (0) prev_prio=0x78 (120) prev_pid=0x0 (0) prev_comm=swapper/3 sshd-807 [003] d..2. 538.644340: sched_switch: next_prio=0x78 (120) next_pid=0x0 (0) next_comm=swapper/3 prev_state=0x1 (1) prev_prio=0x78 (120) prev_pid=0x327 (807) prev_comm=sshd It traces the data safely without using the trace print formatting. Link: https://lore.kernel.org/linux-trace-kernel/20230328145156.497651be@gandalf.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2023-03-29 06:52:08 -04:00
Vlastimil Babka	ed4cdfbeb8	Merge branch 'slab/for-6.4/slob-removal' into slab/for-next A series by myself to remove CONFIG_SLOB: The SLOB allocator was deprecated in 6.2 and there have been no complaints so far so let's proceed with the removal. Besides the code cleanup, the main immediate benefit will be allowing kfree() family of function to work on kmem_cache_alloc() objects, which was incompatible with SLOB. This includes kfree_rcu() which had no kmem_cache_free_rcu() counterpart yet and now it shouldn't be necessary anymore. Otherwise it's all straightforward removal. After this series, 'git grep slob' or 'git grep SLOB' will have 3 remaining relevant hits in non-mm code: - tomoyo - patch submitted and carried there, doesn't need to wait for this series - skbuff - patch to cleanup now-unnecessary #ifdefs will be posted to netdev after this is merged, as requested to avoid conflicts - ftrace ring_buffer - patch to remove obsolete comment is carried there The rest of 'git grep SLOB' hits are false positives, or intentional (CREDITS, and mm/Kconfig SLUB_TINY description to help those that will happen to migrate later).	2023-03-29 10:48:39 +02:00
Vlastimil Babka	c9929f0e34	mm/slob: remove CONFIG_SLOB Remove SLOB from Kconfig and Makefile. Everything under #ifdef CONFIG_SLOB, and mm/slob.c is now dead code. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Acked-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>	2023-03-29 10:31:40 +02:00
Nicholas Piggin	2655421ae6	lazy tlb: shoot lazies, non-refcounting lazy tlb mm reference handling scheme On big systems, the mm refcount can become highly contented when doing a lot of context switching with threaded applications. user<->idle switch is one of the important cases. Abandoning lazy tlb entirely slows this switching down quite a bit in the common uncontended case, so that is not viable. Implement a scheme where lazy tlb mm references do not contribute to the refcount, instead they get explicitly removed when the refcount reaches zero. The final mmdrop() sends IPIs to all CPUs in the mm_cpumask and they switch away from this mm to init_mm if it was being used as the lazy tlb mm. Enabling the shoot lazies option therefore requires that the arch ensures that mm_cpumask contains all CPUs that could possibly be using mm. A DEBUG_VM option IPIs every CPU in the system after this to ensure there are no references remaining before the mm is freed. Shootdown IPIs cost could be an issue, but they have not been observed to be a serious problem with this scheme, because short-lived processes tend not to migrate CPUs much, therefore they don't get much chance to leave lazy tlb mm references on remote CPUs. There are a lot of options to reduce them if necessary, described in comments. The near-worst-case can be benchmarked with will-it-scale: context_switch1_threads -t $(($(nproc) / 2)) This will create nproc threads (nproc / 2 switching pairs) all sharing the same mm that spread over all CPUs so each CPU does thread->idle->thread switching. [ Rik came up with basically the same idea a few years ago, so credit to him for that. ] Link: https://lore.kernel.org/linux-mm/20230118080011.2258375-1-npiggin@gmail.com/ Link: https://lore.kernel.org/all/20180728215357.3249-11-riel@surriel.com/ Link: https://lkml.kernel.org/r/20230203071837.1136453-5-npiggin@gmail.com Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-03-28 16:20:08 -07:00
Nicholas Piggin	aa464ba9a1	lazy tlb: introduce lazy tlb mm refcount helper functions Add explicit _lazy_tlb annotated functions for lazy tlb mm refcounting. This makes the lazy tlb mm references more obvious, and allows the refcounting scheme to be modified in later changes. There is no functional change with this patch. Link: https://lkml.kernel.org/r/20230203071837.1136453-3-npiggin@gmail.com Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-03-28 16:20:08 -07:00

... 4 5 6 7 8 ...

41771 Commits