Daniel Borkmann says:
====================
pull-request: bpf 2024-02-22
The following pull-request contains BPF updates for your *net* tree.
We've added 11 non-merge commits during the last 24 day(s) which contain
a total of 15 files changed, 217 insertions(+), 17 deletions(-).
The main changes are:
1) Fix a syzkaller-triggered oops when attempting to read the vsyscall
page through bpf_probe_read_kernel and friends, from Hou Tao.
2) Fix a kernel panic due to uninitialized iter position pointer in
bpf_iter_task, from Yafang Shao.
3) Fix a race between bpf_timer_cancel_and_free and bpf_timer_cancel,
from Martin KaFai Lau.
4) Fix a xsk warning in skb_add_rx_frag() (under CONFIG_DEBUG_NET)
due to incorrect truesize accounting, from Sebastian Andrzej Siewior.
5) Fix a NULL pointer dereference in sk_psock_verdict_data_ready,
from Shigeru Yoshida.
6) Fix a resolve_btfids warning when bpf_cpumask symbol cannot be
resolved, from Hari Bathini.
bpf-for-netdev
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf, sockmap: Fix NULL pointer dereference in sk_psock_verdict_data_ready()
selftests/bpf: Add negtive test cases for task iter
bpf: Fix an issue due to uninitialized bpf_iter_task
selftests/bpf: Test racing between bpf_timer_cancel_and_free and bpf_timer_cancel
bpf: Fix racing between bpf_timer_cancel_and_free and bpf_timer_cancel
selftest/bpf: Test the read of vsyscall page under x86-64
x86/mm: Disallow vsyscall page read for copy_from_kernel_nofault()
x86/mm: Move is_vsyscall_vaddr() into asm/vsyscall.h
bpf, scripts: Correct GPL license name
xsk: Add truesize to skb_add_rx_frag().
bpf: Fix warning for bpf_cpumask in verifier
====================
Link: https://lore.kernel.org/r/20240221231826.1404-1-daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
On a 8-socket server the TSC is wrongly marked as 'unstable' and disabled
during boot time on about one out of 120 boot attempts:
clocksource: timekeeping watchdog on CPU227: wd-tsc-wd excessive read-back delay of 153560ns vs. limit of 125000ns,
wd-wd read-back delay only 11440ns, attempt 3, marking tsc unstable
tsc: Marking TSC unstable due to clocksource watchdog
TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
sched_clock: Marking unstable (119294969739, 159204297)<-(125446229205, -5992055152)
clocksource: Checking clocksource tsc synchronization from CPU 319 to CPUs 0,99,136,180,210,542,601,896.
clocksource: Switched to clocksource hpet
The reason is that for platform with a large number of CPUs, there are
sporadic big or huge read latencies while reading the watchog/clocksource
during boot or when system is under stress work load, and the frequency and
maximum value of the latency goes up with the number of online CPUs.
The cCurrent code already has logic to detect and filter such high latency
case by reading the watchdog twice and checking the two deltas. Due to the
randomness of the latency, there is a low probabilty that the first delta
(latency) is big, but the second delta is small and looks valid. The
watchdog code retries the readouts by default twice, which is not
necessarily sufficient for systems with a large number of CPUs.
There is a command line parameter 'max_cswd_read_retries' which allows to
increase the number of retries, but that's not user friendly as it needs to
be tweaked per system. As the number of required retries is proportional to
the number of online CPUs, this parameter can be calculated at runtime.
Scale and enlarge the number of retries according to the number of online
CPUs and remove the command line parameter completely.
[ tglx: Massaged change log and comments ]
Signed-off-by: Feng Tang <feng.tang@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jin Wang <jin1.wang@intel.com>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Waiman Long <longman@redhat.com>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20240221060859.1027450-1-feng.tang@intel.com
Right now we determine the scope of the signal based on the type of
pidfd. There are use-cases where it's useful to override the scope of
the signal. For example in [1]. Add flags to determine the scope of the
signal:
(1) PIDFD_SIGNAL_THREAD: send signal to specific thread reference by @pidfd
(2) PIDFD_SIGNAL_THREAD_GROUP: send signal to thread-group of @pidfd
(2) PIDFD_SIGNAL_PROCESS_GROUP: send signal to process-group of @pidfd
Since we now allow specifying PIDFD_SEND_PROCESS_GROUP for
pidfd_send_signal() to send signals to process groups we need to adjust
the check restricting si_code emulation by userspace to account for
PIDTYPE_PGID.
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Link: https://github.com/systemd/systemd/issues/31093 [1]
Link: https://lore.kernel.org/r/20240210-chihuahua-hinzog-3945b6abd44a@brauner
Link: https://lore.kernel.org/r/20240214123655.GB16265@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
- set_work_data() takes a separate @flags argument but just ORs it to @data.
This is more confusing than helpful. Just take @data.
- Use the name @flags consistently and add the parameter to
set_work_pool_and_{keep|clear}_pending(). This will be used by the planned
disable/enable support.
No functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
clear_work_data() is only used in one place and immediately followed by
smp_mb(), making it equivalent to set_work_pool_and_clear_pending() w/
WORK_OFFQ_POOL_NONE for @pool_id. Drop it. No functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
The planned disable/enable support will need the same logic. Let's factor it
out. No functional changes.
v2: Update function comment to include @irq_flags.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
The bits of work->data are used for a few different purposes. How the bits
are used is determined by enum work_bits. The planned disable/enable support
will add another use, so let's clean it up a bit in preparation.
- Let WORK_STRUCT_*_BIT's values be determined by enum definition order.
- Deliminate different bit sections the same way using SHIFT and BITS
values.
- Rename __WORK_OFFQ_CANCELING to WORK_OFFQ_CANCELING_BIT for consistency.
- Introduce WORK_STRUCT_PWQ_SHIFT and replace WORK_STRUCT_FLAG_MASK and
WORK_STRUCT_WQ_DATA_MASK with WQ_STRUCT_PWQ_MASK for clarity.
- Improve documentation.
No functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
The cancel path used bool @is_dwork to distinguish canceling a regular work
and a delayed one. The planned disable/enable support will need passing
around another flag in the code path. As passing them around with bools will
be confusing, let's introduce named flags to pass around in the cancel path.
WORK_CANCEL_DELAYED replaces @is_dwork. No functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Using the generic term `flags` for irq flags is conventional but can be
confusing as there's quite a bit of code dealing with work flags which
involves some subtleties. Let's use a more explicit name `irq_flags` for
local irq flags. No functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
They are currently a bit disorganized with flush and cancel functions mixed.
Reoranize them so that flush functions come first, cancel next and
cancel_sync last. This way, we won't have to add prototypes for internal
functions for the planned disable/enable support.
This is pure code reorganization. No functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
__cancel_work_timer() is used to implement cancel_work_sync() and
cancel_delayed_work_sync(), similarly to how __cancel_work() is used to
implement cancel_work() and cancel_delayed_work(). ie. The _timer part of
the name is a complete misnomer. The difference from __cancel_work() is the
fact that it syncs against work item execution not whether it handles timers
or not.
Let's rename it to less confusing __cancel_work_sync(). No functional
change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
The different flavors of RCU read critical sections have been unified. Let's
update the locking assertion macros accordingly to avoid requiring
unnecessary explicit rcu_read_[un]lock() calls.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Reorder some global declarations and adjust comments and whitespaces for
clarity and consistency. No functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Commit 94f8f319cb ("drm: Remove Kconfig option for legacy support
(CONFIG_DRM_LEGACY)") removes the config DRM_LEGACY, but one reference to
that config is left in the hardening.config fragment.
As there is no drm legacy driver left, we do not need to recommend this
attack surface reduction anymore.
Drop this reference in hardening.config fragment.
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Link: https://lore.kernel.org/r/20240208091045.9219-3-lukas.bulwahn@gmail.com
Signed-off-by: Kees Cook <keescook@chromium.org>
Commit 7a628f818499 ("ubsan: Remove CONFIG_UBSAN_SANITIZE_ALL") removes the
config UBSAN_SANITIZE_ALL, but one reference to that config is left in the
hardening.config fragment.
Drop this reference in hardening.config fragment.
Note that CONFIG_UBSAN is still enabled in the hardening.config fragment,
so the functionality when using this fragment remains the same.
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Link: https://lore.kernel.org/r/20240208091045.9219-2-lukas.bulwahn@gmail.com
Signed-off-by: Kees Cook <keescook@chromium.org>
On some systems, sys_membarrier can be very expensive, causing overall
slowdowns for everything. So put a lock on the path in order to
serialize the accesses to prevent the ability for this to be called at
too high of a frequency and saturate the machine.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-and-tested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Borislav Petkov <bp@alien8.de>
Fixes: 22e4ebb975 ("membarrier: Provide expedited private command")
Fixes: c5f58bd58f ("membarrier: Provide GLOBAL_EXPEDITED command")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Users of the IRQCHIP_PLATFORM_DRIVER_{BEGIN,END} helpers rely on a fwspec
containing only the fwnode (and crucially a number of parameters set to 0)
together with a DOMAIN_BUS_ANY token to check whether a parent irqchip has
probed and registered a domain.
Since de1ff306dc ("genirq/irqdomain: Remove the param count restriction
from select()"), ops->select() is called unconditionally, meaning that
irqchips implementing select() now need to handle ANY as a match.
Instead of adding more esoteric checks to the individual drivers, add that
condition to irq_find_matching_fwspec(), and let it handle the corner case,
as per the comment in the function.
This restores the functionality of the above helpers.
Fixes: de1ff306dc ("genirq/irqdomain: Remove the param count restriction from select()")
Reported-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Reported-by: Biju Das <biju.das.jz@bp.renesas.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Tested-by: Biju Das <biju.das.jz@bp.renesas.com>
Link: https://lore.kernel.org/r/20240220114731.1898534-1-maz@kernel.org
Link: https://lore.kernel.org/r/20240219-gic-fix-child-domain-v1-1-09f8fd2d9a8f@linaro.org
This reverts the following two commits:
- a555bdd0c5 ("Kbuild: enable TRIM_UNUSED_KSYMS again, with some guarding")
- 5cf0fd591f ("Kbuild: disable TRIM_UNUSED_KSYMS option")
Commit 5e9e95cc91 ("kbuild: implement CONFIG_TRIM_UNUSED_KSYMS without
recursion") solved the build time issue.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
There's a conflict between this recent upstream fix:
dad6a09f31 ("hrtimer: Report offline hrtimer enqueue")
and a pending commit in the timers tree:
1a4729ecaf ("hrtimers: Move hrtimer base related definitions into hrtimer_defs.h")
Resolve it by applying the upstream fix to the new <linux/hrtimer_defs.h> header.
Conflict:
include/linux/hrtimer.h
Semantic conflict:
include/linux/hrtimer_defs.h
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Failure to initialize it->pos, coupled with the presence of an invalid
value in the flags variable, can lead to it->pos referencing an invalid
task, potentially resulting in a kernel panic. To mitigate this risk, it's
crucial to ensure proper initialization of it->pos to NULL.
Fixes: ac8148d957 ("bpf: bpf_iter_task_next: use next_task(kit->task) rather than next_task(kit->pos)")
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/bpf/20240217114152.1623-2-laoar.shao@gmail.com
The following race is possible between bpf_timer_cancel_and_free
and bpf_timer_cancel. It will lead a UAF on the timer->timer.
bpf_timer_cancel();
spin_lock();
t = timer->time;
spin_unlock();
bpf_timer_cancel_and_free();
spin_lock();
t = timer->timer;
timer->timer = NULL;
spin_unlock();
hrtimer_cancel(&t->timer);
kfree(t);
/* UAF on t */
hrtimer_cancel(&t->timer);
In bpf_timer_cancel_and_free, this patch frees the timer->timer
after a rcu grace period. This requires a rcu_head addition
to the "struct bpf_hrtimer". Another kfree(t) happens in bpf_timer_init,
this does not need a kfree_rcu because it is still under the
spin_lock and timer->timer has not been visible by others yet.
In bpf_timer_cancel, rcu_read_lock() is added because this helper
can be used in a non rcu critical section context (e.g. from
a sleepable bpf prog). Other timer->timer usages in helpers.c
have been audited, bpf_timer_cancel() is the only place where
timer->timer is used outside of the spin_lock.
Another solution considered is to mark a t->flag in bpf_timer_cancel
and clear it after hrtimer_cancel() is done. In bpf_timer_cancel_and_free,
it busy waits for the flag to be cleared before kfree(t). This patch
goes with a straight forward solution and frees timer->timer after
a rcu grace period.
Fixes: b00628b1c7 ("bpf: Introduce bpf timers.")
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/bpf/20240215211218.990808-1-martin.lau@linux.dev
So far, get_device_system_crosststamp() unconditionally passes
system_counterval.cycles to timekeeping_cycles_to_ns(). But when
interpolating system time (do_interp == true), system_counterval.cycles is
before tkr_mono.cycle_last, contrary to the timekeeping_cycles_to_ns()
expectations.
On x86, CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE will mitigate on
interpolating, setting delta to 0. With delta == 0, xtstamp->sys_monoraw
and xtstamp->sys_realtime are then set to the last update time, as
implicitly expected by adjust_historical_crosststamp(). On other
architectures, the resulting nonsense xtstamp->sys_monoraw and
xtstamp->sys_realtime corrupt the xtstamp (ts) adjustment in
adjust_historical_crosststamp().
Fix this by deriving xtstamp->sys_monoraw and xtstamp->sys_realtime from
the last update time when interpolating, by using the local variable
"cycles". The local variable already has the right value when
interpolating, unlike system_counterval.cycles.
Fixes: 2c756feb18 ("time: Add history to cross timestamp interface supporting slower devices")
Signed-off-by: Peter Hilber <peter.hilber@opensynergy.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/r/20231218073849.35294-4-peter.hilber@opensynergy.com
The cycle_between() helper checks if parameter test is in the open interval
(before, after). Colloquially speaking, this also applies to the counter
wrap-around special case before > after. get_device_system_crosststamp()
currently uses cycle_between() at the first call site to decide whether to
interpolate for older counter readings.
get_device_system_crosststamp() has the following problem with
cycle_between() testing against an open interval: Assume that, by chance,
cycles == tk->tkr_mono.cycle_last (in the following, "cycle_last" for
brevity). Then, cycle_between() at the first call site, with effective
argument values cycle_between(cycle_last, cycles, now), returns false,
enabling interpolation. During interpolation,
get_device_system_crosststamp() will then call cycle_between() at the
second call site (if a history_begin was supplied). The effective argument
values are cycle_between(history_begin->cycles, cycles, cycles), since
system_counterval.cycles == interval_start == cycles, per the assumption.
Due to the test against the open interval, cycle_between() returns false
again. This causes get_device_system_crosststamp() to return -EINVAL.
This failure should be avoided, since get_device_system_crosststamp() works
both when cycles follows cycle_last (no interpolation), and when cycles
precedes cycle_last (interpolation). For the case cycles == cycle_last,
interpolation is actually unneeded.
Fix this by changing cycle_between() into timestamp_in_interval(), which
now checks against the closed interval, rather than the open interval.
This changes the get_device_system_crosststamp() behavior for three corner
cases:
1. Bypass interpolation in the case cycles == tk->tkr_mono.cycle_last,
fixing the problem described above.
2. At the first timestamp_in_interval() call site, cycles == now no longer
causes failure.
3. At the second timestamp_in_interval() call site, history_begin->cycles
== system_counterval.cycles no longer causes failure.
adjust_historical_crosststamp() also works for this corner case,
where partial_history_cycles == total_history_cycles.
These behavioral changes should not cause any problems.
Fixes: 2c756feb18 ("time: Add history to cross timestamp interface supporting slower devices")
Signed-off-by: Peter Hilber <peter.hilber@opensynergy.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20231218073849.35294-3-peter.hilber@opensynergy.com
The affinity setting of interrupt threads happens in the context of the
thread when the thread is woken up by an hard interrupt. As this can be an
arbitrary after changing the affinity, the thread can become runnable on an
isolated CPU and cause isolation disruption.
Avoid this by checking the set affinity request in wait_for_interrupt() and
waking the threads immediately when the affinity is modified.
Note that this is of the most benefit on systems where the interrupt
affinity itself does not need to be deferred to the interrupt handler, but
even where that's not the case, the total dirsuption will be less.
Signed-off-by: Crystal Wood <crwood@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240122235353.15235-1-crwood@redhat.com
Documentation of functions lacks the annotations which are used by
kernel-doc and *.rst to make appearance in rendered documents more
user-friendly.
Use those annotations to improve user-friendliness. While at it prevent
duplication of comments and use a reference instead.
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240123164702.55612-3-anna-maria@linutronix.de
Pull probes fix from Masami Hiramatsu:
- tracing/probes: Fix BTF structure member finder to find the members
which are placed after any anonymous union member correctly.
* tag 'probes-fixes-v6.8-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing/probes: Fix to search structure fields correctly
Fix to search a field from the structure which has anonymous union
correctly.
Since the reference `type` pointer was updated in the loop, the search
loop suddenly aborted where it hits an anonymous union. Thus it can not
find the field after the anonymous union. This avoids updating the
cursor `type` pointer in the loop.
Link: https://lore.kernel.org/all/170791694361.389532.10047514554799419688.stgit@devnote2/
Fixes: 302db0f5b3 ("tracing/probes: Add a function to search a member of a struct/union")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Pull workqueue fix from Tejun Heo:
"Just one patch to revert commit ca10d851b9 ("workqueue: Override
implicit ordered attribute in workqueue_apply_unbound_cpumask()").
This commit could break ordering guarantees for ordered workqueues.
The problem that the commit tried to resolve partially - making
ordered workqueues follow unbound cpumask - is fully solved in
wq/for-6.9 branch"
* tag 'wq-for-6.8-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
Revert "workqueue: Override implicit ordered attribute in workqueue_apply_unbound_cpumask()"
set_memory_ro(), set_memory_nx(), set_memory_x() and other helpers
can fail and return an error. In that case the memory might not be
protected as expected and the module loading has to be aborted to
avoid security issues.
Check return value of all calls to set_memory_XX() and handle
error if any.
Add a check to not call set_memory_XX() on NULL pointers as some
architectures may not like it allthough numpages is always 0 in that
case. This also avoid a useless call to set_vm_flush_reset_perms().
Link: https://github.com/KSPP/linux/issues/7
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Pull tracing fixes from Steven Rostedt:
- Fix the #ifndef that didn't have the 'CONFIG_' prefix on
HAVE_DYNAMIC_FTRACE_WITH_REGS
The fix to have dynamic trampolines work with x86 broke arm64 as the
config used in the #ifdef was HAVE_DYNAMIC_FTRACE_WITH_REGS and not
CONFIG_HAVE_DYNAMIC_FTRACE_WITH_REGS which removed the fix that the
previous fix was to fix.
- Fix tracing_on state
The code to test if "tracing_on" is set incorrectly used
ring_buffer_record_is_on() which returns false if the ring buffer
isn't able to be written to.
But the ring buffer disable has several bits that disable it. One is
internal disabling which is used for resizing and other modifications
of the ring buffer. But the "tracing_on" user space visible flag
should only report if tracing is actually on and not internally
disabled, as this can cause confusion as writing "1" when it is
disabled will not enable it.
Instead use ring_buffer_record_is_set_on() which shows the user space
visible settings.
- Fix a false positive kmemleak on saved cmdlines
Now that the saved_cmdlines structure is allocated via alloc_page()
and not via kmalloc() it has become invisible to kmemleak. The
allocation done to one of its pointers was flagged as a dangling
allocation leak. Make kmemleak aware of this allocation and free.
- Fix synthetic event dynamic strings
An update that cleaned up the synthetic event code removed the return
value of trace_string(), and had it return zero instead of the
length, causing dynamic strings in the synthetic event to always have
zero size.
- Clean up documentation and header files for seq_buf
* tag 'trace-v6.8-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
seq_buf: Fix kernel documentation
seq_buf: Don't use "proxy" headers
tracing/synthetic: Fix trace_string() return value
tracing: Inform kmemleak of saved_cmdlines allocation
tracing: Use ring_buffer_record_is_set_on() in tracer_tracing_is_on()
tracing: Fix HAVE_DYNAMIC_FTRACE_WITH_REGS ifdef
2f34d7337d ("workqueue: Fix queue_work_on() with BH workqueues") added
irq_work usage to workqueue; however, it turns out irq_work is actually
optional and the change breaks build on configuration which doesn't have
CONFIG_IRQ_WORK enabled.
Fix build by making workqueue use irq_work only when CONFIG_SMP and enabling
CONFIG_IRQ_WORK when CONFIG_SMP is set. It's reasonable to argue that it may
be better to just always enable it. However, this still saves a small bit of
memory for tiny UP configs and also the least amount of change, so, for now,
let's keep it conditional.
Verified to do the right thing for x86_64 allnoconfig and defconfig, and
aarch64 allnoconfig, allnoconfig + prink disable (SMP but nothing selects
IRQ_WORK) and a modified aarch64 Kconfig where !SMP and nothing selects
IRQ_WORK.
v2: `depends on SMP` leads to Kconfig warnings when CONFIG_IRQ_WORK is
selected by something else when !CONFIG_SMP. Use `def_bool y if SMP`
instead.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Tested-by: Anders Roxell <anders.roxell@linaro.org>
Fixes: 2f34d7337d ("workqueue: Fix queue_work_on() with BH workqueues")
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
The debug.config file is really great to easily enable a bunch of
general debugging features on a CI-like setup. But it would be great to
also include core networking debugging config.
A few CI's validating features from the Net tree also enable a few other
debugging options on top of debug.config. A small selection is quite
generic for the whole net tree. They validate some assumptions in
different parts of the core net tree. As suggested by Jakub Kicinski in
[1], having them added to this debug.config file would help other CIs
using network features to find bugs in this area.
Note that the two REFCNT configs also select REF_TRACKER, which doesn't
seem to be an issue.
Link: https://lore.kernel.org/netdev/20240202093148.33bd2b14@kernel.org/T/ [1]
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20240212-kconfig-debug-enable-net-v1-1-fb026de8174c@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
With latest llvm19, I hit the following selftest failures with
$ ./test_progs -j
libbpf: prog 'on_event': BPF program load failed: Permission denied
libbpf: prog 'on_event': -- BEGIN PROG LOAD LOG --
combined stack size of 4 calls is 544. Too large
verification time 1344153 usec
stack depth 24+440+0+32
processed 51008 insns (limit 1000000) max_states_per_insn 19 total_states 1467 peak_states 303 mark_read 146
-- END PROG LOAD LOG --
libbpf: prog 'on_event': failed to load: -13
libbpf: failed to load object 'strobemeta_subprogs.bpf.o'
scale_test:FAIL:expect_success unexpected error: -13 (errno 13)
#498 verif_scale_strobemeta_subprogs:FAIL
The verifier complains too big of the combined stack size (544 bytes) which
exceeds the maximum stack limit 512. This is a regression from llvm19 ([1]).
In the above error log, the original stack depth is 24+440+0+32.
To satisfy interpreter's need, in verifier the stack depth is adjusted to
32+448+32+32=544 which exceeds 512, hence the error. The same adjusted
stack size is also used for jit case.
But the jitted codes could use smaller stack size.
$ egrep -r stack_depth | grep round_up
arm64/net/bpf_jit_comp.c: ctx->stack_size = round_up(prog->aux->stack_depth, 16);
loongarch/net/bpf_jit.c: bpf_stack_adjust = round_up(ctx->prog->aux->stack_depth, 16);
powerpc/net/bpf_jit_comp.c: cgctx.stack_size = round_up(fp->aux->stack_depth, 16);
riscv/net/bpf_jit_comp32.c: round_up(ctx->prog->aux->stack_depth, STACK_ALIGN);
riscv/net/bpf_jit_comp64.c: bpf_stack_adjust = round_up(ctx->prog->aux->stack_depth, 16);
s390/net/bpf_jit_comp.c: u32 stack_depth = round_up(fp->aux->stack_depth, 8);
sparc/net/bpf_jit_comp_64.c: stack_needed += round_up(stack_depth, 16);
x86/net/bpf_jit_comp.c: EMIT3_off32(0x48, 0x81, 0xEC, round_up(stack_depth, 8));
x86/net/bpf_jit_comp.c: int tcc_off = -4 - round_up(stack_depth, 8);
x86/net/bpf_jit_comp.c: round_up(stack_depth, 8));
x86/net/bpf_jit_comp.c: int tcc_off = -4 - round_up(stack_depth, 8);
x86/net/bpf_jit_comp.c: EMIT3_off32(0x48, 0x81, 0xC4, round_up(stack_depth, 8));
In the above, STACK_ALIGN in riscv/net/bpf_jit_comp32.c is defined as 16.
So stack is aligned in either 8 or 16, x86/s390 having 8-byte stack alignment and
the rest having 16-byte alignment.
This patch calculates total stack depth based on 16-byte alignment if jit is requested.
For the above failing case, the new stack size will be 32+448+0+32=512 and no verification
failure. llvm19 regression will be discussed separately in llvm upstream.
The verifier change caused three test failures as these tests compared messages
with stack size. More specifically,
- test_global_funcs/global_func1: fail with interpreter mode and success with jit mode.
Adjusted stack sizes so both jit and interpreter modes will fail.
- async_stack_depth/{pseudo_call_check, async_call_root_check}: since jit and interpreter
will calculate different stack sizes, the failure msg is adjusted to omit those
specific stack size numbers.
[1] https://lore.kernel.org/bpf/32bde0f0-1881-46c9-931a-673be566c61d@linux.dev/
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20240214232951.4113094-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Verifier log avoids printing the same source code line multiple times
when a consecutive block of BPF assembly instructions are covered by the
same original (C) source code line. This greatly improves verifier log
legibility.
Unfortunately, this check is imperfect and in production applications it
quite often happens that verifier log will have multiple duplicated
source lines emitted, for no apparently good reason. E.g., this is
excerpt from a real-world BPF application (with register states omitted
for clarity):
BEFORE
======
; for (int i = 0; i < STROBE_MAX_MAP_ENTRIES; ++i) { @ strobemeta_probe.bpf.c:394
5369: (07) r8 += 2 ;
5370: (07) r7 += 16 ;
; for (int i = 0; i < STROBE_MAX_MAP_ENTRIES; ++i) { @ strobemeta_probe.bpf.c:394
5371: (07) r9 += 1 ;
5372: (79) r4 = *(u64 *)(r10 -32) ;
; for (int i = 0; i < STROBE_MAX_MAP_ENTRIES; ++i) { @ strobemeta_probe.bpf.c:394
5373: (55) if r9 != 0xf goto pc+2
; if (i >= map->cnt) @ strobemeta_probe.bpf.c:396
5376: (79) r1 = *(u64 *)(r10 -40) ;
5377: (79) r1 = *(u64 *)(r1 +8) ;
; if (i >= map->cnt) @ strobemeta_probe.bpf.c:396
5378: (dd) if r1 s<= r9 goto pc-5 ;
; descr->key_lens[i] = 0; @ strobemeta_probe.bpf.c:398
5379: (b4) w1 = 0 ;
5380: (6b) *(u16 *)(r8 -30) = r1 ;
; task, data, off, STROBE_MAX_STR_LEN, map->entries[i].key); @ strobemeta_probe.bpf.c:400
5381: (79) r3 = *(u64 *)(r7 -8) ;
5382: (7b) *(u64 *)(r10 -24) = r6 ;
; task, data, off, STROBE_MAX_STR_LEN, map->entries[i].key); @ strobemeta_probe.bpf.c:400
5383: (bc) w6 = w6 ;
; barrier_var(payload_off); @ strobemeta_probe.bpf.c:280
5384: (bf) r2 = r6 ;
5385: (bf) r1 = r4 ;
As can be seen, line 394 is emitted thrice, 396 is emitted twice, and
line 400 is duplicated as well. Note that there are no intermingling
other lines of source code in between these duplicates, so the issue is
not compiler reordering assembly instruction such that multiple original
source code lines are in effect.
It becomes more obvious what's going on if we look at *full* original line info
information (using btfdump for this, [0]):
#2764: line: insn #5363 --> 394:3 @ ./././strobemeta_probe.bpf.c
for (int i = 0; i < STROBE_MAX_MAP_ENTRIES; ++i) {
#2765: line: insn #5373 --> 394:21 @ ./././strobemeta_probe.bpf.c
for (int i = 0; i < STROBE_MAX_MAP_ENTRIES; ++i) {
#2766: line: insn #5375 --> 394:47 @ ./././strobemeta_probe.bpf.c
for (int i = 0; i < STROBE_MAX_MAP_ENTRIES; ++i) {
#2767: line: insn #5377 --> 394:3 @ ./././strobemeta_probe.bpf.c
for (int i = 0; i < STROBE_MAX_MAP_ENTRIES; ++i) {
#2768: line: insn #5378 --> 414:10 @ ./././strobemeta_probe.bpf.c
return off;
We can see that there are four line info records covering
instructions #5363 through #5377 (instruction indices are shifted due to
subprog instruction being appended to main program), all of them are
pointing to the same C source code line #394. But each of them points to
a different part of that line, which is denoted by differing column
numbers (3, 21, 47, 3).
But verifier log doesn't distinguish between parts of the same source code line
and doesn't emit this column number information, so for end user it's just a
repetitive visual noise. So let's improve the detection of repeated source code
line and avoid this.
With the changes in this patch, we get this output for the same piece of BPF
program log:
AFTER
=====
; for (int i = 0; i < STROBE_MAX_MAP_ENTRIES; ++i) { @ strobemeta_probe.bpf.c:394
5369: (07) r8 += 2 ;
5370: (07) r7 += 16 ;
5371: (07) r9 += 1 ;
5372: (79) r4 = *(u64 *)(r10 -32) ;
5373: (55) if r9 != 0xf goto pc+2
; if (i >= map->cnt) @ strobemeta_probe.bpf.c:396
5376: (79) r1 = *(u64 *)(r10 -40) ;
5377: (79) r1 = *(u64 *)(r1 +8) ;
5378: (dd) if r1 s<= r9 goto pc-5 ;
; descr->key_lens[i] = 0; @ strobemeta_probe.bpf.c:398
5379: (b4) w1 = 0 ;
5380: (6b) *(u16 *)(r8 -30) = r1 ;
; task, data, off, STROBE_MAX_STR_LEN, map->entries[i].key); @ strobemeta_probe.bpf.c:400
5381: (79) r3 = *(u64 *)(r7 -8) ;
5382: (7b) *(u64 *)(r10 -24) = r6 ;
5383: (bc) w6 = w6 ;
; barrier_var(payload_off); @ strobemeta_probe.bpf.c:280
5384: (bf) r2 = r6 ;
5385: (bf) r1 = r4 ;
All the duplication is gone and the log is cleaner and less distracting.
[0] https://github.com/anakryiko/btfdump
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240214174100.2847419-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
To support wire to MSI bridges proper in the MSI core infrastructure it is
required to have separate allocation/free interfaces which can be invoked
from the regular irqdomain allocaton/free functions.
The mechanism for allocation is:
- Allocate the next free MSI descriptor index in the domain
- Store the hardware interrupt number and the trigger type
which was extracted by the irqdomain core from the firmware spec
in the MSI descriptor device cookie so it can be retrieved by
the underlying interrupt domain and interrupt chip
- Use the regular MSI allocation mechanism for the newly allocated
index which returns a fully initialized Linux interrupt on succes
This works because:
- the domains have a fixed size
- each hardware interrupt is only allocated once
- the underlying domain does not care about the MSI index it only cares
about the hardware interrupt number and the trigger type
The free function looks up the MSI index in the MSI descriptor of the
provided Linux interrupt number and uses the regular index based free
functions of the MSI core.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anup Patel <apatel@ventanamicro.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20240127161753.114685-12-apatel@ventanamicro.com