Pull arm64 fix from Catalin Marinas:
"Fix a sparse warning in the arm64 signal code dealing with the user
shadow stack register, GCSPR_EL0"
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64/signal: Silence sparse warning storing GCSPR_EL0
Pull hwmon fixes from Guenter Roeck:
- Fix reporting of negative temperature, current, and voltage values in
the tmp513 driver
* tag 'hwmon-for-v6.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
hwmon: (tmp513) Fix interpretation of values of Temperature Result and Limit Registers
hwmon: (tmp513) Fix Current Register value interpretation
hwmon: (tmp513) Fix interpretation of values of Shunt Voltage and Limit Registers
Pull block fixes from Jens Axboe:
- Minor cleanups for bdev/nvme using the helpers introduced
- Revert of a deadlock fix that still needs more work
- Fix a UAF of hctx in the cpu hotplug code
* tag 'block-6.13-20241220' of git://git.kernel.dk/linux:
block: avoid to reuse `hctx` not removed from cpuhp callback list
block: Revert "block: Fix potential deadlock while freezing queue and acquiring sysfs_lock"
nvme: use blk_validate_block_size() for max LBA check
block/bdev: use helper for max block size check
Pull io_uring fixes from Jens Axboe:
- Fix for a file ref leak for registered ring fds
- Turn the ->timeout_lock into a raw spinlock, as it nests under the
io-wq lock which is a raw spinlock as it's called from the scheduler
side
- Limit ring resizing to DEFER_TASKRUN for now. We will broaden this in
the future, but for now, ensure that it's only feasible on rings with
a single user
- Add sanity check for io-wq enqueuing
* tag 'io_uring-6.13-20241220' of git://git.kernel.dk/linux:
io_uring: check if iowq is killed before queuing
io_uring/register: limit ring resizing to DEFER_TASKRUN
io_uring: Fix registered ring file refcount leak
io_uring: make ctx->timeout_lock a raw spinlock
Pull USB / Thunderbolt fixes from Greg KH:
"Here are some important, and small, fixes for USB and Thunderbolt
issues that have come up in the -rc releases. And some new device ids
for good measure. Included in here are:
- Much reported xhci bugfix for usb-storage devices (and other
devices as well, tripped me up on a video camera)
- thunderbolt fixes for some small reported issues
- new usb-serial device ids
All of these have been in linux-next this week with no reported issues"
* tag 'usb-6.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
usb: xhci: fix ring expansion regression in 6.13-rc1
xhci: Turn NEC specific quirk for handling Stop Endpoint errors generic
thunderbolt: Improve redrive mode handling
USB: serial: option: add Telit FE910C04 rmnet compositions
USB: serial: option: add MediaTek T7XX compositions
USB: serial: option: add Netprisma LCUK54 modules for WWAN Ready
USB: serial: option: add MeiG Smart SLM770A
USB: serial: option: add TCL IK512 MBIM & ECM
thunderbolt: Don't display nvm_version unless upgrade supported
thunderbolt: Add support for Intel Panther Lake-M/P
Pull spi fix from Mark Brown:
"A fix for the remove path of the Rockchip driver, the code was just
clearly and obviously wrong"
* tag 'spi-fix-v6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
spi: rockchip-sfc: Fix error in remove progress
Pull regulator fix from Mark Brown:
"The recently added regulator-uv-survival-time-ms property was renamed
during the review of the series that added it, but unfortunately only
in the DT binding and not in the code that parses the binding.
This brings the code in line with the binding, if someone started
using the original name we can add compat support for it but there's
nothing upstream yet and it's a very niche feature so hopefully not"
* tag 'regulator-fix-v6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
regulator: rename regulator-uv-survival-time-ms according to DT binding
Pull drm fixes from Dave Airlie:
"Probably the last pull before Christmas holidays, I'll still be around
for most of the time anyways, nothing too major in here, bunch of
amdgpu and i915 along with a smattering of fixes across the board.
core:
- fix FB dependency
- avoid div by 0 more in vrefresh
- maintainers update
display:
- fix DP tunnel error path
dma-buf:
- fix !DEBUG_FS
sched:
- docs warning fix
panel:
- collection of misc panel fixes
i915:
- Reset engine utilization buffer before registration
- Ensure busyness counter increases motonically
- Accumulate active runtime on gt reset
amdgpu:
- Disable BOCO when CONFIG_HOTPLUG_PCI_PCIE is not enabled
- scheduler job fixes
- IP version check fixes
- devcoredump fix
- GPUVM update fix
- NBIO 2.5 fix
udmabuf:
- fix memory leak on last export
- sealing fixes
ivpu:
- fix NULL pointer
- fix memory leak
- fix WARN"
* tag 'drm-fixes-2024-12-20' of https://gitlab.freedesktop.org/drm/kernel: (33 commits)
drm/sched: Fix drm_sched_fini() docu generation
accel/ivpu: Fix WARN in ivpu_ipc_send_receive_internal()
accel/ivpu: Fix memory leak in ivpu_mmu_reserved_context_init()
accel/ivpu: Fix general protection fault in ivpu_bo_list()
drm/amdgpu/nbio7.0: fix IP version check
drm/amd: Update strapping for NBIO 2.5.0
drm/amdgpu: Handle NULL bo->tbo.resource (again) in amdgpu_vm_bo_update
drm/amdgpu: fix amdgpu_coredump
drm/amdgpu/smu14.0.2: fix IP version check
drm/amdgpu/gfx12: fix IP version check
drm/amdgpu/mmhub4.1: fix IP version check
drm/amdgpu/nbio7.11: fix IP version check
drm/amdgpu/nbio7.7: fix IP version check
drm/amdgpu: don't access invalid sched
drm/amd: Require CONFIG_HOTPLUG_PCI_PCIE for BOCO
drm: rework FB_CORE dependency
drm/fbdev: Select FB_CORE dependency for fbdev on DMA and TTM
fbdev: Fix recursive dependencies wrt BACKLIGHT_CLASS_DEVICE
i915/guc: Accumulate active runtime on gt reset
i915/guc: Ensure busyness counter increases motonically
...
Pull ring-buffer fixes from Steven Rostedt:
- Fix possible overflow of mmapped ring buffer with bad offset
If the mmap() to the ring buffer passes in a start address that is
passed the end of the mmapped file, it is not caught and a
slab-out-of-bounds is triggered.
Add a check to make sure the start address is within the bounds
- Do not use TP_printk() to boot mapped ring buffers
As a boot mapped ring buffer's data may have pointers that map to the
previous boot's memory map, it is unsafe to allow the TP_printk() to
be used to read the boot mapped buffer's events. If a TP_printk()
points to a static string from within the kernel it will not match
the current kernel mapping if KASLR is active, and it can fault.
Have it simply print out the raw fields.
* tag 'trace-ringbuffer-v6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
trace/ring-buffer: Do not use TP_printk() formatting for boot mapped buffers
ring-buffer: Fix overflow in __rb_map_vma
We are seeing a sparse warning in gcs_restore_signal():
arch/arm64/kernel/signal.c:1054:9: sparse: sparse: cast removes address space '__user' of expression
when storing the final GCSPR_EL0 value back into the register, caused by
the fact that write_sysreg_s() casts the value it writes to a u64 which
sparse sees as discarding the __userness of the pointer.
Avoid this by treating the address as an integer, casting to a pointer only
when using it to write to userspace.
While we're at it also inline gcs_signal_cap_valid() into it's one user
and make equivalent updates to gcs_signal_entry().
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202412082005.OBJ0BbWs-lkp@intel.com/
Signed-off-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20241214-arm64-gcs-signal-sparse-v3-1-5e8d18fffc0c@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
task work can be executed after the task has gone through io_uring
termination, whether it's the final task_work run or the fallback path.
In this case, task work will find ->io_wq being already killed and
null'ed, which is a problem if it then tries to forward the request to
io_queue_iowq(). Make io_queue_iowq() fail requests in this case.
Note that it also checks PF_KTHREAD, because the user can first close
a DEFER_TASKRUN ring and shortly after kill the task, in which case
->iowq check would race.
Cc: stable@vger.kernel.org
Fixes: 50c52250e2 ("block: implement async io_uring discard cmd")
Fixes: 773af69121 ("io_uring: always reissue from task_work context")
Reported-by: Will <willsroot@protonmail.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/63312b4a2c2bb67ad67b857d17a300e1d3b078e8.1734637909.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull networking fixes from Paolo Abeni:
"Including fixes from can and netfilter.
Current release - regressions:
- rtnetlink: try the outer netns attribute in rtnl_get_peer_net()
- rust: net::phy fix module autoloading
Current release - new code bugs:
- phy: avoid undefined behavior in *_led_polarity_set()
- eth: octeontx2-pf: fix netdev memory leak in rvu_rep_create()
Previous releases - regressions:
- smc: check sndbuf_space again after NOSPACE flag is set in smc_poll
- ipvs: fix clamp() of ip_vs_conn_tab on small memory systems
- dsa: restore dsa_software_vlan_untag() ability to operate on
VLAN-untagged traffic
- eth:
- tun: fix tun_napi_alloc_frags()
- ionic: no double destroy workqueue
- idpf: trigger SW interrupt when exiting wb_on_itr mode
- rswitch: rework ts tags management
- team: fix feature exposure when no ports are present
Previous releases - always broken:
- core: fix repeated netlink messages in queue dump
- mdiobus: fix an OF node reference leak
- smc: check iparea_offset and ipv6_prefixes_cnt when receiving
proposal msg
- can: fix missed interrupts with m_can_pci
- eth: oa_tc6: fix infinite loop error when tx credits becomes 0"
* tag 'net-6.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (45 commits)
net: mctp: handle skb cleanup on sock_queue failures
net: mdiobus: fix an OF node reference leak
octeontx2-pf: fix error handling of devlink port in rvu_rep_create()
octeontx2-pf: fix netdev memory leak in rvu_rep_create()
psample: adjust size if rate_as_probability is set
netdev-genl: avoid empty messages in queue dump
net: dsa: restore dsa_software_vlan_untag() ability to operate on VLAN-untagged traffic
selftests: openvswitch: fix tcpdump execution
net: usb: qmi_wwan: add Quectel RG255C
net: phy: avoid undefined behavior in *_led_polarity_set()
netfilter: ipset: Fix for recursive locking warning
ipvs: Fix clamp() of ip_vs_conn_tab on small memory systems
can: m_can: fix missed interrupts with m_can_pci
can: m_can: set init flag earlier in probe
rtnetlink: Try the outer netns attribute in rtnl_get_peer_net().
net: netdevsim: fix nsim_pp_hold_write()
idpf: trigger SW interrupt when exiting wb_on_itr mode
idpf: add support for SW triggered interrupts
qed: fix possible uninit pointer read in qed_mcp_nvm_info_populate()
net: ethernet: bgmac-platform: fix an OF node reference leak
...
Pull MMC fixes from Ulf Hansson:
- mtk-sd: Cleanup the wakeup configuration in error/remove-path
- sdhci-tegra: Correct quirk for ADMA2 length
* tag 'mmc-v6.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
mmc: mtk-sd: disable wakeup in .remove() and in the error path of .probe()
mmc: sdhci-tegra: Remove SDHCI_QUIRK_BROKEN_ADMA_ZEROLEN_DESC quirk
Pull pwm fix from Uwe Kleine-König:
"Fix regression in pwm-stm32 driver when converting to new waveform
support
Fabrice Gasnier found and fixed a regression I introduced with
v6.13-rc1 when converting the stm32 pwm driver to support the new
waveform stuff. On some hardware variants this completely broke the
driver"
* tag 'pwm/for-6.13-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/ukleinek/linux:
pwm: stm32: Fix complementary output in round_waveform_tohw()
Pull smb server fixes from Steve French:
- Two fixes for better handling maximum outstanding requests
- Fix simultaneous negotiate protocol race
* tag 'v6.13-rc3-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
ksmbd: conn lock to serialize smb2 negotiate
ksmbd: fix broken transfers when exceeding max simultaneous operations
ksmbd: count all requests in req_running counter
With DEFER_TASKRUN, we know the ring can't be both waited upon and
resized at the same time. This is important for CQ resizing. Allowing SQ
ring resizing is more trivial, but isn't the interesting use case. Hence
limit ring resizing in general to DEFER_TASKRUN only for now. This isn't
a huge problem as CQ ring resizing is generally the most useful on
networking type of workloads where it can be hard to size the ring
appropriately upfront, and those should be using DEFER_TASKRUN for
better performance.
Fixes: 79cfe9e59c ("io_uring/register: add IORING_REGISTER_RESIZE_RINGS")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Mika writes:
thunderbolt: Fixes for v6.13-rc4
This includes following USB4/Thunderbolt fixes for v6.13-rc4:
- Add Intel Panther Lake PCI IDs
- Do not show nvm_version for retimers that are not supported
- Fix redrive mode handling.
All these have been in linux-next with no reported issues.
* tag 'thunderbolt-for-v6.13-rc4' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/westeri/thunderbolt:
thunderbolt: Improve redrive mode handling
thunderbolt: Don't display nvm_version unless upgrade supported
thunderbolt: Add support for Intel Panther Lake-M/P
Currently, we don't use the return value from sock_queue_rcv_skb, which
means we may leak skbs if a message is not successfully queued to a
socket.
Instead, ensure that we're freeing the skb where the sock hasn't
otherwise taken ownership of the skb by adding checks on the
sock_queue_rcv_skb() to invoke a kfree on failure.
In doing so, rather than using the 'rc' value to trigger the
kfree_skb(), use the skb pointer itself, which is more explicit.
Also, add a kunit test for the sock delivery failure cases.
Fixes: 4a992bbd36 ("mctp: Implement message fragmentation & reassembly")
Cc: stable@vger.kernel.org
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Link: https://patch.msgid.link/20241218-mctp-next-v2-1-1c1729645eaa@codeconstruct.com.au
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
fwnode_find_mii_timestamper() calls of_parse_phandle_with_fixed_args()
but does not decrement the refcount of the obtained OF node. Add an
of_node_put() call before returning from the function.
This bug was detected by an experimental static analysis tool that I am
developing.
Fixes: bc1bee3b87 ("net: mdiobus: Introduce fwnode_mdiobus_register_phy()")
Signed-off-by: Joe Hattori <joe@pf.is.s.u-tokyo.ac.jp>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20241218035106.1436405-1-joe@pf.is.s.u-tokyo.ac.jp
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS fixes for net
The following series contains two fixes for Netfilter/IPVS:
1) Possible build failure in IPVS on systems with less than 512MB
memory due to incorrect use of clamp(), from David Laight.
2) Fix bogus lockdep nesting splat with ipset list:set type,
from Phil Sutter.
netfilter pull request 24-12-19
* tag 'nf-24-12-19' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: ipset: Fix for recursive locking warning
ipvs: Fix clamp() of ip_vs_conn_tab on small memory systems
====================
Link: https://patch.msgid.link/20241218234137.1687288-1-pablo@netfilter.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Robert Hodaszi reports that locally terminated traffic towards
VLAN-unaware bridge ports is broken with ocelot-8021q. He is describing
the same symptoms as for commit 1f9fc48fd3 ("net: dsa: sja1105: fix
reception from VLAN-unaware bridges").
For context, the set merged as "VLAN fixes for Ocelot driver":
https://lore.kernel.org/netdev/20240815000707.2006121-1-vladimir.oltean@nxp.com/
was developed in a slightly different form earlier this year, in January.
Initially, the switch was unconditionally configured to set OCELOT_ES0_TAG
when using ocelot-8021q, regardless of port operating mode.
This led to the situation where VLAN-unaware bridge ports would always
push their PVID - see ocelot_vlan_unaware_pvid() - a negligible value
anyway - into RX packets. To strip this in software, we would have needed
DSA to know what private VID the switch chose for VLAN-unaware bridge
ports, and pushed into the packets. This was implemented downstream, and
a remnant of it remains in the form of a comment mentioning
ds->ops->get_private_vid(), as something which would maybe need to be
considered in the future.
However, for upstream, it was deemed inappropriate, because it would
mean introducing yet another behavior for stripping VLAN tags from
VLAN-unaware bridge ports, when one already existed (ds->untag_bridge_pvid).
The latter has been marked as obsolete along with an explanation why it
is logically broken, but still, it would have been confusing.
So, for upstream, felix_update_tag_8021q_rx_rule() was developed, which
essentially changed the state of affairs from "Felix with ocelot-8021q
delivers all packets as VLAN-tagged towards the CPU" into "Felix with
ocelot-8021q delivers all packets from VLAN-aware bridge ports towards
the CPU". This was done on the premise that in VLAN-unaware mode,
there's nothing useful in the VLAN tags, and we can avoid introducing
ds->ops->get_private_vid() in the DSA receive path if we configure the
switch to not push those VLAN tags into packets in the first place.
Unfortunately, and this is when the trainwreck started, the selftests
developed initially and posted with the series were not re-ran.
dsa_software_vlan_untag() was initially written given the assumption
that users of this feature would send _all_ traffic as VLAN-tagged.
It was only partially adapted to the new scheme, by removing
ds->ops->get_private_vid(), which also used to be necessary in
standalone ports mode.
Where the trainwreck became even worse is that I had a second opportunity
to think about this, when the dsa_software_vlan_untag() logic change
initially broke sja1105, in commit 1f9fc48fd3 ("net: dsa: sja1105: fix
reception from VLAN-unaware bridges"). I did not connect the dots that
it also breaks ocelot-8021q, for pretty much the same reason that not
all received packets will be VLAN-tagged.
To be compatible with the optimized Felix control path which runs
felix_update_tag_8021q_rx_rule() to only push VLAN tags when useful (in
VLAN-aware mode), we need to restore the old dsa_software_vlan_untag()
logic. The blamed commit introduced the assumption that
dsa_software_vlan_untag() will see only VLAN-tagged packets, assumption
which is false. What corrupts RX traffic is the fact that we call
skb_vlan_untag() on packets which are not VLAN-tagged in the first
place.
Fixes: 93e4649efa ("net: dsa: provide a software untagging function on RX for VLAN-aware bridges")
Reported-by: Robert Hodaszi <robert.hodaszi@digi.com>
Closes: https://lore.kernel.org/netdev/20241215163334.615427-1-robert.hodaszi@digi.com/
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20241216135059.1258266-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Tony Nguyen says:
====================
idpf: trigger SW interrupt when exiting wb_on_itr mode
Joshua Hay says:
This patch series introduces SW triggered interrupt support for idpf,
then uses said interrupt to fix a race condition between completion
writebacks and re-enabling interrupts.
* '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
idpf: trigger SW interrupt when exiting wb_on_itr mode
idpf: add support for SW triggered interrupts
====================
Link: https://patch.msgid.link/20241217225715.4005644-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Fix the way tcpdump is executed by:
- Using the right variable for the namespace. Currently the use of the
empty "ns" makes the command fail.
- Waiting until it starts to capture to ensure the interesting traffic
is caught on slow systems.
- Using line-buffered output to ensure logs are available when the test
is paused with "-p". Otherwise the last chunk of data might only be
written when tcpdump is killed.
Fixes: 74cc26f416 ("selftests: openvswitch: add interface support")
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Link: https://patch.msgid.link/20241217211652.483016-1-amorenoz@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Marc Kleine-Budde says:
====================
pull-request: can 2024-12-18
There are 2 patches by Matthias Schiffer for the m_can_pci driver that
handles the m_can cores found on the Intel Elkhart Lake processor.
They fix the initialization and the interrupt handling under high CAN
bus load.
* tag 'linux-can-fixes-for-6.13-20241218' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can:
can: m_can: fix missed interrupts with m_can_pci
can: m_can: set init flag earlier in probe
====================
Link: https://patch.msgid.link/20241218121722.2311963-1-mkl@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently, io_uring_unreg_ringfd() (which cleans up registered rings) is
only called on exit, but __io_uring_free (which frees the tctx in which the
registered ring pointers are stored) is also called on execve (via
begin_new_exec -> io_uring_task_cancel -> __io_uring_cancel ->
io_uring_cancel_generic -> __io_uring_free).
This means: A process going through execve while having registered rings
will leak references to the rings' `struct file`.
Fix it by zapping registered rings on execve(). This is implemented by
moving the io_uring_unreg_ringfd() from io_uring_files_cancel() into its
callee __io_uring_cancel(), which is called from io_uring_task_cancel() on
execve.
This could probably be exploited *on 32-bit kernels* by leaking 2^32
references to the same ring, because the file refcount is stored in a
pointer-sized field and get_file() doesn't have protection against
refcount overflow, just a WARN_ONCE(); but on 64-bit it should have no
impact beyond a memory leak.
Cc: stable@vger.kernel.org
Fixes: e7a6c00dc7 ("io_uring: add support for registering ring file descriptors")
Signed-off-by: Jann Horn <jannh@google.com>
Link: https://lore.kernel.org/r/20241218-uring-reg-ring-cleanup-v1-1-8f63e999045b@google.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
gcc runs into undefined behavior at the end of the three led_polarity_set()
callback functions if it were called with a zero 'modes' argument and it
just ends the function there without returning from it.
This gets flagged by 'objtool' as a function that continues on
to the next one:
drivers/net/phy/aquantia/aquantia_leds.o: warning: objtool: aqr_phy_led_polarity_set+0xf: can't find jump dest instruction at .text+0x5d9
drivers/net/phy/intel-xway.o: warning: objtool: xway_gphy_led_polarity_set() falls through to next function xway_gphy_config_init()
drivers/net/phy/mxl-gpy.o: warning: objtool: gpy_led_polarity_set() falls through to next function gpy_led_hw_control_get()
There is no point to micro-optimize the behavior here to save a single-digit
number of bytes in the kernel, so just change this to a "return -EINVAL"
as we do when any unexpected bits are set.
Fixes: 1758af47b9 ("net: phy: intel-xway: add support for PHY LEDs")
Fixes: 9d55e68b19 ("net: phy: aquantia: correctly describe LED polarity override")
Fixes: eb89c79c1b ("net: phy: mxl-gpy: correctly describe LED polarity")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://patch.msgid.link/20241217081056.238792-1-arnd@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
With CONFIG_PROVE_LOCKING, when creating a set of type bitmap:ip, adding
it to a set of type list:set and populating it from iptables SET target
triggers a kernel warning:
| WARNING: possible recursive locking detected
| 6.12.0-rc7-01692-g5e9a28f41134-dirty #594 Not tainted
| --------------------------------------------
| ping/4018 is trying to acquire lock:
| ffff8881094a6848 (&set->lock){+.-.}-{2:2}, at: ip_set_add+0x28c/0x360 [ip_set]
|
| but task is already holding lock:
| ffff88811034c048 (&set->lock){+.-.}-{2:2}, at: ip_set_add+0x28c/0x360 [ip_set]
This is a false alarm: ipset does not allow nested list:set type, so the
loop in list_set_kadd() can never encounter the outer set itself. No
other set type supports embedded sets, so this is the only case to
consider.
To avoid the false report, create a distinct lock class for list:set
type ipset locks.
Fixes: f830837f0e ("netfilter: ipset: list:set set type support")
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The 'max_avail' value is calculated from the system memory
size using order_base_2().
order_base_2(x) is defined as '(x) ? fn(x) : 0'.
The compiler generates two copies of the code that follows
and then expands clamp(max, min, PAGE_SHIFT - 12) (11 on 32bit).
This triggers a compile-time assert since min is 5.
In reality a system would have to have less than 512MB memory
for the bounds passed to clamp to be reversed.
Swap the order of the arguments to clamp() to avoid the warning.
Replace the clamp_val() on the line below with clamp().
clamp_val() is just 'an accident waiting to happen' and not needed here.
Detected by compile time checks added to clamp(), specifically:
minmax.h: use BUILD_BUG_ON_MSG() for the lo < hi test in clamp()
Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
Closes: https://lore.kernel.org/all/CA+G9fYsT34UkGFKxus63H6UVpYi5GRZkezT9MRLfAbM3f6ke0g@mail.gmail.com/
Fixes: 4f325e2627 ("ipvs: dynamically limit the connection hash table")
Tested-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Reviewed-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Signed-off-by: David Laight <david.laight@aculab.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pull btrfs fixes from David Sterba:
- tree-checker catches invalid number of inline extent references
- zoned mode fixes:
- enhance zone append IO command so it also detects emulated writes
- handle bio splitting at sectorsize boundary
- when deleting a snapshot, fix a condition for visiting nodes in reloc
trees
* tag 'for-6.13-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: tree-checker: reject inline extent items with 0 ref count
btrfs: split bios to the fs sector size boundary
btrfs: use bio_is_zone_append() in the completion handler
btrfs: fix improper generation check in snapshot delete
Pull cxl fixes from Ira Weiny:
- prevent probe failure when non-critical RAS unmasking fails
- fix CXL 1.1 link status sysfs attribute
- fix 4 way (and greater) switch interleave region creation
* tag 'cxl-fixes-6.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl:
cxl/region: Fix region creation for greater than x2 switches
cxl/pci: Check dport->regs.rcd_pcie_cap availability before accessing
cxl/pci: Fix potential bogus return value upon successful probing
Pull selinux fix from Paul Moore:
"One small SELinux patch to get rid improve our handling of unknown
extended permissions by safely ignoring them.
Not only does this make it easier to support newer SELinux policy
on older kernels in the future, it removes to BUG() calls from the
SELinux code."
* tag 'selinux-pr-20241217' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
selinux: ignore unknown extended permissions
The TP_printk() of a TRACE_EVENT() is a generic printf format that any
developer can create for their event. It may include pointers to strings
and such. A boot mapped buffer may contain data from a previous kernel
where the strings addresses are different.
One solution is to copy the event content and update the pointers by the
recorded delta, but a simpler solution (for now) is to just use the
print_fields() function to print these events. The print_fields() function
just iterates the fields and prints them according to what type they are,
and ignores the TP_printk() format from the event itself.
To understand the difference, when printing via TP_printk() the output
looks like this:
4582.696626: kmem_cache_alloc: call_site=getname_flags+0x47/0x1f0 ptr=00000000e70e10e0 bytes_req=4096 bytes_alloc=4096 gfp_flags=GFP_KERNEL node=-1 accounted=false
4582.696629: kmem_cache_alloc: call_site=alloc_empty_file+0x6b/0x110 ptr=0000000095808002 bytes_req=360 bytes_alloc=384 gfp_flags=GFP_KERNEL node=-1 accounted=false
4582.696630: kmem_cache_alloc: call_site=security_file_alloc+0x24/0x100 ptr=00000000576339c3 bytes_req=16 bytes_alloc=16 gfp_flags=GFP_KERNEL|__GFP_ZERO node=-1 accounted=false
4582.696653: kmem_cache_free: call_site=do_sys_openat2+0xa7/0xd0 ptr=00000000e70e10e0 name=names_cache
But when printing via print_fields() (echo 1 > /sys/kernel/tracing/options/fields)
the same event output looks like this:
4582.696626: kmem_cache_alloc: call_site=0xffffffff92d10d97 (-1831793257) ptr=0xffff9e0e8571e000 (-107689771147264) bytes_req=0x1000 (4096) bytes_alloc=0x1000 (4096) gfp_flags=0xcc0 (3264) node=0xffffffff (-1) accounted=(0)
4582.696629: kmem_cache_alloc: call_site=0xffffffff92d0250b (-1831852789) ptr=0xffff9e0e8577f800 (-107689770747904) bytes_req=0x168 (360) bytes_alloc=0x180 (384) gfp_flags=0xcc0 (3264) node=0xffffffff (-1) accounted=(0)
4582.696630: kmem_cache_alloc: call_site=0xffffffff92efca74 (-1829778828) ptr=0xffff9e0e8d35d3b0 (-107689640864848) bytes_req=0x10 (16) bytes_alloc=0x10 (16) gfp_flags=0xdc0 (3520) node=0xffffffff (-1) accounted=(0)
4582.696653: kmem_cache_free: call_site=0xffffffff92cfbea7 (-1831879001) ptr=0xffff9e0e8571e000 (-107689771147264) name=names_cache
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/20241218141507.28389a1d@gandalf.local.home
Fixes: 07714b4bb3 ("tracing: Handle old buffer mappings for event strings and functions")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Pull tracing fixes from Steven Rostedt:
"Replace trace_check_vprintf() with test_event_printk() and
ignore_event()
The function test_event_printk() checks on boot up if the trace event
printf() formats dereference any pointers, and if they do, it then
looks at the arguments to make sure that the pointers they dereference
will exist in the event on the ring buffer. If they do not, it issues
a WARN_ON() as it is a likely bug.
But this isn't the case for the strings that can be dereferenced with
"%s", as some trace events (notably RCU and some IPI events) save a
pointer to a static string in the ring buffer. As the string it points
to lives as long as the kernel is running, it is not a bug to
reference it, as it is guaranteed to be there when the event is read.
But it is also possible (and a common bug) to point to some allocated
string that could be freed before the trace event is read and the
dereference is to bad memory. This case requires a run time check.
The previous way to handle this was with trace_check_vprintf() that
would process the printf format piece by piece and send what it didn't
care about to vsnprintf() to handle arguments that were not strings.
This kept it from having to reimplement vsnprintf(). But it relied on
va_list implementation and for architectures that copied the va_list
and did not pass it by reference, it wasn't even possible to do this
check and it would be skipped. As 64bit x86 passed va_list by
reference, most events were tested and this kept out bugs where
strings would have been dereferenced after being freed.
Instead of relying on the implementation of va_list, extend the boot
up test_event_printk() function to validate all the "%s" strings that
can be validated at boot, and for the few events that point to strings
outside the ring buffer, flag both the event and the field that is
dereferenced as "needs_test". Then before the event is printed, a call
to ignore_event() is made, and if the event has the flag set, it
iterates all its fields and for every field that is to be tested, it
will read the pointer directly from the event in the ring buffer and
make sure that it is valid. If the pointer is not valid, it will print
a WARN_ON(), print out to the trace that the event has unsafe memory
and ignore the print format.
With this new update, the trace_check_vprintf() can be safely removed
and now all events can be verified regardless of architecture"
* tag 'trace-v6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Check "%s" dereference via the field and not the TP_printk format
tracing: Add "%s" check in test_event_printk()
tracing: Add missing helper functions in event pointer dereference check
tracing: Fix test_event_printk() to process entire print argument
The VM pointer might already be outdated when that function is called.
Use the PASID instead to gather the information instead.
Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 57f812d171af4ba233d3ed7c94dfa5b8e92dcc04)
Cc: stable@vger.kernel.org
Since 2320c9e6a7 ("drm/sched: memset() 'job' in drm_sched_job_init()")
accessing job->base.sched can produce unexpected results as the initialisation
of (*job)->base.sched done in amdgpu_job_alloc is overwritten by the
memset.
This commit fixes an issue when a CS would fail validation and would
be rejected after job->num_ibs is incremented. In this case,
amdgpu_ib_free(ring->adev, ...) will be called, which would crash the
machine because the ring value is bogus.
To fix this, pass a NULL pointer to amdgpu_ib_free(): we can do this
because the device is actually not used in this function.
The next commit will remove the ring argument completely.
Fixes: 2320c9e6a7 ("drm/sched: memset() 'job' in drm_sched_job_init()")
Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit 2ae520cb12831d264ceb97c61f72c59d33c0dbd7)
Pull hyperv fixes from Wei Liu:
- Various fixes to Hyper-V tools in the kernel tree (Dexuan Cui, Olaf
Hering, Vitaly Kuznetsov)
- Fix a bug in the Hyper-V TSC page based sched_clock() (Naman Jain)
- Two bug fixes in the Hyper-V utility functions (Michael Kelley)
- Convert open-coded timeouts to secs_to_jiffies() in Hyper-V drivers
(Easwar Hariharan)
* tag 'hyperv-fixes-signed-20241217' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
tools/hv: reduce resource usage in hv_kvp_daemon
tools/hv: add a .gitignore file
tools/hv: reduce resouce usage in hv_get_dns_info helper
hv/hv_kvp_daemon: Pass NIC name to hv_get_dns_info as well
Drivers: hv: util: Avoid accessing a ringbuffer not initialized yet
Drivers: hv: util: Don't force error code to ENODEV in util_probe()
tools/hv: terminate fcopy daemon if read from uio fails
drivers: hv: Convert open-coded timeouts to secs_to_jiffies()
tools: hv: change permissions of NetworkManager configuration file
x86/hyperv: Fix hv tsc page based sched_clock for hibernation
tools: hv: Fix a complier warning in the fcopy uio daemon
In 32-bit x86 builds CONFIG_STATIC_CALL_INLINE isn't set, leading to
static_call_initialized not being available.
Define it as "0" in that case.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Fixes: 0ef8047b73 ("x86/static-call: provide a way to do very early static-call updates")
Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The interrupt line of PCI devices is interpreted as edge-triggered,
however the interrupt signal of the m_can controller integrated in Intel
Elkhart Lake CPUs appears to be generated level-triggered.
Consider the following sequence of events:
- IR register is read, interrupt X is set
- A new interrupt Y is triggered in the m_can controller
- IR register is written to acknowledge interrupt X. Y remains set in IR
As at no point in this sequence no interrupt flag is set in IR, the
m_can interrupt line will never become deasserted, and no edge will ever
be observed to trigger another run of the ISR. This was observed to
result in the TX queue of the EHL m_can to get stuck under high load,
because frames were queued to the hardware in m_can_start_xmit(), but
m_can_finish_tx() was never run to account for their successful
transmission.
On an Elkhart Lake based board with the two CAN interfaces connected to
each other, the following script can reproduce the issue:
ip link set can0 up type can bitrate 1000000
ip link set can1 up type can bitrate 1000000
cangen can0 -g 2 -I 000 -L 8 &
cangen can0 -g 2 -I 001 -L 8 &
cangen can0 -g 2 -I 002 -L 8 &
cangen can0 -g 2 -I 003 -L 8 &
cangen can0 -g 2 -I 004 -L 8 &
cangen can0 -g 2 -I 005 -L 8 &
cangen can0 -g 2 -I 006 -L 8 &
cangen can0 -g 2 -I 007 -L 8 &
cangen can1 -g 2 -I 100 -L 8 &
cangen can1 -g 2 -I 101 -L 8 &
cangen can1 -g 2 -I 102 -L 8 &
cangen can1 -g 2 -I 103 -L 8 &
cangen can1 -g 2 -I 104 -L 8 &
cangen can1 -g 2 -I 105 -L 8 &
cangen can1 -g 2 -I 106 -L 8 &
cangen can1 -g 2 -I 107 -L 8 &
stress-ng --matrix 0 &
To fix the issue, repeatedly read and acknowledge interrupts at the
start of the ISR until no interrupt flags are set, so the next incoming
interrupt will also result in an edge on the interrupt line.
While we have received a report that even with this patch, the TX queue
can become stuck under certain (currently unknown) circumstances on the
Elkhart Lake, this patch completely fixes the issue with the above
reproducer, and it is unclear whether the remaining issue has a similar
cause at all.
Fixes: cab7ffc032 ("can: m_can: add PCI glue driver for Intel Elkhart Lake")
Signed-off-by: Matthias Schiffer <matthias.schiffer@ew.tq-group.com>
Reviewed-by: Markus Schneider-Pargmann <msp@baylibre.com>
Link: https://patch.msgid.link/fdf0439c51bcb3a46c21e9fb21c7f1d06363be84.1728288535.git.matthias.schiffer@ew.tq-group.com
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
While an m_can controller usually already has the init flag from a
hardware reset, no such reset happens on the integrated m_can_pci of the
Intel Elkhart Lake. If the CAN controller is found in an active state,
m_can_dev_setup() would fail because m_can_niso_supported() calls
m_can_cccr_update_bits(), which refuses to modify any other configuration
bits when CCCR_INIT is not set.
To avoid this issue, set CCCR_INIT before attempting to modify any other
configuration flags.
Fixes: cd5a46ce6f ("can: m_can: don't enable transceiver when probing")
Signed-off-by: Matthias Schiffer <matthias.schiffer@ew.tq-group.com>
Reviewed-by: Markus Schneider-Pargmann <msp@baylibre.com>
Link: https://patch.msgid.link/e247f331cb72829fcbdfda74f31a59cbad1a6006.1728288535.git.matthias.schiffer@ew.tq-group.com
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
The Hexagon-specific constant extender optimization in LLVM may crash on
Linux kernel code [1], such as fs/bcache/btree_io.c after
commit 32ed4a620c ("bcachefs: Btree path tracepoints") in 6.12:
clang: llvm/lib/Target/Hexagon/HexagonConstExtenders.cpp:745: bool (anonymous namespace)::HexagonConstExtenders::ExtRoot::operator<(const HCE::ExtRoot &) const: Assertion `ThisB->getParent() == OtherB->getParent()' failed.
Stack dump:
0. Program arguments: clang --target=hexagon-linux-musl ... fs/bcachefs/btree_io.c
1. <eof> parser at end of file
2. Code generation
3. Running pass 'Function Pass Manager' on module 'fs/bcachefs/btree_io.c'.
4. Running pass 'Hexagon constant-extender optimization' on function '@__btree_node_lock_nopath'
Without assertions enabled, there is just a hang during compilation.
This has been resolved in LLVM main (20.0.0) [2] and backported to LLVM
19.1.0 but the kernel supports LLVM 13.0.1 and newer, so disable the
constant expander optimization using the '-mllvm' option when using a
toolchain that is not fixed.
Cc: stable@vger.kernel.org
Link: https://github.com/llvm/llvm-project/issues/99714 [1]
Link: 68df06a0b2 [2]
Link: 2ab8d93061 [3]
Reviewed-by: Brian Cain <bcain@quicinc.com>
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is a race condition between exiting wb_on_itr and completion write
backs. For example, we are in wb_on_itr mode and a Tx completion is
generated by HW, ready to be written back, as we are re-enabling
interrupts:
HW SW
| |
| | idpf_tx_splitq_clean_all
| | napi_complete_done
| |
| tx_completion_wb | idpf_vport_intr_update_itr_ena_irq
That tx_completion_wb happens before the vector is fully re-enabled.
Continuing with this example, it is a UDP stream and the
tx_completion_wb is the last one in the flow (there are no rx packets).
Because the HW generated the completion before the interrupt is fully
enabled, the HW will not fire the interrupt once the timer expires and
the write back will not happen. NAPI poll won't be called. We have
indicated we're back in interrupt mode but nothing else will trigger the
interrupt. Therefore, the completion goes unprocessed, triggering a Tx
timeout.
To mitigate this, fire a SW triggered interrupt upon exiting wb_on_itr.
This interrupt will catch the rogue completion and avoid the timeout.
Add logic to set the appropriate bits in the vector's dyn_ctl register.
Fixes: 9c4a27da0e ("idpf: enable WB_ON_ITR")
Reviewed-by: Madhu Chittim <madhu.chittim@intel.com>
Signed-off-by: Joshua Hay <joshua.a.hay@intel.com>
Tested-by: Krishneil Singh <krishneil.k.singh@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
SW triggered interrupts are guaranteed to fire after their timer
expires, unlike Tx and Rx interrupts which will only fire after the
timer expires _and_ a descriptor write back is available to be processed
by the driver.
Add the necessary fields, defines, and initializations to enable a SW
triggered interrupt in the vector's dyn_ctl register.
Reviewed-by: Madhu Chittim <madhu.chittim@intel.com>
Signed-off-by: Joshua Hay <joshua.a.hay@intel.com>
Tested-by: Krishneil Singh <krishneil.k.singh@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
[BUG]
There is a bug report in the mailing list where btrfs_run_delayed_refs()
failed to drop the ref count for logical 25870311358464 num_bytes
2113536.
The involved leaf dump looks like this:
item 166 key (25870311358464 168 2113536) itemoff 10091 itemsize 50
extent refs 1 gen 84178 flags 1
ref#0: shared data backref parent 32399126528000 count 0 <<<
ref#1: shared data backref parent 31808973717504 count 1
Notice the count number is 0.
[CAUSE]
There is no concrete evidence yet, but considering 0 -> 1 is also a
single bit flipped, it's possible that hardware memory bitflip is
involved, causing the on-disk extent tree to be corrupted.
[FIX]
To prevent us reading such corrupted extent item, or writing such
damaged extent item back to disk, enhance the handling of
BTRFS_EXTENT_DATA_REF_KEY and BTRFS_SHARED_DATA_REF_KEY keys for both
inlined and key items, to detect such 0 ref count and reject them.
CC: stable@vger.kernel.org # 5.4+
Link: https://lore.kernel.org/linux-btrfs/7c69dd49-c346-4806-86e7-e6f863a66f48@app.fastmail.com/
Reported-by: Frankie Fisher <frankie@terrorise.me.uk>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Btrfs like other file systems can't really deal with I/O not aligned to
it's internal block size (which strangely is called sector size in
btrfs, for historical reasons), but the block layer split helper doesn't
even know about that.
Round down the split boundary so that all I/Os are aligned.
Fixes: d5e4377d50 ("btrfs: split zone append bios in btrfs_submit_bio")
CC: stable@vger.kernel.org # 6.12
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Otherwise it won't catch bios turned into regular writes by the block
level zone write plugging. The additional test it adds is for emulated
zone append.
Fixes: 9b1ce7f0c6 ("block: Implement zone append emulation")
CC: stable@vger.kernel.org # 6.12
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have been using the following check
if (generation <= root->root_key.offset)
to make decisions about whether or not to visit a node during snapshot
delete. This is because for normal subvolumes this is set to 0, and for
snapshots it's set to the creation generation. The idea being that if
the generation of the node is less than or equal to our creation
generation then we don't need to visit that node, because it doesn't
belong to us, we can simply drop our reference and move on.
However reloc roots don't have their generation stored in
root->root_key.offset, instead that is the objectid of their
corresponding fs root. This means we can incorrectly not walk into
nodes that need to be dropped when deleting a reloc root.
There are a variety of consequences to making the wrong choice in two
distinct areas.
visit_node_for_delete()
1. False positive. We think we are newer than the block when we really
aren't. We don't visit the node and drop our reference to the node
and carry on. This would result in leaked space.
2. False negative. We do decide to walk down into a block that we
should have just dropped our reference to. However this means that
the child node will have refs > 1, so we will switch to
UPDATE_BACKREF, and then the subsequent walk_down_proc() will notice
that btrfs_header_owner(node) != root->root_key.objectid and it'll
break out of the loop, and then walk_up_proc() will drop our reference,
so this appears to be ok.
do_walk_down()
1. False positive. We are in UPDATE_BACKREF and incorrectly decide that
we are done and don't need to update the backref for our lower nodes.
This is another case that simply won't happen with relocation, as we
only have to do UPDATE_BACKREF if the node below us was shared and
didn't have FULL_BACKREF set, and since we don't own that node
because we're a reloc root we actually won't end up in this case.
2. False negative. Again this is tricky because as described above, we
simply wouldn't be here from relocation, because we don't own any of
the nodes because we never set btrfs_header_owner() to the reloc root
objectid, and we always use FULL_BACKREF, we never actually need to
set FULL_BACKREF on any children.
Having spent a lot of time stressing relocation/snapshot delete recently
I've not seen this pop in practice. But this is objectively incorrect,
so fix this to get the correct starting generation based on the root
we're dropping to keep me from thinking there's a problem here.
CC: stable@vger.kernel.org
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The 'select FB_CORE' statement moved from CONFIG_DRM to DRM_CLIENT_LIB,
but there are now configurations that have code calling into fb_core
as built-in even though the client_lib itself is a loadable module:
x86_64-linux-ld: drivers/gpu/drm/drm_fb_helper.o: in function `drm_fb_helper_set_suspend':
drm_fb_helper.c:(.text+0x2c6): undefined reference to `fb_set_suspend'
x86_64-linux-ld: drivers/gpu/drm/drm_fb_helper.o: in function `drm_fb_helper_resume_worker':
drm_fb_helper.c:(.text+0x2e1): undefined reference to `fb_set_suspend'
In addition to DRM_CLIENT_LIB, the 'select' needs to be at least in
DRM_KMS_HELPER and DRM_GEM_SHMEM_HELPER, so add it here.
This patch is the KMS_HELPER part of [1].
Fixes: dadd28d414 ("drm/client: Add client-lib module")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Thomas Zimmermann <tzimmermann@suse.de>
Link: https://patchwork.freedesktop.org/series/141411/ # 1
Link: https://patchwork.freedesktop.org/patch/msgid/20241216074450.8590-4-tzimmermann@suse.de
Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
Pull ftrace fixes from Steven Rostedt:
- Always try to initialize the idle functions when graph tracer starts
A bug was found that when a CPU is offline when graph tracing starts
and then comes online, that CPU is not traced. The fix to that was to
move the initialization of the idle shadow stack over to the hot plug
online logic, which also handle onlined CPUs. The issue was that it
removed the initialization of the shadow stack when graph tracing
starts, but the callbacks to the hot plug logic do nothing if graph
tracing isn't currently running. Although that fix fixed the onlining
of a CPU during tracing, it broke the CPUs that were already online.
- Have microblaze not try to get the "true parent" in function tracing
If function tracing and graph tracing are both enabled at the same
time the parent of the functions traced by the function tracer may
sometimes be the graph tracing trampoline. The graph tracing hijacks
the return pointer of the function to trace it, but that can
interfere with the function tracing parent output.
This was fixed by using the ftrace_graph_ret_addr() function passing
in the kernel stack pointer using the ftrace_regs_get_stack_pointer()
function. But Al Viro reported that Microblaze does not implement the
kernel_stack_pointer(regs) helper function that
ftrace_regs_get_stack_pointer() uses and fails to compile when
function graph tracing is enabled.
It was first thought that this was a microblaze issue, but the real
cause is that this only works when an architecture implements
HAVE_DYNAMIC_FTRACE_WITH_ARGS, as a requirement for that config is to
have ftrace always pass a valid ftrace_regs to the callbacks. That
also means that the architecture supports
ftrace_regs_get_stack_pointer()
Microblaze does not set HAVE_DYNAMIC_FTRACE_WITH_ARGS nor does it
implement ftrace_regs_get_stack_pointer() which caused it to fail to
build. Only implement the "true parent" logic if an architecture has
that config set"
* tag 'ftrace-v6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
ftrace: Do not find "true_parent" if HAVE_DYNAMIC_FTRACE_WITH_ARGS is not set
fgraph: Still initialize idle shadow stacks when starting
Pull s390 fixes from Alexander Gordeev:
- Fix DirectMap accounting in /proc/meminfo file
- Fix strscpy() return code handling that led to "unsigned 'len' is
never less than zero" warning
- Fix the calculation determining whether to use three- or four-level
paging: account KMSAN modules metadata
* tag 's390-6.13-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/mm: Consider KMSAN modules metadata for paging levels
s390/ipl: Fix never less than zero warning
s390/mm: Fix DirectMap accounting
Do not select BACKLIGHT_CLASS_DEVICE from FB_BACKLIGHT. The latter
only controls backlight support within fbdev core code and data
structures.
Make fbdev drivers depend on BACKLIGHT_CLASS_DEVICE and let users
select it explicitly. Fixes warnings about recursive dependencies,
such as
error: recursive dependency detected!
symbol BACKLIGHT_CLASS_DEVICE is selected by FB_BACKLIGHT
symbol FB_BACKLIGHT is selected by FB_SH_MOBILE_LCDC
symbol FB_SH_MOBILE_LCDC depends on FB_DEVICE
symbol FB_DEVICE depends on FB_CORE
symbol FB_CORE is selected by DRM_GEM_DMA_HELPER
symbol DRM_GEM_DMA_HELPER is selected by DRM_PANEL_ILITEK_ILI9341
symbol DRM_PANEL_ILITEK_ILI9341 depends on BACKLIGHT_CLASS_DEVICE
BACKLIGHT_CLASS_DEVICE is user-selectable, so making drivers adapt to
it is the correct approach in any case. For most drivers, backlight
support is also configurable separately.
v3:
- Select BACKLIGHT_CLASS_DEVICE in PowerMac defconfigs (Christophe)
- Fix PMAC_BACKLIGHT module dependency corner cases (Christophe)
v2:
- s/BACKLIGHT_DEVICE_CLASS/BACKLIGHT_CLASS_DEVICE (Helge)
- Fix fbdev driver-dependency corner case (Arnd)
Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Link: https://patchwork.freedesktop.org/patch/msgid/20241216074450.8590-2-tzimmermann@suse.de
Pull erofs fixes from Gao Xiang:
"The first one fixes a syzbot UAF report caused by a commit introduced
in this cycle, but it also addresses a longstanding memory leak. The
second one resolves a PSI memstall mis-accounting issue.
The remaining patches switch file-backed mounts to use buffered I/Os
by default instead of direct I/Os, since the page cache of underlay
files is typically valid and maybe even dirty. This change also aligns
with the default policy of loopback devices. A mount option has been
added to try to use direct I/Os explicitly.
Summary:
- Fix (pcluster) memory leak and (sbi) UAF after umounting
- Fix a case of PSI memstall mis-accounting
- Use buffered I/Os by default for file-backed mounts"
* tag 'erofs-for-6.13-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: use buffered I/O for file-backed mounts by default
erofs: reference `struct erofs_device_info` for erofs_map_dev
erofs: use `struct erofs_device_info` for the primary device
erofs: add erofs_sb_free() helper
MAINTAINERS: erofs: update Yue Hu's email address
erofs: fix PSI memstall accounting
erofs: fix rare pcluster memory leak after unmounting
Pull hardening fix from Kees Cook:
"Silence a GCC value-range warning that is being ironically triggered
by bounds checking"
* tag 'hardening-v6.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
fortify: Hide run-time copy size from value range tracking
The TP_printk() portion of a trace event is executed at the time a event
is read from the trace. This can happen seconds, minutes, hours, days,
months, years possibly later since the event was recorded. If the print
format contains a dereference to a string via "%s", and that string was
allocated, there's a chance that string could be freed before it is read
by the trace file.
To protect against such bugs, there are two functions that verify the
event. The first one is test_event_printk(), which is called when the
event is created. It reads the TP_printk() format as well as its arguments
to make sure nothing may be dereferencing a pointer that was not copied
into the ring buffer along with the event. If it is, it will trigger a
WARN_ON().
For strings that use "%s", it is not so easy. The string may not reside in
the ring buffer but may still be valid. Strings that are static and part
of the kernel proper which will not be freed for the life of the running
system, are safe to dereference. But to know if it is a pointer to a
static string or to something on the heap can not be determined until the
event is triggered.
This brings us to the second function that tests for the bad dereferencing
of strings, trace_check_vprintf(). It would walk through the printf format
looking for "%s", and when it finds it, it would validate that the pointer
is safe to read. If not, it would produces a WARN_ON() as well and write
into the ring buffer "[UNSAFE-MEMORY]".
The problem with this is how it used va_list to have vsnprintf() handle
all the cases that it didn't need to check. Instead of re-implementing
vsnprintf(), it would make a copy of the format up to the %s part, and
call vsnprintf() with the current va_list ap variable, where the ap would
then be ready to point at the string in question.
For architectures that passed va_list by reference this was possible. For
architectures that passed it by copy it was not. A test_can_verify()
function was used to differentiate between the two, and if it wasn't
possible, it would disable it.
Even for architectures where this was feasible, it was a stretch to rely
on such a method that is undocumented, and could cause issues later on
with new optimizations of the compiler.
Instead, the first function test_event_printk() was updated to look at
"%s" as well. If the "%s" argument is a pointer outside the event in the
ring buffer, it would find the field type of the event that is the problem
and mark the structure with a new flag called "needs_test". The event
itself will be marked by TRACE_EVENT_FL_TEST_STR to let it be known that
this event has a field that needs to be verified before the event can be
printed using the printf format.
When the event fields are created from the field type structure, the
fields would copy the field type's "needs_test" value.
Finally, before being printed, a new function ignore_event() is called
which will check if the event has the TEST_STR flag set (if not, it
returns false). If the flag is set, it then iterates through the events
fields looking for the ones that have the "needs_test" flag set.
Then it uses the offset field from the field structure to find the pointer
in the ring buffer event. It runs the tests to make sure that pointer is
safe to print and if not, it triggers the WARN_ON() and also adds to the
trace output that the event in question has an unsafe memory access.
The ignore_event() makes the trace_check_vprintf() obsolete so it is
removed.
Link: https://lore.kernel.org/all/CAHk-=wh3uOnqnZPpR0PeLZZtyWbZLboZ7cHLCKRWsocvs9Y7hQ@mail.gmail.com/
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/20241217024720.848621576@goodmis.org
Fixes: 5013f454a3 ("tracing: Add check of trace event print fmts for dereferencing pointers")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The test_event_printk() code makes sure that when a trace event is
registered, any dereferenced pointers in from the event's TP_printk() are
pointing to content in the ring buffer. But currently it does not handle
"%s", as there's cases where the string pointer saved in the ring buffer
points to a static string in the kernel that will never be freed. As that
is a valid case, the pointer needs to be checked at runtime.
Currently the runtime check is done via trace_check_vprintf(), but to not
have to replicate everything in vsnprintf() it does some logic with the
va_list that may not be reliable across architectures. In order to get rid
of that logic, more work in the test_event_printk() needs to be done. Some
of the strings can be validated at this time when it is obvious the string
is valid because the string will be saved in the ring buffer content.
Do all the validation of strings in the ring buffer at boot in
test_event_printk(), and make sure that the field of the strings that
point into the kernel are accessible. This will allow adding checks at
runtime that will validate the fields themselves and not rely on paring
the TP_printk() format at runtime.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/20241217024720.685917008@goodmis.org
Fixes: 5013f454a3 ("tracing: Add check of trace event print fmts for dereferencing pointers")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The process_pointer() helper function looks to see if various trace event
macros are used. These macros are for storing data in the event. This
makes it safe to dereference as the dereference will then point into the
event on the ring buffer where the content of the data stays with the
event itself.
A few helper functions were missing. Those were:
__get_rel_dynamic_array()
__get_dynamic_array_len()
__get_rel_dynamic_array_len()
__get_rel_sockaddr()
Also add a helper function find_print_string() to not need to use a middle
man variable to test if the string exists.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/20241217024720.521836792@goodmis.org
Fixes: 5013f454a3 ("tracing: Add check of trace event print fmts for dereferencing pointers")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The test_event_printk() analyzes print formats of trace events looking for
cases where it may dereference a pointer that is not in the ring buffer
which can possibly be a bug when the trace event is read from the ring
buffer and the content of that pointer no longer exists.
The function needs to accurately go from one print format argument to the
next. It handles quotes and parenthesis that may be included in an
argument. When it finds the start of the next argument, it uses a simple
"c = strstr(fmt + i, ',')" to find the end of that argument!
In order to include "%s" dereferencing, it needs to process the entire
content of the print format argument and not just the content of the first
',' it finds. As there may be content like:
({ const char *saved_ptr = trace_seq_buffer_ptr(p); static const char
*access_str[] = { "---", "--x", "w--", "w-x", "-u-", "-ux", "wu-", "wux"
}; union kvm_mmu_page_role role; role.word = REC->role;
trace_seq_printf(p, "sp gen %u gfn %llx l%u %u-byte q%u%s %s%s" " %snxe
%sad root %u %s%c", REC->mmu_valid_gen, REC->gfn, role.level,
role.has_4_byte_gpte ? 4 : 8, role.quadrant, role.direct ? " direct" : "",
access_str[role.access], role.invalid ? " invalid" : "", role.efer_nx ? ""
: "!", role.ad_disabled ? "!" : "", REC->root_count, REC->unsync ?
"unsync" : "sync", 0); saved_ptr; })
Which is an example of a full argument of an existing event. As the code
already handles finding the next print format argument, process the
argument at the end of it and not the start of it. This way it has both
the start of the argument as well as the end of it.
Add a helper function "process_pointer()" that will do the processing during
the loop as well as at the end. It also makes the code cleaner and easier
to read.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/20241217024720.362271189@goodmis.org
Fixes: 5013f454a3 ("tracing: Add check of trace event print fmts for dereferencing pointers")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Pull xen fixes from Juergen Gross:
"Fix xen netfront crash (XSA-465) and avoid using the hypercall page
that doesn't do speculation mitigations (XSA-466)"
* tag 'xsa465+xsa466-6.13-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
x86/xen: remove hypercall page
x86/xen: use new hypercall functions instead of hypercall page
x86/xen: add central hypercall functions
x86/xen: don't do PV iret hypercall through hypercall page
x86/static-call: provide a way to do very early static-call updates
objtool/x86: allow syscall instruction
x86: make get_cpu_vendor() accessible from Xen code
xen/netfront: fix crash when removing device
Chase reports that their tester complaints about a locking context
mismatch:
=============================
[ BUG: Invalid wait context ]
6.13.0-rc1-gf137f14b7ccb-dirty #9 Not tainted
-----------------------------
syz.1.25198/182604 is trying to lock:
ffff88805e66a358 (&ctx->timeout_lock){-.-.}-{3:3}, at: spin_lock_irq
include/linux/spinlock.h:376 [inline]
ffff88805e66a358 (&ctx->timeout_lock){-.-.}-{3:3}, at:
io_match_task_safe io_uring/io_uring.c:218 [inline]
ffff88805e66a358 (&ctx->timeout_lock){-.-.}-{3:3}, at:
io_match_task_safe+0x187/0x250 io_uring/io_uring.c:204
other info that might help us debug this:
context-{5:5}
1 lock held by syz.1.25198/182604:
#0: ffff88802b7d48c0 (&acct->lock){+.+.}-{2:2}, at:
io_acct_cancel_pending_work+0x2d/0x6b0 io_uring/io-wq.c:1049
stack backtrace:
CPU: 0 UID: 0 PID: 182604 Comm: syz.1.25198 Not tainted
6.13.0-rc1-gf137f14b7ccb-dirty #9
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:94 [inline]
dump_stack_lvl+0x82/0xd0 lib/dump_stack.c:120
print_lock_invalid_wait_context kernel/locking/lockdep.c:4826 [inline]
check_wait_context kernel/locking/lockdep.c:4898 [inline]
__lock_acquire+0x883/0x3c80 kernel/locking/lockdep.c:5176
lock_acquire.part.0+0x11b/0x370 kernel/locking/lockdep.c:5849
__raw_spin_lock_irq include/linux/spinlock_api_smp.h:119 [inline]
_raw_spin_lock_irq+0x36/0x50 kernel/locking/spinlock.c:170
spin_lock_irq include/linux/spinlock.h:376 [inline]
io_match_task_safe io_uring/io_uring.c:218 [inline]
io_match_task_safe+0x187/0x250 io_uring/io_uring.c:204
io_acct_cancel_pending_work+0xb8/0x6b0 io_uring/io-wq.c:1052
io_wq_cancel_pending_work io_uring/io-wq.c:1074 [inline]
io_wq_cancel_cb+0xb0/0x390 io_uring/io-wq.c:1112
io_uring_try_cancel_requests+0x15e/0xd70 io_uring/io_uring.c:3062
io_uring_cancel_generic+0x6ec/0x8c0 io_uring/io_uring.c:3140
io_uring_files_cancel include/linux/io_uring.h:20 [inline]
do_exit+0x494/0x27a0 kernel/exit.c:894
do_group_exit+0xb3/0x250 kernel/exit.c:1087
get_signal+0x1d77/0x1ef0 kernel/signal.c:3017
arch_do_signal_or_restart+0x79/0x5b0 arch/x86/kernel/signal.c:337
exit_to_user_mode_loop kernel/entry/common.c:111 [inline]
exit_to_user_mode_prepare include/linux/entry-common.h:329 [inline]
__syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline]
syscall_exit_to_user_mode+0x150/0x2a0 kernel/entry/common.c:218
do_syscall_64+0xd8/0x250 arch/x86/entry/common.c:89
entry_SYSCALL_64_after_hwframe+0x77/0x7f
which is because io_uring has ctx->timeout_lock nesting inside the
io-wq acct lock, the latter of which is used from inside the scheduler
and hence is a raw spinlock, while the former is a "normal" spinlock
and can hence be sleeping on PREEMPT_RT.
Change ctx->timeout_lock to be a raw spinlock to solve this nesting
dependency on PREEMPT_RT=y.
Reported-by: chase xd <sl1589472800@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Parthiban Veerasooran says:
====================
Fixes on the OPEN Alliance TC6 10BASE-T1x MAC-PHY support generic lib
This patch series contain the below fixes.
- Infinite loop error when tx credits becomes 0.
- Race condition between tx skb reference pointers.
v2:
- Added mutex lock to protect tx skb reference handling.
v3:
- Added mutex protection in assigning new tx skb to waiting_tx_skb
pointer.
- Explained the possible scenario for the race condition with the time
diagram in the commit message.
v4:
- Replaced mutex with spin_lock_bh() variants as the start_xmit runs in
BH/softirq context which can't take sleeping locks.
====================
Link: https://patch.msgid.link/20241213123159.439739-1-parthiban.veerasooran@microchip.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
There are two skb pointers to manage tx skb's enqueued from n/w stack.
waiting_tx_skb pointer points to the tx skb which needs to be processed
and ongoing_tx_skb pointer points to the tx skb which is being processed.
SPI thread prepares the tx data chunks from the tx skb pointed by the
ongoing_tx_skb pointer. When the tx skb pointed by the ongoing_tx_skb is
processed, the tx skb pointed by the waiting_tx_skb is assigned to
ongoing_tx_skb and the waiting_tx_skb pointer is assigned with NULL.
Whenever there is a new tx skb from n/w stack, it will be assigned to
waiting_tx_skb pointer if it is NULL. Enqueuing and processing of a tx skb
handled in two different threads.
Consider a scenario where the SPI thread processed an ongoing_tx_skb and
it moves next tx skb from waiting_tx_skb pointer to ongoing_tx_skb pointer
without doing any NULL check. At this time, if the waiting_tx_skb pointer
is NULL then ongoing_tx_skb pointer is also assigned with NULL. After
that, if a new tx skb is assigned to waiting_tx_skb pointer by the n/w
stack and there is a chance to overwrite the tx skb pointer with NULL in
the SPI thread. Finally one of the tx skb will be left as unhandled,
resulting packet missing and memory leak.
- Consider the below scenario where the TXC reported from the previous
transfer is 10 and ongoing_tx_skb holds an tx ethernet frame which can be
transported in 20 TXCs and waiting_tx_skb is still NULL.
tx_credits = 10; /* 21 are filled in the previous transfer */
ongoing_tx_skb = 20;
waiting_tx_skb = NULL; /* Still NULL */
- So, (tc6->ongoing_tx_skb || tc6->waiting_tx_skb) becomes true.
- After oa_tc6_prepare_spi_tx_buf_for_tx_skbs()
ongoing_tx_skb = 10;
waiting_tx_skb = NULL; /* Still NULL */
- Perform SPI transfer.
- Process SPI rx buffer to get the TXC from footers.
- Now let's assume previously filled 21 TXCs are freed so we are good to
transport the next remaining 10 tx chunks from ongoing_tx_skb.
tx_credits = 21;
ongoing_tx_skb = 10;
waiting_tx_skb = NULL;
- So, (tc6->ongoing_tx_skb || tc6->waiting_tx_skb) becomes true again.
- In the oa_tc6_prepare_spi_tx_buf_for_tx_skbs()
ongoing_tx_skb = NULL;
waiting_tx_skb = NULL;
- Now the below bad case might happen,
Thread1 (oa_tc6_start_xmit) Thread2 (oa_tc6_spi_thread_handler)
--------------------------- -----------------------------------
- if waiting_tx_skb is NULL
- if ongoing_tx_skb is NULL
- ongoing_tx_skb = waiting_tx_skb
- waiting_tx_skb = skb
- waiting_tx_skb = NULL
...
- ongoing_tx_skb = NULL
- if waiting_tx_skb is NULL
- waiting_tx_skb = skb
To overcome the above issue, protect the moving of tx skb reference from
waiting_tx_skb pointer to ongoing_tx_skb pointer and assigning new tx skb
to waiting_tx_skb pointer, so that the other thread can't access the
waiting_tx_skb pointer until the current thread completes moving the tx
skb reference safely.
Fixes: 53fbde8ab2 ("net: ethernet: oa_tc6: implement transmit path to transfer tx ethernet frames")
Signed-off-by: Parthiban Veerasooran <parthiban.veerasooran@microchip.com>
Reviewed-by: Larysa Zaremba <larysa.zaremba@intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
SPI thread wakes up to perform SPI transfer whenever there is an TX skb
from n/w stack or interrupt from MAC-PHY. Ethernet frame from TX skb is
transferred based on the availability tx credits in the MAC-PHY which is
reported from the previous SPI transfer. Sometimes there is a possibility
that TX skb is available to transmit but there is no tx credits from
MAC-PHY. In this case, there will not be any SPI transfer but the thread
will be running in an endless loop until tx credits available again.
So checking the availability of tx credits along with TX skb will prevent
the above infinite loop. When the tx credits available again that will be
notified through interrupt which will trigger the SPI transfer to get the
available tx credits.
Fixes: 53fbde8ab2 ("net: ethernet: oa_tc6: implement transmit path to transfer tx ethernet frames")
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Parthiban Veerasooran <parthiban.veerasooran@microchip.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Active busyness of an engine is calculated using gt timestamp and the
context switch in time. While capturing the gt timestamp, it's possible
that the context switches out. This race could result in an active
busyness value that is greater than the actual context runtime value by a
small amount. This leads to a negative delta and throws off busyness
calculations for the user.
If a subsequent count is smaller than the previous one, just return the
previous one, since we expect the busyness to catch up.
Fixes: 77cdd054dd ("drm/i915/pmu: Connect engine busyness stats from GuC to pmu")
Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20241127174006.190128-3-umesh.nerlige.ramappa@intel.com
(cherry picked from commit cf907f6d294217985e9dafd9985dce874e04ca37)
Signed-off-by: Tvrtko Ursulin <tursulin@ursulin.net>
On GT reset, we store total busyness counts for all engines and
re-register the utilization buffer with GuC. At that time we should
reset the buffer, so that we don't get spurious busyness counts on
subsequent queries.
To repro this issue, run igt@perf_pmu@busy-hang followed by
igt@perf_pmu@most-busy-idle-check-all for a couple iterations.
Fixes: 77cdd054dd ("drm/i915/pmu: Connect engine busyness stats from GuC to pmu")
Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20241127174006.190128-2-umesh.nerlige.ramappa@intel.com
(cherry picked from commit abd318237fa6556c1e5225529af145ef15d5ff0d)
Signed-off-by: Tvrtko Ursulin <tursulin@ursulin.net>
The hypercall page is no longer needed. It can be removed, as from the
Xen perspective it is optional.
But, from Linux's perspective, it removes naked RET instructions that
escape the speculative protections that Call Depth Tracking and/or
Untrain Ret are trying to achieve.
This is part of XSA-466 / CVE-2024-53241.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Call the Xen hypervisor via the new xen_hypercall_func static-call
instead of the hypercall page.
This is part of XSA-466 / CVE-2024-53241.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Co-developed-by: Peter Zijlstra <peterz@infradead.org>
Co-developed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Add generic hypercall functions usable for all normal (i.e. not iret)
hypercalls. Depending on the guest type and the processor vendor
different functions need to be used due to the to be used instruction
for entering the hypervisor:
- PV guests need to use syscall
- HVM/PVH guests on Intel need to use vmcall
- HVM/PVH guests on AMD and Hygon need to use vmmcall
As PVH guests need to issue hypercalls very early during boot, there
is a 4th hypercall function needed for HVM/PVH which can be used on
Intel and AMD processors. It will check the vendor type and then set
the Intel or AMD specific function to use via static_call().
This is part of XSA-466 / CVE-2024-53241.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Co-developed-by: Peter Zijlstra <peterz@infradead.org>
There is a check for NULL at the start of create_txqs() and
create_rxqs() which tess if "nic_dev->txqs" is non-NULL. The
intention is that if the device is already open and the queues
are already created then we don't create them a second time.
However, the bug is that if we have an error in the create_txqs()
then the pointer doesn't get set back to NULL. The NULL check
at the start of the function will say that it's already open when
it's not and the device can't be used.
Set ->txqs back to NULL on cleanup on error.
Fixes: c3e79baf1b ("net-next/hinic: Add logical Txq and Rxq")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/0cc98faf-a0ed-4565-a55b-0fa2734bc205@stanley.mountain
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Small follow-up to align this to an equivalent behavior as the bond driver.
The change in 3625920b62 ("teaming: fix vlan_features computing") removed
the netdevice vlan_features when there is no team port attached, yet it
leaves the full set of enc_features intact.
Instead, leave the default features as pre 3625920b62, and recompute once
we do have ports attached. Also, similarly as in bonding case, call the
netdev_base_features() helper on the enc_features.
Fixes: 3625920b62 ("teaming: fix vlan_features computing")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://patch.msgid.link/20241213123657.401868-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski says:
====================
netdev: fix repeated netlink messages in queue dumps
Fix dump continuation for queues and queue stats in the netdev family.
Because we used post-increment when saving id of dumped queue next
skb would re-dump the already dumped queue.
====================
Link: https://patch.msgid.link/20241213152244.3080955-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This test already catches a netlink bug fixed by this series,
but only when running on HW with many queues. Make sure the
netdevsim instance created has a lot of queues, and constrain
the size of the recv_buffer used by netlink.
While at it test both rx and tx queues.
Reviewed-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Link: https://patch.msgid.link/20241213152244.3080955-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
recv_size parameter allows constraining the buffer size for dumps.
It's useful in testing kernel handling of dump continuation,
IOW testing dumps which span multiple skbs.
Let the tests set this parameter when initializing the YNL family.
Keep the normal default, we don't want tests to unintentionally
behave very differently than normal code.
Reviewed-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Link: https://patch.msgid.link/20241213152244.3080955-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The context is supposed to record the next queue to dump,
not last dumped. If the dump doesn't fit we will restart
from the already-dumped queue, duplicating the message.
Before this fix and with the selftest improvements later
in this series we see:
# ./run_kselftest.sh -t drivers/net:stats.py
timeout set to 45
selftests: drivers/net: stats.py
KTAP version 1
1..5
ok 1 stats.check_pause
ok 2 stats.check_fec
ok 3 stats.pkt_byte_sum
# Check| At /root/ksft-net-drv/drivers/net/./stats.py, line 125, in qstat_by_ifindex:
# Check| ksft_eq(len(queues[qtype]), len(set(queues[qtype])),
# Check failed 45 != 44 repeated queue keys
# Check| At /root/ksft-net-drv/drivers/net/./stats.py, line 127, in qstat_by_ifindex:
# Check| ksft_eq(len(queues[qtype]), max(queues[qtype]) + 1,
# Check failed 45 != 44 missing queue keys
# Check| At /root/ksft-net-drv/drivers/net/./stats.py, line 125, in qstat_by_ifindex:
# Check| ksft_eq(len(queues[qtype]), len(set(queues[qtype])),
# Check failed 45 != 44 repeated queue keys
# Check| At /root/ksft-net-drv/drivers/net/./stats.py, line 127, in qstat_by_ifindex:
# Check| ksft_eq(len(queues[qtype]), max(queues[qtype]) + 1,
# Check failed 45 != 44 missing queue keys
# Check| At /root/ksft-net-drv/drivers/net/./stats.py, line 125, in qstat_by_ifindex:
# Check| ksft_eq(len(queues[qtype]), len(set(queues[qtype])),
# Check failed 103 != 100 repeated queue keys
# Check| At /root/ksft-net-drv/drivers/net/./stats.py, line 127, in qstat_by_ifindex:
# Check| ksft_eq(len(queues[qtype]), max(queues[qtype]) + 1,
# Check failed 103 != 100 missing queue keys
# Check| At /root/ksft-net-drv/drivers/net/./stats.py, line 125, in qstat_by_ifindex:
# Check| ksft_eq(len(queues[qtype]), len(set(queues[qtype])),
# Check failed 102 != 100 repeated queue keys
# Check| At /root/ksft-net-drv/drivers/net/./stats.py, line 127, in qstat_by_ifindex:
# Check| ksft_eq(len(queues[qtype]), max(queues[qtype]) + 1,
# Check failed 102 != 100 missing queue keys
not ok 4 stats.qstat_by_ifindex
ok 5 stats.check_down
# Totals: pass:4 fail:1 xfail:0 xpass:0 skip:0 error:0
With the fix:
# ./ksft-net-drv/run_kselftest.sh -t drivers/net:stats.py
timeout set to 45
selftests: drivers/net: stats.py
KTAP version 1
1..5
ok 1 stats.check_pause
ok 2 stats.check_fec
ok 3 stats.pkt_byte_sum
ok 4 stats.qstat_by_ifindex
ok 5 stats.check_down
# Totals: pass:5 fail:0 xfail:0 xpass:0 skip:0 error:0
Fixes: ab63a2387c ("netdev: add per-queue statistics")
Reviewed-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20241213152244.3080955-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The context is supposed to record the next queue to dump,
not last dumped. If the dump doesn't fit we will restart
from the already-dumped queue, duplicating the message.
Before this fix and with the selftest improvements later
in this series we see:
# ./run_kselftest.sh -t drivers/net:queues.py
timeout set to 45
selftests: drivers/net: queues.py
KTAP version 1
1..2
# Check| At /root/ksft-net-drv/drivers/net/./queues.py, line 32, in get_queues:
# Check| ksft_eq(queues, expected)
# Check failed 102 != 100
# Check| At /root/ksft-net-drv/drivers/net/./queues.py, line 32, in get_queues:
# Check| ksft_eq(queues, expected)
# Check failed 101 != 100
not ok 1 queues.get_queues
ok 2 queues.addremove_queues
# Totals: pass:1 fail:1 xfail:0 xpass:0 skip:0 error:0
not ok 1 selftests: drivers/net: queues.py # exit=1
With the fix:
# ./ksft-net-drv/run_kselftest.sh -t drivers/net:queues.py
timeout set to 45
selftests: drivers/net: queues.py
KTAP version 1
1..2
ok 1 queues.get_queues
ok 2 queues.addremove_queues
# Totals: pass:2 fail:0 xfail:0 xpass:0 skip:0 error:0
Fixes: 6b6171db7f ("netdev-genl: Add netlink framework functions for queue")
Reviewed-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/20241213152244.3080955-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
GCC performs value range tracking for variables as a way to provide better
diagnostics. One place this is regularly seen is with warnings associated
with bounds-checking, e.g. -Wstringop-overflow, -Wstringop-overread,
-Warray-bounds, etc. In order to keep the signal-to-noise ratio high,
warnings aren't emitted when a value range spans the entire value range
representable by a given variable. For example:
unsigned int len;
char dst[8];
...
memcpy(dst, src, len);
If len's value is unknown, it has the full "unsigned int" range of [0,
UINT_MAX], and GCC's compile-time bounds checks against memcpy() will
be ignored. However, when a code path has been able to narrow the range:
if (len > 16)
return;
memcpy(dst, src, len);
Then the range will be updated for the execution path. Above, len is
now [0, 16] when reading memcpy(), so depending on other optimizations,
we might see a -Wstringop-overflow warning like:
error: '__builtin_memcpy' writing between 9 and 16 bytes into region of size 8 [-Werror=stringop-overflow]
When building with CONFIG_FORTIFY_SOURCE, the fortified run-time bounds
checking can appear to narrow value ranges of lengths for memcpy(),
depending on how the compiler constructs the execution paths during
optimization passes, due to the checks against the field sizes. For
example:
if (p_size_field != SIZE_MAX &&
p_size != p_size_field && p_size_field < size)
As intentionally designed, these checks only affect the kernel warnings
emitted at run-time and do not block the potentially overflowing memcpy(),
so GCC thinks it needs to produce a warning about the resulting value
range that might be reaching the memcpy().
We have seen this manifest a few times now, with the most recent being
with cpumasks:
In function ‘bitmap_copy’,
inlined from ‘cpumask_copy’ at ./include/linux/cpumask.h:839:2,
inlined from ‘__padata_set_cpumasks’ at kernel/padata.c:730:2:
./include/linux/fortify-string.h:114:33: error: ‘__builtin_memcpy’ reading between 257 and 536870904 bytes from a region of size 256 [-Werror=stringop-overread]
114 | #define __underlying_memcpy __builtin_memcpy
| ^
./include/linux/fortify-string.h:633:9: note: in expansion of macro ‘__underlying_memcpy’
633 | __underlying_##op(p, q, __fortify_size); \
| ^~~~~~~~~~~~~
./include/linux/fortify-string.h:678:26: note: in expansion of macro ‘__fortify_memcpy_chk’
678 | #define memcpy(p, q, s) __fortify_memcpy_chk(p, q, s, \
| ^~~~~~~~~~~~~~~~~~~~
./include/linux/bitmap.h:259:17: note: in expansion of macro ‘memcpy’
259 | memcpy(dst, src, len);
| ^~~~~~
kernel/padata.c: In function ‘__padata_set_cpumasks’:
kernel/padata.c:713:48: note: source object ‘pcpumask’ of size [0, 256]
713 | cpumask_var_t pcpumask,
| ~~~~~~~~~~~~~~^~~~~~~~
This warning is _not_ emitted when CONFIG_FORTIFY_SOURCE is disabled,
and with the recent -fdiagnostics-details we can confirm the origin of
the warning is due to FORTIFY's bounds checking:
../include/linux/bitmap.h:259:17: note: in expansion of macro 'memcpy'
259 | memcpy(dst, src, len);
| ^~~~~~
'__padata_set_cpumasks': events 1-2
../include/linux/fortify-string.h:613:36:
612 | if (p_size_field != SIZE_MAX &&
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~
613 | p_size != p_size_field && p_size_field < size)
| ~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~
| |
| (1) when the condition is evaluated to false
| (2) when the condition is evaluated to true
'__padata_set_cpumasks': event 3
114 | #define __underlying_memcpy __builtin_memcpy
| ^
| |
| (3) out of array bounds here
Note that the cpumask warning started appearing since bitmap functions
were recently marked __always_inline in commit ed8cd2b3bd ("bitmap:
Switch from inline to __always_inline"), which allowed GCC to gain
visibility into the variables as they passed through the FORTIFY
implementation.
In order to silence these false positives but keep otherwise deterministic
compile-time warnings intact, hide the length variable from GCC with
OPTIMIZE_HIDE_VAR() before calling the builtin memcpy.
Additionally add a comment about why all the macro args have copies with
const storage.
Reported-by: "Thomas Weißschuh" <linux@weissschuh.net>
Closes: https://lore.kernel.org/all/db7190c8-d17f-4a0d-bc2f-5903c79f36c2@t-8ch.de/
Reported-by: Nilay Shroff <nilay@linux.ibm.com>
Closes: https://lore.kernel.org/all/20241112124127.1666300-1-nilay@linux.ibm.com/
Tested-by: Nilay Shroff <nilay@linux.ibm.com>
Acked-by: Yury Norov <yury.norov@gmail.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Kees Cook <kees@kernel.org>
The values returned by the driver after processing the contents of the
Temperature Result and the Temperature Limit Registers do not correspond to
the TMP512/TMP513 specifications. A raw register value is converted to a
signed integer value by a sign extension in accordance with the algorithm
provided in the specification, but due to the off-by-one error in the sign
bit index, the result is incorrect.
According to the TMP512 and TMP513 datasheets, the Temperature Result (08h
to 0Bh) and Limit (11h to 14h) Registers are 13-bit two's complement
integer values, shifted left by 3 bits. The value is scaled by 0.0625
degrees Celsius per bit. E.g., if regval = 1 1110 0111 0000 000, the
output should be -25 degrees, but the driver will return +487 degrees.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Fixes: 59dfa75e5d ("hwmon: Add driver for Texas Instruments TMP512/513 sensor chips.")
Signed-off-by: Murad Masimov <m.masimov@maxima.ru>
Link: https://lore.kernel.org/r/20241216173648.526-4-m.masimov@maxima.ru
[groeck: fixed description line length]
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
The value returned by the driver after processing the contents of the
Current Register does not correspond to the TMP512/TMP513 specifications.
A raw register value is converted to a signed integer value by a sign
extension in accordance with the algorithm provided in the specification,
but due to the off-by-one error in the sign bit index, the result is
incorrect. Moreover, negative values will be reported as large positive
due to missing sign extension from u32 to long.
According to the TMP512 and TMP513 datasheets, the Current Register (07h)
is a 16-bit two's complement integer value. E.g., if regval = 1000 0011
0000 0000, then the value must be (-32000 * lsb), but the driver will
return (33536 * lsb).
Fix off-by-one bug, and also cast data->curr_lsb_ua (which is of type u32)
to long to prevent incorrect cast for negative values.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Fixes: 59dfa75e5d ("hwmon: Add driver for Texas Instruments TMP512/513 sensor chips.")
Signed-off-by: Murad Masimov <m.masimov@maxima.ru>
Link: https://lore.kernel.org/r/20241216173648.526-3-m.masimov@maxima.ru
[groeck: Fixed description line length]
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
The values returned by the driver after processing the contents of the
Shunt Voltage Register and the Shunt Limit Registers do not correspond to
the TMP512/TMP513 specifications. A raw register value is converted to a
signed integer value by a sign extension in accordance with the algorithm
provided in the specification, but due to the off-by-one error in the sign
bit index, the result is incorrect. Moreover, the PGA shift calculated with
the tmp51x_get_pga_shift function is relevant only to the Shunt Voltage
Register, but is also applied to the Shunt Limit Registers.
According to the TMP512 and TMP513 datasheets, the Shunt Voltage Register
(04h) is 13 to 16 bit two's complement integer value, depending on the PGA
setting. The Shunt Positive (0Ch) and Negative (0Dh) Limit Registers are
16-bit two's complement integer values. Below are some examples:
* Shunt Voltage Register
If PGA = 8, and regval = 1000 0011 0000 0000, then the decimal value must
be -32000, but the value calculated by the driver will be 33536.
* Shunt Limit Register
If regval = 1000 0011 0000 0000, then the decimal value must be -32000, but
the value calculated by the driver will be 768, if PGA = 1.
Fix sign bit index, and also correct misleading comment describing the
tmp51x_get_pga_shift function.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Fixes: 59dfa75e5d ("hwmon: Add driver for Texas Instruments TMP512/513 sensor chips.")
Signed-off-by: Murad Masimov <m.masimov@maxima.ru>
Link: https://lore.kernel.org/r/20241216173648.526-2-m.masimov@maxima.ru
[groeck: Fixed description and multi-line alignments]
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
When function tracing and function graph tracing are both enabled (in
different instances) the "parent" of some of the function tracing events
is "return_to_handler" which is the trampoline used by function graph
tracing. To fix this, ftrace_get_true_parent_ip() was introduced that
returns the "true" parent ip instead of the trampoline.
To do this, the ftrace_regs_get_stack_pointer() is used, which uses
kernel_stack_pointer(). The problem is that microblaze does not implement
kerenl_stack_pointer() so when function graph tracing is enabled, the
build fails. But microblaze also does not enabled HAVE_DYNAMIC_FTRACE_WITH_ARGS.
That option has to be enabled by the architecture to reliably get the
values from the fregs parameter passed in. When that config is not set,
the architecture can also pass in NULL, which is not tested for in that
function and could cause the kernel to crash.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Jeff Xie <jeff.xie@linux.dev>
Link: https://lore.kernel.org/20241216164633.6df18e87@gandalf.local.home
Fixes: 60b1f578b5 ("ftrace: Get the true parent ip for function tracer")
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
A bug was discovered where the idle shadow stacks were not initialized
for offline CPUs when starting function graph tracer, and when they came
online they were not traced due to the missing shadow stack. To fix
this, the idle task shadow stack initialization was moved to using the
CPU hotplug callbacks. But it removed the initialization when the
function graph was enabled. The problem here is that the hotplug
callbacks are called when the CPUs come online, but the idle shadow
stack initialization only happens if function graph is currently
active. This caused the online CPUs to not get their shadow stack
initialized.
The idle shadow stack initialization still needs to be done when the
function graph is registered, as they will not be allocated if function
graph is not registered.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20241211135335.094ba282@batman.local.home
Fixes: 2c02f7375e ("fgraph: Use CPU hotplug mechanism to initialize idle shadow stacks")
Reported-by: Linus Walleij <linus.walleij@linaro.org>
Tested-by: Linus Walleij <linus.walleij@linaro.org>
Closes: https://lore.kernel.org/all/CACRpkdaTBrHwRbbrphVy-=SeDz6MSsXhTKypOtLrTQ+DgGAOcQ@mail.gmail.com/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Pull SoC fixes from Arnd Bergmann:
"Three small fixes for the soc tree:
- devicetee fix for the Arm Juno reference machine, to allow more
interesting PCI configurations
- build fix for SCMI firmware on the NXP i.MX platform
- fix for a race condition in Arm FF-A firmware"
* tag 'soc-fixes-6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc:
arm64: dts: fvp: Update PCIe bus-range property
firmware: arm_ffa: Fix the race around setting ffa_dev->properties
firmware: arm_scmi: Fix i.MX build dependency
Pull x86 platform driver fixes from Ilpo Järvinen:
- alienware-wmi:
- Add support for Alienware m16 R1 AMD
- Do not setup legacy LED control with X and G Series
- intel/ifs: Clearwater Forest support
- intel/vsec: Panther Lake support
- p2sb: Do not hide the device if BIOS left it unhidden
- touchscreen_dmi: Add SARY Tab 3 tablet information
* tag 'platform-drivers-x86-v6.13-3' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86:
platform/x86/intel/vsec: Add support for Panther Lake
platform/x86/intel/ifs: Add Clearwater Forest to CPU support list
platform/x86: touchscreen_dmi: Add info for SARY Tab 3 tablet
p2sb: Do not scan and remove the P2SB device when it is unhidden
p2sb: Move P2SB hide and unhide code to p2sb_scan_and_cache()
p2sb: Introduce the global flag p2sb_hidden_by_bios
p2sb: Factor out p2sb_read_from_cache()
alienware-wmi: Adds support to Alienware m16 R1 AMD
alienware-wmi: Fix X Series and G Series quirks
Johan writes:
USB-serial device ids for 6.13-rc3
Here are some new modem device ids.
All have been in linux-next with no reported issues.
* tag 'usb-serial-6.13-rc3' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/johan/usb-serial:
USB: serial: option: add Telit FE910C04 rmnet compositions
USB: serial: option: add MediaTek T7XX compositions
USB: serial: option: add Netprisma LCUK54 modules for WWAN Ready
USB: serial: option: add MeiG Smart SLM770A
USB: serial: option: add TCL IK512 MBIM & ECM
For many use cases (e.g. container images are just fetched from remote),
performance will be impacted if underlay page cache is up-to-date but
direct i/o flushes dirty pages first.
Instead, let's use buffered I/O by default to keep in sync with loop
devices and add a (re)mount option to explicitly give a try to use
direct I/O if supported by the underlying files.
The container startup time is improved as below:
[workload] docker.io/library/workpress:latest
unpack 1st run non-1st runs
EROFS snapshotter buffered I/O file 4.586404265s 0.308s 0.198s
EROFS snapshotter direct I/O file 4.581742849s 2.238s 0.222s
EROFS snapshotter loop 4.596023152s 0.346s 0.201s
Overlayfs snapshotter 5.382851037s 0.206s 0.214s
Fixes: fb17675026 ("erofs: add file-backed mount support")
Cc: Derek McGowan <derek@mcg.dev>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20241212134336.2059899-1-hsiangkao@linux.alibaba.com
When USB-C monitor is connected directly to Intel Barlow Ridge host, it
goes into "redrive" mode that basically routes the DisplayPort signals
directly from the GPU to the USB-C monitor without any tunneling needed.
However, the host router must be powered on for this to work. Aaron
reported that there are a couple of cases where this will not work with
the current code:
- Booting with USB-C monitor plugged in.
- Plugging in USB-C monitor when the host router is in sleep state
(runtime suspended).
- Plugging in USB-C device while the system is in system sleep state.
In all these cases once the host router is runtime suspended the picture
on the connected USB-C display disappears too. This is certainly not
what the user expected.
For this reason improve the redrive mode handling to keep the host
router from runtime suspending when detect that any of the above cases
is happening.
Fixes: a75e0684ef ("thunderbolt: Keep the domain powered when USB4 port is in redrive mode")
Reported-by: Aaron Rainbolt <arainbolt@kfocus.org>
Closes: https://lore.kernel.org/linux-usb/20241009220118.70bfedd0@kf-ir16/
Cc: stable@vger.kernel.org
Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
If client send parallel smb2 negotiate request on same connection,
ksmbd_conn can be racy. smb2 negotiate handling that are not
performance-related can be serialized with conn lock.
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Since commit 0a77d947f5 ("ksmbd: check outstanding simultaneous SMB
operations"), ksmbd enforces a maximum number of simultaneous operations
for a connection. The problem is that reaching the limit causes ksmbd to
close the socket, and the client has no indication that it should have
slowed down.
This behaviour can be reproduced by setting "smb2 max credits = 128" (or
lower), and transferring a large file (25GB).
smbclient fails as below:
$ smbclient //192.168.1.254/testshare -U user%pass
smb: \> put file.bin
cli_push returned NT_STATUS_USER_SESSION_DELETED
putting file file.bin as \file.bin smb2cli_req_compound_submit:
Insufficient credits. 0 available, 1 needed
NT_STATUS_INTERNAL_ERROR closing remote file \file.bin
smb: \> smb2cli_req_compound_submit: Insufficient credits. 0 available,
1 needed
Windows clients fail with 0x8007003b (with smaller files even).
Fix this by delaying reading from the socket until there's room to
allocate a request. This effectively applies backpressure on the client,
so the transfer completes, albeit at a slower rate.
Fixes: 0a77d947f5 ("ksmbd: check outstanding simultaneous SMB operations")
Signed-off-by: Marios Makassikis <mmakassikis@freebox.fr>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
This changes the semantics of req_running to count all in-flight
requests on a given connection, rather than the number of elements
in the conn->request list. The latter is used only in smb2_cancel,
and the counter is not used
Signed-off-by: Marios Makassikis <mmakassikis@freebox.fr>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
When evaluating extended permissions, ignore unknown permissions instead
of calling BUG(). This commit ensures that future permissions can be
added without interfering with older kernels.
Cc: stable@vger.kernel.org
Fixes: fa1aa143ac ("selinux: extended permissions for ioctls")
Signed-off-by: Thiébaud Weksteen <tweek@google.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Pull EFI fixes from Ard Biesheuvel:
- Limit EFI zboot to GZIP and ZSTD before it comes in wider use
- Fix inconsistent error when looking up a non-existent file in
efivarfs with a name that does not adhere to the NAME-GUID format
- Drop some unused code
* tag 'efi-fixes-for-v6.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
efi/esrt: remove esre_attribute::store()
efivarfs: Fix error on non-existent file
efi/zboot: Limit compression options to GZIP and ZSTD
Pull i2c fixes from Wolfram Sang:
"i2c host fixes: PNX used the wrong unit for timeouts, Nomadik was
missing a sentinel, and RIIC was missing rounding up"
* tag 'i2c-for-6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: riic: Always round-up when calculating bus period
i2c: nomadik: Add missing sentinel to match table
i2c: pnx: Fix timeout in wait functions
The existing linked list based implementation of how ts tags are
assigned and managed is unsafe against concurrency and corner cases:
- element addition in tx processing can race against element removal
in ts queue completion,
- element removal in ts queue completion can race against element
removal in device close,
- if a large number of frames gets added to tx queue without ts queue
completions in between, elements with duplicate tag values can get
added.
Use a different implementation, based on per-port used tags bitmaps and
saved skb arrays.
Safety for addition in tx processing vs removal in ts completion is
provided by:
tag = find_first_zero_bit(...);
smp_mb();
<write rdev->ts_skb[tag]>
set_bit(...);
vs
<read rdev->ts_skb[tag]>
smp_mb();
clear_bit(...);
Safety for removal in ts completion vs removal in device close is
provided by using atomic read-and-clear for rdev->ts_skb[tag]:
ts_skb = xchg(&rdev->ts_skb[tag], NULL);
if (ts_skb)
<handle it>
Fixes: 33f5d733b5 ("net: renesas: rswitch: Improve TX timestamp accuracy")
Signed-off-by: Nikita Yushchenko <nikita.yoush@cogentembedded.com>
Link: https://patch.msgid.link/20241212062558.436455-1-nikita.yoush@cogentembedded.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The calculation determining whether to use three- or four-level paging
didn't account for KMSAN modules metadata. Include this metadata in the
virtual memory size calculation to ensure correct paging mode selection
and avoiding potentially unnecessary physical memory size limitations.
Fixes: 65ca73f9fb ("s390/mm: define KMSAN metadata for vmalloc and modules")
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
There are some FW error handling paths that can cause us to
try to destroy the workqueue more than once, so let's be sure
we're checking for that.
The case where this popped up was in an AER event where the
handlers got called in such a way that ionic_reset_prepare()
and thus ionic_dev_teardown() got called twice in a row.
The second time through the workqueue was already destroyed,
and destroy_workqueue() choked on the bad wq pointer.
We didn't hit this in AER handler testing before because at
that time we weren't using a private workqueue. Later we
replaced the use of the system workqueue with our own private
workqueue but hadn't rerun the AER handler testing since then.
Fixes: 9e25450da7 ("ionic: add private workqueue per-device")
Signed-off-by: Shannon Nelson <shannon.nelson@amd.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20241212213157.12212-3-shannon.nelson@amd.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Use the correct attribute space for sub-message key lookup in nested
attributes when adding attributes. This fixes rt_link where the "kind"
key and "data" sub-message are nested attributes in "linkinfo".
For example:
./tools/net/ynl/cli.py \
--create \
--spec Documentation/netlink/specs/rt_link.yaml \
--do newlink \
--json '{"link": 99,
"linkinfo": { "kind": "vlan", "data": {"id": 4 } }
}'
Signed-off-by: Donald Hunter <donald.hunter@gmail.com>
Fixes: ab463c4342 ("tools/net/ynl: Add support for encoding sub-messages")
Link: https://patch.msgid.link/20241213130711.40267-1-donald.hunter@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Packets injected by the CPU should have a SRC_PORT field equal to the
CPU port module index in the Analyzer block (ocelot->num_phys_ports).
The blamed commit copied the ocelot_ifh_set_basic() call incorrectly
from ocelot_xmit_common() in net/dsa/tag_ocelot.c. Instead of calling
with "x", it calls with BIT_ULL(x), but the field is not a port mask,
but rather a single port index.
[ side note: this is the technical debt of code duplication :( ]
The error used to be silent and doesn't appear to have other
user-visible manifestations, but with new changes in the packing
library, it now fails loudly as follows:
------------[ cut here ]------------
Cannot store 0x40 inside bits 46-43 - will truncate
sja1105 spi2.0: xmit timed out
WARNING: CPU: 1 PID: 102 at lib/packing.c:98 __pack+0x90/0x198
sja1105 spi2.0: timed out polling for tstamp
CPU: 1 UID: 0 PID: 102 Comm: felix_xmit
Tainted: G W N 6.13.0-rc1-00372-gf706b85d972d-dirty #2605
Call trace:
__pack+0x90/0x198 (P)
__pack+0x90/0x198 (L)
packing+0x78/0x98
ocelot_ifh_set_basic+0x260/0x368
ocelot_port_inject_frame+0xa8/0x250
felix_port_deferred_xmit+0x14c/0x258
kthread_worker_fn+0x134/0x350
kthread+0x114/0x138
The code path pertains to the ocelot switchdev driver and to the felix
secondary DSA tag protocol, ocelot-8021q. Here seen with ocelot-8021q.
The messenger (packing) is not really to blame, so fix the original
commit instead.
Fixes: e1b9e80236 ("net: mscc: ocelot: fix QoS class for injected packets with "ocelot-8021q"")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20241212165546.879567-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pull EDAC fix from Borislav Petkov:
- Make sure amd64_edac loads successfully on certain Zen4 memory
configurations
* tag 'edac_urgent_for_v6.13_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
EDAC/amd64: Simplify ECC check on unified memory controllers
Pull irq fixes from Borislav Petkov:
- Disable the secure programming interface of the GIC500 chip in the
RK3399 SoC to fix interrupt priority assignment and even make a dead
machine boot again when the gic-v3 driver enables pseudo NMIs
- Correct the declaration of a percpu variable to fix several sparse
warnings
* tag 'irq_urgent_for_v6.13_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/gic-v3: Work around insecure GIC integrations
irqchip/gic: Correct declaration of *percpu_base pointer in union gic_base
Pull scheduler fixes from Borislav Petkov:
- Prevent incorrect dequeueing of the deadline dlserver helper task and
fix its time accounting
- Properly track the CFS runqueue runnable stats
- Check the total number of all queued tasks in a sched fair's runqueue
hierarchy before deciding to stop the tick
- Fix the scheduling of the task that got woken last (NEXT_BUDDY) by
preventing those from being delayed
* tag 'sched_urgent_for_v6.13_rc3-p2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/dlserver: Fix dlserver time accounting
sched/dlserver: Fix dlserver double enqueue
sched/eevdf: More PELT vs DELAYED_DEQUEUE
sched/fair: Fix sched_can_stop_tick() for fair tasks
sched/fair: Fix NEXT_BUDDY
Pull kvm fixes from Paolo Bonzini:
"ARM64:
- Fix confusion with implicitly-shifted MDCR_EL2 masks breaking
SPE/TRBE initialization
- Align nested page table walker with the intended memory attribute
combining rules of the architecture
- Prevent userspace from constraining the advertised ASID width,
avoiding horrors of guest TLBIs not matching the intended context
in hardware
- Don't leak references on LPIs when insertion into the translation
cache fails
RISC-V:
- Replace csr_write() with csr_set() for HVIEN PMU overflow bit
x86:
- Cache CPUID.0xD XSTATE offsets+sizes during module init
On Intel's Emerald Rapids CPUID costs hundreds of cycles and there
are a lot of leaves under 0xD. Getting rid of the CPUIDs during
nested VM-Enter and VM-Exit is planned for the next release, for
now just cache them: even on Skylake that is 40% faster"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: x86: Cache CPUID.0xD XSTATE offsets+sizes during module init
RISC-V: KVM: Fix csr_write -> csr_set for HVIEN PMU overflow bit
KVM: arm64: vgic-its: Add error handling in vgic_its_cache_translation
KVM: arm64: Do not allow ID_AA64MMFR0_EL1.ASIDbits to be overridden
KVM: arm64: Fix S1/S2 combination when FWB==1 and S2 has Device memory type
arm64: Fix usage of new shifted MDCR_EL2 values
Guangguan Wang says:
====================
net: several fixes for smc
v1 -> v2:
rewrite patch #2 suggested by Paolo.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
When receiving clc msg, the field length in smc_clc_msg_hdr indicates the
length of msg should be received from network and the value should not be
fully trusted as it is from the network. Once the value of length exceeds
the value of buflen in function smc_clc_wait_msg it may run into deadloop
when trying to drain the remaining data exceeding buflen.
This patch checks the return value of sock_recvmsg when draining data in
case of deadloop in draining.
Fixes: fb4f79264c ("net/smc: tolerate future SMCD versions")
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Reviewed-by: Wen Gu <guwen@linux.alibaba.com>
Reviewed-by: D. Wythe <alibuda@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When receiving proposal msg in server, the field smcd_v2_ext_offset in
proposal msg is from the remote client and can not be fully trusted.
Once the value of smcd_v2_ext_offset exceed the max value, there has
the chance to access wrong address, and crash may happen.
This patch checks the value of smcd_v2_ext_offset before using it.
Fixes: 5c21c4ccaf ("net/smc: determine accepted ISM devices")
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Reviewed-by: Wen Gu <guwen@linux.alibaba.com>
Reviewed-by: D. Wythe <alibuda@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When receiving proposal msg in server, the fields v2_ext_offset/
eid_cnt/ism_gid_cnt in proposal msg are from the remote client
and can not be fully trusted. Especially the field v2_ext_offset,
once exceed the max value, there has the chance to access wrong
address, and crash may happen.
This patch checks the fields v2_ext_offset/eid_cnt/ism_gid_cnt
before using them.
Fixes: 8c3dca341a ("net/smc: build and send V2 CLC proposal")
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Reviewed-by: Wen Gu <guwen@linux.alibaba.com>
Reviewed-by: D. Wythe <alibuda@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When receiving proposal msg in server, the field iparea_offset
and the field ipv6_prefixes_cnt in proposal msg are from the
remote client and can not be fully trusted. Especially the
field iparea_offset, once exceed the max value, there has the
chance to access wrong address, and crash may happen.
This patch checks iparea_offset and ipv6_prefixes_cnt before using them.
Fixes: e7b7a64a84 ("smc: support variable CLC proposal messages")
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Reviewed-by: Wen Gu <guwen@linux.alibaba.com>
Reviewed-by: D. Wythe <alibuda@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When application sending data more than sndbuf_space, there have chances
application will sleep in epoll_wait, and will never be wakeup again. This
is caused by a race between smc_poll and smc_cdc_tx_handler.
application tasklet
smc_tx_sendmsg(len > sndbuf_space) |
epoll_wait for EPOLL_OUT,timeout=0 |
smc_poll |
if (!smc->conn.sndbuf_space) |
| smc_cdc_tx_handler
| atomic_add sndbuf_space
| smc_tx_sndbuf_nonfull
| if (!test_bit SOCK_NOSPACE)
| do not sk_write_space;
set_bit SOCK_NOSPACE; |
return mask=0; |
Application will sleep in epoll_wait as smc_poll returns 0. And
smc_cdc_tx_handler will not call sk_write_space because the SOCK_NOSPACE
has not be set. If there is no inflight cdc msg, sk_write_space will not be
called any more, and application will sleep in epoll_wait forever.
So check sndbuf_space again after NOSPACE flag is set to break the race.
Fixes: 8dce2786a2 ("net/smc: smc_poll improvements")
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
link down work may be scheduled before lgr freed but execute
after lgr freed, which may result in crash. So it is need to
hold a reference before shedule link down work, and put the
reference after work executed or canceled.
The relevant crash call stack as follows:
list_del corruption. prev->next should be ffffb638c9c0fe20,
but was 0000000000000000
------------[ cut here ]------------
kernel BUG at lib/list_debug.c:51!
invalid opcode: 0000 [#1] SMP NOPTI
CPU: 6 PID: 978112 Comm: kworker/6:119 Kdump: loaded Tainted: G #1
Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 2221b89 04/01/2014
Workqueue: events smc_link_down_work [smc]
RIP: 0010:__list_del_entry_valid.cold+0x31/0x47
RSP: 0018:ffffb638c9c0fdd8 EFLAGS: 00010086
RAX: 0000000000000054 RBX: ffff942fb75e5128 RCX: 0000000000000000
RDX: ffff943520930aa0 RSI: ffff94352091fc80 RDI: ffff94352091fc80
RBP: 0000000000000000 R08: 0000000000000000 R09: ffffb638c9c0fc38
R10: ffffb638c9c0fc30 R11: ffffffffa015eb28 R12: 0000000000000002
R13: ffffb638c9c0fe20 R14: 0000000000000001 R15: ffff942f9cd051c0
FS: 0000000000000000(0000) GS:ffff943520900000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f4f25214000 CR3: 000000025fbae004 CR4: 00000000007706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
rwsem_down_write_slowpath+0x17e/0x470
smc_link_down_work+0x3c/0x60 [smc]
process_one_work+0x1ac/0x350
worker_thread+0x49/0x2f0
? rescuer_thread+0x360/0x360
kthread+0x118/0x140
? __kthread_bind_mask+0x60/0x60
ret_from_fork+0x1f/0x30
Fixes: 541afa10c1 ("net/smc: add smcr_port_err() and smcr_link_down() processing")
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull SCSI fix from James Bottomley:
"Single one-line fix in the ufs driver"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: ufs: core: Update compl_time_stamp_local_clock after completing a cqe
Pull bpf fixes from Daniel Borkmann:
- Fix a bug in the BPF verifier to track changes to packet data
property for global functions (Eduard Zingerman)
- Fix a theoretical BPF prog_array use-after-free in RCU handling of
__uprobe_perf_func (Jann Horn)
- Fix BPF tracing to have an explicit list of tracepoints and their
arguments which need to be annotated as PTR_MAYBE_NULL (Kumar
Kartikeya Dwivedi)
- Fix a logic bug in the bpf_remove_insns code where a potential error
would have been wrongly propagated (Anton Protopopov)
- Avoid deadlock scenarios caused by nested kprobe and fentry BPF
programs (Priya Bala Govindasamy)
- Fix a bug in BPF verifier which was missing a size check for
BTF-based context access (Kumar Kartikeya Dwivedi)
- Fix a crash found by syzbot through an invalid BPF prog_array access
in perf_event_detach_bpf_prog (Jiri Olsa)
- Fix several BPF sockmap bugs including a race causing a refcount
imbalance upon element replace (Michal Luczaj)
- Fix a use-after-free from mismatching BPF program/attachment RCU
flavors (Jann Horn)
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: (23 commits)
bpf: Avoid deadlock caused by nested kprobe and fentry bpf programs
selftests/bpf: Add tests for raw_tp NULL args
bpf: Augment raw_tp arguments with PTR_MAYBE_NULL
bpf: Revert "bpf: Mark raw_tp arguments with PTR_MAYBE_NULL"
selftests/bpf: Add test for narrow ctx load for pointer args
bpf: Check size for BTF-based ctx access of pointer members
selftests/bpf: extend changes_pkt_data with cases w/o subprograms
bpf: fix null dereference when computing changes_pkt_data of prog w/o subprogs
bpf: Fix theoretical prog_array UAF in __uprobe_perf_func()
bpf: fix potential error return
selftests/bpf: validate that tail call invalidates packet pointers
bpf: consider that tail calls invalidate packet pointers
selftests/bpf: freplace tests for tracking of changes_packet_data
bpf: check changes_pkt_data property for extension programs
selftests/bpf: test for changing packet data from global functions
bpf: track changes_pkt_data property for global functions
bpf: refactor bpf_helper_changes_pkt_data to use helper number
bpf: add find_containing_subprog() utility function
bpf,perf: Fix invalid prog_array access in perf_event_detach_bpf_prog
bpf: Fix UAF via mismatching bpf_prog/attachment RCU flavors
...
Pull USB driver fixes from Greg KH:
"Here are some small USB driver fixes for some reported issues.
Included in here are:
- typec driver bugfixes
- u_serial gadget driver bugfix for much reported and discussed issue
- dwc2 bugfixes
- midi gadget driver bugfix
- ehci-hcd driver bugfix
- other small bugfixes
All of these have been in linux-next for over a week with no reported
issues"
* tag 'usb-6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
usb: typec: ucsi: Fix connector status writing past buffer size
usb: typec: ucsi: Fix completion notifications
usb: dwc2: Fix HCD port connection race
usb: dwc2: hcd: Fix GetPortStatus & SetPortFeature
usb: dwc2: Fix HCD resume
usb: gadget: u_serial: Fix the issue that gs_start_io crashed due to accessing null pointer
usb: misc: onboard_usb_dev: skip suspend/resume sequence for USB5744 SMBus support
usb: dwc3: xilinx: make sure pipe clock is deselected in usb2 only mode
usb: core: hcd: only check primary hcd skip_phy_initialization
usb: gadget: midi2: Fix interpretation of is_midi1 bits
usb: dwc3: imx8mp: fix software node kernel dump
usb: typec: anx7411: fix OF node reference leaks in anx7411_typec_switch_probe()
usb: typec: anx7411: fix fwnode_handle reference leak
usb: host: max3421-hcd: Correctly abort a USB request.
dt-bindings: phy: imx8mq-usb: correct reference to usb-switch.yaml
usb: ehci-hcd: fix call balance of clocks handling routines
Pull serial driver fixes from Greg KH:
"Here are two small serial driver fixes for 6.13-rc3. They are:
- ioport build fallout fix for the 8250 port driver that should
resolve Guenter's runtime problems
- sh-sci driver bugfix for a reported problem
Both of these have been in linux-next for a while with no reported
issues"
* tag 'tty-6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
tty: serial: Work around warning backtrace in serial8250_set_defaults
serial: sh-sci: Check if TX data was written to device in .tx_empty()
Pull staging driver fixes from Greg KH:
"Here are some small staging gpib driver build and bugfixes for issues
that have been much-reported (should finally fix Guenter's build
issues). There are more of these coming in later -rc releases, but for
now this should fix the majority of the reported problems.
All of these have been in linux-next for a while with no reported
issues"
* tag 'staging-6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
staging: gpib: Fix i386 build issue
staging: gpib: Fix faulty workaround for assignment in if
staging: gpib: Workaround for ppc build failure
staging: gpib: Make GPIB_NI_PCI_ISA depend on HAS_IOPORT
Pull crypto fixes from Herbert Xu:
"Fix a regression in rsassa-pkcs1 as well as a buffer overrun in
hisilicon/debugfs"
* tag 'v6.13-p2' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
crypto: hisilicon/debugfs - fix the struct pointer incorrectly offset problem
crypto: rsassa-pkcs1 - Copy source data for SG list
Pull rust fixes from Miguel Ojeda:
"Toolchain and infrastructure:
- Set bindgen's Rust target version to prevent issues when
pairing older rustc releases with newer bindgen releases,
such as bindgen >= 0.71.0 and rustc < 1.82 due to
unsafe_extern_blocks.
drm/panic:
- Remove spurious empty line detected by a new Clippy warning"
* tag 'rust-fixes-6.13' of https://github.com/Rust-for-Linux/linux:
rust: kbuild: set `bindgen`'s Rust target version
drm/panic: remove spurious empty line to clean warning
Pull iommu fixes from Joerg Roedel:
- Per-domain device-list locking fixes for the AMD IOMMU driver
- Fix incorrect use of smp_processor_id() in the NVidia-specific part
of the ARM-SMMU-v3 driver
- Intel IOMMU driver fixes:
- Remove cache tags before disabling ATS
- Avoid draining PRQ in sva mm release path
- Fix qi_batch NULL pointer with nested parent domain
* tag 'iommu-fixes-v6.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux:
iommu/vt-d: Avoid draining PRQ in sva mm release path
iommu/vt-d: Fix qi_batch NULL pointer with nested parent domain
iommu/vt-d: Remove cache tags before disabling ATS
iommu/amd: Add lockdep asserts for domain->dev_list
iommu/amd: Put list_add/del(dev_data) back under the domain->lock
iommu/tegra241-cmdqv: do not use smp_processor_id in preemptible context
Pull ata fix from Damien Le Moal:
- Fix an OF node reference leak in the sata_highbank driver
* tag 'ata-6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux:
ata: sata_highbank: fix OF node reference leak in highbank_initialize_phys()
i2c-host-fixes for v6.13-rc3
- Replaced jiffies with msec for timeout calculations.
- Added a sentinel to the 'of_device_id' array in Nomadik.
- Rounded up bus period calculation in RIIC.
Pull smb client fixes from Steve French:
- fix rmmod leak
- two minor cleanups
- fix for unlink/rename with pending i/o
* tag '6.13-rc2-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
smb: client: destroy cfid_put_wq on module exit
cifs: Use str_yes_no() helper in cifs_ses_add_channel()
cifs: Fix rmdir failure due to ongoing I/O on deleted file
smb3: fix compiler warning in reparse code
Pull spi fixes from Mark Brown:
"A few fairly small fixes for v6.13, the most substatial one being
disabling STIG mode for Cadence QSPI controllers on Altera SoCFPGA
platforms since it doesn't work"
* tag 'spi-fix-v6.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
spi: spi-cadence-qspi: Disable STIG mode for Altera SoCFPGA.
spi: rockchip: Fix PM runtime count on no-op cs
spi: aspeed: Fix an error handling path in aspeed_spi_[read|write]_user()
Pull regulator fixes from Mark Brown:
"A couple of additional changes, one ensuring we give AXP717 enough
time to stabilise after changing voltages which fixes serious
stability issues on some platforms and another documenting the DT
support required for the Qualcomm WCN6750"
* tag 'regulator-fix-v6.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
regulator: axp20x: AXP717: set ramp_delay
regulator: dt-bindings: qcom,qca6390-pmu: document wcn6750-pmu
Pull drm fixes from Dave Airlie:
"This is the weekly fixes pull for drm. Just has i915, xe and amdgpu
changes in it. Nothing too major in here:
i915:
- Don't use indexed register writes needlessly [dsb]
- Stop using non-posted DSB writes for legacy LUT [color]
- Fix NULL pointer dereference in capture_engine
- Fix memory leak by correcting cache object name in error handler
xe:
- Fix a KUNIT test error message (Mirsad Todorovac)
- Fix an invalidation fence PM ref leak (Daniele)
- Fix a register pool UAF (Lucas)
amdgpu:
- ISP hw init fix
- SR-IOV fixes
- Fix contiguous VRAM mapping for UVD on older GPUs
- Fix some regressions due to drm scheduler changes
- Workload profile fixes
- Cleaner shader fix
amdkfd:
- Fix DMA map direction for migration
- Fix a potential null pointer dereference
- Cacheline size fixes
- Runtime PM fix"
* tag 'drm-fixes-2024-12-14' of https://gitlab.freedesktop.org/drm/kernel:
drm/xe/reg_sr: Remove register pool
drm/xe: Call invalidation_fence_fini for PT inval fences in error state
drm/xe: fix the ERR_PTR() returned on failure to allocate tiny pt
drm/amdkfd: pause autosuspend when creating pdd
drm/amdgpu: fix when the cleaner shader is emitted
drm/amdgpu: Fix ISP HW init issue
drm/amdkfd: hard-code MALL cacheline size for gfx11, gfx12
drm/amdkfd: hard-code cacheline size for gfx11
drm/amdkfd: Dereference null return value
drm/i915: Fix memory leak by correcting cache object name in error handler
drm/i915: Fix NULL pointer dereference in capture_engine
drm/i915/color: Stop using non-posted DSB writes for legacy LUT
drm/i915/dsb: Don't use indexed register writes needlessly
drm/amdkfd: Correct the migration DMA map direction
drm/amd/pm: Set SMU v13.0.7 default workload type
drm/amd/pm: Initialize power profile mode
amdgpu/uvd: get ring reference from rq scheduler
drm/amdgpu: fix UVD contiguous CS mapping problem
drm/amdgpu: use sjt mec fw on gfx943 for sriov
Revert "drm/amdgpu: Fix ISP hw init issue"
Pull power management documentation fix from Rafael Wysocki:
"Fix a runtime PM documentation mistake that may mislead someone into
making a coding mistake (Paul Barker)"
* tag 'pm-6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
Documentation: PM: Clarify pm_runtime_resume_and_get() return value
Pull ACPI fixes from Rafael Wysocki:
"These fix two coding mistakes, one in the ACPI resources handling code
and one in ACPICA:
- Relocate the addr->info.mem.caching check in acpi_decode_space() to
only execute it if the resource is of the correct type (Ilpo
Järvinen)
- Don't release a context_mutex that was never acquired in
acpi_remove_address_space_handler() (Daniil Tatianin)"
* tag 'acpi-6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPICA: events/evxfregn: don't release the ContextMutex that was never acquired
ACPI: resource: Fix memory resource type union access
Kumar Kartikeya Dwivedi says:
====================
Explicit raw_tp NULL arguments
This set reverts the raw_tp masking changes introduced in commit
cb4158ce8e ("bpf: Mark raw_tp arguments with PTR_MAYBE_NULL") and
replaces it wwith an explicit list of tracepoints and their arguments
which need to be annotated as PTR_MAYBE_NULL. More context on the
fallout caused by the masking fix and subsequent discussions can be
found in [0].
To remedy this, we implement a solution of explicitly defined tracepoint
and define which args need to be marked NULL or scalar (for IS_ERR
case). The commit logs describes the details of this approach in detail.
We will follow up this solution an approach Eduard is working on to
perform automated analysis of NULL-ness of tracepoint arguments. The
current PoC is available here:
- LLVM branch with the analysis:
https://github.com/eddyz87/llvm-project/tree/nullness-for-tracepoint-params
- Python script for merging of analysis results:
https://gist.github.com/eddyz87/e47c164466a60e8d49e6911cff146f47
The idea is to infer a tri-state verdict for each tracepoint parameter:
definitely not null, can be null, unknown (in which case no assumptions
should be made).
Using this information, the verifier in most cases will be able to
precisely determine the state of the tracepoint parameter without any
human effort. At that point, the table maintained manually in this set
can be dropped and replace with this automated analysis tool's result.
This will be kept up to date with each kernel release.
[0]: https://lore.kernel.org/bpf/20241206161053.809580-1-memxor@gmail.com
Changelog:
----------
v2 -> v3:
v2: https://lore.kernel.org/bpf/20241213175127.2084759-1-memxor@gmail.com
* Address Eduard's nits, add Reviewed-by
v1 -> v2:
v1: https://lore.kernel.org/bpf/20241211020156.18966-1-memxor@gmail.com
* Address comments from Jiri
* Mark module tracepoints args NULL by default
* Add more sunrpc tracepoints
* Unify scalar or null handling
* Address comments from Alexei
* Use bitmask approach suggested in review
* Unify scalar or null handling
* Drop most tests that rely on CONFIG options
* Drop scripts to generate tests
====================
Link: https://patch.msgid.link/20241213221929.3495062-1-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add tests to ensure that arguments are correctly marked based on their
specified positions, and whether they get marked correctly as maybe
null. For modules, all tracepoint parameters should be marked
PTR_MAYBE_NULL by default.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20241213221929.3495062-4-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Arguments to a raw tracepoint are tagged as trusted, which carries the
semantics that the pointer will be non-NULL. However, in certain cases,
a raw tracepoint argument may end up being NULL. More context about this
issue is available in [0].
Thus, there is a discrepancy between the reality, that raw_tp arguments can
actually be NULL, and the verifier's knowledge, that they are never NULL,
causing explicit NULL check branch to be dead code eliminated.
A previous attempt [1], i.e. the second fixed commit, was made to
simulate symbolic execution as if in most accesses, the argument is a
non-NULL raw_tp, except for conditional jumps. This tried to suppress
branch prediction while preserving compatibility, but surfaced issues
with production programs that were difficult to solve without increasing
verifier complexity. A more complete discussion of issues and fixes is
available at [2].
Fix this by maintaining an explicit list of tracepoints where the
arguments are known to be NULL, and mark the positional arguments as
PTR_MAYBE_NULL. Additionally, capture the tracepoints where arguments
are known to be ERR_PTR, and mark these arguments as scalar values to
prevent potential dereference.
Each hex digit is used to encode NULL-ness (0x1) or ERR_PTR-ness (0x2),
shifted by the zero-indexed argument number x 4. This can be represented
as follows:
1st arg: 0x1
2nd arg: 0x10
3rd arg: 0x100
... and so on (likewise for ERR_PTR case).
In the future, an automated pass will be used to produce such a list, or
insert __nullable annotations automatically for tracepoints. Each
compilation unit will be analyzed and results will be collated to find
whether a tracepoint pointer is definitely not null, maybe null, or an
unknown state where verifier conservatively marks it PTR_MAYBE_NULL.
A proof of concept of this tool from Eduard is available at [3].
Note that in case we don't find a specification in the raw_tp_null_args
array and the tracepoint belongs to a kernel module, we will
conservatively mark the arguments as PTR_MAYBE_NULL. This is because
unlike for in-tree modules, out-of-tree module tracepoints may pass NULL
freely to the tracepoint. We don't protect against such tracepoints
passing ERR_PTR (which is uncommon anyway), lest we mark all such
arguments as SCALAR_VALUE.
While we are it, let's adjust the test raw_tp_null to not perform
dereference of the skb->mark, as that won't be allowed anymore, and make
it more robust by using inline assembly to test the dead code
elimination behavior, which should still stay the same.
[0]: https://lore.kernel.org/bpf/ZrCZS6nisraEqehw@jlelli-thinkpadt14gen4.remote.csb
[1]: https://lore.kernel.org/all/20241104171959.2938862-1-memxor@gmail.com
[2]: https://lore.kernel.org/bpf/20241206161053.809580-1-memxor@gmail.com
[3]: https://github.com/eddyz87/llvm-project/tree/nullness-for-tracepoint-params
Reported-by: Juri Lelli <juri.lelli@redhat.com> # original bug
Reported-by: Manu Bretelle <chantra@meta.com> # bugs in masking fix
Fixes: 3f00c52393 ("bpf: Allow trusted pointers to be passed to KF_TRUSTED_ARGS kfuncs")
Fixes: cb4158ce8e ("bpf: Mark raw_tp arguments with PTR_MAYBE_NULL")
Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
Co-developed-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20241213221929.3495062-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This patch reverts commit
cb4158ce8e ("bpf: Mark raw_tp arguments with PTR_MAYBE_NULL"). The
patch was well-intended and meant to be as a stop-gap fixing branch
prediction when the pointer may actually be NULL at runtime. Eventually,
it was supposed to be replaced by an automated script or compiler pass
detecting possibly NULL arguments and marking them accordingly.
However, it caused two main issues observed for production programs and
failed to preserve backwards compatibility. First, programs relied on
the verifier not exploring == NULL branch when pointer is not NULL, thus
they started failing with a 'dereference of scalar' error. Next,
allowing raw_tp arguments to be modified surfaced the warning in the
verifier that warns against reg->off when PTR_MAYBE_NULL is set.
More information, context, and discusson on both problems is available
in [0]. Overall, this approach had several shortcomings, and the fixes
would further complicate the verifier's logic, and the entire masking
scheme would have to be removed eventually anyway.
Hence, revert the patch in preparation of a better fix avoiding these
issues to replace this commit.
[0]: https://lore.kernel.org/bpf/20241206161053.809580-1-memxor@gmail.com
Reported-by: Manu Bretelle <chantra@meta.com>
Fixes: cb4158ce8e ("bpf: Mark raw_tp arguments with PTR_MAYBE_NULL")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20241213221929.3495062-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Pull block fixes from Jens Axboe:
- Series from Damien fixing issues with the zoned write plugging
- Fix for a potential UAF in block cgroups
- Fix deadlock around queue freezing and the sysfs lock
- Various little cleanups and fixes
* tag 'block-6.13-20241213' of git://git.kernel.dk/linux:
block: Fix potential deadlock while freezing queue and acquiring sysfs_lock
block: Fix queue_iostats_passthrough_show()
blk-mq: Clean up blk_mq_requeue_work()
mq-deadline: Remove a local variable
blk-iocost: Avoid using clamp() on inuse in __propagate_weights()
block: Make bio_iov_bvec_set() accept pointer to const iov_iter
block: get wp_offset by bdev_offset_from_zone_start
blk-cgroup: Fix UAF in blkcg_unpin_online()
MAINTAINERS: update Coly Li's email address
block: Prevent potential deadlocks in zone write plug error recovery
dm: Fix dm-zoned-reclaim zone write pointer alignment
block: Ignore REQ_NOWAIT for zone reset and zone finish operations
block: Use a zone write plug BIO work for REQ_NOWAIT BIOs
Pull io_uring fix from Jens Axboe:
"A single fix for a regression introduced in the 6.13 merge window"
* tag 'io_uring-6.13-20241213' of git://git.kernel.dk/linux:
io_uring/rsrc: don't put/free empty buffers
Pull libnvdimm fix from Ira Weiny:
- sysbot fix for out of bounds access
* tag 'libnvdimm-fixes-6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
acpi: nfit: vmalloc-out-of-bounds Read in acpi_nfit_ctl
ARC GCC compiler is packaged starting from Fedora 39i and the GCC
variant of cross compile tools has arc-linux-gnu- prefix and not
arc-linux-. This is causing that CROSS_COMPILE variable is left unset.
This change allows builds without need to supply CROSS_COMPILE argument
if distro package is used.
Before this change:
$ make -j 128 ARCH=arc W=1 drivers/infiniband/hw/mlx4/
gcc: warning: ‘-mcpu=’ is deprecated; use ‘-mtune=’ or ‘-march=’ instead
gcc: error: unrecognized command-line option ‘-mmedium-calls’
gcc: error: unrecognized command-line option ‘-mlock’
gcc: error: unrecognized command-line option ‘-munaligned-access’
[1] https://packages.fedoraproject.org/pkgs/cross-gcc/gcc-arc-linux-gnu/index.html
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Vineet Gupta <vgupta@kernel.org>
Pull xfs fixes from Carlos Maiolino:
- Fixes for scrub subsystem
- Fix quota crashes
- Fix sb_spino_align checons on large fsblock sizes
- Fix discarded superblock updates
- Fix stuck unmount due to a locked inode
* tag 'xfs-fixes-6.13-rc3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (28 commits)
xfs: port xfs_ioc_start_commit to multigrain timestamps
xfs: return from xfs_symlink_verify early on V4 filesystems
xfs: fix zero byte checking in the superblock scrubber
xfs: check pre-metadir fields correctly
xfs: don't crash on corrupt /quotas dirent
xfs: don't move nondir/nonreg temporary repair files to the metadir namespace
xfs: fix sb_spino_align checks for large fsblock sizes
xfs: convert quotacheck to attach dquot buffers
xfs: attach dquot buffer to dquot log item buffer
xfs: clean up log item accesses in xfs_qm_dqflush{,_done}
xfs: separate dquot buffer reads from xfs_dqflush
xfs: don't lose solo dquot update transactions
xfs: don't lose solo superblock counter update transactions
xfs: avoid nested calls to __xfs_trans_commit
xfs: only run precommits once per transaction object
xfs: unlock inodes when erroring out of xfs_trans_alloc_dir
xfs: fix scrub tracepoints when inode-rooted btrees are involved
xfs: update btree keys correctly when _insrec splits an inode root block
xfs: fix error bailout in xfs_rtginode_create
xfs: fix null bno_hint handling in xfs_rtallocate_rtg
...
Pull arm64 fixes from Catalin Marinas:
- arm64 stacktrace: address some fallout from the recent changes to
unwinding across exception boundaries
- Ensure the arm64 signal delivery failure is recoverable - only
override the return registers after all the user accesses took place
- Fix the arm64 kselftest access to SVCR - only when SME is detected
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
kselftest/arm64: abi: fix SVCR detection
arm64: signal: Ensure signal delivery failure is recoverable
arm64: stacktrace: Don't WARN when unwinding other tasks
arm64: stacktrace: Skip reporting LR at exception boundaries
Pull RISC-V fixes from Palmer Dabbelt:
- avoid taking a mutex while resolving jump_labels in the mutex
implementation
- avoid trying to resolve the early boot DT pointer via the linear map
- avoid trying to IPI kfence TLB flushes, as kfence might flush with
IRQs disabled
- avoid calling PMD destructors on PMDs that were never constructed
* tag 'riscv-for-linus-6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
riscv: mm: Do not call pmd dtor on vmemmap page table teardown
riscv: Fix IPIs usage in kfence_protect_page()
riscv: Fix wrong usage of __pa() on a fixmap address
riscv: Fixup boot failure when CONFIG_DEBUG_RT_MUTEXES=y
Merge an ACPICA fix for 6.13-rc3:
- Don't release a context_mutex that was never acquired in
acpi_remove_address_space_handler() (Daniil Tatianin).
* acpica:
ACPICA: events/evxfregn: don't release the ContextMutex that was never acquired
Snapshot the output of CPUID.0xD.[1..n] during kvm.ko initiliaization to
avoid the overead of CPUID during runtime. The offset, size, and metadata
for CPUID.0xD.[1..n] sub-leaves does not depend on XCR0 or XSS values, i.e.
is constant for a given CPU, and thus can be cached during module load.
On Intel's Emerald Rapids, CPUID is *wildly* expensive, to the point where
recomputing XSAVE offsets and sizes results in a 4x increase in latency of
nested VM-Enter and VM-Exit (nested transitions can trigger
xstate_required_size() multiple times per transition), relative to using
cached values. The issue is easily visible by running `perf top` while
triggering nested transitions: kvm_update_cpuid_runtime() shows up at a
whopping 50%.
As measured via RDTSC from L2 (using KVM-Unit-Test's CPUID VM-Exit test
and a slightly modified L1 KVM to handle CPUID in the fastpath), a nested
roundtrip to emulate CPUID on Skylake (SKX), Icelake (ICX), and Emerald
Rapids (EMR) takes:
SKX 11650
ICX 22350
EMR 28850
Using cached values, the latency drops to:
SKX 6850
ICX 9000
EMR 7900
The underlying issue is that CPUID itself is slow on ICX, and comically
slow on EMR. The problem is exacerbated on CPUs which support XSAVES
and/or XSAVEC, as KVM invokes xstate_required_size() twice on each
runtime CPUID update, and because there are more supported XSAVE features
(CPUID for supported XSAVE feature sub-leafs is significantly slower).
SKX:
CPUID.0xD.2 = 348 cycles
CPUID.0xD.3 = 400 cycles
CPUID.0xD.4 = 276 cycles
CPUID.0xD.5 = 236 cycles
<other sub-leaves are similar>
EMR:
CPUID.0xD.2 = 1138 cycles
CPUID.0xD.3 = 1362 cycles
CPUID.0xD.4 = 1068 cycles
CPUID.0xD.5 = 910 cycles
CPUID.0xD.6 = 914 cycles
CPUID.0xD.7 = 1350 cycles
CPUID.0xD.8 = 734 cycles
CPUID.0xD.9 = 766 cycles
CPUID.0xD.10 = 732 cycles
CPUID.0xD.11 = 718 cycles
CPUID.0xD.12 = 734 cycles
CPUID.0xD.13 = 1700 cycles
CPUID.0xD.14 = 1126 cycles
CPUID.0xD.15 = 898 cycles
CPUID.0xD.16 = 716 cycles
CPUID.0xD.17 = 748 cycles
CPUID.0xD.18 = 776 cycles
Note, updating runtime CPUID information multiple times per nested
transition is itself a flaw, especially since CPUID is a mandotory
intercept on both Intel and AMD. E.g. KVM doesn't need to ensure emulated
CPUID state is up-to-date while running L2. That flaw will be fixed in a
future patch, as deferring runtime CPUID updates is more subtle than it
appears at first glance, the benefits aren't super critical to have once
the XSAVE issue is resolved, and caching CPUID output is desirable even if
KVM's updates are deferred.
Cc: Jim Mattson <jmattson@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20241211013302.1347853-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Pull gpio fixes from Bartosz Golaszewski:
- fix several low-level issues in gpio-graniterapids
- fix an initialization order issue that manifests itself with
__counted_by() checks in gpio-ljca
- don't default to y for CONFIG_GPIO_MVEBU with COMPILE_TEST enabled
- move the DEFAULT_SYMBOL_NAMESPACE define before the export.h include
in gpio-idio-16
* tag 'gpio-fixes-for-v6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
gpio: idio-16: Actually make use of the GPIO_IDIO_16 symbol namespace
gpio: graniterapids: Fix GPIO Ack functionality
gpio: graniterapids: Check if GPIO line can be used for IRQs
gpio: graniterapids: Determine if GPIO pad can be used by driver
gpio: graniterapids: Fix invalid RXEVCFG register bitmask
gpio: graniterapids: Fix invalid GPI_IS register offset
gpio: graniterapids: Fix incorrect BAR assignment
gpio: graniterapids: Fix vGPIO driver crash
gpio: ljca: Initialize num before accessing item in ljca_gpio_config
gpio: GPIO_MVEBU should not default to y when compile-testing
Pull sound fixes from Takashi Iwai:
"A collection of fixes; all look small, device-specific and boring"
* tag 'sound-6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ASoC: Intel: sof_sdw: Add space for a terminator into DAIs array
ASoC: fsl_spdif: change IFACE_PCM to IFACE_MIXER
ASoC: fsl_xcvr: change IFACE_PCM to IFACE_MIXER
ASoC: tas2781: Fix calibration issue in stress test
ASoC: audio-graph-card: Call of_node_put() on correct node
ASoC: amd: yc: Fix the wrong return value
ALSA: control: Avoid WARN() for symlink errors
sound: usb: format: don't warn that raw DSD is unsupported
sound: usb: enable DSD output for ddHiFi TC44C
ALSA: hda/realtek: Add new alc2xx-fixup-headset-mic model
ALSA: hda/ca0132: Use standard HD-audio quirk matching helpers
ALSA: usb-audio: Add implicit feedback quirk for Yamaha THR5
ALSA: hda/realtek - Add support for ASUS Zen AIO 27 Z272SD_A272SD audio
ALSA: hda/realtek: Fix headset mic on Acer Nitro 5
ALSA: hda: cs35l56: Remove calls to cs35l56_force_sync_asp1_registers_from_cache()
Pull documentation fix from Jonathan Corbet:
"A single fix for a docs-build regression caused by the
EXPORT_SYMBOL_NS() mass change"
* tag 'docs-6.13-fix' of git://git.lwn.net/linux:
scripts/kernel-doc: Get -export option working again
Pull slab fix from Vlastimil Babka:
- Fix for memcg unreclaimable slab stats drift when post-charging large
kmalloc allocations (Shakeel Butt)
* tag 'slab-for-6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab:
memcg: slub: fix SUnreclaim for post charged objects
It appears that the relatively popular RK3399 SoC has been put together
using a large amount of illicit substances, as experiments reveal that its
integration of GIC500 exposes the *secure* programming interface to
non-secure.
This has some pretty bad effects on the way priorities are handled, and
results in a dead machine if booting with pseudo-NMI enabled
(irqchip.gicv3_pseudo_nmi=1) if the kernel contains 18fdb6348c ("arm64:
irqchip/gic-v3: Select priorities at boot time"), which relies on the
priorities being programmed using the NS view.
Let's restore some sanity by going one step further and disable security
altogether in this case. This is not any worse, and puts us in a mode where
priorities actually make some sense.
Huge thanks to Mark Kettenis who initially identified this issue on
OpenBSD, and to Chen-Yu Tsai who reported the problem in Linux.
Fixes: 18fdb6348c ("arm64: irqchip/gic-v3: Select priorities at boot time")
Reported-by: Mark Kettenis <mark.kettenis@xs4all.nl>
Reported-by: Chen-Yu Tsai <wens@csie.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Chen-Yu Tsai <wens@csie.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/20241213141037.3995049-1-maz@kernel.org
percpu_base is used in various percpu functions that expect variable in
__percpu address space. Correct the declaration of percpu_base to
void __iomem * __percpu *percpu_base;
to declare the variable as __percpu pointer.
The patch fixes several sparse warnings:
irq-gic.c:1172:44: warning: incorrect type in assignment (different address spaces)
irq-gic.c:1172:44: expected void [noderef] __percpu *[noderef] __iomem *percpu_base
irq-gic.c:1172:44: got void [noderef] __iomem *[noderef] __percpu *
...
irq-gic.c:1231:43: warning: incorrect type in argument 1 (different address spaces)
irq-gic.c:1231:43: expected void [noderef] __percpu *__pdata
irq-gic.c:1231:43: got void [noderef] __percpu *[noderef] __iomem *percpu_base
There were no changes in the resulting object files.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/all/20241213145809.2918-2-ubizjak@gmail.com
Instead of returning a generic NULL on error from
drm_dp_tunnel_mgr_create(), use error pointers with informative codes
to align the function with stub that is executed when
CONFIG_DRM_DISPLAY_DP_TUNNEL is unset. This will also trigger IS_ERR()
in current caller (intel_dp_tunnerl_mgr_init()) instead of bypassing it
via NULL pointer.
v2: use error codes inside drm_dp_tunnel_mgr_create() instead of handling
on caller's side (Michal, Imre)
v3: fixup commit message and add "CC"/"Fixes" lines (Andi),
mention aligning function code with stub
Fixes: 91888b5b1a ("drm/i915/dp: Add support for DP tunnel BW allocation")
Cc: Imre Deak <imre.deak@intel.com>
Cc: <stable@vger.kernel.org> # v6.9+
Signed-off-by: Krzysztof Karas <krzysztof.karas@intel.com>
Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com>
Signed-off-by: Imre Deak <imre.deak@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/7q4fpnmmztmchczjewgm6igy55qt6jsm7tfd4fl4ucfq6yg2oy@q4lxtsu6445c
Make queue_iostats_passthrough_show() report 0/1 in sysfs instead of 0/4.
This patch fixes the following sparse warning:
block/blk-sysfs.c:266:31: warning: incorrect type in argument 1 (different base types)
block/blk-sysfs.c:266:31: expected unsigned long var
block/blk-sysfs.c:266:31: got restricted blk_flags_t
Cc: Keith Busch <kbusch@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Fixes: 110234da18 ("block: enable passthrough command statistics")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20241212212941.1268662-4-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When a PASID is used for SVA by a device, it's possible that the PASID
entry is cleared before the device flushes all ongoing DMA requests and
removes the SVA domain. This can occur when an exception happens and the
process terminates before the device driver stops DMA and calls the
iommu driver to unbind the PASID.
There's no need to drain the PRQ in the mm release path. Instead, the PRQ
will be drained in the SVA unbind path.
Unfortunately, commit c43e1ccdeb ("iommu/vt-d: Drain PRQs when domain
removed from RID") changed this behavior by unconditionally draining the
PRQ in intel_pasid_tear_down_entry(). This can lead to a potential
sleeping-in-atomic-context issue.
Smatch static checker warning:
drivers/iommu/intel/prq.c:95 intel_iommu_drain_pasid_prq()
warn: sleeping in atomic context
To avoid this issue, prevent draining the PRQ in the SVA mm release path
and restore the previous behavior.
Fixes: c43e1ccdeb ("iommu/vt-d: Drain PRQs when domain removed from RID")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/linux-iommu/c5187676-2fa2-4e29-94e0-4a279dc88b49@stanley.mountain/
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20241212021529.1104745-1-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The qi_batch is allocated when assigning cache tag for a domain. While
for nested parent domain, it is missed. Hence, when trying to map pages
to the nested parent, NULL dereference occurred. Also, there is potential
memleak since there is no lock around domain->qi_batch allocation.
To solve it, add a helper for qi_batch allocation, and call it in both
the __cache_tag_assign_domain() and __cache_tag_assign_parent_domain().
BUG: kernel NULL pointer dereference, address: 0000000000000200
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 8104795067 P4D 0
Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
CPU: 223 UID: 0 PID: 4357 Comm: qemu-system-x86 Not tainted 6.13.0-rc1-00028-g4b50c3c3b998-dirty #2632
Call Trace:
? __die+0x24/0x70
? page_fault_oops+0x80/0x150
? do_user_addr_fault+0x63/0x7b0
? exc_page_fault+0x7c/0x220
? asm_exc_page_fault+0x26/0x30
? cache_tag_flush_range_np+0x13c/0x260
intel_iommu_iotlb_sync_map+0x1a/0x30
iommu_map+0x61/0xf0
batch_to_domain+0x188/0x250
iopt_area_fill_domains+0x125/0x320
? rcu_is_watching+0x11/0x50
iopt_map_pages+0x63/0x100
iopt_map_common.isra.0+0xa7/0x190
iopt_map_user_pages+0x6a/0x80
iommufd_ioas_map+0xcd/0x1d0
iommufd_fops_ioctl+0x118/0x1c0
__x64_sys_ioctl+0x93/0xc0
do_syscall_64+0x71/0x140
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Fixes: 705c1cdf1e ("iommu/vt-d: Introduce batched cache invalidation")
Cc: stable@vger.kernel.org
Co-developed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20241210130322.17175-1-yi.l.liu@intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
The current implementation removes cache tags after disabling ATS,
leading to potential memory leaks and kernel crashes. Specifically,
CACHE_TAG_DEVTLB type cache tags may still remain in the list even
after the domain is freed, causing a use-after-free condition.
This issue really shows up when multiple VFs from different PFs
passed through to a single user-space process via vfio-pci. In such
cases, the kernel may crash with kernel messages like:
BUG: kernel NULL pointer dereference, address: 0000000000000014
PGD 19036a067 P4D 1940a3067 PUD 136c9b067 PMD 0
Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
CPU: 74 UID: 0 PID: 3183 Comm: testCli Not tainted 6.11.9 #2
RIP: 0010:cache_tag_flush_range+0x9b/0x250
Call Trace:
<TASK>
? __die+0x1f/0x60
? page_fault_oops+0x163/0x590
? exc_page_fault+0x72/0x190
? asm_exc_page_fault+0x22/0x30
? cache_tag_flush_range+0x9b/0x250
? cache_tag_flush_range+0x5d/0x250
intel_iommu_tlb_sync+0x29/0x40
intel_iommu_unmap_pages+0xfe/0x160
__iommu_unmap+0xd8/0x1a0
vfio_unmap_unpin+0x182/0x340 [vfio_iommu_type1]
vfio_remove_dma+0x2a/0xb0 [vfio_iommu_type1]
vfio_iommu_type1_ioctl+0xafa/0x18e0 [vfio_iommu_type1]
Move cache_tag_unassign_domain() before iommu_disable_pci_caps() to fix
it.
Fixes: 3b1d9e2b2d ("iommu/vt-d: Add cache tag assignment interface")
Cc: stable@vger.kernel.org
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20241129020506.576413-1-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Commit eaf62ce156 ("arm64/signal: Set up and restore the GCS
context for signal handlers") introduced a potential failure point
at the end of setup_return(). This is unfortunate as it is too late
to deliver a SIGSEGV: if that SIGSEGV is handled, the subsequent
sigreturn will end up returning to the original handler, which is
not the intention (since we failed to deliver that signal).
Make sure this does not happen by calling gcs_signal_entry()
at the very beginning of setup_return(), and add a comment just
after to discourage error cases being introduced from that point
onwards.
While at it, also take care of copy_siginfo_to_user(): since it may
fail, we shouldn't be calling it after setup_return() either. Call
it before setup_return() instead, and move the setting of X1/X2
inside setup_return() where it belongs (after the "point of no
failure").
Background: the first part of setup_rt_frame(), including
setup_sigframe(), has no impact on the execution of the interrupted
thread. The signal frame is written to the stack, but the stack
pointer remains unchanged. Failure at this stage can be recovered by
a SIGSEGV handler, and sigreturn will restore the original context,
at the point where the original signal occurred. On the other hand,
once setup_return() has updated registers including SP, the thread's
control flow has been modified and we must deliver the original
signal.
Fixes: eaf62ce156 ("arm64/signal: Set up and restore the GCS context for signal handlers")
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
Reviewed-by: Dave Martin <Dave.Martin@arm.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20241210160940.2031997-1-kevin.brodsky@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Arm FF-A fix for v6.13
A single fix to address a possible race around setting ffa_dev->properties
in ffa_device_register() by updating ffa_device_register() to take all
the partition information received from the firmware and updating the
struct ffa_device accordingly before registering the device to the
bus/driver model in the kernel.
* tag 'ffa-fix-6.13' of https://git.kernel.org/pub/scm/linux/kernel/git/sudeep.holla/linux:
firmware: arm_ffa: Fix the race around setting ffa_dev->properties
Link: https://lore.kernel.org/r/20241210101113.3232602-1-sudeep.holla@arm.com
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
dlserver time is accounted when:
- dlserver is active and the dlserver proxies the cfs task.
- dlserver is active but deferred and cfs task runs after being picked
through the normal fair class pick.
dl_server_update is called in two places to make sure that both the
above times are accounted for. But it doesn't check if dlserver is
active or not. Now that we have this dl_server_active flag, we can
consolidate dl_server_update into one place and all we need to check is
whether dlserver is active or not. When dlserver is active there is only
two possible conditions:
- dlserver is deferred.
- cfs task is running on behalf of dlserver.
Fixes: a110a81c52 ("sched/deadline: Deferrable dl server")
Signed-off-by: "Vineeth Pillai (Google)" <vineeth@bitbyteword.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> # ROCK 5B
Link: https://lore.kernel.org/r/20241213032244.877029-2-vineeth@bitbyteword.org
dlserver can get dequeued during a dlserver pick_task due to the delayed
deueue feature and this can lead to issues with dlserver logic as it
still thinks that dlserver is on the runqueue. The dlserver throttling
and replenish logic gets confused and can lead to double enqueue of
dlserver.
Double enqueue of dlserver could happend due to couple of reasons:
Case 1
------
Delayed dequeue feature[1] can cause dlserver being stopped during a
pick initiated by dlserver:
__pick_next_task
pick_task_dl -> server_pick_task
pick_task_fair
pick_next_entity (if (sched_delayed))
dequeue_entities
dl_server_stop
server_pick_task goes ahead with update_curr_dl_se without knowing that
dlserver is dequeued and this confuses the logic and may lead to
unintended enqueue while the server is stopped.
Case 2
------
A race condition between a task dequeue on one cpu and same task's enqueue
on this cpu by a remote cpu while the lock is released causing dlserver
double enqueue.
One cpu would be in the schedule() and releasing RQ-lock:
current->state = TASK_INTERRUPTIBLE();
schedule();
deactivate_task()
dl_stop_server();
pick_next_task()
pick_next_task_fair()
sched_balance_newidle()
rq_unlock(this_rq)
at which point another CPU can take our RQ-lock and do:
try_to_wake_up()
ttwu_queue()
rq_lock()
...
activate_task()
dl_server_start() --> first enqueue
wakeup_preempt() := check_preempt_wakeup_fair()
update_curr()
update_curr_task()
if (current->dl_server)
dl_server_update()
enqueue_dl_entity() --> second enqueue
This bug was not apparent as the enqueue in dl_server_start doesn't
usually happen because of the defer logic. But as a side effect of the
first case(dequeue during dlserver pick), dl_throttled and dl_yield will
be set and this causes the time accounting of dlserver to messup and
then leading to a enqueue in dl_server_start.
Have an explicit flag representing the status of dlserver to avoid the
confusion. This is set in dl_server_start and reset in dlserver_stop.
Fixes: 63ba8422f8 ("sched/deadline: Introduce deadline servers")
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: "Vineeth Pillai (Google)" <vineeth@bitbyteword.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> # ROCK 5B
Link: https://lkml.kernel.org/r/20241213032244.877029-1-vineeth@bitbyteword.org
Instead of jumping to the Xen hypercall page for doing the iret
hypercall, directly code the required sequence in xen-asm.S.
This is done in preparation of no longer using hypercall page at all,
as it has shown to cause problems with speculation mitigations.
This is part of XSA-466 / CVE-2024-53241.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Add static_call_update_early() for updating static-call targets in
very early boot.
This will be needed for support of Xen guest type specific hypercall
functions.
This is part of XSA-466 / CVE-2024-53241.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Co-developed-by: Peter Zijlstra <peterz@infradead.org>
Co-developed-by: Josh Poimboeuf <jpoimboe@redhat.com>
The syscall instruction is used in Xen PV mode for doing hypercalls.
Allow syscall to be used in the kernel in case it is tagged with an
unwind hint for objtool.
This is part of XSA-466 / CVE-2024-53241.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Co-developed-by: Peter Zijlstra <peterz@infradead.org>
In order to be able to differentiate between AMD and Intel based
systems for very early hypercalls without having to rely on the Xen
hypercall page, make get_cpu_vendor() non-static.
Refactor early_cpu_init() for the same reason by splitting out the
loop initializing cpu_devs() into an externally callable function.
This is part of XSA-466 / CVE-2024-53241.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
When removing a netfront device directly after a suspend/resume cycle
it might happen that the queues have not been setup again, causing a
crash during the attempt to stop the queues another time.
Fix that by checking the queues are existing before trying to stop
them.
This is XSA-465 / CVE-2024-53240.
Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Fixes: d50b7914fa ("xen-netfront: Fix NULL sring after live migration")
Signed-off-by: Juergen Gross <jgross@suse.com>
xfs: bug fixes for 6.13 [01/12]
Bug fixes for 6.13.
This has been running on the djcloud for months with no problems. Enjoy!
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Take advantage of the multigrain timestamp APIs to ensure that nobody
can sneak in and write things to a file between starting a file update
operation and committing the results. This should have been part of the
multigrain timestamp merge, but I forgot to fling it at jlayton when he
resubmitted the patchset due to developer bandwidth problems.
Cc: <stable@vger.kernel.org> # v6.13-rc1
Fixes: 4e40eff0b5 ("fs: add infrastructure for multigrain timestamps")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
V4 symlink blocks didn't have headers, so return early if this is a V4
filesystem.
Cc: <stable@vger.kernel.org> # v5.1
Fixes: 39708c20ab ("xfs: miscellaneous verifier magic value fixups")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
The logic to check that the region past the end of the superblock is all
zeroes is wrong -- we don't want to check only the bytes past the end of
the maximally sized ondisk superblock structure as currently defined in
xfs_format.h; we want to check the bytes beyond the end of the ondisk as
defined by the feature bits.
Port the superblock size logic from xfs_repair and then put it to use in
xfs_scrub.
Cc: <stable@vger.kernel.org> # v4.15
Fixes: 21fb4cb198 ("xfs: scrub the secondary superblocks")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
The checks that were added to the superblock scrubber for metadata
directories aren't quite right -- the old inode pointers are now defined
to be zeroes until someone else reuses them. Also consolidate the new
metadir field checks to one place; they were inexplicably scattered
around.
Cc: <stable@vger.kernel.org> # v6.13-rc1
Fixes: 28d756d4d5 ("xfs: update sb field checks when metadir is turned on")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
If the /quotas dirent points to an inode but the inode isn't loadable
(and hence mkdir returns -EEXIST), don't crash, just bail out.
Cc: <stable@vger.kernel.org> # v6.13-rc1
Fixes: e80fbe1ad8 ("xfs: use metadir for quota inodes")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Only directories or regular files are allowed in the metadata directory
tree. Don't move the repair tempfile to the metadir namespace if this
is not true; this will cause the inode verifiers to trip.
xrep_tempfile_adjust_directory_tree opportunistically moves sc->tempip
from the regular directory tree to the metadata directory tree if sc->ip
is part of the metadata directory tree. However, the scrub setup
functions grab sc->ip and create sc->tempip before we actually get
around to checking if the file mode is the right type for the scrubber.
IOWs, you can invoke the symlink scrubber with the file handle of a
subdirectory in the metadir. xrep_setup_symlink will create a temporary
symlink file, xrep_tempfile_adjust_directory_tree will foolishly try to
set the METADATA flag on the temp symlink, which trips the inode
verifier in the inode item precommit, which shuts down the filesystem
when expensive checks are turned on. If they're /not/ turned on, then
xchk_symlink will return ENOENT when it sees that it's been passed a
symlink, but the invalid inode could still get flushed to disk. We
don't want that.
Cc: <stable@vger.kernel.org> # v6.13-rc1
Fixes: 9dc31acb01 ("xfs: move repair temporary files to the metadata directory tree")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
For a sparse inodes filesystem, mkfs.xfs computes the values of
sb_spino_align and sb_inoalignmt with the following code:
int cluster_size = XFS_INODE_BIG_CLUSTER_SIZE;
if (cfg->sb_feat.crcs_enabled)
cluster_size *= cfg->inodesize / XFS_DINODE_MIN_SIZE;
sbp->sb_spino_align = cluster_size >> cfg->blocklog;
sbp->sb_inoalignmt = XFS_INODES_PER_CHUNK *
cfg->inodesize >> cfg->blocklog;
On a V5 filesystem with 64k fsblocks and 512 byte inodes, this results
in cluster_size = 8192 * (512 / 256) = 16384. As a result,
sb_spino_align and sb_inoalignmt are both set to zero. Unfortunately,
this trips the new sb_spino_align check that was just added to
xfs_validate_sb_common, and the mkfs fails:
# mkfs.xfs -f -b size=64k, /dev/sda
meta-data=/dev/sda isize=512 agcount=4, agsize=81136 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=1, rmapbt=1
= reflink=1 bigtime=1 inobtcount=1 nrext64=1
= exchange=0 metadir=0
data = bsize=65536 blocks=324544, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=65536 ascii-ci=0, ftype=1, parent=0
log =internal log bsize=65536 blocks=5006, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=65536 blocks=0, rtextents=0
= rgcount=0 rgsize=0 extents
Discarding blocks...Sparse inode alignment (0) is invalid.
Metadata corruption detected at 0x560ac5a80bbe, xfs_sb block 0x0/0x200
libxfs_bwrite: write verifier failed on xfs_sb bno 0x0/0x1
mkfs.xfs: Releasing dirty buffer to free list!
found dirty buffer (bulk) on free list!
Sparse inode alignment (0) is invalid.
Metadata corruption detected at 0x560ac5a80bbe, xfs_sb block 0x0/0x200
libxfs_bwrite: write verifier failed on xfs_sb bno 0x0/0x1
mkfs.xfs: writing AG headers failed, err=22
Prior to commit 59e43f5479 this all worked fine, even if "sparse"
inodes are somewhat meaningless when everything fits in a single
fsblock. Adjust the checks to handle existing filesystems.
Cc: <stable@vger.kernel.org> # v6.13-rc1
Fixes: 59e43f5479 ("xfs: sb_spino_align is not verified")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Now that we've converted the dquot logging machinery to attach the dquot
buffer to the li_buf pointer so that the AIL dqflush doesn't have to
allocate or read buffers in a reclaim path, do the same for the
quotacheck code so that the reclaim shrinker dqflush call doesn't have
to do that either.
Cc: <stable@vger.kernel.org> # v6.12
Fixes: 903edea6c5 ("mm: warn about illegal __GFP_NOFAIL usage in a more appropriate location and manner")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Ever since 6.12-rc1, I've observed a pile of warnings from the kernel
when running fstests with quotas enabled:
WARNING: CPU: 1 PID: 458580 at mm/page_alloc.c:4221 __alloc_pages_noprof+0xc9c/0xf18
CPU: 1 UID: 0 PID: 458580 Comm: xfsaild/sda3 Tainted: G W 6.12.0-rc6-djwa #rc6 6ee3e0e531f6457e2d26aa008a3b65ff184b377c
<snip>
Call trace:
__alloc_pages_noprof+0xc9c/0xf18
alloc_pages_mpol_noprof+0x94/0x240
alloc_pages_noprof+0x68/0xf8
new_slab+0x3e0/0x568
___slab_alloc+0x5a0/0xb88
__slab_alloc.constprop.0+0x7c/0xf8
__kmalloc_noprof+0x404/0x4d0
xfs_buf_get_map+0x594/0xde0 [xfs 384cb02810558b4c490343c164e9407332118f88]
xfs_buf_read_map+0x64/0x2e0 [xfs 384cb02810558b4c490343c164e9407332118f88]
xfs_trans_read_buf_map+0x1dc/0x518 [xfs 384cb02810558b4c490343c164e9407332118f88]
xfs_qm_dqflush+0xac/0x468 [xfs 384cb02810558b4c490343c164e9407332118f88]
xfs_qm_dquot_logitem_push+0xe4/0x148 [xfs 384cb02810558b4c490343c164e9407332118f88]
xfsaild+0x3f4/0xde8 [xfs 384cb02810558b4c490343c164e9407332118f88]
kthread+0x110/0x128
ret_from_fork+0x10/0x20
---[ end trace 0000000000000000 ]---
This corresponds to the line:
WARN_ON_ONCE(current->flags & PF_MEMALLOC);
within the NOFAIL checks. What's happening here is that the XFS AIL is
trying to write a disk quota update back into the filesystem, but for
that it needs to read the ondisk buffer for the dquot. The buffer is
not in memory anymore, probably because it was evicted. Regardless, the
buffer cache tries to allocate a new buffer, but those allocations are
NOFAIL. The AIL thread has marked itself PF_MEMALLOC (aka noreclaim)
since commit 43ff2122e6 ("xfs: on-stack delayed write buffer lists")
presumably because reclaim can push on XFS to push on the AIL.
An easy way to fix this probably would have been to drop the NOFAIL flag
from the xfs_buf allocation and open code a retry loop, but then there's
still the problem that for bs>ps filesystems, the buffer itself could
require up to 64k worth of pages.
Inode items had similar behavior (multi-page cluster buffers that we
don't want to allocate in the AIL) which we solved by making transaction
precommit attach the inode cluster buffers to the dirty log item. Let's
solve the dquot problem in the same way.
So: Make a real precommit handler to read the dquot buffer and attach it
to the log item; pass it to dqflush in the push method; and have the
iodone function detach the buffer once we've flushed everything. Add a
state flag to the log item to track when a thread has entered the
precommit -> push mechanism to skip the detaching if it turns out that
the dquot is very busy, as we don't hold the dquot lock between log item
commit and AIL push).
Reading and attaching the dquot buffer in the precommit hook is inspired
by the work done for inode cluster buffers some time ago.
Cc: <stable@vger.kernel.org> # v6.12
Fixes: 903edea6c5 ("mm: warn about illegal __GFP_NOFAIL usage in a more appropriate location and manner")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Clean up these functions a little bit before we move on to the real
modifications, and make the variable naming consistent for dquot log
items.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
The first step towards holding the dquot buffer in the li_buf instead of
reading it in the AIL is to separate the part that reads the buffer from
the actual flush code. There should be no functional changes.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Quota counter updates are tracked via incore objects which hang off the
xfs_trans object. These changes are then turned into dirty log items in
xfs_trans_apply_dquot_deltas just prior to commiting the log items to
the CIL.
However, updating the incore deltas do not cause XFS_TRANS_DIRTY to be
set on the transaction. In other words, a pure quota counter update
will be silently discarded if there are no other dirty log items
attached to the transaction.
This is currently not the case anywhere in the filesystem because quota
updates always dirty at least one other metadata item, but a subsequent
bug fix will add dquot log item precommits, so we actually need a dirty
dquot log item prior to xfs_trans_run_precommits. Also let's not leave
a logic bomb.
Cc: <stable@vger.kernel.org> # v2.6.35
Fixes: 0924378a68 ("xfs: split out iclog writing from xfs_trans_commit()")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Superblock counter updates are tracked via per-transaction counters in
the xfs_trans object. These changes are then turned into dirty log
items in xfs_trans_apply_sb_deltas just prior to commiting the log items
to the CIL.
However, updating the per-transaction counter deltas do not cause
XFS_TRANS_DIRTY to be set on the transaction. In other words, a pure sb
counter update will be silently discarded if there are no other dirty
log items attached to the transaction.
This is currently not the case anywhere in the filesystem because sb
counter updates always dirty at least one other metadata item, but let's
not leave a logic bomb.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Currently, __xfs_trans_commit calls xfs_defer_finish_noroll, which calls
__xfs_trans_commit again on the same transaction. In other words,
there's a nested function call (albeit with slightly different
arguments) that has caused minor amounts of confusion in the past.
There's no reason to keep this around, since there's only one place
where we actually want the xfs_defer_finish_noroll, and that is in the
top level xfs_trans_commit call.
This also reduces stack usage a little bit.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Committing a transaction tx0 with a defer ops chain of (A, B, C)
creates a chain of transactions that looks like this:
tx0 -> txA -> txB -> txC
Prior to commit cb04211748, __xfs_trans_commit would run precommits
on tx0, then call xfs_defer_finish_noroll to convert A-C to tx[A-C].
Unfortunately, after the finish_noroll loop we forgot to run precommits
on txC. That was fixed by adding the second precommit call.
Unfortunately, none of us remembered that xfs_defer_finish_noroll
calls __xfs_trans_commit a second time to commit tx0 before finishing
work A in txA and committing that. In other words, we run precommits
twice on tx0:
xfs_trans_commit(tx0)
__xfs_trans_commit(tx0, false)
xfs_trans_run_precommits(tx0)
xfs_defer_finish_noroll(tx0)
xfs_trans_roll(tx0)
txA = xfs_trans_dup(tx0)
__xfs_trans_commit(tx0, true)
xfs_trans_run_precommits(tx0)
This currently isn't an issue because the inode item precommit is
idempotent; the iunlink item precommit deletes itself so it can't be
called again; and the buffer/dquot item precommits only check the incore
objects for corruption. However, it doesn't make sense to run
precommits twice.
Fix this situation by only running precommits after finish_noroll.
Cc: <stable@vger.kernel.org> # v6.4
Fixes: cb04211748 ("xfs: defered work could create precommits")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Debugging a filesystem patch with generic/475 caused the system to hang
after observing the following sequences in dmesg:
XFS (dm-0): metadata I/O error in "xfs_imap_to_bp+0x61/0xe0 [xfs]" at daddr 0x491520 len 32 error 5
XFS (dm-0): metadata I/O error in "xfs_btree_read_buf_block+0xba/0x160 [xfs]" at daddr 0x3445608 len 8 error 5
XFS (dm-0): metadata I/O error in "xfs_imap_to_bp+0x61/0xe0 [xfs]" at daddr 0x138e1c0 len 32 error 5
XFS (dm-0): log I/O error -5
XFS (dm-0): Metadata I/O Error (0x1) detected at xfs_trans_read_buf_map+0x1ea/0x4b0 [xfs] (fs/xfs/xfs_trans_buf.c:311). Shutting down filesystem.
XFS (dm-0): Please unmount the filesystem and rectify the problem(s)
XFS (dm-0): Internal error dqp->q_ino.reserved < dqp->q_ino.count at line 869 of file fs/xfs/xfs_trans_dquot.c. Caller xfs_trans_dqresv+0x236/0x440 [xfs]
XFS (dm-0): Corruption detected. Unmount and run xfs_repair
XFS (dm-0): Unmounting Filesystem be6bcbcc-9921-4deb-8d16-7cc94e335fa7
The system is stuck in unmount trying to lock a couple of inodes so that
they can be purged. The dquot corruption notice above is a clue to what
happened -- a link() call tried to set up a transaction to link a child
into a directory. Quota reservation for the transaction failed after IO
errors shut down the filesystem, but then we forgot to unlock the inodes
on our way out. Fix that.
Cc: <stable@vger.kernel.org> # v6.10
Fixes: bd5562111d ("xfs: Hold inode locks in xfs_trans_alloc_dir")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Fix a minor mistakes in the scrub tracepoints that can manifest when
inode-rooted btrees are enabled. The existing code worked fine for bmap
btrees, but we should tighten the code up to be less sloppy.
Cc: <stable@vger.kernel.org> # v5.7
Fixes: 92219c292a ("xfs: convert btree cursor inode-private member names")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
In commit 2c813ad66a, I partially fixed a bug wherein xfs_btree_insrec
would erroneously try to update the parent's key for a block that had
been split if we decided to insert the new record into the new block.
The solution was to detect this situation and update the in-core key
value that we pass up to the caller so that the caller will (eventually)
add the new block to the parent level of the tree with the correct key.
However, I missed a subtlety about the way inode-rooted btrees work. If
the full block was a maximally sized inode root block, we'll solve that
fullness by moving the root block's records to a new block, resizing the
root block, and updating the root to point to the new block. We don't
pass a pointer to the new block to the caller because that work has
already been done. The new record will /always/ land in the new block,
so in this case we need to use xfs_btree_update_keys to update the keys.
This bug can theoretically manifest itself in the very rare case that we
split a bmbt root block and the new record lands in the very first slot
of the new block, though I've never managed to trigger it in practice.
However, it is very easy to reproduce by running generic/522 with the
realtime rmapbt patchset if rtinherit=1.
Cc: <stable@vger.kernel.org> # v4.8
Fixes: 2c813ad66a ("xfs: support btrees with overlapping intervals for keys")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
smatch reported that we screwed up the error cleanup in this function.
Fix it.
Cc: <stable@vger.kernel.org> # v6.13-rc1
Fixes: ae897e0bed ("xfs: support creating per-RTG files in growfs")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
xfs_bmap_rtalloc initializes the bno_hint variable to NULLRTBLOCK (aka
NULLFSBLOCK). If the allocation request is for a file range that's
adjacent to an existing mapping, it will then change bno_hint to the
blkno hint in the bmalloca structure.
In other words, bno_hint is either a rt block number, or it's all 1s.
Unfortunately, commit ec12f97f1b didn't take the NULLRTBLOCK state
into account, which means that it tries to translate that into a
realtime extent number. We then end up with an obnoxiously high rtx
number and pointlessly feed that to the near allocator. This often
fails and falls back to the by-size allocator. Seeing as we had no
locality hint anyway, this is a waste of time.
Fix the code to detect a lack of bno_hint correctly. This was detected
by running xfs/009 with metadir enabled and a 28k rt extent size.
Cc: <stable@vger.kernel.org> # v6.12
Fixes: ec12f97f1b ("xfs: make the rtalloc start hint a xfs_rtblock_t")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Once in a long while, xfs/566 and xfs/801 report directory corruption in
one of the metadata subdirectories while it's forcibly rebuilding all
filesystem metadata. I observed the following sequence of events:
1. Initiate a repair of the parent pointers for the /quota/user file.
This is the secret file containing user quota data.
2. The pptr repair thread creates a temporary file and begins staging
parent pointers in the ondisk metadata in preparation for an
exchange-range to commit the new pptr data.
3. At the same time, initiate a repair of the /quota directory itself.
4. The dir repair thread finds the temporary file from (2), scans it for
parent pointers, and stages a dirent in its own temporary dir in
preparation to commit the fixed directory.
5. The parent pointer repair completes and frees the temporary file.
6. The dir repair commits the new directory and scans it again. It
finds the dirent that points to the old temporary file in (2) and
marks the directory corrupt.
Oops! Repair code must never scan the temporary files that other repair
functions create to stage new metadata. They're not supposed to do
that, but the predicate function xrep_is_tempfile is incorrect because
it assumes that any XFS_DIFLAG2_METADATA file cannot ever be a temporary
file, but xrep_tempfile_adjust_directory_tree creates exactly that.
Fix this by setting the IRECOVERY flag on temporary metadata directory
inodes and using that to correct the predicate. Repair code is supposed
to erase all the data in temporary files before releasing them, so it's
ok if a thread scans the temporary file after we drop IRECOVERY.
Cc: <stable@vger.kernel.org> # v6.13-rc1
Fixes: bb6cdd5529 ("xfs: hide metadata inodes from everyone because they are special")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
If we need to reset a symlink target to the "durr it's busted" string,
then we clear the zapped flag as well. However, this should be using
the provided helper so that we don't set the zapped state on an
otherwise ok symlink.
Cc: <stable@vger.kernel.org> # v6.10
Fixes: 2651923d8d ("xfs: online repair of symbolic links")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
In commit d9041681dd we introduced some XFS_SICK_*ZAPPED flags so
that the inode record repair code could clean up a damaged inode record
enough to iget the inode but still be able to remember that the higher
level repair code needs to be called. As part of that, we introduced a
xchk_mark_healthy_if_clean helper that is supposed to cause the ZAPPED
state to be removed if that higher level metadata actually checks out.
This was done by setting additional bits in sick_mask hoping that
xchk_update_health will clear all those bits after a healthy scrub.
Unfortunately, that's not quite what sick_mask means -- bits in that
mask are indeed cleared if the metadata is healthy, but they're set if
the metadata is NOT healthy. fsck is only intended to set the ZAPPED
bits explicitly.
If something else sets the CORRUPT/XCORRUPT state after the
xchk_mark_healthy_if_clean call, we end up marking the metadata zapped.
This can happen if the following sequence happens:
1. Scrub runs, discovers that the metadata is fine but could be
optimized and calls xchk_mark_healthy_if_clean on a ZAPPED flag.
That causes the ZAPPED flag to be set in sick_mask because the
metadata is not CORRUPT or XCORRUPT.
2. Repair runs to optimize the metadata.
3. Some other metadata used for cross-referencing in (1) becomes
corrupt.
4. Post-repair scrub runs, but this time it sets CORRUPT or XCORRUPT due
to the events in (3).
5. Now the xchk_health_update sets the ZAPPED flag on the metadata we
just repaired. This is not the correct state.
Fix this by moving the "if healthy" mask to a separate field, and only
ever using it to clear the sick state.
Cc: <stable@vger.kernel.org> # v6.8
Fixes: d9041681dd ("xfs: set inode sick state flags when we zap either ondisk fork")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Way back when we first implemented FICLONE for XFS, life was simple --
either the the entire remapping completed, or something happened and we
had to return an errno explaining what happened. Neither of those
ioctls support returning partial results, so it's all or nothing.
Then things got complicated when copy_file_range came along, because it
actually can return the number of bytes copied, so commit 3f68c1f562
tried to make it so that we could return a partial result if the
REMAP_FILE_CAN_SHORTEN flag is set. This is also how FIDEDUPERANGE can
indicate that the kernel performed a partial deduplication.
Unfortunately, the logic is wrong if an error stops the remapping and
CAN_SHORTEN is not set. Because those callers cannot return partial
results, it is an error for ->remap_file_range to return a positive
quantity that is less than the @len passed in. Implementations really
should be returning a negative errno in this case, because that's what
btrfs (which introduced FICLONE{,RANGE}) did.
Therefore, ->remap_range implementations cannot silently drop an errno
that they might have when the number of bytes remapped is less than the
number of bytes requested and CAN_SHORTEN is not set.
Found by running generic/562 on a 64k fsblock filesystem and wondering
why it reported corrupt files.
Cc: <stable@vger.kernel.org> # v4.20
Fixes: 3fc9f5e409 ("xfs: remove xfs_reflink_remap_range")
Really-Fixes: 3f68c1f562 ("xfs: support returning partial reflink results")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
With the nrext64 feature enabled, it's possible for a data fork to have
2^48 extent mappings. Even with a 64k fsblock size, that maps out to
a bmbt containing more than 2^32 blocks. Therefore, this predicate must
return a u64 count to avoid an integer wraparound that will cause scrub
to do the wrong thing.
It's unlikely that any such filesystem currently exists, because the
incore bmbt would consume more than 64GB of kernel memory on its own,
and so far nobody except me has driven a filesystem that far, judging
from the lack of complaints.
Cc: <stable@vger.kernel.org> # v5.19
Fixes: df9ad5cc7a ("xfs: Introduce macros to represent new maximum extent counts for data/attr forks")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
In the same vein as the previous patch, there's no point in the metapath
scrub setup function doing a lookup on the quota metadir just so it can
validate that lookups work correctly. Instead, retain the quota
directory inode in memory for the lifetime of the mount so that we can
check this meaningfully.
Cc: <stable@vger.kernel.org> # v6.13-rc1
Fixes: 128a055291 ("xfs: scrub quota file metapaths")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Don't waste time in xchk_setup_metapath_dqinode doing a second lookup of
the quota inodes, just grab them from the quotainfo structure. The
whole point of this scrubber is to make sure that the dirents exist, so
it's completely silly to do lookups.
Cc: <stable@vger.kernel.org> # v6.13-rc1
Fixes: 128a055291 ("xfs: scrub quota file metapaths")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
In commit ca6448aed4, we created an "end_daddr" variable to fix
fsmap reporting when the end of the range requested falls in the middle
of an unknown (aka free on the rmapbt) region. Unfortunately, I didn't
notice that the the code sets end_daddr to the last sector of the device
but then uses that quantity to compute the length of the synthesized
mapping.
Zizhi Wo later observed that when end_daddr isn't set, we still don't
report the last fsblock on a device because in that case (aka when
info->last is true), the info->high mapping that we pass to
xfs_getfsmap_group_helper has a startblock that points to the last
fsblock. This is also wrong because the code uses startblock to
compute the length of the synthesized mapping.
Fix the second problem by setting end_daddr unconditionally, and fix the
first problem by setting start_daddr to one past the end of the range to
query.
Cc: <stable@vger.kernel.org> # v6.11
Fixes: ca6448aed4 ("xfs: Fix missing interval for missing_owner in xfs fsmap")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reported-by: Zizhi Wo <wozizhi@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Pull smb server fixes from Steve French:
- fix ctime setting in setattr
- fix reference count on user session to avoid potential race with
session expire
- fix query dir issue
* tag 'v6.13-rc2-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
ksmbd: set ATTR_CTIME flags when setting mtime
ksmbd: fix racy issue from session lookup and expire
ksmbd: retry iterate_dir in smb2_query_dir
Pull perf tools fixes from Namhyung Kim:
"A set of random fixes for this cycle.
perf record:
- Fix build-id event size calculation in perf record
- Fix perf record -C/--cpu option on hybrid systems
- Fix perf mem record with precise-ip on SapphireRapids
perf test:
- Refresh hwmon directory before reading the test files
- Make sure system_tsc_freq event is tested on x86 only
Others:
- Usual header file sync
- Fix undefined behavior in perf ftrace profile
- Properly initialize a return variable in perf probe"
* tag 'perf-tools-fixes-for-v6.13-2024-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools: (21 commits)
perf probe: Fix uninitialized variable
libperf: evlist: Fix --cpu argument on hybrid platform
perf test expr: Fix system_tsc_freq for only x86
perf test hwmon_pmu: Fix event file location
perf hwmon_pmu: Use openat rather than dup to refresh directory
perf ftrace: Fix undefined behavior in cmp_profile_data()
perf tools: Fix precise_ip fallback logic
perf tools: Fix build error on generated/fs_at_flags_array.c
tools headers: Sync uapi/linux/prctl.h with the kernel sources
tools headers: Sync uapi/linux/mount.h with the kernel sources
tools headers: Sync uapi/linux/fcntl.h with the kernel sources
tools headers: Sync uapi/asm-generic/mman.h with the kernel sources
tools headers: Sync *xattrat syscall changes with the kernel sources
tools headers: Sync arm64 kvm header with the kernel sources
tools headers: Sync x86 kvm and cpufeature headers with the kernel
tools headers: Sync uapi/linux/kvm.h with the kernel sources
tools headers: Sync uapi/linux/perf_event.h with the kernel sources
tools headers: Sync uapi/drm/drm.h with the kernel sources
perf machine: Initialize machine->env to address a segfault
perf test: Don't signal all processes on system when interrupting tests
...
Pull OpenRISC fixes from Stafford Horne:
- Fix from Masahiro Yamada to fix 6.13 OpenRISC boot issues after
vmlinux.lds.h symbol ordering was changed
- Code formatting fixups from Geert
* tag 'for-linus' of https://github.com/openrisc/linux:
openrisc: Fix misalignments in head.S
openrisc: place exception table at the head of vmlinux
Kumar Kartikeya Dwivedi says:
====================
Add missing size check for BTF-based ctx access
This set fixes a issue reported for tracing and struct ops programs
using btf_ctx_access for ctx checks, where loading a pointer argument
from the ctx doesn't enforce a BPF_DW access size check. The original
report is at link [0]. Also add a regression test along with the fix.
[0]: https://lore.kernel.org/bpf/51338.1732985814@localhost
====================
Link: https://patch.msgid.link/20241212092050.3204165-1-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Robert Morris reported the following program type which passes the
verifier in [0]:
SEC("struct_ops/bpf_cubic_init")
void BPF_PROG(bpf_cubic_init, struct sock *sk)
{
asm volatile("r2 = *(u16*)(r1 + 0)"); // verifier should demand u64
asm volatile("*(u32 *)(r2 +1504) = 0"); // 1280 in some configs
}
The second line may or may not work, but the first instruction shouldn't
pass, as it's a narrow load into the context structure of the struct ops
callback. The code falls back to btf_ctx_access to ensure correctness
and obtaining the types of pointers. Ensure that the size of the access
is correctly checked to be 8 bytes, otherwise the verifier thinks the
narrow load obtained a trusted BTF pointer and will permit loads/stores
as it sees fit.
Perform the check on size after we've verified that the load is for a
pointer field, as for scalar values narrow loads are fine. Access to
structs passed as arguments to a BPF program are also treated as
scalars, therefore no adjustment is needed in their case.
Existing verifier selftests are broken by this change, but because they
were incorrect. Verifier tests for d_path were performing narrow load
into context to obtain path pointer, had this program actually run it
would cause a crash. The same holds for verifier_btf_ctx_access tests.
[0]: https://lore.kernel.org/bpf/51338.1732985814@localhost
Fixes: 9e15db6613 ("bpf: Implement accurate raw_tp context access via BTF")
Reported-by: Robert Morris <rtm@mit.edu>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20241212092050.3204165-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
bpf_prog_aux->func field might be NULL if program does not have
subprograms except for main sub-program. The fixed commit does
bpf_prog_aux->func access unconditionally, which might lead to null
pointer dereference.
The bug could be triggered by replacing the following BPF program:
SEC("tc")
int main_changes(struct __sk_buff *sk)
{
bpf_skb_pull_data(sk, 0);
return 0;
}
With the following BPF program:
SEC("freplace")
long changes_pkt_data(struct __sk_buff *sk)
{
return bpf_skb_pull_data(sk, 0);
}
bpf_prog_aux instance itself represents the main sub-program,
use this property to fix the bug.
Fixes: 81f6d0530b ("bpf: check changes_pkt_data property for extension programs")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/202412111822.qGw6tOyB-lkp@intel.com/
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20241212070711.427443-1-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Pull networking fixes from Jakub Kicinski:
"Including fixes from bluetooth, netfilter and wireless.
Current release - fix to a fix:
- rtnetlink: fix error code in rtnl_newlink()
- tipc: fix NULL deref in cleanup_bearer()
Current release - regressions:
- ip: fix warning about invalid return from in ip_route_input_rcu()
Current release - new code bugs:
- udp: fix L4 hash after reconnect
- eth: lan969x: fix cyclic dependency between modules
- eth: bnxt_en: fix potential crash when dumping FW log coredump
Previous releases - regressions:
- wifi: mac80211:
- fix a queue stall in certain cases of channel switch
- wake the queues in case of failure in resume
- splice: do not checksum AF_UNIX sockets
- virtio_net: fix BUG()s in BQL support due to incorrect accounting
of purged packets during interface stop
- eth:
- stmmac: fix TSO DMA API mis-usage causing oops
- bnxt_en: fixes for HW GRO: GSO type on 5750X chips and oops
due to incorrect aggregation ID mask on 5760X chips
Previous releases - always broken:
- Bluetooth: improve setsockopt() handling of malformed user input
- eth: ocelot: fix PTP timestamping in presence of packet loss
- ptp: kvm: x86: avoid "fail to initialize ptp_kvm" when simply not
supported"
* tag 'net-6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (81 commits)
net: dsa: tag_ocelot_8021q: fix broken reception
net: dsa: microchip: KSZ9896 register regmap alignment to 32 bit boundaries
net: renesas: rswitch: fix initial MPIC register setting
Bluetooth: btmtk: avoid UAF in btmtk_process_coredump
Bluetooth: iso: Fix circular lock in iso_conn_big_sync
Bluetooth: iso: Fix circular lock in iso_listen_bis
Bluetooth: SCO: Add support for 16 bits transparent voice setting
Bluetooth: iso: Fix recursive locking warning
Bluetooth: iso: Always release hdev at the end of iso_listen_bis
Bluetooth: hci_event: Fix using rcu_read_(un)lock while iterating
Bluetooth: hci_core: Fix sleeping function called from invalid context
team: Fix feature propagation of NETIF_F_GSO_ENCAP_ALL
team: Fix initial vlan_feature set in __team_compute_features
bonding: Fix feature propagation of NETIF_F_GSO_ENCAP_ALL
bonding: Fix initial {vlan,mpls}_feature set in bond_compute_features
net, team, bonding: Add netdev_base_features helper
net/sched: netem: account for backlog updates from child qdisc
net: dsa: felix: fix stuck CPU-injected packets with short taprio windows
splice: do not checksum AF_UNIX sockets
net: usb: qmi_wwan: add Telit FE910C04 compositions
...
After a recent change to clamp() and its variants [1] that increases the
coverage of the check that high is greater than low because it can be
done through inlining, certain build configurations (such as s390
defconfig) fail to build with clang with:
block/blk-iocost.c:1101:11: error: call to '__compiletime_assert_557' declared with 'error' attribute: clamp() low limit 1 greater than high limit active
1101 | inuse = clamp_t(u32, inuse, 1, active);
| ^
include/linux/minmax.h:218:36: note: expanded from macro 'clamp_t'
218 | #define clamp_t(type, val, lo, hi) __careful_clamp(type, val, lo, hi)
| ^
include/linux/minmax.h:195:2: note: expanded from macro '__careful_clamp'
195 | __clamp_once(type, val, lo, hi, __UNIQUE_ID(v_), __UNIQUE_ID(l_), __UNIQUE_ID(h_))
| ^
include/linux/minmax.h:188:2: note: expanded from macro '__clamp_once'
188 | BUILD_BUG_ON_MSG(statically_true(ulo > uhi), \
| ^
__propagate_weights() is called with an active value of zero in
ioc_check_iocgs(), which results in the high value being less than the
low value, which is undefined because the value returned depends on the
order of the comparisons.
The purpose of this expression is to ensure inuse is not more than
active and at least 1. This could be written more simply with a ternary
expression that uses min(inuse, active) as the condition so that the
value of that condition can be used if it is not zero and one if it is.
Do this conversion to resolve the error and add a comment to deter
people from turning this back into clamp().
Fixes: 7caa47151a ("blkcg: implement blk-iocost")
Link: https://lore.kernel.org/r/34d53778977747f19cce2abb287bb3e6@AcuMS.aculab.com/ [1]
Suggested-by: David Laight <david.laight@aculab.com>
Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
Closes: https://lore.kernel.org/llvm/CA+G9fYsD7mw13wredcZn0L-KBA3yeoVSTuxnss-AEWMN3ha0cA@mail.gmail.com/
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202412120322.3GfVe3vF-lkp@intel.com/
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There may still exist some pcluster with valid reference counts
during unmounting. Instead of introducing another synchronization
primitive, just try again as unmounting is relatively rare. This
approach is similar to z_erofs_cache_invalidate_folio().
It was also reported by syzbot as a UAF due to commit f5ad9f9a60
("erofs: free pclusters if no cached folio is attached"):
BUG: KASAN: slab-use-after-free in do_raw_spin_trylock+0x72/0x1f0 kernel/locking/spinlock_debug.c:123
..
queued_spin_trylock include/asm-generic/qspinlock.h:92 [inline]
do_raw_spin_trylock+0x72/0x1f0 kernel/locking/spinlock_debug.c:123
__raw_spin_trylock include/linux/spinlock_api_smp.h:89 [inline]
_raw_spin_trylock+0x20/0x80 kernel/locking/spinlock.c:138
spin_trylock include/linux/spinlock.h:361 [inline]
z_erofs_put_pcluster fs/erofs/zdata.c:959 [inline]
z_erofs_decompress_pcluster fs/erofs/zdata.c:1403 [inline]
z_erofs_decompress_queue+0x3798/0x3ef0 fs/erofs/zdata.c:1425
z_erofs_decompressqueue_work+0x99/0xe0 fs/erofs/zdata.c:1437
process_one_work kernel/workqueue.c:3229 [inline]
process_scheduled_works+0xa68/0x1840 kernel/workqueue.c:3310
worker_thread+0x870/0xd30 kernel/workqueue.c:3391
kthread+0x2f2/0x390 kernel/kthread.c:389
ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
</TASK>
However, it seems a long outstanding memory leak. Fix it now.
Fixes: f5ad9f9a60 ("erofs: free pclusters if no cached folio is attached")
Reported-by: syzbot+7ff87b095e7ca0c5ac39@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/r/674c1235.050a0220.ad585.0032.GAE@google.com
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20241203072821.1885740-1-hsiangkao@linux.alibaba.com
The arm64 stacktrace code has a few error conditions where a
WARN_ON_ONCE() is triggered before the stacktrace is terminated and an
error is returned to the caller. The conditions shouldn't be triggered
when unwinding the current task, but it is possible to trigger these
when unwinding another task which is not blocked, as the stack of that
task is concurrently modified. Kent reports that these warnings can be
triggered while running filesystem tests on bcachefs, which calls the
stacktrace code directly.
To produce a meaningful stacktrace of another task, the task in question
should be blocked, but the stacktrace code is expected to be robust to
cases where it is not blocked. Note that this is purely about not
unuduly scaring the user and/or crashing the kernel; stacktraces in such
cases are meaningless and may leak kernel secrets from the stack of the
task being unwound.
Ideally we'd pin the task in a blocked state during the unwind, as we do
for /proc/${PID}/wchan since commit:
42a20f86dc ("sched: Add wrapper for get_wchan() to keep task blocked")
... but a bunch of places don't do that, notably /proc/${PID}/stack,
where we don't pin the task in a blocked state, but do restrict the
output to privileged users since commit:
f8a00cef17 ("proc: restrict kernel stack dumps to root")
... and so it's possible to trigger these warnings accidentally, e.g. by
reading /proc/*/stack (as root):
| for n in $(seq 1 10); do
| while true; do cat /proc/*/stack > /dev/null 2>&1; done &
| done
| ------------[ cut here ]------------
| WARNING: CPU: 3 PID: 166 at arch/arm64/kernel/stacktrace.c:207 arch_stack_walk+0x1c8/0x370
| Modules linked in:
| CPU: 3 UID: 0 PID: 166 Comm: cat Not tainted 6.13.0-rc2-00003-g3dafa7a7925d #2
| Hardware name: linux,dummy-virt (DT)
| pstate: 81400005 (Nzcv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
| pc : arch_stack_walk+0x1c8/0x370
| lr : arch_stack_walk+0x1b0/0x370
| sp : ffff800080773890
| x29: ffff800080773930 x28: fff0000005c44500 x27: fff00000058fa038
| x26: 000000007ffff000 x25: 0000000000000000 x24: 0000000000000000
| x23: ffffa35a8d9600ec x22: 0000000000000000 x21: fff00000043a33c0
| x20: ffff800080773970 x19: ffffa35a8d960168 x18: 0000000000000000
| x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
| x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
| x11: 0000000000000000 x10: 0000000000000000 x9 : 0000000000000000
| x8 : ffff8000807738e0 x7 : ffff8000806e3800 x6 : ffff8000806e3818
| x5 : ffff800080773920 x4 : ffff8000806e4000 x3 : ffff8000807738e0
| x2 : 0000000000000018 x1 : ffff8000806e3800 x0 : 0000000000000000
| Call trace:
| arch_stack_walk+0x1c8/0x370 (P)
| stack_trace_save_tsk+0x8c/0x108
| proc_pid_stack+0xb0/0x134
| proc_single_show+0x60/0x120
| seq_read_iter+0x104/0x438
| seq_read+0xf8/0x140
| vfs_read+0xc4/0x31c
| ksys_read+0x70/0x108
| __arm64_sys_read+0x1c/0x28
| invoke_syscall+0x48/0x104
| el0_svc_common.constprop.0+0x40/0xe0
| do_el0_svc+0x1c/0x28
| el0_svc+0x30/0xcc
| el0t_64_sync_handler+0x10c/0x138
| el0t_64_sync+0x198/0x19c
| ---[ end trace 0000000000000000 ]---
Fix this by only warning when unwinding the current task. When unwinding
another task the error conditions will be handled by returning an error
without producing a warning.
The two warnings in kunwind_next_frame_record_meta() were added recently
as part of commit:
c2c6b27b5a ("arm64: stacktrace: unwind exception boundaries")
The warning when recovering the fgraph return address has changed form
many times, but was originally introduced back in commit:
9f416319f4 ("arm64: fix unwind_frame() for filtered out fn for function graph tracing")
Fixes: c2c6b27b5a ("arm64: stacktrace: unwind exception boundaries")
Fixes: 9f416319f4 ("arm64: fix unwind_frame() for filtered out fn for function graph tracing")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reported-by: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Kees Cook <keescook@chromium.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20241211140704.2498712-3-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Aishwarya reports that warnings are sometimes seen when running the
ftrace kselftests, e.g.
| WARNING: CPU: 5 PID: 2066 at arch/arm64/kernel/stacktrace.c:141 arch_stack_walk+0x4a0/0x4c0
| Modules linked in:
| CPU: 5 UID: 0 PID: 2066 Comm: ftracetest Not tainted 6.13.0-rc2 #2
| Hardware name: linux,dummy-virt (DT)
| pstate: 604000c5 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
| pc : arch_stack_walk+0x4a0/0x4c0
| lr : arch_stack_walk+0x248/0x4c0
| sp : ffff800083643d20
| x29: ffff800083643dd0 x28: ffff00007b891400 x27: ffff00007b891928
| x26: 0000000000000001 x25: 00000000000000c0 x24: ffff800082f39d80
| x23: ffff80008003ee8c x22: ffff80008004baa8 x21: ffff8000800533e0
| x20: ffff800083643e10 x19: ffff80008003eec8 x18: 0000000000000000
| x17: 0000000000000000 x16: ffff800083640000 x15: 0000000000000000
| x14: 02a37a802bbb8a92 x13: 00000000000001a9 x12: 0000000000000001
| x11: ffff800082ffad60 x10: ffff800083643d20 x9 : ffff80008003eed0
| x8 : ffff80008004baa8 x7 : ffff800086f2be80 x6 : ffff0000057cf000
| x5 : 0000000000000000 x4 : 0000000000000000 x3 : ffff800086f2b690
| x2 : ffff80008004baa8 x1 : ffff80008004baa8 x0 : ffff80008004baa8
| Call trace:
| arch_stack_walk+0x4a0/0x4c0 (P)
| arch_stack_walk+0x248/0x4c0 (L)
| profile_pc+0x44/0x80
| profile_tick+0x50/0x80 (F)
| tick_nohz_handler+0xcc/0x160 (F)
| __hrtimer_run_queues+0x2ac/0x340 (F)
| hrtimer_interrupt+0xf4/0x268 (F)
| arch_timer_handler_virt+0x34/0x60 (F)
| handle_percpu_devid_irq+0x88/0x220 (F)
| generic_handle_domain_irq+0x34/0x60 (F)
| gic_handle_irq+0x54/0x140 (F)
| call_on_irq_stack+0x24/0x58 (F)
| do_interrupt_handler+0x88/0x98
| el1_interrupt+0x34/0x68 (F)
| el1h_64_irq_handler+0x18/0x28
| el1h_64_irq+0x6c/0x70
| queued_spin_lock_slowpath+0x78/0x460 (P)
The warning in question is:
WARN_ON_ONCE(state->common.pc == orig_pc))
... in kunwind_recover_return_address(), which is triggered when
return_to_handler() is encountered in the trace, but
ftrace_graph_ret_addr() cannot find a corresponding original return
address on the fgraph return stack.
This happens because the stacktrace code encounters an exception
boundary where the LR was not live at the time of the exception, but the
LR happens to contain return_to_handler(); either because the task
recently returned there, or due to unfortunate usage of the LR at a
scratch register. In such cases attempts to recover the return address
via ftrace_graph_ret_addr() may fail, triggering the WARN_ON_ONCE()
above and aborting the unwind (hence the stacktrace terminating after
reporting the PC at the time of the exception).
Handling unreliable LR values in these cases is likely to require some
larger rework, so for the moment avoid this problem by restoring the old
behaviour of skipping the LR at exception boundaries, which the
stacktrace code did prior to commit:
c2c6b27b5a ("arm64: stacktrace: unwind exception boundaries")
This commit is effectively a partial revert, keeping the structures and
logic to explicitly identify exception boundaries while still skipping
reporting of the LR. The logic to explicitly identify exception
boundaries is still useful for general robustness and as a building
block for future support for RELIABLE_STACKTRACE.
Fixes: c2c6b27b5a ("arm64: stacktrace: unwind exception boundaries")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reported-by: Aishwarya TCV <aishwarya.tcv@arm.com>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20241211140704.2498712-2-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
That pool implementation doesn't really work: if the krealloc happens to
move the memory and return another address, the entries in the xarray
become invalid, leading to use-after-free later:
BUG: KASAN: slab-use-after-free in xe_reg_sr_apply_mmio+0x570/0x760 [xe]
Read of size 4 at addr ffff8881244b2590 by task modprobe/2753
Allocated by task 2753:
kasan_save_stack+0x39/0x70
kasan_save_track+0x14/0x40
kasan_save_alloc_info+0x37/0x60
__kasan_kmalloc+0xc3/0xd0
__kmalloc_node_track_caller_noprof+0x200/0x6d0
krealloc_noprof+0x229/0x380
Simplify the code to fix the bug. A better pooling strategy may be added
back later if needed.
Fixes: dd08ebf6c3 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Reviewed-by: Matt Roper <matthew.d.roper@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20241209232739.147417-2-lucas.demarchi@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
(cherry picked from commit e5283bd4dfecbd3335f43b62a68e24dae23f59e4)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Luiz Augusto von Dentz says:
====================
bluetooth pull request for net:
- SCO: Fix transparent voice setting
- ISO: Locking fixes
- hci_core: Fix sleeping function called from invalid context
- hci_event: Fix using rcu_read_(un)lock while iterating
- btmtk: avoid UAF in btmtk_process_coredump
* tag 'for-net-2024-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
Bluetooth: btmtk: avoid UAF in btmtk_process_coredump
Bluetooth: iso: Fix circular lock in iso_conn_big_sync
Bluetooth: iso: Fix circular lock in iso_listen_bis
Bluetooth: SCO: Add support for 16 bits transparent voice setting
Bluetooth: iso: Fix recursive locking warning
Bluetooth: iso: Always release hdev at the end of iso_listen_bis
Bluetooth: hci_event: Fix using rcu_read_(un)lock while iterating
Bluetooth: hci_core: Fix sleeping function called from invalid context
Bluetooth: Improve setsockopt() handling of malformed user input
====================
Link: https://patch.msgid.link/20241212142806.2046274-1-luiz.dentz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The blamed commit changed the dsa_8021q_rcv() calling convention to
accept pre-populated source_port and switch_id arguments. If those are
not available, as in the case of tag_ocelot_8021q, the arguments must be
pre-initialized with -1.
Due to the bug of passing uninitialized arguments in tag_ocelot_8021q,
dsa_8021q_rcv() does not detect that it needs to populate the
source_port and switch_id, and this makes dsa_conduit_find_user() fail,
which leads to packet loss on reception.
Fixes: dcfe767378 ("net: dsa: tag_sja1105: absorb logic for not overwriting precise info into dsa_8021q_rcv()")
Signed-off-by: Robert Hodaszi <robert.hodaszi@digi.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20241211144741.1415758-1-robert.hodaszi@digi.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit 8d7ae22ae9 ("net: dsa: microchip: KSZ9477 register regmap
alignment to 32 bit boundaries") fixed an issue whereby regmap_reg_range
did not allow writes as 32 bit words to KSZ9477 PHY registers, this fix
for KSZ9896 is adapted from there as the same errata is present in
KSZ9896C as "Module 5: Certain PHY registers must be written as pairs
instead of singly" the explanation below is likewise taken from this
commit.
The commit provided code
to apply "Module 6: Certain PHY registers must be written as pairs instead
of singly" errata for KSZ9477 as this chip for certain PHY registers
(0xN120 to 0xN13F, N=1,2,3,4,5) must be accessed as 32 bit words instead
of 16 or 8 bit access.
Otherwise, adjacent registers (no matter if reserved or not) are
overwritten with 0x0.
Without this patch some registers (e.g. 0x113c or 0x1134) required for 32
bit access are out of valid regmap ranges.
As a result, following error is observed and KSZ9896 is not properly
configured:
ksz-switch spi1.0: can't rmw 32bit reg 0x113c: -EIO
ksz-switch spi1.0: can't rmw 32bit reg 0x1134: -EIO
ksz-switch spi1.0 lan1 (uninitialized): failed to connect to PHY: -EIO
ksz-switch spi1.0 lan1 (uninitialized): error -5 setting up PHY for tree 0, switch 0, port 0
The solution is to modify regmap_reg_range to allow accesses with 4 bytes
boundaries.
Fixes: 5c844d57aa ("net: dsa: microchip: fix writes to phy registers >= 0x10")
Signed-off-by: Jesse Van Gavere <jesse.vangavere@scioteq.com>
Link: https://patch.msgid.link/20241211092932.26881-1-jesse.vangavere@scioteq.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
hci_devcd_append may lead to the release of the skb, so it cannot be
accessed once it is called.
==================================================================
BUG: KASAN: slab-use-after-free in btmtk_process_coredump+0x2a7/0x2d0 [btmtk]
Read of size 4 at addr ffff888033cfabb0 by task kworker/0:3/82
CPU: 0 PID: 82 Comm: kworker/0:3 Tainted: G U 6.6.40-lockdep-03464-g1d8b4eb3060e #1 b0b3c1cc0c842735643fb411799d97921d1f688c
Hardware name: Google Yaviks_Ufs/Yaviks_Ufs, BIOS Google_Yaviks_Ufs.15217.552.0 05/07/2024
Workqueue: events btusb_rx_work [btusb]
Call Trace:
<TASK>
dump_stack_lvl+0xfd/0x150
print_report+0x131/0x780
kasan_report+0x177/0x1c0
btmtk_process_coredump+0x2a7/0x2d0 [btmtk 03edd567dd71a65958807c95a65db31d433e1d01]
btusb_recv_acl_mtk+0x11c/0x1a0 [btusb 675430d1e87c4f24d0c1f80efe600757a0f32bec]
btusb_rx_work+0x9e/0xe0 [btusb 675430d1e87c4f24d0c1f80efe600757a0f32bec]
worker_thread+0xe44/0x2cc0
kthread+0x2ff/0x3a0
ret_from_fork+0x51/0x80
ret_from_fork_asm+0x1b/0x30
</TASK>
Allocated by task 82:
stack_trace_save+0xdc/0x190
kasan_set_track+0x4e/0x80
__kasan_slab_alloc+0x4e/0x60
kmem_cache_alloc+0x19f/0x360
skb_clone+0x132/0xf70
btusb_recv_acl_mtk+0x104/0x1a0 [btusb]
btusb_rx_work+0x9e/0xe0 [btusb]
worker_thread+0xe44/0x2cc0
kthread+0x2ff/0x3a0
ret_from_fork+0x51/0x80
ret_from_fork_asm+0x1b/0x30
Freed by task 1733:
stack_trace_save+0xdc/0x190
kasan_set_track+0x4e/0x80
kasan_save_free_info+0x28/0xb0
____kasan_slab_free+0xfd/0x170
kmem_cache_free+0x183/0x3f0
hci_devcd_rx+0x91a/0x2060 [bluetooth]
worker_thread+0xe44/0x2cc0
kthread+0x2ff/0x3a0
ret_from_fork+0x51/0x80
ret_from_fork_asm+0x1b/0x30
The buggy address belongs to the object at ffff888033cfab40
which belongs to the cache skbuff_head_cache of size 232
The buggy address is located 112 bytes inside of
freed 232-byte region [ffff888033cfab40, ffff888033cfac28)
The buggy address belongs to the physical page:
page:00000000a174ba93 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x33cfa
head:00000000a174ba93 order:1 entire_mapcount:0 nr_pages_mapped:0 pincount:0
anon flags: 0x4000000000000840(slab|head|zone=1)
page_type: 0xffffffff()
raw: 4000000000000840 ffff888100848a00 0000000000000000 0000000000000001
raw: 0000000000000000 0000000080190019 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff888033cfaa80: fb fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc
ffff888033cfab00: fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb fb
>ffff888033cfab80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff888033cfac00: fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc
ffff888033cfac80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================
Check if we need to call hci_devcd_complete before calling
hci_devcd_append. That requires that we check data->cd_info.cnt >=
MTK_COREDUMP_NUM instead of data->cd_info.cnt > MTK_COREDUMP_NUM, as we
increment data->cd_info.cnt only once the call to hci_devcd_append
succeeds.
Fixes: 0b70151328 ("Bluetooth: btusb: mediatek: add MediaTek devcoredump support")
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
The voice setting is used by sco_connect() or sco_conn_defer_accept()
after being set by sco_sock_setsockopt().
The PCM part of the voice setting is used for offload mode through PCM
chipset port.
This commits add support for mSBC 16 bits offloading, i.e. audio data
not transported over HCI.
The BCM4349B1 supports 16 bits transparent data on its I2S port.
If BT_VOICE_TRANSPARENT is used when accepting a SCO connection, this
gives only garbage audio while using BT_VOICE_TRANSPARENT_16BIT gives
correct audio.
This has been tested with connection to iPhone 14 and Samsung S24.
Fixes: ad10b1a487 ("Bluetooth: Add Bluetooth socket voice option")
Signed-off-by: Frédéric Danis <frederic.danis@collabora.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
This updates iso_sock_accept to use nested locking for the parent
socket, to avoid lockdep warnings caused because the parent and
child sockets are locked by the same thread:
[ 41.585683] ============================================
[ 41.585688] WARNING: possible recursive locking detected
[ 41.585694] 6.12.0-rc6+ #22 Not tainted
[ 41.585701] --------------------------------------------
[ 41.585705] iso-tester/3139 is trying to acquire lock:
[ 41.585711] ffff988b29530a58 (sk_lock-AF_BLUETOOTH)
at: bt_accept_dequeue+0xe3/0x280 [bluetooth]
[ 41.585905]
but task is already holding lock:
[ 41.585909] ffff988b29533a58 (sk_lock-AF_BLUETOOTH)
at: iso_sock_accept+0x61/0x2d0 [bluetooth]
[ 41.586064]
other info that might help us debug this:
[ 41.586069] Possible unsafe locking scenario:
[ 41.586072] CPU0
[ 41.586076] ----
[ 41.586079] lock(sk_lock-AF_BLUETOOTH);
[ 41.586086] lock(sk_lock-AF_BLUETOOTH);
[ 41.586093]
*** DEADLOCK ***
[ 41.586097] May be due to missing lock nesting notation
[ 41.586101] 1 lock held by iso-tester/3139:
[ 41.586107] #0: ffff988b29533a58 (sk_lock-AF_BLUETOOTH)
at: iso_sock_accept+0x61/0x2d0 [bluetooth]
Fixes: ccf74f2390 ("Bluetooth: Add BTPROTO_ISO socket type")
Signed-off-by: Iulia Tanasescu <iulia.tanasescu@nxp.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Since hci_get_route holds the device before returning, the hdev
should be released with hci_dev_put at the end of iso_listen_bis
even if the function returns with an error.
Fixes: 02171da6e8 ("Bluetooth: ISO: Add hcon for listening bis sk")
Signed-off-by: Iulia Tanasescu <iulia.tanasescu@nxp.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
The usage of rcu_read_(un)lock while inside list_for_each_entry_rcu is
not safe since for the most part entries fetched this way shall be
treated as rcu_dereference:
Note that the value returned by rcu_dereference() is valid
only within the enclosing RCU read-side critical section [1]_.
For example, the following is **not** legal::
rcu_read_lock();
p = rcu_dereference(head.next);
rcu_read_unlock();
x = p->address; /* BUG!!! */
rcu_read_lock();
y = p->data; /* BUG!!! */
rcu_read_unlock();
Fixes: a0bfde167b ("Bluetooth: ISO: Add support for connecting multiple BISes")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
The arguments for __dma_buf_debugfs_list_del do not match for both the
CONFIG_DEBUG_FS case and the !CONFIG_DEBUG_FS case. The !CONFIG_DEBUG_FS
case should take a struct dma_buf *, but it's currently struct file *.
This can lead to the build error:
error: passing argument 1 of ‘__dma_buf_debugfs_list_del’ from
incompatible pointer type [-Werror=incompatible-pointer-types]
dma-buf.c:63:53: note: expected ‘struct file *’ but argument is of
type ‘struct dma_buf *’
63 | static void __dma_buf_debugfs_list_del(struct file *file)
Fixes: bfc7bc5393 ("dma-buf: Do not build debugfs related code when !CONFIG_DEBUG_FS")
Signed-off-by: T.J. Mercier <tjmercier@google.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
Signed-off-by: Sumit Semwal <sumit.semwal@linaro.org>
Link: https://patchwork.freedesktop.org/patch/msgid/20241117170326.1971113-1-tjmercier@google.com
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net:
1) Fix bogus test reports in rpath.sh selftest by adding permanent
neighbor entries, from Phil Sutter.
2) Lockdep reports possible ABBA deadlock in xt_IDLETIMER, fix it by
removing sysfs out of the mutex section, also from Phil Sutter.
3) It is illegal to release basechain via RCU callback, for several
reasons. Keep it simple and safe by calling synchronize_rcu() instead.
This is a partially reverting a botched recent attempt of me to fix
this basechain release path on netdevice removal.
From Florian Westphal.
netfilter pull request 24-12-11
* tag 'nf-24-12-11' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nf_tables: do not defer rule destruction via call_rcu
netfilter: IDLETIMER: Fix for possible ABBA deadlock
selftests: netfilter: Stabilize rpath.sh
====================
Link: https://patch.msgid.link/20241211230130.176937-1-pablo@netfilter.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Currently, the RIIC driver may run the I2C bus faster than requested,
which may cause subtle failures. E.g. Biju reported a measured bus
speed of 450 kHz instead of the expected maximum of 400 kHz on RZ/G2L.
The initial calculation of the bus period uses DIV_ROUND_UP(), to make
sure the actual bus speed never becomes faster than the requested bus
speed. However, the subsequent division-by-two steps do not use
round-up, which may lead to a too-small period, hence a too-fast and
possible out-of-spec bus speed. E.g. on RZ/Five, requesting a bus speed
of 100 resp. 400 kHz will yield too-fast target bus speeds of 100806
resp. 403226 Hz instead of 97656 resp. 390625 Hz.
Fix this by using DIV_ROUND_UP() in the subsequent divisions, too.
Tested on RZ/A1H, RZ/A2M, and RZ/Five.
Fixes: d982d66514 ("i2c: riic: remove clock and frequency restrictions")
Reported-by: Biju Das <biju.das.jz@bp.renesas.com>
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Cc: <stable@vger.kernel.org> # v4.15+
Link: https://lore.kernel.org/r/c59aea77998dfea1b4456c4b33b55ab216fcbf5e.1732284746.git.geert+renesas@glider.be
Signed-off-by: Andi Shyti <andi.shyti@kernel.org>
Drivers like mlx5 expose NIC's vlan_features such as
NETIF_F_GSO_UDP_TUNNEL & NETIF_F_GSO_UDP_TUNNEL_CSUM which are
later not propagated when the underlying devices are bonded and
a vlan device created on top of the bond.
Right now, the more cumbersome workaround for this is to create
the vlan on top of the mlx5 and then enslave the vlan devices
to a bond.
To fix this, add NETIF_F_GSO_ENCAP_ALL to BOND_VLAN_FEATURES
such that bond_compute_features() can probe and propagate the
vlan_features from the slave devices up to the vlan device.
Given the following bond:
# ethtool -i enp2s0f{0,1}np{0,1}
driver: mlx5_core
[...]
# ethtool -k enp2s0f0np0 | grep udp
tx-udp_tnl-segmentation: on
tx-udp_tnl-csum-segmentation: on
tx-udp-segmentation: on
rx-udp_tunnel-port-offload: on
rx-udp-gro-forwarding: off
# ethtool -k enp2s0f1np1 | grep udp
tx-udp_tnl-segmentation: on
tx-udp_tnl-csum-segmentation: on
tx-udp-segmentation: on
rx-udp_tunnel-port-offload: on
rx-udp-gro-forwarding: off
# ethtool -k bond0 | grep udp
tx-udp_tnl-segmentation: on
tx-udp_tnl-csum-segmentation: on
tx-udp-segmentation: on
rx-udp_tunnel-port-offload: off [fixed]
rx-udp-gro-forwarding: off
Before:
# ethtool -k bond0.100 | grep udp
tx-udp_tnl-segmentation: off [requested on]
tx-udp_tnl-csum-segmentation: off [requested on]
tx-udp-segmentation: on
rx-udp_tunnel-port-offload: off [fixed]
rx-udp-gro-forwarding: off
After:
# ethtool -k bond0.100 | grep udp
tx-udp_tnl-segmentation: on
tx-udp_tnl-csum-segmentation: on
tx-udp-segmentation: on
rx-udp_tunnel-port-offload: off [fixed]
rx-udp-gro-forwarding: off
Various users have run into this reporting performance issues when
configuring Cilium in vxlan tunneling mode and having the combination
of bond & vlan for the core devices connecting the Kubernetes cluster
to the outside world.
Fixes: a9b3ace44c ("bonding: fix vlan_features computing")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Nikolay Aleksandrov <razor@blackwall.org>
Cc: Ido Schimmel <idosch@idosch.org>
Cc: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://patch.msgid.link/20241210141245.327886-3-daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
If a bonding device has slave devices, then the current logic to derive
the feature set for the master bond device is limited in that flags which
are fully supported by the underlying slave devices cannot be propagated
up to vlan devices which sit on top of bond devices. Instead, these get
blindly masked out via current NETIF_F_ALL_FOR_ALL logic.
vlan_features and mpls_features should reuse netdev_base_features() in
order derive the set in the same way as ndo_fix_features before iterating
through the slave devices to refine the feature set.
Fixes: a9b3ace44c ("bonding: fix vlan_features computing")
Fixes: 2e770b507c ("net: bonding: Inherit MPLS features from slave devices")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Nikolay Aleksandrov <razor@blackwall.org>
Cc: Ido Schimmel <idosch@idosch.org>
Cc: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://patch.msgid.link/20241210141245.327886-2-daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Since the linked fixes: commit, err is returned uninitialized due to the
removal of "return 0". Initialize err to fix it.
This fixes the following intermittent test failure on release builds:
$ perf test "testsuite_probe"
...
-- [ FAIL ] -- perf_probe :: test_invalid_options :: mutually exclusive options :: -L foo -V bar (output regexp parsing)
Regexp not found: \"Error: switch .+ cannot be used with switch .+\"
...
Fixes: 080e47b2a2 ("perf probe: Introduce quotation marks support")
Tested-by: Namhyung Kim <namhyung@kernel.org>
Reviewed-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: James Clark <james.clark@linaro.org>
Link: https://lore.kernel.org/r/20241211085525.519458-2-james.clark@linaro.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
In general, 'qlen' of any classful qdisc should keep track of the
number of packets that the qdisc itself and all of its children holds.
In case of netem, 'qlen' only accounts for the packets in its internal
tfifo. When netem is used with a child qdisc, the child qdisc can use
'qdisc_tree_reduce_backlog' to inform its parent, netem, about created
or dropped SKBs. This function updates 'qlen' and the backlog statistics
of netem, but netem does not account for changes made by a child qdisc.
'qlen' then indicates the wrong number of packets in the tfifo.
If a child qdisc creates new SKBs during enqueue and informs its parent
about this, netem's 'qlen' value is increased. When netem dequeues the
newly created SKBs from the child, the 'qlen' in netem is not updated.
If 'qlen' reaches the configured sch->limit, the enqueue function stops
working, even though the tfifo is not full.
Reproduce the bug:
Ensure that the sender machine has GSO enabled. Configure netem as root
qdisc and tbf as its child on the outgoing interface of the machine
as follows:
$ tc qdisc add dev <oif> root handle 1: netem delay 100ms limit 100
$ tc qdisc add dev <oif> parent 1:0 tbf rate 50Mbit burst 1542 latency 50ms
Send bulk TCP traffic out via this interface, e.g., by running an iPerf3
client on the machine. Check the qdisc statistics:
$ tc -s qdisc show dev <oif>
Statistics after 10s of iPerf3 TCP test before the fix (note that
netem's backlog > limit, netem stopped accepting packets):
qdisc netem 1: root refcnt 2 limit 1000 delay 100ms
Sent 2767766 bytes 1848 pkt (dropped 652, overlimits 0 requeues 0)
backlog 4294528236b 1155p requeues 0
qdisc tbf 10: parent 1:1 rate 50Mbit burst 1537b lat 50ms
Sent 2767766 bytes 1848 pkt (dropped 327, overlimits 7601 requeues 0)
backlog 0b 0p requeues 0
Statistics after the fix:
qdisc netem 1: root refcnt 2 limit 1000 delay 100ms
Sent 37766372 bytes 24974 pkt (dropped 9, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
qdisc tbf 10: parent 1:1 rate 50Mbit burst 1537b lat 50ms
Sent 37766372 bytes 24974 pkt (dropped 327, overlimits 96017 requeues 0)
backlog 0b 0p requeues 0
tbf segments the GSO SKBs (tbf_segment) and updates the netem's 'qlen'.
The interface fully stops transferring packets and "locks". In this case,
the child qdisc and tfifo are empty, but 'qlen' indicates the tfifo is at
its limit and no more packets are accepted.
This patch adds a counter for the entries in the tfifo. Netem's 'qlen' is
only decreased when a packet is returned by its dequeue function, and not
during enqueuing into the child qdisc. External updates to 'qlen' are thus
accounted for and only the behavior of the backlog statistics changes. As
in other qdiscs, 'qlen' then keeps track of how many packets are held in
netem and all of its children. As before, sch->limit remains as the
maximum number of packets in the tfifo. The same applies to netem's
backlog statistics.
Fixes: 50612537e9 ("netem: fix classful handling")
Signed-off-by: Martin Ottens <martin.ottens@fau.de>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20241210131412.1837202-1-martin.ottens@fau.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Simon Wunderlich says:
====================
Here are some batman-adv bugfixes:
- fix TT unitialized data and size limit issues, by Remi Pommarel
(3 patches)
* tag 'batadv-net-pullrequest-20241210' of git://git.open-mesh.org/linux-merge:
batman-adv: Do not let TT changes list grows indefinitely
batman-adv: Remove uninitialized data in full table TT response
batman-adv: Do not send uninitialized TT changes
====================
Link: https://patch.msgid.link/20241210135024.39068-1-sw@simonwunderlich.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
With this port schedule:
tc qdisc replace dev $send_if parent root handle 100 taprio \
num_tc 8 queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \
map 0 1 2 3 4 5 6 7 \
base-time 0 cycle-time 10000 \
sched-entry S 01 1250 \
sched-entry S 02 1250 \
sched-entry S 04 1250 \
sched-entry S 08 1250 \
sched-entry S 10 1250 \
sched-entry S 20 1250 \
sched-entry S 40 1250 \
sched-entry S 80 1250 \
flags 2
ptp4l would fail to take TX timestamps of Pdelay_Resp messages like this:
increasing tx_timestamp_timeout may correct this issue, but it is likely caused by a driver bug
ptp4l[4134.168]: port 2: send peer delay response failed
It turns out that the driver can't take their TX timestamps because it
can't transmit them in the first place. And there's nothing special
about the Pdelay_Resp packets - they're just regular 68 byte packets.
But with this taprio configuration, the switch would refuse to send even
the ETH_ZLEN minimum packet size.
This should have definitely not been the case. When applying the taprio
config, the driver prints:
mscc_felix 0000:00:00.5: port 0 tc 0 min gate length 1250 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 132 octets including FCS
mscc_felix 0000:00:00.5: port 0 tc 1 min gate length 1250 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 132 octets including FCS
mscc_felix 0000:00:00.5: port 0 tc 2 min gate length 1250 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 132 octets including FCS
mscc_felix 0000:00:00.5: port 0 tc 3 min gate length 1250 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 132 octets including FCS
mscc_felix 0000:00:00.5: port 0 tc 4 min gate length 1250 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 132 octets including FCS
mscc_felix 0000:00:00.5: port 0 tc 5 min gate length 1250 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 132 octets including FCS
mscc_felix 0000:00:00.5: port 0 tc 6 min gate length 1250 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 132 octets including FCS
mscc_felix 0000:00:00.5: port 0 tc 7 min gate length 1250 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 132 octets including FCS
and thus, everything under 132 bytes - ETH_FCS_LEN should have been sent
without problems. Yet it's not.
For the forwarding path, the configuration is fine, yet packets injected
from Linux get stuck with this schedule no matter what.
The first hint that the static guard bands are the cause of the problem
is that reverting Michael Walle's commit 297c4de6f7 ("net: dsa: felix:
re-enable TAS guard band mode") made things work. It must be that the
guard bands are calculated incorrectly.
I remembered that there is a magic constant in the driver, set to 33 ns
for no logical reason other than experimentation, which says "never let
the static guard bands get so large as to leave less than this amount of
remaining space in the time slot, because the queue system will refuse
to schedule packets otherwise, and they will get stuck". I had a hunch
that my previous experimentally-determined value was only good for
packets coming from the forwarding path, and that the CPU injection path
needed more.
I came to the new value of 35 ns through binary search, after seeing
that with 544 ns (the bit time required to send the Pdelay_Resp packet
at gigabit) it works. Again, this is purely experimental, there's no
logic and the manual doesn't say anything.
The new driver prints for this schedule look like this:
mscc_felix 0000:00:00.5: port 0 tc 0 min gate length 1250 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 131 octets including FCS
mscc_felix 0000:00:00.5: port 0 tc 1 min gate length 1250 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 131 octets including FCS
mscc_felix 0000:00:00.5: port 0 tc 2 min gate length 1250 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 131 octets including FCS
mscc_felix 0000:00:00.5: port 0 tc 3 min gate length 1250 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 131 octets including FCS
mscc_felix 0000:00:00.5: port 0 tc 4 min gate length 1250 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 131 octets including FCS
mscc_felix 0000:00:00.5: port 0 tc 5 min gate length 1250 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 131 octets including FCS
mscc_felix 0000:00:00.5: port 0 tc 6 min gate length 1250 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 131 octets including FCS
mscc_felix 0000:00:00.5: port 0 tc 7 min gate length 1250 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 131 octets including FCS
So yes, the maximum MTU is now even smaller by 1 byte than before.
This is maybe counter-intuitive, but makes more sense with a diagram of
one time slot.
Before:
Gate open Gate close
| |
v 1250 ns total time slot duration v
<---------------------------------------------------->
<----><---------------------------------------------->
33 ns 1217 ns static guard band
useful
Gate open Gate close
| |
v 1250 ns total time slot duration v
<---------------------------------------------------->
<-----><--------------------------------------------->
35 ns 1215 ns static guard band
useful
The static guard band implemented by this switch hardware directly
determines the maximum allowable MTU for that traffic class. The larger
it is, the earlier the switch will stop scheduling frames for
transmission, because otherwise they might overrun the gate close time
(and avoiding that is the entire purpose of Michael's patch).
So, we now have guard bands smaller by 2 ns, thus, in this particular
case, we lose a byte of the maximum MTU.
Fixes: 11afdc6526 ("net: dsa: felix: tc-taprio intervals smaller than MTU should send at least one packet")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Michael Walle <mwalle@kernel.org>
Link: https://patch.msgid.link/20241210132640.3426788-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When `skb_splice_from_iter` was introduced, it inadvertently added
checksumming for AF_UNIX sockets. This resulted in significant
slowdowns, for example when using sendfile over unix sockets.
Using the test code in [1] in my test setup (2G single core qemu),
the client receives a 1000M file in:
- without the patch: 1482ms (+/- 36ms)
- with the patch: 652.5ms (+/- 22.9ms)
This commit addresses the issue by marking checksumming as unnecessary in
`unix_stream_sendmsg`
Cc: stable@vger.kernel.org
Signed-off-by: Frederik Deweerdt <deweerdt.lkml@gmail.com>
Fixes: 2e910b9532 ("net: Add a function to splice pages into an skbuff for MSG_SPLICE_PAGES")
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/Z1fMaHkRf8cfubuE@xiberoa
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In export_udmabuf(), if dma_buf_fd() fails because the FD table is full, a
dma_buf owning the udmabuf has already been created; but the error handling
in udmabuf_create() will tear down the udmabuf without doing anything about
the containing dma_buf.
This leaves a dma_buf in memory that contains a dangling pointer; though
that doesn't seem to lead to anything bad except a memory leak.
Fix it by moving the dma_buf_fd() call out of export_udmabuf() so that we
can give it different error handling.
Note that the shape of this code changed a lot in commit 5e72b2b41a
("udmabuf: convert udmabuf driver to use folios"); but the memory leak
seems to have existed since the introduction of udmabuf.
Fixes: fbb0de7950 ("Add udmabuf misc device")
Acked-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20241204-udmabuf-fixes-v2-3-23887289de1c@google.com
nf_tables_chain_destroy can sleep, it can't be used from call_rcu
callbacks.
Moreover, nf_tables_rule_release() is only safe for error unwinding,
while transaction mutex is held and the to-be-desroyed rule was not
exposed to either dataplane or dumps, as it deactives+frees without
the required synchronize_rcu() in-between.
nft_rule_expr_deactivate() callbacks will change ->use counters
of other chains/sets, see e.g. nft_lookup .deactivate callback, these
must be serialized via transaction mutex.
Also add a few lockdep asserts to make this more explicit.
Calling synchronize_rcu() isn't ideal, but fixing this without is hard
and way more intrusive. As-is, we can get:
WARNING: .. net/netfilter/nf_tables_api.c:5515 nft_set_destroy+0x..
Workqueue: events nf_tables_trans_destroy_work
RIP: 0010:nft_set_destroy+0x3fe/0x5c0
Call Trace:
<TASK>
nf_tables_trans_destroy_work+0x6b7/0xad0
process_one_work+0x64a/0xce0
worker_thread+0x613/0x10d0
In case the synchronize_rcu becomes an issue, we can explore alternatives.
One way would be to allocate nft_trans_rule objects + one nft_trans_chain
object, deactivate the rules + the chain and then defer the freeing to the
nft destroy workqueue. We'd still need to keep the synchronize_rcu path as
a fallback to handle -ENOMEM corner cases though.
Reported-by: syzbot+b26935466701e56cfdc2@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/67478d92.050a0220.253251.0062.GAE@google.com/T/
Fixes: c03d278fdf ("netfilter: nf_tables: wait for rcu grace period on net_device removal")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Deletion of the last rule referencing a given idletimer may happen at
the same time as a read of its file in sysfs:
| ======================================================
| WARNING: possible circular locking dependency detected
| 6.12.0-rc7-01692-g5e9a28f41134-dirty #594 Not tainted
| ------------------------------------------------------
| iptables/3303 is trying to acquire lock:
| ffff8881057e04b8 (kn->active#48){++++}-{0:0}, at: __kernfs_remove+0x20
|
| but task is already holding lock:
| ffffffffa0249068 (list_mutex){+.+.}-{3:3}, at: idletimer_tg_destroy_v]
|
| which lock already depends on the new lock.
A simple reproducer is:
| #!/bin/bash
|
| while true; do
| iptables -A INPUT -i foo -j IDLETIMER --timeout 10 --label "testme"
| iptables -D INPUT -i foo -j IDLETIMER --timeout 10 --label "testme"
| done &
| while true; do
| cat /sys/class/xt_idletimer/timers/testme >/dev/null
| done
Avoid this by freeing list_mutex right after deleting the element from
the list, then continuing with the teardown.
Fixes: 0902b469bd ("netfilter: xtables: idletimer target implementation")
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
On some systems, neighbor discoveries from ns1 for fec0:42::1 (i.e., the
martian trap address) would happen at the wrong time and cause
false-negative test result.
Problem analysis also discovered that IPv6 martian ping test was broken
in that sent neighbor discoveries, not echo requests were inadvertently
trapped
Avoid the race condition by introducing the neighbors to each other
upfront. Also pin down the firewall rules to matching on echo requests
only.
Fixes: efb056e5f1 ("netfilter: ip6t_rpfilter: Fix regression with VRF interfaces")
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This reverts commit 5c26d2f1d3.
It turns out that we can't do this, because while the old behavior of
ignoring ignorable code points was most definitely wrong, we have
case-folding filesystems with on-disk hash values with that wrong
behavior.
So now you can't look up those names, because they hash to something
different.
Of course, it's also entirely possible that in the meantime people have
created *new* files with the new ("more correct") case folding logic,
and reverting will just make other things break.
The correct solution is to not do case folding in filesystems, but
sadly, people seem to never really understand that. People still see it
as a feature, not a bug.
Reported-by: Qi Han <hanqi@vivo.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=219586
Cc: Gabriel Krisman Bertazi <krisman@suse.de>
Requested-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull vfio fix from Alex Williamson:
- Fix migration dirty page tracking support in the mlx5-vfio-pci
variant driver in configurations where the system page size exceeds
the device maximum message size, and anticipate device updates where
the opposite may also be required (Yishai Hadas)
* tag 'vfio-v6.13-rc3' of https://github.com/awilliam/linux-vfio:
vfio/mlx5: Align the page tracking max message size with the device capability
Pull Kselftest fix from Shuah Khan:
- fix the offset for kprobe syntax error test case when checking the
BTF arguments on 64-bit powerpc
* tag 'linux_kselftest-fixes-6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
selftests/ftrace: adjust offset for kprobe syntax error test
When CONFIG_DEBUG_RT_MUTEXES=y, mutex_lock->rt_mutex_try_acquire
would change from rt_mutex_cmpxchg_acquire to
rt_mutex_slowtrylock():
raw_spin_lock_irqsave(&lock->wait_lock, flags);
ret = __rt_mutex_slowtrylock(lock);
raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
Because queued_spin_#ops to ticket_#ops is changed one by one by
jump_label, raw_spin_lock/unlock would cause a deadlock during the
changing.
That means in arch/riscv/kernel/jump_label.c:
1.
arch_jump_label_transform_queue() ->
mutex_lock(&text_mutex); +-> raw_spin_lock -> queued_spin_lock
|-> raw_spin_unlock -> queued_spin_unlock
patch_insn_write -> change the raw_spin_lock to ticket_lock
mutex_unlock(&text_mutex);
...
2. /* Dirty the lock value */
arch_jump_label_transform_queue() ->
mutex_lock(&text_mutex); +-> raw_spin_lock -> *ticket_lock*
|-> raw_spin_unlock -> *queued_spin_unlock*
/* BUG: ticket_lock with queued_spin_unlock */
patch_insn_write -> change the raw_spin_unlock to ticket_unlock
mutex_unlock(&text_mutex);
...
3. /* Dead lock */
arch_jump_label_transform_queue() ->
mutex_lock(&text_mutex); +-> raw_spin_lock -> ticket_lock /* deadlock! */
|-> raw_spin_unlock -> ticket_unlock
patch_insn_write -> change other raw_spin_#op -> ticket_#op
mutex_unlock(&text_mutex);
So, the solution is to disable mutex usage of
arch_jump_label_transform_queue() during early_boot_irqs_disabled, just
like we have done for stop_machine.
Reported-by: Conor Dooley <conor@kernel.org>
Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
Signed-off-by: Guo Ren <guoren@kernel.org>
Fixes: ab83647fad ("riscv: Add qspinlock support")
Link: https://lore.kernel.org/linux-riscv/CAJF2gTQwYTGinBmCSgVUoPv0_q4EPt_+WiyfUA1HViAKgUzxAg@mail.gmail.com/T/#mf488e6347817fca03bb93a7d34df33d8615b3775
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
Tested-by: Conor Dooley <conor.dooley@microchip.com>
Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Tested-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Tested-by: Nam Cao <namcao@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20241130153310.3349484-1-guoren@kernel.org
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Since the linked fixes: commit, specifying a CPU on hybrid platforms
results in an error because Perf tries to open an extended type event
on "any" CPU which isn't valid. Extended type events can only be opened
on CPUs that match the type.
Before (working):
$ perf record --cpu 1 -- true
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 2.385 MB perf.data (7 samples) ]
After (not working):
$ perf record -C 1 -- true
WARNING: A requested CPU in '1' is not supported by PMU 'cpu_atom' (CPUs 16-27) for event 'cycles:P'
Error:
The sys_perf_event_open() syscall returned with 22 (Invalid argument) for event (cpu_atom/cycles:P/).
/bin/dmesg | grep -i perf may provide additional information.
(Ignore the warning message, that's expected and not particularly
relevant to this issue).
This is because perf_cpu_map__intersect() of the user specified CPU (1)
and one of the PMU's CPUs (16-27) correctly results in an empty (NULL)
CPU map. However for the purposes of opening an event, libperf converts
empty CPU maps into an any CPU (-1) which the kernel rejects.
Fix it by deleting evsels with empty CPU maps in the specific case where
user requested CPU maps are evaluated.
Fixes: 251aa04024 ("perf parse-events: Wildcard most "numeric" events")
Reviewed-by: Ian Rogers <irogers@google.com>
Tested-by: Thomas Falcon <thomas.falcon@intel.com>
Signed-off-by: James Clark <james.clark@linaro.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Link: https://lore.kernel.org/r/20241114160450.295844-2-james.clark@linaro.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
In 'NOFENTRY_ARGS' test case for syntax check, any offset X of
`vfs_read+X` except function entry offset (0) fits the criterion,
even if that offset is not at instruction boundary, as the parser
comes before probing. But with "ENDBR64" instruction on x86, offset
4 is treated as function entry. So, X can't be 4 as well. Thus, 8
was used as offset for the test case. On 64-bit powerpc though, any
offset <= 16 can be considered function entry depending on build
configuration (see arch_kprobe_on_func_entry() for implementation
details). So, use `vfs_read+20` to accommodate that scenario too.
Link: https://lore.kernel.org/r/20241129202621.721159-1-hbathini@linux.ibm.com
Fixes: 4231f30fcc ("selftests/ftrace: Add BTF arguments test cases")
Suggested-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
The bt_copy_from_sockptr() return value is being misinterpreted by most
users: a non-zero result is mistakenly assumed to represent an error code,
but actually indicates the number of bytes that could not be copied.
Remove bt_copy_from_sockptr() and adapt callers to use
copy_safe_from_sockptr().
For sco_sock_setsockopt() (case BT_CODEC) use copy_struct_from_sockptr() to
scrub parts of uninitialized buffer.
Opportunistically, rename `len` to `optlen` in hci_sock_setsockopt_old()
and hci_sock_setsockopt().
Fixes: 51eda36d33 ("Bluetooth: SCO: Fix not validating setsockopt user input")
Fixes: a97de7bff1 ("Bluetooth: RFCOMM: Fix not validating setsockopt user input")
Fixes: 4f3951242a ("Bluetooth: L2CAP: Fix not validating setsockopt user input")
Fixes: 9e8742cdfc ("Bluetooth: ISO: Fix not validating setsockopt user input")
Fixes: b2186061d6 ("Bluetooth: hci_sock: Fix not validating setsockopt user input")
Reviewed-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Reviewed-by: David Wei <dw@davidwei.uk>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Since commit cdd30ebb1b ("module: Convert symbol namespace to string
literal"), exported symbols marked by EXPORT_SYMBOL_NS(_GPL) are
ignored by "kernel-doc -export" in fresh build of "make htmldocs".
This is because regex in the perl script for those markers fails to
match the new signatures:
- EXPORT_SYMBOL_NS(symbol, "ns");
- EXPORT_SYMBOL_NS_GPL(symbol, "ns");
Update the regex so that it matches quoted string.
Note: Escape sequence of \w is good for C identifiers, but can be
too strict for quoted strings. Instead, use \S, which matches any
non-whitespace character, for compatibility with possible extension
of namespace convention in the future [1].
Fixes: cdd30ebb1b ("module: Convert symbol namespace to string literal")
Link: https://lore.kernel.org/CAK7LNATMufXP0EA6QUE9hBkZMa6vJO6ZiaYuak2AhOrd2nSVKQ@mail.gmail.com/ [1]
Signed-off-by: Akira Yokosawa <akiyks@gmail.com>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Link: https://lore.kernel.org/r/e5c43f36-45cd-49f4-b7b8-ff342df3c7a4@gmail.com
One specific test condition: the default registers of p[j].reg ~
p[j+3].reg are 0, TASDEVICE_REG(0x00, 0x14, 0x38)(PLT_FLAG_REG),
TASDEVICE_REG(0x00, 0x14, 0x40)(SINEGAIN_REG), and
TASDEVICE_REG(0x00, 0x14, 0x44)(SINEGAIN2_REG). After first calibration,
they are freshed to TASDEVICE_REG(0x00, 0x1a, 0x20), TASDEVICE_REG(0x00,
0x16, 0x58)(PLT_FLAG_REG), TASDEVICE_REG(0x00, 0x14, 0x44)(SINEGAIN_REG),
and TASDEVICE_REG(0x00, 0x16, 0x64)(SINEGAIN2_REG) via "Calibration Start"
kcontrol. In second calibration, the p[j].reg ~ p[j+3].reg have already
become tas2781_cali_start_reg. However, p[j+2].reg, TASDEVICE_REG(0x00,
0x14, 0x44)(SINEGAIN_REG), will be freshed to TASDEVICE_REG(0x00, 0x16,
0x64), which is the third register in the input params of the kcontrol.
This is why only first calibration can work, the second-time, third-time
or more-time calibration always failed without reboot. Of course, if no
p[j].reg is in the list of tas2781_cali_start_reg, this stress test can
work well.
Fixes: 49e2e353fb ("ASoC: tas2781: Add Calibration Kcontrols for Chromebook")
Signed-off-by: Shenghao Ding <shenghao-ding@ti.com>
Link: https://patch.msgid.link/20241211043859.1328-1-shenghao-ding@ti.com
Signed-off-by: Mark Brown <broonie@kernel.org>
Currently the stop routine of rswitch driver does not immediately
prevent hardware from continuing to update descriptors and requesting
interrupts.
It can happen that when rswitch_stop() executes the masking of
interrupts from the queues of the port being closed, napi poll for
that port is already scheduled or running on a different CPU. When
execution of this napi poll completes, it will unmask the interrupts.
And unmasked interrupt can fire after rswitch_stop() returns from
napi_disable() call. Then, the handler won't mask it, because
napi_schedule_prep() will return false, and interrupt storm will
happen.
This can't be fixed by making rswitch_stop() call napi_disable() before
masking interrupts. In this case, the interrupt storm will happen if
interrupt fires between napi_disable() and masking.
Fix this by checking for priv->opened_ports bit when unmasking
interrupts after napi poll. For that to be consistent, move
priv->opened_ports changes into spinlock-protected areas, and reorder
other operations in rswitch_open() and rswitch_stop() accordingly.
Signed-off-by: Nikita Yushchenko <nikita.yoush@cogentembedded.com>
Reviewed-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
Fixes: 3590918b5d ("net: ethernet: renesas: Add support for "Ethernet Switch"")
Link: https://patch.msgid.link/20241209113204.175015-1-nikita.yoush@cogentembedded.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
If error path is taken while filling descriptor for a frame, skb
pointer is left in the entry. Later, on the ring entry reuse, the
same entry could be used as a part of a multi-descriptor frame,
and skb for that new frame could be stored in a different entry.
Then, the stale pointer will reach the completion routine, and passed
to the release operation.
Fix that by clearing the saved skb pointer at the error path.
Fixes: d2c96b9d5f ("net: rswitch: Add jumbo frames handling for TX")
Signed-off-by: Nikita Yushchenko <nikita.yoush@cogentembedded.com>
Reviewed-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
Link: https://patch.msgid.link/20241208095004.69468-4-nikita.yoush@cogentembedded.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
If hardware is already transmitting, it can start handling the
descriptor being written to immediately after it observes updated DT
field, before the queue is kicked by a write to GWTRC.
If the start_xmit() execution is preempted at unfortunate moment, this
transmission can complete, and interrupt handled, before gq->cur gets
updated. With the current implementation of completion, this will cause
the last entry not completed.
Fix that by changing completion loop to check DT values directly, instead
of depending on gq->cur.
Fixes: 3590918b5d ("net: ethernet: renesas: Add support for "Ethernet Switch"")
Signed-off-by: Nikita Yushchenko <nikita.yoush@cogentembedded.com>
Reviewed-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
Link: https://patch.msgid.link/20241208095004.69468-3-nikita.yoush@cogentembedded.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When sending frame split into multiple descriptors, hardware processes
descriptors one by one, including writing back DT values. The first
descriptor could be already marked as completed when processing of
next descriptors for the same frame is still in progress.
Although only the last descriptor is configured to generate interrupt,
completion of the first descriptor could be noticed by the driver when
handling interrupt for the previous frame.
Currently, driver stores skb in the entry that corresponds to the first
descriptor. This results into skb could be unmapped and freed when
hardware did not complete the send yet. This opens a window for
corrupting the data being sent.
Fix this by saving skb in the entry that corresponds to the last
descriptor used to send the frame.
Fixes: d2c96b9d5f ("net: rswitch: Add jumbo frames handling for TX")
Signed-off-by: Nikita Yushchenko <nikita.yoush@cogentembedded.com>
Reviewed-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
Link: https://patch.msgid.link/20241208095004.69468-2-nikita.yoush@cogentembedded.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The cifs_io_request struct (a wrapper around netfs_io_request) holds open
the file on the server, even beyond the local Linux file being closed.
This can cause problems with Windows-based filesystems as the file's name
still exists after deletion until the file is closed, preventing the parent
directory from being removed and causing spurious test failures in xfstests
due to inability to remove a directory. The symptom looks something like
this in the test output:
rm: cannot remove '/mnt/scratch/test/p0/d3': Directory not empty
rm: cannot remove '/mnt/scratch/test/p1/dc/dae': Directory not empty
Fix this by waiting in unlink and rename for any outstanding I/O requests
to be completed on the target file before removing that file.
Note that this doesn't prevent Linux from trying to start new requests
after deletion if it still has the file open locally - something that's
perfectly acceptable on a UNIX system.
Note also that whilst I've marked this as fixing the commit to make cifs
use netfslib, I don't know that it won't occur before that.
Fixes: 3ee1a1fc39 ("cifs: Cut over to using netfslib")
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
Johannes Berg says:
====================
A small set of fixes:
- avoid CSA warnings during link removal
(by changing link bitmap after remove)
- fix # of spatial streams initialisation
- fix queues getting stuck in some CSA cases
and resume failures
- fix interface address when switching monitor mode
- fix MBSS change flags 32-bit stack corruption
- more UBSAN __counted_by "fixes" ...
- fix link ID netlink validation
* tag 'wireless-2024-12-10' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
wifi: cfg80211: sme: init n_channels before channels[] access
wifi: mac80211: fix station NSS capability initialization order
wifi: mac80211: fix vif addr when switching from monitor to station
wifi: mac80211: fix a queue stall in certain cases of CSA
wifi: mac80211: wake the queues in case of failure in resume
wifi: cfg80211: clear link ID from bitmap during link delete after clean up
wifi: mac80211: init cnt before accessing elem in ieee80211_copy_mbssid_beacon
wifi: mac80211: fix mbss changed flags corruption on 32 bit systems
wifi: nl80211: fix NL80211_ATTR_MLO_LINK_ID off-by-one
====================
Link: https://patch.msgid.link/20241210130145.28618-3-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net.ipv4.nexthop_compat_mode was added when nexthop objects were added to
provide the view of nexthop objects through the usual lens of the route
UAPI. As nexthop objects evolved, the information provided through this
lens became incomplete. For example, details of resilient nexthop groups
are obviously omitted.
Now that 16-bit nexthop group weights are a thing, the 8-bit UAPI cannot
convey the >8-bit weight accurately. Instead of inventing workarounds for
an obsolete interface, just document the expectations of inaccuracy.
Fixes: b72a6a7ab9 ("net: nexthop: Increase weight to u16")
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/b575e32399ccacd09079b2a218255164535123bd.1733740749.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The 5760X (P7) chip's HW GRO/LRO interface is very similar to that of
the previous generation (5750X or P5). However, the aggregation ID
fields in the completion structures on P7 have been redefined from
16 bits to 12 bits. The freed up 4 bits are redefined for part of the
metadata such as the VLAN ID. The aggregation ID mask was not modified
when adding support for P7 chips. Including the extra 4 bits for the
aggregation ID can potentially cause the driver to store or fetch the
packet header of GRO/LRO packets in the wrong TPA buffer. It may hit
the BUG() condition in __skb_pull() because the SKB contains no valid
packet header:
kernel BUG at include/linux/skbuff.h:2766!
Oops: invalid opcode: 0000 1 PREEMPT SMP NOPTI
CPU: 4 UID: 0 PID: 0 Comm: swapper/4 Kdump: loaded Tainted: G OE 6.12.0-rc2+ #7
Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
Hardware name: Dell Inc. PowerEdge R760/0VRV9X, BIOS 1.0.1 12/27/2022
RIP: 0010:eth_type_trans+0xda/0x140
Code: 80 00 00 00 eb c1 8b 47 70 2b 47 74 48 8b 97 d0 00 00 00 83 f8 01 7e 1b 48 85 d2 74 06 66 83 3a ff 74 09 b8 00 04 00 00 eb a5 <0f> 0b b8 00 01 00 00 eb 9c 48 85 ff 74 eb 31 f6 b9 02 00 00 00 48
RSP: 0018:ff615003803fcc28 EFLAGS: 00010283
RAX: 00000000000022d2 RBX: 0000000000000003 RCX: ff2e8c25da334040
RDX: 0000000000000040 RSI: ff2e8c25c1ce8000 RDI: ff2e8c25869f9000
RBP: ff2e8c258c31c000 R08: ff2e8c25da334000 R09: 0000000000000001
R10: ff2e8c25da3342c0 R11: ff2e8c25c1ce89c0 R12: ff2e8c258e0990b0
R13: ff2e8c25bb120000 R14: ff2e8c25c1ce89c0 R15: ff2e8c25869f9000
FS: 0000000000000000(0000) GS:ff2e8c34be300000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055f05317e4c8 CR3: 000000108bac6006 CR4: 0000000000773ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
<IRQ>
? die+0x33/0x90
? do_trap+0xd9/0x100
? eth_type_trans+0xda/0x140
? do_error_trap+0x65/0x80
? eth_type_trans+0xda/0x140
? exc_invalid_op+0x4e/0x70
? eth_type_trans+0xda/0x140
? asm_exc_invalid_op+0x16/0x20
? eth_type_trans+0xda/0x140
bnxt_tpa_end+0x10b/0x6b0 [bnxt_en]
? bnxt_tpa_start+0x195/0x320 [bnxt_en]
bnxt_rx_pkt+0x902/0xd90 [bnxt_en]
? __bnxt_tx_int.constprop.0+0x89/0x300 [bnxt_en]
? kmem_cache_free+0x343/0x440
? __bnxt_tx_int.constprop.0+0x24f/0x300 [bnxt_en]
__bnxt_poll_work+0x193/0x370 [bnxt_en]
bnxt_poll_p5+0x9a/0x300 [bnxt_en]
? try_to_wake_up+0x209/0x670
__napi_poll+0x29/0x1b0
Fix it by redefining the aggregation ID mask for P5_PLUS chips to be
12 bits. This will work because the maximum aggregation ID is less
than 4096 on all P5_PLUS chips.
Fixes: 13d2d3d381 ("bnxt_en: Add new P7 hardware interface definitions")
Reviewed-by: Damodharam Ammepalli <damodharam.ammepalli@broadcom.com>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20241209015448.1937766-1-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pull clk fixes from Stephen Boyd:
"Two reverts and two EN7581 driver fixes:
- Revert the attempt to make CLK_GET_RATE_NOCACHE flag work in
clk_set_rate() because it led to problems with the Qualcomm CPUFreq
driver
- Revert Amlogic reset driver back to the initial implementation.
This broke probe of the audio subsystem on axg based platforms and
also had compilation problems. We'll try again next time.
- Fix a clk frequency and fix array bounds runtime checks in the
Airoha EN7581 driver"
* tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
clk: en7523: Initialize num before accessing hws in en7523_register_clocks()
clk: en7523: Fix wrong BUS clock for EN7581
clk: amlogic: axg-audio: revert reset implementation
Revert "clk: Fix invalid execution of clk_set_rate"
Pull btrfs fixes from David Sterba:
"A few more fixes. Apart from the one liners and updated bio splitting
error handling there's a fix for subvolume mount with different flags.
This was known and fixed for some time but I've delayed it to give it
more testing.
- fix unbalanced locking when swapfile activation fails when the
subvolume gets deleted in the meantime
- add btrfs error handling after bio_split() calls that got error
handling recently
- during unmount, flush delalloc workers at the right time before the
cleaner thread is shut down
- fix regression in buffered write folio conversion, explicitly wait
for writeback as FGP_STABLE flag is currently a no-op on btrfs
- handle race in subvolume mount with different flags, the conversion
to the new mount API did not handle the case where multiple
subvolumes get mounted in parallel, which is a distro use case"
* tag 'for-6.13-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: flush delalloc workers queue before stopping cleaner kthread during unmount
btrfs: handle bio_split() errors
btrfs: properly wait for writeback before buffered write
btrfs: fix missing snapshot drew unlock when root is dead during swap activation
btrfs: fix mount failure due to remount races
Pull eprobes fix from Masami Hiramatsu:
- release eprobe when failing to add dyn_event.
This unregisters event call and release eprobe when it fails to add a
dynamic event. Found in cleaning up.
* tag 'probes-fixes-v6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing/eprobe: Fix to release eprobe when failed to add dyn_event
Some file systems do not ensure that the single call of iterate_dir
reaches the end of the directory. For example, FUSE fetches entries from
a daemon using 4KB buffer and stops fetching if entries exceed the
buffer. And then an actor of caller, KSMBD, is used to fill the entries
from the buffer.
Thus, pattern searching on FUSE, files located after the 4KB could not
be found and STATUS_NO_SUCH_FILE was returned.
Signed-off-by: Hobin Woo <hobin.woo@samsung.com>
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Reviewed-by: Namjae Jeon <linkinjeon@kernel.org>
Tested-by: Yoonho Shin <yoonho.shin@samsung.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
RCD Upstream Port's PCI Express Capability is a component registers
block stored in RCD Upstream Port RCRB. CXL PCI driver helps to map it
during the RCD probing, but mapping failure is allowed for component
registers blocks in CXL PCI driver.
dport->regs.rcd_pcie_cap is used to store the virtual address of the RCD
Upstream Port's PCI Express Capability, add a dport->regs.rcd_pcie_cap
checking in rcd_pcie_cap_emit() just in case user accesses a invalid
address via RCD sysfs.
Fixes: c5eaec79fa ("cxl/pci: Add sysfs attribute for CXL 1.1 device link status")
Signed-off-by: Li Ming <ming.li@zohomail.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Link: https://patch.msgid.link/20241129132825.569237-1-ming.li@zohomail.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Currently, the pointer stored in call->prog_array is loaded in
__uprobe_perf_func(), with no RCU annotation and no immediately visible
RCU protection, so it looks as if the loaded pointer can immediately be
dangling.
Later, bpf_prog_run_array_uprobe() starts a RCU-trace read-side critical
section, but this is too late. It then uses rcu_dereference_check(), but
this use of rcu_dereference_check() does not actually dereference anything.
Fix it by aligning the semantics to bpf_prog_run_array(): Let the caller
provide rcu_read_lock_trace() protection and then load call->prog_array
with rcu_dereference_check().
This issue seems to be theoretical: I don't know of any way to reach this
code without having handle_swbp() further up the stack, which is already
holding a rcu_read_lock_trace() lock, so where we take
rcu_read_lock_trace() in __uprobe_perf_func()/bpf_prog_run_array_uprobe()
doesn't actually have any effect.
Fixes: 8c7dcb84e3 ("bpf: implement sleepable uprobes by chaining gps")
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241210-bpf-fix-uprobe-uaf-v4-1-5fc8959b2b74@google.com
The bpf_remove_insns() function returns WARN_ON_ONCE(error), where
error is a result of bpf_adj_branches(), and thus should be always 0
However, if for any reason it is not 0, then it will be converted to
boolean by WARN_ON_ONCE and returned to user space as 1, not an actual
error value. Fix this by returning the original err after the WARN check.
Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20241210114245.836164-1-aspsk@isovalent.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Eduard Zingerman says:
====================
bpf: track changes_pkt_data property for global functions
Nick Zavaritsky reported [0] a bug in verifier, where the following
unsafe program is not rejected:
__attribute__((__noinline__))
long skb_pull_data(struct __sk_buff *sk, __u32 len)
{
return bpf_skb_pull_data(sk, len);
}
SEC("tc")
int test_invalidate_checks(struct __sk_buff *sk)
{
int *p = (void *)(long)sk->data;
if ((void *)(p + 1) > (void *)(long)sk->data_end) return TCX_DROP;
skb_pull_data(sk, 0);
/* not safe, p is invalid after bpf_skb_pull_data call */
*p = 42;
return TCX_PASS;
}
This happens because verifier does not track package invalidation
effect of global sub-programs.
This patch-set fixes the issue by modifying check_cfg() to compute
whether or not each sub-program calls (directly or indirectly)
helper invalidating packet pointers.
As global functions could be replaced with extension programs,
a new field 'changes_pkt_data' is added to struct bpf_prog_aux.
Verifier only allows replacing functions that do not change packet
data with functions that do not change packet data.
In case if there is a need to a have a global function that does not
change packet data, but allow replacing it with function that does,
the recommendation is to add a noop call to a helper, e.g.:
- for skb do 'bpf_skb_change_proto(skb, 0, 0)';
- for xdp do 'bpf_xdp_adjust_meta(xdp, 0)'.
Functions also can do tail calls. Effects of the tail call cannot be
analyzed before-hand, thus verifier assumes that tail calls always
change packet data.
Changes v1 [1] -> v2:
- added handling of extension programs and tail calls
(thanks, Alexei, for all the input).
[0] https://lore.kernel.org/bpf/0498CA22-5779-4767-9C0C-A9515CEA711F@gmail.com/
[1] https://lore.kernel.org/bpf/20241206040307.568065-1-eddyz87@gmail.com/
====================
Link: https://patch.msgid.link/20241210041100.1898468-1-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Tail-called programs could execute any of the helpers that invalidate
packet pointers. Hence, conservatively assume that each tail call
invalidates packet pointers.
Making the change in bpf_helper_changes_pkt_data() automatically makes
use of check_cfg() logic that computes 'changes_pkt_data' effect for
global sub-programs, such that the following program could be
rejected:
int tail_call(struct __sk_buff *sk)
{
bpf_tail_call_static(sk, &jmp_table, 0);
return 0;
}
SEC("tc")
int not_safe(struct __sk_buff *sk)
{
int *p = (void *)(long)sk->data;
... make p valid ...
tail_call(sk);
*p = 42; /* this is unsafe */
...
}
The tc_bpf2bpf.c:subprog_tc() needs change: mark it as a function that
can invalidate packet pointers. Otherwise, it can't be freplaced with
tailcall_freplace.c:entry_freplace() that does a tail call.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20241210041100.1898468-8-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Try different combinations of global functions replacement:
- replace function that changes packet data with one that doesn't;
- replace function that changes packet data with one that does;
- replace function that doesn't change packet data with one that does;
- replace function that doesn't change packet data with one that doesn't;
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20241210041100.1898468-7-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
When processing calls to global sub-programs, verifier decides whether
to invalidate all packet pointers in current state depending on the
changes_pkt_data property of the global sub-program.
Because of this, an extension program replacing a global sub-program
must be compatible with changes_pkt_data property of the sub-program
being replaced.
This commit:
- adds changes_pkt_data flag to struct bpf_prog_aux:
- this flag is set in check_cfg() for main sub-program;
- in jit_subprogs() for other sub-programs;
- modifies bpf_check_attach_btf_id() to check changes_pkt_data flag;
- moves call to check_attach_btf_id() after the call to check_cfg(),
because it needs changes_pkt_data flag to be set:
bpf_check:
... ...
- check_attach_btf_id resolve_pseudo_ldimm64
resolve_pseudo_ldimm64 --> bpf_prog_is_offloaded
bpf_prog_is_offloaded check_cfg
check_cfg + check_attach_btf_id
... ...
The following fields are set by check_attach_btf_id():
- env->ops
- prog->aux->attach_btf_trace
- prog->aux->attach_func_name
- prog->aux->attach_func_proto
- prog->aux->dst_trampoline
- prog->aux->mod
- prog->aux->saved_dst_attach_type
- prog->aux->saved_dst_prog_type
- prog->expected_attach_type
Neither of these fields are used by resolve_pseudo_ldimm64() or
bpf_prog_offload_verifier_prep() (for netronome and netdevsim
drivers), so the reordering is safe.
Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20241210041100.1898468-6-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
When processing calls to certain helpers, verifier invalidates all
packet pointers in a current state. For example, consider the
following program:
__attribute__((__noinline__))
long skb_pull_data(struct __sk_buff *sk, __u32 len)
{
return bpf_skb_pull_data(sk, len);
}
SEC("tc")
int test_invalidate_checks(struct __sk_buff *sk)
{
int *p = (void *)(long)sk->data;
if ((void *)(p + 1) > (void *)(long)sk->data_end) return TCX_DROP;
skb_pull_data(sk, 0);
*p = 42;
return TCX_PASS;
}
After a call to bpf_skb_pull_data() the pointer 'p' can't be used
safely. See function filter.c:bpf_helper_changes_pkt_data() for a list
of such helpers.
At the moment verifier invalidates packet pointers when processing
helper function calls, and does not traverse global sub-programs when
processing calls to global sub-programs. This means that calls to
helpers done from global sub-programs do not invalidate pointers in
the caller state. E.g. the program above is unsafe, but is not
rejected by verifier.
This commit fixes the omission by computing field
bpf_subprog_info->changes_pkt_data for each sub-program before main
verification pass.
changes_pkt_data should be set if:
- subprogram calls helper for which bpf_helper_changes_pkt_data
returns true;
- subprogram calls a global function,
for which bpf_subprog_info->changes_pkt_data should be set.
The verifier.c:check_cfg() pass is modified to compute this
information. The commit relies on depth first instruction traversal
done by check_cfg() and absence of recursive function calls:
- check_cfg() would eventually visit every call to subprogram S in a
state when S is fully explored;
- when S is fully explored:
- every direct helper call within S is explored
(and thus changes_pkt_data is set if needed);
- every call to subprogram S1 called by S was visited with S1 fully
explored (and thus S inherits changes_pkt_data from S1).
The downside of such approach is that dead code elimination is not
taken into account: if a helper call inside global function is dead
because of current configuration, verifier would conservatively assume
that the call occurs for the purpose of the changes_pkt_data
computation.
Reported-by: Nick Zavaritsky <mejedi@gmail.com>
Closes: https://lore.kernel.org/bpf/0498CA22-5779-4767-9C0C-A9515CEA711F@gmail.com/
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20241210041100.1898468-4-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Use BPF helper number instead of function pointer in
bpf_helper_changes_pkt_data(). This would simplify usage of this
function in verifier.c:check_cfg() (in a follow-up patch),
where only helper number is easily available and there is no real need
to lookup helper proto.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20241210041100.1898468-3-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Syzbot reported [1] crash that happens for following tracing scenario:
- create tracepoint perf event with attr.inherit=1, attach it to the
process and set bpf program to it
- attached process forks -> chid creates inherited event
the new child event shares the parent's bpf program and tp_event
(hence prog_array) which is global for tracepoint
- exit both process and its child -> release both events
- first perf_event_detach_bpf_prog call will release tp_event->prog_array
and second perf_event_detach_bpf_prog will crash, because
tp_event->prog_array is NULL
The fix makes sure the perf_event_detach_bpf_prog checks prog_array
is valid before it tries to remove the bpf program from it.
[1] https://lore.kernel.org/bpf/Z1MR6dCIKajNS6nU@krava/T/#m91dbf0688221ec7a7fc95e896a7ef9ff93b0b8ad
Fixes: 0ee288e69d ("bpf,perf: Fix perf_event_detach_bpf_prog error handling")
Reported-by: syzbot+2e0d2840414ce817aaac@syzkaller.appspotmail.com
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241208142507.1207698-1-jolsa@kernel.org
Uprobes always use bpf_prog_run_array_uprobe() under tasks-trace-RCU
protection. But it is possible to attach a non-sleepable BPF program to a
uprobe, and non-sleepable BPF programs are freed via normal RCU (see
__bpf_prog_put_noref()). This leads to UAF of the bpf_prog because a normal
RCU grace period does not imply a tasks-trace-RCU grace period.
Fix it by explicitly waiting for a tasks-trace-RCU grace period after
removing the attachment of a bpf_prog to a perf_event.
Fixes: 8c7dcb84e3 ("bpf: implement sleepable uprobes by chaining gps")
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/bpf/20241210-bpf-fix-actual-uprobe-uaf-v1-1-19439849dd44@google.com
The original code checks 'if (ARC_CC_AL)', which is always true since
ARC_CC_AL is a constant. This makes the check redundant and likely
obscures the intention of verifying whether the jump is conditional.
Updates the code to check cond == ARC_CC_AL instead, reflecting the intent
to differentiate conditional from unconditional jumps.
Suggested-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Acked-by: Shahab Vahedi <list+bpf@vahedi.org>
Signed-off-by: Hardevsinh Palaniya <hardevsinh.palaniya@siliconsignals.io>
Signed-off-by: Vineet Gupta <vgupta@kernel.org>
snps,dw-apb-gpio-port is deprecated since commit ef42a8da3c
("dt-bindings: gpio: dwapb: Add ngpios property support"). The
respective driver supports this since commit 7569486d79 ("gpio: dwapb:
Add ngpios DT-property support") which is included in Linux v5.10-rc1.
This change was created using
git grep -l snps,nr-gpios arch/arc/boot/dts | xargs perl -p -i -e 's/\bsnps,nr-gpios\b/ngpios/
.
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@baylibre.com>
Signed-off-by: Vineet Gupta <vgupta@kernel.org>
Currently, the cast of the first argument to cmpxchg_emu_u8() drops the
__percpu address-space designator, which results in sparse complaints
when applying cmpxchg() to per-CPU variables in ARC. Therefore, use
__force to suppress these complaints, given that this does not pertain
to cmpxchg() semantics, which are plently well-defined on variables in
general, whether per-CPU or otherwise.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202409251336.ToC0TvWB-lkp@intel.com/
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: <linux-snps-arc@lists.infradead.org>
Signed-off-by: Vineet Gupta <vgupta@kernel.org>
Commit d71e629bed5b ("ARC: build: disallow invalid PAE40 + 4K page config")
reworks the build dependencies for ARC_HAS_PAE40, and accidentally refers
to the non-existing config option MMU_V4 rather than the intended option
ARC_MMU_V4. Note the missing prefix in the name here.
Refer to the intended config option in the dependency of the ARC_HAS_PAE40
config.
Fixes: d71e629bed5b ("ARC: build: disallow invalid PAE40 + 4K page config")
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@redhat.com>
Signed-off-by: Vineet Gupta <vgupta@kernel.org>
The config option being built was
| CONFIG_ARC_MMU_V4=y
| CONFIG_ARC_PAGE_SIZE_4K=y
| CONFIG_HIGHMEM=y
| CONFIG_ARC_HAS_PAE40=y
This was hitting a BUILD_BUG_ON() since a 4K page can't hoist 1k, 8-byte
PTE entries (8 byte due to PAE40). BUILD_BUG_ON() is a good last ditch
resort, but such a config needs to be disallowed explicitly in Kconfig.
Side-note: the actual fix is single liner dependency, but while at it
cleaned out a few things:
- 4K dependency on MMU v3 or v4 is always true, since 288ff7de62
("ARC: retire MMUv1 and MMUv2 support")
- PAE40 dependency in on MMU ver not really ISA, although that follows
eventually.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202409160223.xydgucbY-lkp@intel.com/
Signed-off-by: Vineet Gupta <vgupta@kernel.org>
The goal is to clean-up Linux repository from AUX file names, because
the use of such file names is prohibited on other operating systems
such as Windows, so the Linux repository cannot be cloned and
edited on them.
Reviewed-by: Shahab Vahedi <list+bpf@vahedi.org>
Signed-off-by: Benjamin Szőke <egyszeregy@freemail.hu>
Signed-off-by: Vineet Gupta <vgupta@kernel.org>
Consider a sockmap entry being updated with the same socket:
osk = stab->sks[idx];
sock_map_add_link(psock, link, map, &stab->sks[idx]);
stab->sks[idx] = sk;
if (osk)
sock_map_unref(osk, &stab->sks[idx]);
Due to sock_map_unref(), which invokes sock_map_del_link(), all the
psock's links for stab->sks[idx] are torn:
list_for_each_entry_safe(link, tmp, &psock->link, list) {
if (link->link_raw == link_raw) {
...
list_del(&link->list);
sk_psock_free_link(link);
}
}
And that includes the new link sock_map_add_link() added just before
the unref.
This results in a sockmap holding a socket, but without the respective
link. This in turn means that close(sock) won't trigger the cleanup,
i.e. a closed socket will not be automatically removed from the sockmap.
Stop tearing the links when a matching link_raw is found.
Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20241202-sockmap-replace-v1-1-1e88579e7bd5@rbox.co
blkcg_unpin_online() walks up the blkcg hierarchy putting the online pin. To
walk up, it uses blkcg_parent(blkcg) but it was calling that after
blkcg_destroy_blkgs(blkcg) which could free the blkcg, leading to the
following UAF:
==================================================================
BUG: KASAN: slab-use-after-free in blkcg_unpin_online+0x15a/0x270
Read of size 8 at addr ffff8881057678c0 by task kworker/9:1/117
CPU: 9 UID: 0 PID: 117 Comm: kworker/9:1 Not tainted 6.13.0-rc1-work-00182-gb8f52214c61a-dirty #48
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS unknown 02/02/2022
Workqueue: cgwb_release cgwb_release_workfn
Call Trace:
<TASK>
dump_stack_lvl+0x27/0x80
print_report+0x151/0x710
kasan_report+0xc0/0x100
blkcg_unpin_online+0x15a/0x270
cgwb_release_workfn+0x194/0x480
process_scheduled_works+0x71b/0xe20
worker_thread+0x82a/0xbd0
kthread+0x242/0x2c0
ret_from_fork+0x33/0x70
ret_from_fork_asm+0x1a/0x30
</TASK>
...
Freed by task 1944:
kasan_save_track+0x2b/0x70
kasan_save_free_info+0x3c/0x50
__kasan_slab_free+0x33/0x50
kfree+0x10c/0x330
css_free_rwork_fn+0xe6/0xb30
process_scheduled_works+0x71b/0xe20
worker_thread+0x82a/0xbd0
kthread+0x242/0x2c0
ret_from_fork+0x33/0x70
ret_from_fork_asm+0x1a/0x30
Note that the UAF is not easy to trigger as the free path is indirected
behind a couple RCU grace periods and a work item execution. I could only
trigger it with artifical msleep() injected in blkcg_unpin_online().
Fix it by reading the parent pointer before destroying the blkcg's blkg's.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Abagail ren <renzezhongucas@gmail.com>
Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
Fixes: 4308a434e5 ("blkcg: don't offline parent blkcg first")
Cc: stable@vger.kernel.org # v5.7+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Zone write plugging for handling writes to zones of a zoned block
device always execute a zone report whenever a write BIO to a zone
fails. The intent of this is to ensure that the tracking of a zone write
pointer is always correct to ensure that the alignment to a zone write
pointer of write BIOs can be checked on submission and that we can
always correctly emulate zone append operations using regular write
BIOs.
However, this error recovery scheme introduces a potential deadlock if a
device queue freeze is initiated while BIOs are still plugged in a zone
write plug and one of these write operation fails. In such case, the
disk zone write plug error recovery work is scheduled and executes a
report zone. This in turn can result in a request allocation in the
underlying driver to issue the report zones command to the device. But
with the device queue freeze already started, this allocation will
block, preventing the report zone execution and the continuation of the
processing of the plugged BIOs. As plugged BIOs hold a queue usage
reference, the queue freeze itself will never complete, resulting in a
deadlock.
Avoid this problem by completely removing from the zone write plugging
code the use of report zones operations after a failed write operation,
instead relying on the device user to either execute a report zones,
reset the zone, finish the zone, or give up writing to the device (which
is a fairly common pattern for file systems which degrade to read-only
after write failures). This is not an unreasonnable requirement as all
well-behaved applications, FSes and device mapper already use report
zones to recover from write errors whenever possible by comparing the
current position of a zone write pointer with what their assumption
about the position is.
The changes to remove the automatic error recovery are as follows:
- Completely remove the error recovery work and its associated
resources (zone write plug list head, disk error list, and disk
zone_wplugs_work work struct). This also removes the functions
disk_zone_wplug_set_error() and disk_zone_wplug_clear_error().
- Change the BLK_ZONE_WPLUG_ERROR zone write plug flag into
BLK_ZONE_WPLUG_NEED_WP_UPDATE. This new flag is set for a zone write
plug whenever a write opration targetting the zone of the zone write
plug fails. This flag indicates that the zone write pointer offset is
not reliable and that it must be updated when the next report zone,
reset zone, finish zone or disk revalidation is executed.
- Modify blk_zone_write_plug_bio_endio() to set the
BLK_ZONE_WPLUG_NEED_WP_UPDATE flag for the target zone of a failed
write BIO.
- Modify the function disk_zone_wplug_set_wp_offset() to clear this
new flag, thus implementing recovery of a correct write pointer
offset with the reset (all) zone and finish zone operations.
- Modify blkdev_report_zones() to always use the disk_report_zones_cb()
callback so that disk_zone_wplug_sync_wp_offset() can be called for
any zone marked with the BLK_ZONE_WPLUG_NEED_WP_UPDATE flag.
This implements recovery of a correct write pointer offset for zone
write plugs marked with BLK_ZONE_WPLUG_NEED_WP_UPDATE and within
the range of the report zones operation executed by the user.
- Modify blk_revalidate_seq_zone() to call
disk_zone_wplug_sync_wp_offset() for all sequential write required
zones when a zoned block device is revalidated, thus always resolving
any inconsistency between the write pointer offset of zone write
plugs and the actual write pointer position of sequential zones.
Fixes: dd291d77cc ("block: Introduce zone write plugging")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20241209122357.47838-5-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The zone reclaim processing of the dm-zoned device mapper uses
blkdev_issue_zeroout() to align the write pointer of a zone being used
for reclaiming another zone, to write the valid data blocks from the
zone being reclaimed at the same position relative to the zone start in
the reclaim target zone.
The first call to blkdev_issue_zeroout() will try to use hardware
offload using a REQ_OP_WRITE_ZEROES operation if the device reports a
non-zero max_write_zeroes_sectors queue limit. If this operation fails
because of the lack of hardware support, blkdev_issue_zeroout() falls
back to using a regular write operation with the zero-page as buffer.
Currently, such REQ_OP_WRITE_ZEROES failure is automatically handled by
the block layer zone write plugging code which will execute a report
zones operation to ensure that the write pointer of the target zone of
the failed operation has not changed and to "rewind" the zone write
pointer offset of the target zone as it was advanced when the write zero
operation was submitted. So the REQ_OP_WRITE_ZEROES failure does not
cause any issue and blkdev_issue_zeroout() works as expected.
However, since the automatic recovery of zone write pointers by the zone
write plugging code can potentially cause deadlocks with queue freeze
operations, a different recovery must be implemented in preparation for
the removal of zone write plugging report zones based recovery.
Do this by introducing the new function blk_zone_issue_zeroout(). This
function first calls blkdev_issue_zeroout() with the flag
BLKDEV_ZERO_NOFALLBACK to intercept failures on the first execution
which attempt to use the device hardware offload with the
REQ_OP_WRITE_ZEROES operation. If this attempt fails, a report zone
operation is issued to restore the zone write pointer offset of the
target zone to the correct position and blkdev_issue_zeroout() is called
again without the BLKDEV_ZERO_NOFALLBACK flag. The report zones
operation performing this recovery is implemented using the helper
function disk_zone_sync_wp_offset() which calls the gendisk report_zones
file operation with the callback disk_report_zones_cb(). This callback
updates the target write pointer offset of the target zone using the new
function disk_zone_wplug_sync_wp_offset().
dmz_reclaim_align_wp() is modified to change its call to
blkdev_issue_zeroout() to a call to blk_zone_issue_zeroout() without any
other change needed as the two functions are functionnally equivalent.
Fixes: dd291d77cc ("block: Introduce zone write plugging")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20241209122357.47838-4-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There are currently any issuer of REQ_OP_ZONE_RESET and
REQ_OP_ZONE_FINISH operations that set REQ_NOWAIT. However, as we cannot
handle this flag correctly due to the potential request allocation
failure that may happen in blk_mq_submit_bio() after blk_zone_plug_bio()
has handled the zone write plug write pointer updates for the targeted
zones, modify blk_zone_wplug_handle_reset_or_finish() to warn if this
flag is set and ignore it.
Fixes: dd291d77cc ("block: Introduce zone write plugging")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20241209122357.47838-3-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For zoned block devices, a write BIO issued to a zone that has no
on-going writes will be prepared for execution and allowed to execute
immediately by blk_zone_wplug_handle_write() (called from
blk_zone_plug_bio()). However, if this BIO specifies REQ_NOWAIT, the
allocation of a request for its execution in blk_mq_submit_bio() may
fail after blk_zone_plug_bio() completed, marking the target zone of the
BIO as plugged. When this BIO is retried later on, it will be blocked as
the zone write plug of the target zone is in a plugged state without any
on-going write operation (completion of write operations trigger
unplugging of the next write BIOs for a zone). This leads to a BIO that
is stuck in a zone write plug and never completes, which results in
various issues such as hung tasks.
Avoid this problem by always executing REQ_NOWAIT write BIOs using the
BIO work of a zone write plug. This ensure that we never block the BIO
issuer and can thus safely ignore the REQ_NOWAIT flag when executing the
BIO from the zone write plug BIO work.
Since such BIO may be the first write BIO issued to a zone with no
on-going write, modify disk_zone_wplug_add_bio() to schedule the zone
write plug BIO work if the write plug is not already marked with the
BLK_ZONE_WPLUG_PLUGGED flag. This scheduling is otherwise not necessary
as the completion of the on-going write for the zone will schedule the
execution of the next plugged BIOs.
blk_zone_wplug_handle_write() is also fixed to better handle zone write
plug allocation failures for REQ_NOWAIT BIOs by failing a write BIO
using bio_wouldblock_error() instead of bio_io_error().
Reported-by: Bart Van Assche <bvanassche@acm.org>
Fixes: dd291d77cc ("block: Introduce zone write plugging")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20241209122357.47838-2-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When using MES creating a pdd will require talking to the GPU to
setup the relevant context. The code here forgot to wake up the GPU
in case it was in suspend, this causes KVM to EFAULT for passthrough
GPU for example. This issue can be masked if the GPU was woken up by
other things (e.g. opening the KMS node) first and have not yet gone to sleep.
v4: do the allocation of proc_ctx_bo in a lazy fashion
when the first queue is created in a process (Felix)
Signed-off-by: Jesse Zhang <jesse.zhang@amd.com>
Reviewed-by: Yunxiang Li <Yunxiang.Li@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
Emitting the cleaner shader must come after the check if a VM switch is
necessary or not.
Otherwise we will emit the cleaner shader every time and not just when it is
necessary because we switched between applications.
This can otherwise crash on gang submit and probably decreases performance
quite a bit.
v2: squash in fix from Srini (Alex)
Signed-off-by: Christian König <christian.koenig@amd.com>
Fixes: ee7a846ea2 ("drm/amdgpu: Emit cleaner shader at end of IB submission")
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
ISP hw_init is not called with the recent changes related
to hw init levels. AMDGPU_INIT_LEVEL_DEFAULT is ignoring
the ISP IP block as AMDGPU_IP_BLK_MASK_ALL is derived using
incorrect max number of IP blocks.
Update AMDGPU_IP_BLK_MASK_ALL to use AMD_IP_BLOCK_TYPE_NUM
instead of AMDGPU_MAX_IP_NUM to fix the issue.
Fixes: 14f2fe34f5 ("drm/amdgpu: Add init levels")
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Pratap Nirujogi <pratap.nirujogi@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
In the function pqm_uninit there is a call-assignment of "pdd =
kfd_get_process_device_data" which could be null, and this value was
later dereferenced without checking.
Fixes: fb91065851 ("drm/amdkfd: Refactor queue wptr_bo GART mapping")
Signed-off-by: Andrew Martin <Andrew.Martin@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
The OF match table is not NULL-terminated.
Fix this by adding a sentinel to nmk_i2c_eyeq_match_table[].
Fixes: a0d15cc47f ("i2c: nomadik: switch from of_device_is_compatible() to of_match_device()")
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Théo Lebrun <theo.lebrun@bootlin.com>
Signed-off-by: Andi Shyti <andi.shyti@kernel.org>
Current implementation leaves pdev->dev as a wakeup source. Add a
device_init_wakeup(&pdev->dev, false) call in the .remove() function and
in the error path of the .probe() function.
Signed-off-by: Joe Hattori <joe@pf.is.s.u-tokyo.ac.jp>
Fixes: 527f36f5ef ("mmc: mediatek: add support for SDIO eint wakup IRQ")
Cc: stable@vger.kernel.org
Message-ID: <20241203023442.2434018-1-joe@pf.is.s.u-tokyo.ac.jp>
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Since commit f63b94be69 ("i2c: pnx: Fix potential deadlock warning
from del_timer_sync() call in isr") jiffies are stored in
i2c_pnx_algo_data.timeout, but wait_timeout and wait_reset are still
using it as milliseconds. Convert jiffies back to milliseconds to wait
for the expected amount of time.
Fixes: f63b94be69 ("i2c: pnx: Fix potential deadlock warning from del_timer_sync() call in isr")
Signed-off-by: Vladimir Riabchun <ferr.lambarginio@gmail.com>
Signed-off-by: Andi Shyti <andi.shyti@kernel.org>
With uncoupling of physical and virtual address spaces population of
the identity mapping was changed to use the type POPULATE_IDENTITY
instead of POPULATE_DIRECT. This breaks DirectMap accounting:
> cat /proc/meminfo
DirectMap4k: 55296 kB
DirectMap1M: 18446744073709496320 kB
Adjust all locations of update_page_count() in vmem.c to use
POPULATE_IDENTITY instead of POPULATE_DIRECT as well. With this
accounting is correct again:
> cat /proc/meminfo
DirectMap4k: 54264 kB
DirectMap1M: 8334336 kB
Fixes: c98d2ecae0 ("s390/mm: Uncouple physical vs virtual address spaces")
Cc: stable@vger.kernel.org
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
After the blamed commit below, udp_rehash() is supposed to be called
with both local and remote addresses set.
Currently that is already the case for IPv6 sockets, but for IPv4 the
destination address is updated after rehashing.
Address the issue moving the destination address and port initialization
before rehashing.
Fixes: 1b29a730ef ("ipv6/udp: Add 4-tuple hash for connected socket")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/4761e466ab9f7542c68cdc95f248987d127044d2.1733499715.git.pabeni@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
When drivers access P2SB device resources, it calls p2sb_bar(). Before
the commit 5913320eb0 ("platform/x86: p2sb: Allow p2sb_bar() calls
during PCI device probe"), p2sb_bar() obtained the resources and then
called pci_stop_and_remove_bus_device() for clean up. Then the P2SB
device disappeared. The commit 5913320eb0 introduced the P2SB device
resource cache feature in the boot process. During the resource cache,
pci_stop_and_remove_bus_device() is called for the P2SB device, then the
P2SB device disappears regardless of whether p2sb_bar() is called or
not. Such P2SB device disappearance caused a confusion [1]. To avoid the
confusion, avoid the pci_stop_and_remove_bus_device() call when the BIOS
does not hide the P2SB device.
For that purpose, cache the P2SB device resources only if the BIOS hides
the P2SB device. Call p2sb_scan_and_cache() only if p2sb_hidden_by_bios
is true. This allows removing two branches from p2sb_scan_and_cache().
When p2sb_bar() is called, get the resources from the cache if the P2SB
device is hidden. Otherwise, read the resources from the unhidden P2SB
device.
Reported-by: Daniel Walker (danielwa) <danielwa@cisco.com>
Closes: https://lore.kernel.org/lkml/ZzTI+biIUTvFT6NC@goliath/ [1]
Fixes: 5913320eb0 ("platform/x86: p2sb: Allow p2sb_bar() calls during PCI device probe")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Hans de Goede <hdegoede@redhat.com>
Link: https://lore.kernel.org/r/20241128002836.373745-5-shinichiro.kawasaki@wdc.com
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
KVM/arm64 fixes for 6.13, part #2
- Fix confusion with implicitly-shifted MDCR_EL2 masks breaking
SPE/TRBE initialization
- Align nested page table walker with the intended memory attribute
combining rules of the architecture
- Prevent userspace from constraining the advertised ASID width,
avoiding horrors of guest TLBIs not matching the intended context in
hardware
- Don't leak references on LPIs when insertion into the translation
cache fails
With the current implementation, when ACP driver fails to read
ACPI _WOV entry then the DMI overrides code won't invoke,
may cause regressions for some BIOS versions.
Add a condition check to jump to check the DMI entries incase of
ACP driver fail to read ACPI _WOV method.
Fixes: 4095cf8720 (ASoC: amd: yc: Fix for enabling DMIC on acp6x via _DSD entry)
Signed-off-by: Venkata Prasad Potturu <venkataprasad.potturu@amd.com>
Link: https://patch.msgid.link/20241210091026.996860-1-venkataprasad.potturu@amd.com
Signed-off-by: Mark Brown <broonie@kernel.org>
Align all line continuations and (sub)section headers in a consistent
way.
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Stafford Horne <shorne@gmail.com>
Since commit 0043ecea23 ("vmlinux.lds.h: Adjust symbol ordering in
text output section"), the exception table in arch/openrisc/kernel/head.S
is no longer positioned at the very beginning of the kernel image, which
causes a boot failure.
Currently, the exception table resides in the regular .text section.
Previously, it was placed at the head by relying on the linker receiving
arch/openrisc/kernel/head.o as the first object. However, this behavior
has changed because sections like .text.{asan,unknown,unlikely,hot} now
precede the regular .text section.
The .head.text section is intended for entry points requiring special
placement. However, in OpenRISC, this section has been misused: instead
of the entry points, it contains boot code meant to be discarded after
booting. This feature is typically handled by the .init.text section.
This commit addresses the issue by replacing the current __HEAD marker
with __INIT and re-annotating the entry points with __HEAD. Additionally,
it adds __REF to entry.S to suppress the following modpost warning:
WARNING: modpost: vmlinux: section mismatch in reference: _tng_kernel_start+0x70 (section: .text) -> _start (section: .init.text)
Fixes: 0043ecea23 ("vmlinux.lds.h: Adjust symbol ordering in text output section")
Reported-by: Guenter Roeck <linux@roeck-us.net>
Closes: https://lore.kernel.org/all/5e032233-5b65-4ad5-ac50-d2eb6c00171c@roeck-us.net/#t
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Rong Xu <xur@google.com>
Signed-off-by: Stafford Horne <shorne@gmail.com>
virtnet_sq_bind_xsk_pool() flushes tx skbs and then resets tx queue, so
DQL counters need to be reset when flushing has actually occurred, Add
virtnet_sq_free_unused_buf_done() as a callback for virtqueue_resize()
to handle this.
Fixes: 21a4e3ce6d ("virtio_net: xsk: bind/unbind xsk for tx")
Signed-off-by: Koichiro Den <koichiro.den@canonical.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
When virtqueue_reset() has actually recycled all unused buffers,
additional work may be required in some cases. Relying solely on its
return status is fragile, so introduce a new function argument
'recycle_done', which is invoked when it really occurs.
Signed-off-by: Koichiro Den <koichiro.den@canonical.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
virtnet_tx_resize() flushes remaining tx skbs, requiring DQL counters to
be reset when flushing has actually occurred. Add
virtnet_sq_free_unused_buf_done() as a callback for virtqueue_reset() to
handle this.
Fixes: c8bd1f7f3e ("virtio_net: add support for Byte Queue Limits")
Cc: <stable@vger.kernel.org> # v6.11+
Signed-off-by: Koichiro Den <koichiro.den@canonical.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
When virtqueue_resize() has actually recycled all unused buffers,
additional work may be required in some cases. Relying solely on its
return status is fragile, so introduce a new function argument
'recycle_done', which is invoked when the recycle really occurs.
Cc: <stable@vger.kernel.org> # v6.11+
Signed-off-by: Koichiro Den <koichiro.den@canonical.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
While not harmful, using vq2rxq where it's always sq appears odd.
Replace it with the more appropriate vq2txq for clarity and correctness.
Fixes: 89f86675cb ("virtio_net: xsk: tx: support xmit xsk buffer")
Signed-off-by: Koichiro Den <koichiro.den@canonical.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
When virtnet_close is followed by virtnet_open, some TX completions can
possibly remain unconsumed, until they are finally processed during the
first NAPI poll after the netdev_tx_reset_queue(), resulting in a crash
[1]. Commit b96ed2c97c ("virtio_net: move netdev_tx_reset_queue() call
before RX napi enable") was not sufficient to eliminate all BQL crash
cases for virtio-net.
This issue can be reproduced with the latest net-next master by running:
`while :; do ip l set DEV down; ip l set DEV up; done` under heavy network
TX load from inside the machine.
netdev_tx_reset_queue() can actually be dropped from virtnet_open path;
the device is not stopped in any case. For BQL core part, it's just like
traffic nearly ceases to exist for some period. For stall detector added
to BQL, even if virtnet_close could somehow lead to some TX completions
delayed for long, followed by virtnet_open, we can just take it as stall
as mentioned in commit 6025b9135f ("net: dqs: add NIC stall detector
based on BQL"). Note also that users can still reset stall_max via sysfs.
So, drop netdev_tx_reset_queue() from virtnet_enable_queue_pair(). This
eliminates the BQL crashes. As a result, netdev_tx_reset_queue() is now
explicitly required in freeze/restore path. This patch adds it to
immediately after free_unused_bufs(), following the rule of thumb:
netdev_tx_reset_queue() should follow any SKB freeing not followed by
netdev_tx_completed_queue(). This seems the most consistent and
streamlined approach, and now netdev_tx_reset_queue() runs whenever
free_unused_bufs() is done.
[1]:
------------[ cut here ]------------
kernel BUG at lib/dynamic_queue_limits.c:99!
Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
CPU: 7 UID: 0 PID: 1598 Comm: ip Tainted: G N 6.12.0net-next_main+ #2
Tainted: [N]=TEST
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), \
BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
RIP: 0010:dql_completed+0x26b/0x290
Code: b7 c2 49 89 e9 44 89 da 89 c6 4c 89 d7 e8 ed 17 47 00 58 65 ff 0d
4d 27 90 7e 0f 85 fd fe ff ff e8 ea 53 8d ff e9 f3 fe ff ff <0f> 0b 01
d2 44 89 d1 29 d1 ba 00 00 00 00 0f 48 ca e9 28 ff ff ff
RSP: 0018:ffffc900002b0d08 EFLAGS: 00010297
RAX: 0000000000000000 RBX: ffff888102398c80 RCX: 0000000080190009
RDX: 0000000000000000 RSI: 000000000000006a RDI: 0000000000000000
RBP: ffff888102398c00 R08: 0000000000000000 R09: 0000000000000000
R10: 00000000000000ca R11: 0000000000015681 R12: 0000000000000001
R13: ffffc900002b0d68 R14: ffff88811115e000 R15: ffff8881107aca40
FS: 00007f41ded69500(0000) GS:ffff888667dc0000(0000)
knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000556ccc2dc1a0 CR3: 0000000104fd8003 CR4: 0000000000772ef0
PKRU: 55555554
Call Trace:
<IRQ>
? die+0x32/0x80
? do_trap+0xd9/0x100
? dql_completed+0x26b/0x290
? dql_completed+0x26b/0x290
? do_error_trap+0x6d/0xb0
? dql_completed+0x26b/0x290
? exc_invalid_op+0x4c/0x60
? dql_completed+0x26b/0x290
? asm_exc_invalid_op+0x16/0x20
? dql_completed+0x26b/0x290
__free_old_xmit+0xff/0x170 [virtio_net]
free_old_xmit+0x54/0xc0 [virtio_net]
virtnet_poll+0xf4/0xe30 [virtio_net]
? __update_load_avg_cfs_rq+0x264/0x2d0
? update_curr+0x35/0x260
? reweight_entity+0x1be/0x260
__napi_poll.constprop.0+0x28/0x1c0
net_rx_action+0x329/0x420
? enqueue_hrtimer+0x35/0x90
? trace_hardirqs_on+0x1d/0x80
? kvm_sched_clock_read+0xd/0x20
? sched_clock+0xc/0x30
? kvm_sched_clock_read+0xd/0x20
? sched_clock+0xc/0x30
? sched_clock_cpu+0xd/0x1a0
handle_softirqs+0x138/0x3e0
do_softirq.part.0+0x89/0xc0
</IRQ>
<TASK>
__local_bh_enable_ip+0xa7/0xb0
virtnet_open+0xc8/0x310 [virtio_net]
__dev_open+0xfa/0x1b0
__dev_change_flags+0x1de/0x250
dev_change_flags+0x22/0x60
do_setlink.isra.0+0x2df/0x10b0
? rtnetlink_rcv_msg+0x34f/0x3f0
? netlink_rcv_skb+0x54/0x100
? netlink_unicast+0x23e/0x390
? netlink_sendmsg+0x21e/0x490
? ____sys_sendmsg+0x31b/0x350
? avc_has_perm_noaudit+0x67/0xf0
? cred_has_capability.isra.0+0x75/0x110
? __nla_validate_parse+0x5f/0xee0
? __pfx___probestub_irq_enable+0x3/0x10
? __create_object+0x5e/0x90
? security_capable+0x3b/0x70
rtnl_newlink+0x784/0xaf0
? avc_has_perm_noaudit+0x67/0xf0
? cred_has_capability.isra.0+0x75/0x110
? stack_depot_save_flags+0x24/0x6d0
? __pfx_rtnl_newlink+0x10/0x10
rtnetlink_rcv_msg+0x34f/0x3f0
? do_syscall_64+0x6c/0x180
? entry_SYSCALL_64_after_hwframe+0x76/0x7e
? __pfx_rtnetlink_rcv_msg+0x10/0x10
netlink_rcv_skb+0x54/0x100
netlink_unicast+0x23e/0x390
netlink_sendmsg+0x21e/0x490
____sys_sendmsg+0x31b/0x350
? copy_msghdr_from_user+0x6d/0xa0
___sys_sendmsg+0x86/0xd0
? __pte_offset_map+0x17/0x160
? preempt_count_add+0x69/0xa0
? __call_rcu_common.constprop.0+0x147/0x610
? preempt_count_add+0x69/0xa0
? preempt_count_add+0x69/0xa0
? _raw_spin_trylock+0x13/0x60
? trace_hardirqs_on+0x1d/0x80
__sys_sendmsg+0x66/0xc0
do_syscall_64+0x6c/0x180
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f41defe5b34
Code: 15 e1 12 0f 00 f7 d8 64 89 02 b8 ff ff ff ff eb bf 0f 1f 44 00 00
f3 0f 1e fa 80 3d 35 95 0f 00 00 74 13 b8 2e 00 00 00 0f 05 <48> 3d 00
f0 ff ff 77 4c c3 0f 1f 00 55 48 89 e5 48 83 ec 20 89 55
RSP: 002b:00007ffe5336ecc8 EFLAGS: 00000202 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f41defe5b34
RDX: 0000000000000000 RSI: 00007ffe5336ed30 RDI: 0000000000000003
RBP: 00007ffe5336eda0 R08: 0000000000000010 R09: 0000000000000001
R10: 00007ffe5336f6f9 R11: 0000000000000202 R12: 0000000000000003
R13: 0000000067452259 R14: 0000556ccc28b040 R15: 0000000000000000
</TASK>
[...]
Fixes: c8bd1f7f3e ("virtio_net: add support for Byte Queue Limits")
Cc: <stable@vger.kernel.org> # v6.11+
Signed-off-by: Koichiro Den <koichiro.den@canonical.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
[ pabeni: trimmed possibly troublesome separator ]
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The 'select FB_CORE' statement moved from CONFIG_DRM to DRM_CLIENT_LIB,
but there are now configurations that have code calling into fb_core
as built-in even though the client_lib itself is a loadable module:
x86_64-linux-ld: drivers/gpu/drm/drm_fbdev_shmem.o: in function `drm_fbdev_shmem_driver_fbdev_probe':
drm_fbdev_shmem.c:(.text+0x1fc): undefined reference to `fb_deferred_io_init'
x86_64-linux-ld: drivers/gpu/drm/drm_fbdev_shmem.o: in function `drm_fbdev_shmem_fb_destroy':
drm_fbdev_shmem.c:(.text+0x2e1): undefined reference to `fb_deferred_io_cleanup'
In addition to DRM_CLIENT_LIB, the 'select' needs to be at least in
two more parts, DRM_KMS_HELPER and DRM_GEM_SHMEM_HELPER, so add those
here.
v3:
- Remove FB_CORE from DRM_KMS_HELPER to avoid circular dependency
Fixes: dadd28d414 ("drm/client: Add client-lib module")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Thomas Zimmermann <tzimmermann@suse.de>
Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
Link: https://patchwork.freedesktop.org/patch/msgid/20241115162323.3555229-1-arnd@kernel.org
Move setting irq_chip.name from probe() function to the initialization
of "irq_chip" struct in order to fix vGPIO driver crash during bootup.
Crash was caused by unauthorized modification of irq_chip.name field
where irq_chip struct was initialized as const.
This behavior is a consequence of suboptimal implementation of
gpio_irq_chip_set_chip(), which should be changed to avoid
casting away const qualifier.
Crash log:
BUG: unable to handle page fault for address: ffffffffc0ba81c0
/#PF: supervisor write access in kernel mode
/#PF: error_code(0x0003) - permissions violation
CPU: 33 UID: 0 PID: 1075 Comm: systemd-udevd Not tainted 6.12.0-rc6-00077-g2e1b3cc9d7f7 #1
Hardware name: Intel Corporation Kaseyville RP/Kaseyville RP, BIOS KVLDCRB1.PGS.0026.D73.2410081258 10/08/2024
RIP: 0010:gnr_gpio_probe+0x171/0x220 [gpio_graniterapids]
Cc: stable@vger.kernel.org
Signed-off-by: Alan Borzeszkowski <alan.borzeszkowski@linux.intel.com>
Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Acked-by: Andy Shevchenko <andy@kernel.org>
Link: https://lore.kernel.org/r/20241204070415.1034449-2-mika.westerberg@linux.intel.com
Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
The list domain->dev_list is protected by the domain->lock spinlock.
Any iteration, addition or removal must be under the lock.
Move the list_del() up into the critical section. pdom_is_sva_capable(),
and destroy_gcr3_table() do not interact with the list element.
Wrap the list_add() in a lock, it would make more sense if this was under
the same critical section as adjusting the refcounts earlier, but that
requires more complications.
Fixes: d6b47dec36 ("iommu/amd: Reduce domain lock scope in attach device path")
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Link: https://lore.kernel.org/r/1-v1-3b9edcf8067d+3975-amd_dev_list_locking_jgg@nvidia.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Large kmalloc directly allocates from the page allocator and then use
lruvec_stat_mod_folio() to increment the unreclaimable slab stats for
global and memcg. However when post memcg charging of slab objects was
added in commit 9028cdeb38 ("memcg: add charging of already allocated
slab objects"), it missed to correctly handle the unreclaimable slab
stats for memcg.
One user visisble effect of that bug is that the node level
unreclaimable slab stat will work correctly but the memcg level stat can
underflow as kernel correctly handles the free path but the charge path
missed to increment the memcg level unreclaimable slab stat. Let's fix
by correctly handle in the post charge code path.
Fixes: 9028cdeb38 ("memcg: add charging of already allocated slab objects")
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Cc: <stable@vger.kernel.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Intel Panther Lake-M/P has the same integrated Thunderbolt/USB4
controller as Lunar Lake. Add these PCI IDs to the driver list of
supported devices.
Cc: stable@vger.kernel.org
Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Offset based on (id * size) is wrong for sqc and cqc.
(*sqc/*cqc + 1) can already offset sizeof(struct(Xqc)) length.
Fixes: 15f112f9ce ("crypto: hisilicon/debugfs - mask the unnecessary info from the dump")
Cc: <stable@vger.kernel.org>
Signed-off-by: Chenghai Huang <huangchenghai2@huawei.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
As virtual addresses in general may not be suitable for DMA, always
perform a copy before using them in an SG list.
Fixes: 1e562deace ("crypto: rsassa-pkcs1 - Migrate to sig_alg backend")
Reported-by: Zorro Lang <zlang@redhat.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
lrbp->compl_time_stamp_local_clock is set to zero after sending a sqe
but it is not updated after completing a cqe. Thus the printed
information in ufshcd_print_tr() will always be zero.
Update lrbp->cmpl_time_stamp_local_clock after completing a cqe.
Log sample:
ufshcd-qcom 1d84000.ufshc: UPIU[8] - issue time 8750227249 us
ufshcd-qcom 1d84000.ufshc: UPIU[8] - complete time 0 us
Fixes: c30d8d010b ("scsi: ufs: core: Prepare for completion in MCQ")
Reviewed-by: Bean Huo <beanhuo@micron.com>
Reviewed-by: Peter Wang <peter.wang@mediatek.com>
Signed-off-by: liuderong <liuderong@oppo.com>
Link: https://lore.kernel.org/r/1733470182-220841-1-git-send-email-liuderong@oppo.com
Reviewed-by: Avri Altman <avri.altman@wdc.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
The module parameter qcaspi_pluggable controls if QCA7000 signature
should be checked at driver probe (current default) or not. Unfortunately
this could fail in case the chip is temporary in reset, which isn't under
total control by the Linux host. So disable this check per default
in order to avoid unexpected probe failures.
Fixes: 291ab06ecf ("net: qualcomm: new Ethernet over SPI driver for QCA7000")
Signed-off-by: Stefan Wahren <wahrenst@gmx.net>
Link: https://patch.msgid.link/20241206184643.123399-3-wahrenst@gmx.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Storing the maximum clock speed in module parameter qcaspi_clkspeed
has the unintended side effect that the first probed instance
defines the value for all other instances. Fix this issue by storing
it in max_speed_hz of the relevant SPI device.
This fix keeps the priority of the speed parameter (module parameter,
device tree property, driver default). Btw this uses the opportunity
to get the rid of the unused member clkspeed.
Fixes: 291ab06ecf ("net: qualcomm: new Ethernet over SPI driver for QCA7000")
Signed-off-by: Stefan Wahren <wahrenst@gmx.net>
Link: https://patch.msgid.link/20241206184643.123399-2-wahrenst@gmx.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
t4_set_vf_mac_acl() uses pf to set mac addr, but t4vf_get_vf_mac_acl()
uses port number to get mac addr, this leads to error when an attempt
to set MAC address on VF's of PF2 and PF3.
This patch fixes the issue by using port number to set mac address.
Fixes: e0cdac65ba ("cxgb4vf: configure ports accessible by the VF")
Signed-off-by: Anumula Murali Mohan Reddy <anumula@chelsio.com>
Signed-off-by: Potnuri Bharat Teja <bharat@chelsio.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20241206062014.49414-1-anumula@chelsio.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Each `bindgen` release may upgrade the list of Rust targets. For instance,
currently, in their master branch [1], the latest ones are:
Nightly => {
vectorcall_abi: #124485,
ptr_metadata: #81513,
layout_for_ptr: #69835,
},
Stable_1_77(77) => { offset_of: #106655 },
Stable_1_73(73) => { thiscall_abi: #42202 },
Stable_1_71(71) => { c_unwind_abi: #106075 },
Stable_1_68(68) => { abi_efiapi: #105795 },
By default, the highest stable release in their list is used, and users
are expected to set one if they need to support older Rust versions
(e.g. see [2]).
Thus, over time, new Rust features are used by default, and at some
point, it is likely that `bindgen` will emit Rust code that requires a
Rust version higher than our minimum (or perhaps enabling an unstable
feature). Currently, there is no problem because the maximum they have,
as seen above, is Rust 1.77.0, and our current minimum is Rust 1.78.0.
Therefore, set a Rust target explicitly now to prevent going forward in
time too much and thus getting potential build failures at some point.
Since we also support a minimum `bindgen` version, and since `bindgen`
does not support passing unknown Rust target versions, we need to use
the list of our minimum `bindgen` version, rather than the latest. So,
since `bindgen` 0.65.1 had this list [3], we need to use Rust 1.68.0:
/// Rust stable 1.64
/// * `core_ffi_c` ([Tracking issue](https://github.com/rust-lang/rust/issues/94501))
=> Stable_1_64 => 1.64;
/// Rust stable 1.68
/// * `abi_efiapi` calling convention ([Tracking issue](https://github.com/rust-lang/rust/issues/65815))
=> Stable_1_68 => 1.68;
/// Nightly rust
/// * `thiscall` calling convention ([Tracking issue](https://github.com/rust-lang/rust/issues/42202))
/// * `vectorcall` calling convention (no tracking issue)
/// * `c_unwind` calling convention ([Tracking issue](https://github.com/rust-lang/rust/issues/74990))
=> Nightly => nightly;
...
/// Latest stable release of Rust
pub const LATEST_STABLE_RUST: RustTarget = RustTarget::Stable_1_68;
Thus add the `--rust-target 1.68` parameter. Add a comment as well
explaining this.
An alternative would be to use the currently running (i.e. actual) `rustc`
and `bindgen` versions to pick a "better" Rust target version. However,
that would introduce more moving parts depending on the user setup and
is also more complex to implement.
Starting with `bindgen` 0.71.0 [4], we will be able to set any future
Rust version instead, i.e. we will be able to set here our minimum
supported Rust version. Christian implemented it [5] after seeing this
patch. Thanks!
Cc: Christian Poveda <git@pvdrz.com>
Cc: Emilio Cobos Álvarez <emilio@crisal.io>
Cc: stable@vger.kernel.org # needed for 6.12.y; unneeded for 6.6.y; do not apply to 6.1.y
Fixes: c844fa64a2 ("rust: start supporting several `bindgen` versions")
Link: 21c60f473f/bindgen/features.rs (L97-L105) [1]
Link: https://github.com/rust-lang/rust-bindgen/issues/2960 [2]
Link: 7d243056d3/bindgen/features.rs (L131-L150) [3]
Link: https://github.com/rust-lang/rust-bindgen/blob/main/CHANGELOG.md#0710-2024-12-06 [4]
Link: https://github.com/rust-lang/rust-bindgen/pull/2993 [5]
Reviewed-by: Alice Ryhl <aliceryhl@google.com>
Link: https://lore.kernel.org/r/20241123180323.255997-1-ojeda@kernel.org
Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
During boot some of the calls to tegra241_cmdqv_get_cmdq() will happen
in preemptible context. As this function calls smp_processor_id(), if
CONFIG_DEBUG_PREEMPT is enabled, these calls will trigger a series of
"BUG: using smp_processor_id() in preemptible" backtraces.
As tegra241_cmdqv_get_cmdq() only calls smp_processor_id() to use the
CPU number as a factor to balance out traffic on cmdq usage, it is safe
to use raw_smp_processor_id() here.
Cc: <stable@vger.kernel.org>
Fixes: 918eb5c856 ("iommu/arm-smmu-v3: Add in-kernel support for NVIDIA Tegra241 (Grace) CMDQV")
Signed-off-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Link: https://lore.kernel.org/r/Z1L1mja3nXzsJ0Pk@uudg.org
Signed-off-by: Will Deacon <will@kernel.org>
Clippy in the upcoming Rust 1.83.0 spots a spurious empty line since the
`clippy::empty_line_after_doc_comments` warning is now enabled by default
given it is part of the `suspicious` group [1]:
error: empty line after doc comment
--> drivers/gpu/drm/drm_panic_qr.rs:931:1
|
931 | / /// They must remain valid for the duration of the function call.
932 | |
| |_
933 | #[no_mangle]
934 | / pub unsafe extern "C" fn drm_panic_qr_generate(
935 | | url: *const i8,
936 | | data: *mut u8,
937 | | data_len: usize,
... |
940 | | tmp_size: usize,
941 | | ) -> u8 {
| |_______- the comment documents this function
|
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#empty_line_after_doc_comments
= note: `-D clippy::empty-line-after-doc-comments` implied by `-D warnings`
= help: to override `-D warnings` add `#[allow(clippy::empty_line_after_doc_comments)]`
= help: if the empty line is unintentional remove it
Thus remove the empty line.
Cc: stable@vger.kernel.org
Fixes: cb5164ac43 ("drm/panic: Add a QR code panic screen")
Link: https://github.com/rust-lang/rust-clippy/pull/13091 [1]
Reviewed-by: Jocelyn Falempe <jfalempe@redhat.com>
Link: https://lore.kernel.org/r/20241125233332.697497-1-ojeda@kernel.org
Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
The temp directory is made and a known fake hwmon PMU created within
it. Prior to this fix the events were being incorrectly written to the
temp directory rather than the fake PMU directory. This didn't impact
the test as the directory fd matched the wrong location, but it
doesn't mirror what a hwmon PMU would actually look like.
Signed-off-by: Ian Rogers <irogers@google.com>
Tested-by: Arnaldo Carvalho de Melo <acme@kernel.org>
Link: https://lore.kernel.org/r/20241206042306.1055913-2-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
The hwmon PMU test will make a temp directory, open the directory with
O_DIRECTORY then fill it with contents. As the open is before the
filling the contents the later fdopendir may reflect the initial empty
state, meaning no events are seen. Change to re-open the directory,
rather than dup the fd, so the latest contents are seen.
Minor tweaks/additions to debug messages.
Signed-off-by: Ian Rogers <irogers@google.com>
Tested-by: Arnaldo Carvalho de Melo <acme@kernel.org>
Link: https://lore.kernel.org/r/20241206042306.1055913-1-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
The comparison function cmp_profile_data() violates the C standard's
requirements for qsort() comparison functions, which mandate symmetry
and transitivity:
* Symmetry: If x < y, then y > x.
* Transitivity: If x < y and y < z, then x < z.
When v1 and v2 are equal, the function incorrectly returns 1, breaking
symmetry and transitivity. This causes undefined behavior, which can
lead to memory corruption in certain versions of glibc [1].
Fix the issue by returning 0 when v1 and v2 are equal, ensuring
compliance with the C standard and preventing undefined behavior.
Link: https://www.qualys.com/2024/01/30/qsort.txt [1]
Fixes: 0f223813ed ("perf ftrace: Add 'profile' command")
Fixes: 74ae366c37 ("perf ftrace profile: Add -s/--sort option")
Cc: stable@vger.kernel.org
Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Reviewed-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: jserv@ccns.ncku.edu.tw
Cc: chuang@cs.nycu.edu.tw
Link: https://lore.kernel.org/r/20241209134226.1939163-1-visitorckw@gmail.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
utf8s_to_utf16s() specifies pwcs as a wchar_t pointer (whether big endian
or little endian is passed in as an additional parm), so to remove a
distracting compile warning it needs to be cast as (wchar_t *) in
parse_reparse_wsl_symlink() as done by other callers.
Fixes: 06a7adf318 ("cifs: Add support for parsing WSL-style symlinks")
Reviewed-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
In acpi_decode_space() addr->info.mem.caching is checked on main level
for any resource type but addr->info.mem is part of union and thus
valid only if the resource type is memory range.
Move the check inside the preceeding switch/case to only execute it
when the union is of correct type.
Fixes: fcb29bbcd5 ("ACPI: Add prefetch decoding to the address space parser")
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Link: https://patch.msgid.link/20241202100614.20731-1-ilpo.jarvinen@linux.intel.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
hv_kvp_daemon uses popen(3) and system(3) as convinience helper to
launch external helpers. These helpers are invoked via a
temporary shell process. There is no need to keep this temporary
process around while the helper runs. Replace this temporary shell
with the actual helper process via 'exec'.
Signed-off-by: Olaf Hering <olaf@aepfle.de>
Link: https://lore.kernel.org/linux-hyperv/20241202123520.27812-1-olaf@aepfle.de/
Signed-off-by: Wei Liu <wei.liu@kernel.org>
The reference implementation of hv_get_dns_info which is in the tree uses
/etc/resolv.conf to get DNS servers and this does not require to know which
NIC is queried. Distro specific implementations, however, may want to
provide per-NIC, fine grained information. E.g. NetworkManager keeps track
of DNS servers per connection.
Similar to hv_get_dhcp_info, pass NIC name as a parameter to
hv_get_dns_info script.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20241112150401.217094-1-vkuznets@redhat.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Message-ID: <20241112150401.217094-1-vkuznets@redhat.com>
If the KVP (or VSS) daemon starts before the VMBus channel's ringbuffer is
fully initialized, we can hit the panic below:
hv_utils: Registering HyperV Utility Driver
hv_vmbus: registering driver hv_utils
...
BUG: kernel NULL pointer dereference, address: 0000000000000000
CPU: 44 UID: 0 PID: 2552 Comm: hv_kvp_daemon Tainted: G E 6.11.0-rc3+ #1
RIP: 0010:hv_pkt_iter_first+0x12/0xd0
Call Trace:
...
vmbus_recvpacket
hv_kvp_onchannelcallback
vmbus_on_event
tasklet_action_common
tasklet_action
handle_softirqs
irq_exit_rcu
sysvec_hyperv_stimer0
</IRQ>
<TASK>
asm_sysvec_hyperv_stimer0
...
kvp_register_done
hvt_op_read
vfs_read
ksys_read
__x64_sys_read
This can happen because the KVP/VSS channel callback can be invoked
even before the channel is fully opened:
1) as soon as hv_kvp_init() -> hvutil_transport_init() creates
/dev/vmbus/hv_kvp, the kvp daemon can open the device file immediately and
register itself to the driver by writing a message KVP_OP_REGISTER1 to the
file (which is handled by kvp_on_msg() ->kvp_handle_handshake()) and
reading the file for the driver's response, which is handled by
hvt_op_read(), which calls hvt->on_read(), i.e. kvp_register_done().
2) the problem with kvp_register_done() is that it can cause the
channel callback to be called even before the channel is fully opened,
and when the channel callback is starting to run, util_probe()->
vmbus_open() may have not initialized the ringbuffer yet, so the
callback can hit the panic of NULL pointer dereference.
To reproduce the panic consistently, we can add a "ssleep(10)" for KVP in
__vmbus_open(), just before the first hv_ringbuffer_init(), and then we
unload and reload the driver hv_utils, and run the daemon manually within
the 10 seconds.
Fix the panic by reordering the steps in util_probe() so the char dev
entry used by the KVP or VSS daemon is not created until after
vmbus_open() has completed. This reordering prevents the race condition
from happening.
Reported-by: Dexuan Cui <decui@microsoft.com>
Fixes: e0fa3e5e7d ("Drivers: hv: utils: fix a race on userspace daemons registration")
Cc: stable@vger.kernel.org
Signed-off-by: Michael Kelley <mhklinux@outlook.com>
Acked-by: Wei Liu <wei.liu@kernel.org>
Link: https://lore.kernel.org/r/20241106154247.2271-3-mhklinux@outlook.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Message-ID: <20241106154247.2271-3-mhklinux@outlook.com>
If the util_init function call in util_probe() returns an error code,
util_probe() always return ENODEV, and the error code from the util_init
function is lost. The error message output in the caller, vmbus_probe(),
doesn't show the real error code.
Fix this by just returning the error code from the util_init function.
There doesn't seem to be a reason to force ENODEV, as other errors
such as ENOMEM can already be returned from util_probe(). And the
code in call_driver_probe() implies that ENODEV should mean that a
matching driver wasn't found, which is not the case here.
Suggested-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Michael Kelley <mhklinux@outlook.com>
Acked-by: Wei Liu <wei.liu@kernel.org>
Link: https://lore.kernel.org/r/20241106154247.2271-2-mhklinux@outlook.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Message-ID: <20241106154247.2271-2-mhklinux@outlook.com>
read_hv_sched_clock_tsc() assumes that the Hyper-V clock counter is
bigger than the variable hv_sched_clock_offset, which is cached during
early boot, but depending on the timing this assumption may be false
when a hibernated VM starts again (the clock counter starts from 0
again) and is resuming back (Note: hv_init_tsc_clocksource() is not
called during hibernation/resume); consequently,
read_hv_sched_clock_tsc() may return a negative integer (which is
interpreted as a huge positive integer since the return type is u64)
and new kernel messages are prefixed with huge timestamps before
read_hv_sched_clock_tsc() grows big enough (which typically takes
several seconds).
Fix the issue by saving the Hyper-V clock counter just before the
suspend, and using it to correct the hv_sched_clock_offset in
resume. This makes hv tsc page based sched_clock continuous and ensures
that post resume, it starts from where it left off during suspend.
Override x86_platform.save_sched_clock_state and
x86_platform.restore_sched_clock_state routines to correct this as soon
as possible.
Note: if Invariant TSC is available, the issue doesn't happen because
1) we don't register read_hv_sched_clock_tsc() for sched clock:
See commit e5313f1c54 ("clocksource/drivers/hyper-v: Rework
clocksource and sched clock setup");
2) the common x86 code adjusts TSC similarly: see
__restore_processor_state() -> tsc_verify_tsc_adjust(true) and
x86_platform.restore_sched_clock_state().
Cc: stable@vger.kernel.org
Fixes: 1349401ff1 ("clocksource/drivers/hyper-v: Suspend/resume Hyper-V clocksource for hibernation")
Co-developed-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Link: https://lore.kernel.org/r/20240917053917.76787-1-namjain@linux.microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
Message-ID: <20240917053917.76787-1-namjain@linux.microsoft.com>
Pull locking fixes from Borislav Petkov:
- Remove if_not_guard() as it is generating incorrect code
- Fix the initialization of the fake lockdep_map for the first locked
ww_mutex
* tag 'locking_urgent_for_v6.13_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
headers/cleanup.h: Remove the if_not_guard() facility
locking/ww_mutex: Fix ww_mutex dummy lockdep map selftest warnings
Pull x86 perf fixes from Borislav Petkov:
- Make sure the PEBS buffer is drained before reconfiguring the
hardware
- Add Arrow Lake U support
* tag 'perf_urgent_for_v6.13_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel/ds: Unconditionally drain PEBS DS when changing PEBS_DATA_CFG
perf/x86/intel: Add Arrow Lake U support
Pull scheduler fixes from Borislav Petkov:
- Remove wrong enqueueing of a task for a later wakeup when a task
blocks on a RT mutex
- Do not setup a new deadline entity on a boosted task as that has
happened already
- Update preempt= kernel command line param
- Prevent needless softirqd wakeups in the idle task's context
- Detect the case where the idle load balancer CPU becomes busy and
avoid unnecessary load balancing invocation
- Remove an unnecessary load balancing need_resched() call in
nohz_csd_func()
- Allow for raising of SCHED_SOFTIRQ softirq type on RT but retain the
warning to catch any other cases
- Remove a wrong warning when a cpuset update makes the task affinity
no longer a subset of the cpuset
* tag 'sched_urgent_for_v6.13_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking: rtmutex: Fix wake_q logic in task_blocks_on_rt_mutex
sched/deadline: Fix warning in migrate_enable for boosted tasks
sched/core: Update kernel boot parameters for LAZY preempt.
sched/core: Prevent wakeup of ksoftirqd during idle load balance
sched/fair: Check idle_cpu() before need_resched() to detect ilb CPU turning busy
sched/core: Remove the unnecessary need_resched() check in nohz_csd_func()
softirq: Allow raising SCHED_SOFTIRQ from SMP-call-function on RT kernel
sched: fix warning in sched_setaffinity
sched/deadline: Fix replenish_dl_new_period dl_server condition
Build 6.13-rc12 for x86_64 with gcc 14.2.1 fails with the error:
ld: vmlinux.o: in function `virtual_mapped':
linux/arch/x86/kernel/relocate_kernel_64.S:249:(.text+0x5915b): undefined reference to `saved_context_gdt_desc'
when CONFIG_KEXEC_JUMP is enabled.
This was introduced by commit 07fa619f2a ("x86/kexec: Restore GDT on
return from ::preserve_context kexec") which introduced a use of
saved_context_gdt_desc without a declaration for it.
Fix that by including asm/asm-offsets.h where saved_context_gdt_desc
is defined (indirectly in include/generated/asm-offsets.h which
asm/asm-offsets.h includes).
Fixes: 07fa619f2a ("x86/kexec: Restore GDT on return from ::preserve_context kexec")
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: David Woodhouse <dwmw@amazon.co.uk>
Closes: https://lore.kernel.org/oe-kbuild-all/202411270006.ZyyzpYf8-lkp@intel.com/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The powerpc user access code is special, and unlike other architectures
distinguishes between user access for reading and writing.
And commit 43a43faf53 ("futex: improve user space accesses") messed
that up. It went undetected elsewhere, but caused ppc32 to fail early
during boot, because the user access had been started with
user_read_access_begin(), but then finished off with just a plain
"user_access_end()".
Note that the address-masking user access helpers don't even have that
read-vs-write distinction, so if powerpc ever wants to do address
masking tricks, we'll have to do some extra work for it.
[ Make sure to also do it for the EFAULT case, as pointed out by
Christophe Leroy ]
Reported-by: Andreas Schwab <schwab@linux-m68k.org>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Link: https://lore.kernel.org/all/87bjxl6b0i.fsf@igel.home/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On port initialization, we configure the maximum frame length accepted
by the receive module associated with the port. This value is currently
written to the MAX_LEN field of the DEV10G_MAC_ENA_CFG register, when in
fact, it should be written to the DEV10G_MAC_MAXLEN_CFG register. Fix
this.
Fixes: 946e7fd505 ("net: sparx5: add port module support")
Signed-off-by: Daniel Machon <daniel.machon@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When doing port mirroring, the physical port to send the frame to, is
written to the FRMC_PORT_VAL field of the QFWD_FRAME_COPY_CFG register.
This field is 7 bits wide on sparx5 and 6 bits wide on lan969x, and has
a default value of 65 and 30, respectively (the number of front ports).
On mirror deletion, we set the default value of the monitor port to
65 for this field, in case no more ports exists for the mirror. Needless
to say, this will not fit the 6 bits on lan969x.
Fix this by correctly using the n_ports constant instead.
Fixes: 3f9e46347a ("net: sparx5: use SPX5_CONST for constants which already have a symbol")
Signed-off-by: Daniel Machon <daniel.machon@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The FDMA handler is responsible for scheduling a NAPI poll, which will
eventually fetch RX packets from the FDMA queue. Currently, the FDMA
handler is run in a threaded context. For some reason, this kills
performance. Admittedly, I did not do a thorough investigation to see
exactly what causes the issue, however, I noticed that in the other
driver utilizing the same FDMA engine, we run the FDMA handler in hard
IRQ context.
Fix this performance issue, by running the FDMA handler in hard IRQ
context, not deferring any work to a thread.
Prior to this change, the RX UDP performance was:
Interval Transfer Bitrate Jitter
0.00-10.20 sec 44.6 MBytes 36.7 Mbits/sec 0.027 ms
After this change, the rx UDP performance is:
Interval Transfer Bitrate Jitter
0.00-9.12 sec 1.01 GBytes 953 Mbits/sec 0.020 ms
Fixes: 10615907e9 ("net: sparx5: switchdev: adding frame DMA functionality")
Signed-off-by: Daniel Machon <daniel.machon@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Depmod reports a cyclic dependency between modules sparx5-switch.ko and
lan969x-switch.ko:
depmod: ERROR: Cycle detected: lan969x_switch -> sparx5_switch -> lan969x_switch
depmod: ERROR: Found 2 modules in dependency cycles!
make[2]: *** [scripts/Makefile.modinst:132: depmod] Error 1
make: *** [Makefile:224: __sub-make] Error 2
This makes sense, as they both require symbols from each other.
Fix this by compiling lan969x support into the sparx5-switch.ko module.
In order to do this, in a sensible way, we move the lan969x/ dir into
the sparx5/ dir and do some code cleanup of code that is no longer
required.
After this patch, depmod will no longer complain, as lan969x support is
compiled into the sparx5-swicth.ko module, and can no longer be compiled
as a standalone module.
Fixes: 98a0111960 ("net: sparx5: add compatible string for lan969x")
Signed-off-by: Daniel Machon <daniel.machon@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
AXP717 datasheet says that regulator ramp delay is 15.625 us/step,
which is 10mV in our case.
Add a AXP_DESC_RANGES_DELAY macro and update AXP_DESC_RANGES macro to
expand to AXP_DESC_RANGES_DELAY with ramp_delay = 0
For DCDC4, steps is 100mv
Add a AXP_DESC_DELAY macro and update AXP_DESC macro to
expand to AXP_DESC_DELAY with ramp_delay = 0
This patch fix crashes when using CPU DVFS.
Signed-off-by: Philippe Simons <simons.philippe@gmail.com>
Tested-by: Hironori KIKUCHI <kikuchan98@gmail.com>
Tested-by: Chris Morgan <macromorgan@hotmail.com>
Reviewed-by: Chen-Yu Tsai <wens@csie.org>
Fixes: d2ac3df75c ("regulator: axp20x: add support for the AXP717")
Link: https://patch.msgid.link/20241208124308.5630-1-simons.philippe@gmail.com
Signed-off-by: Mark Brown <broonie@kernel.org>
DSB LUT register writes vs. palette anti-collision logic
appear to interact in interesting ways:
- posted DSB writes simply vanish into thin air while
anti-collision is active
- non-posted DSB writes actually get blocked by the anti-collision
logic, but unfortunately this ends up hogging the bus for
long enough that unrelated parallel CPU MMIO accesses start
to disappear instead
Even though we are updating the LUT during vblank we aren't
immune to the anti-collision logic because it kicks in briefly
for pipe prefill (initiated at frame start). The safe time
window for performing the LUT update is thus between the
undelayed vblank and frame start. Turns out that with low
enough CDCLK frequency (DSB execution speed depends on CDCLK)
we can exceed that.
As we are currently using non-posted writes for the legacy LUT
updates, in which case we can hit the far more severe failure
mode. The problem is exacerbated by the fact that non-posted
writes are much slower than posted writes (~4x it seems).
To mititage the problem let's switch to using posted DSB
writes for legacy LUT updates (which will involve using the
double write approach to avoid other problems with DSB
vs. legacy LUT writes). Despite writing each register twice
this will in fact make the legacy LUT update faster when
compared to the non-posted write approach, making the
problem less likely to appear. The failure mode is also
less severe.
This isn't the 100% solution we need though. That will involve
estimating how long the LUT update will take, and pushing
frame start and/or delayed vblank forward to guarantee that
the update will have finished by the time the pipe prefill
starts...
Cc: stable@vger.kernel.org
Fixes: 34d8311f4a ("drm/i915/dsb: Re-instate DSB for LUT updates")
Fixes: 25ea3411bd ("drm/i915/dsb: Use non-posted register writes for legacy LUT")
Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/12494
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20241120164123.12706-3-ville.syrjala@linux.intel.com
Reviewed-by: Uma Shankar <uma.shankar@intel.com>
(cherry picked from commit 2504a316b35d49522f39cf0dc01830d7c36a9be4)
Signed-off-by: Tvrtko Ursulin <tursulin@ursulin.net>
Turns out the DSB indexed register write command has
rather significant initial overhead compared to the normal
MMIO write command. Based on some quick experiments on TGL
you have to write the register at least ~5 times for the
indexed write command to come out ahead. If you write the
register less times than that the MMIO write is faster.
So it seems my automagic indexed write logic was a bit
misguided. Go back to the original approach only use
indexed writes for the cases we know will benefit from
it (indexed LUT register updates).
Currently we shouldn't have any cases where this truly
matters (just some rare double writes to the precision
LUT index registers), but we will need to switch the
legacy LUT updates to write each LUT register twice (to
avoid some palette anti-collision logic troubles).
This would be close to the worst case for using indexed
writes (two writes per register, and 256 separate registers).
Using the MMIO write command should shave off around 30%
of the execution time compared to using the indexed write
command.
Cc: stable@vger.kernel.org
Fixes: 34d8311f4a ("drm/i915/dsb: Re-instate DSB for LUT updates")
Fixes: 25ea3411bd ("drm/i915/dsb: Use non-posted register writes for legacy LUT")
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20241120164123.12706-2-ville.syrjala@linux.intel.com
Reviewed-by: Uma Shankar <uma.shankar@intel.com>
(cherry picked from commit ecba559a88ab8399a41893d7828caf4dccbeab6c)
Signed-off-by: Tvrtko Ursulin <tursulin@ursulin.net>
UAC 2 & 3 DAC's set bit 31 of the format to signal support for a
RAW_DATA type, typically used for DSD playback.
This is correctly tested by (format & UAC*_FORMAT_TYPE_I_RAW_DATA),
fp->dsd_raw = true; and call snd_usb_interface_dsd_format_quirks(),
however a confusing and unnecessary message gets printed because
the bit is not properly tested in the last "unsupported" if test:
if (format & ~0x3F) { ... }
For example the output:
usb 7-1: new high-speed USB device number 5 using xhci_hcd
usb 7-1: New USB device found, idVendor=262a, idProduct=9302, bcdDevice=0.01
usb 7-1: New USB device strings: Mfr=1, Product=2, SerialNumber=6
usb 7-1: Product: TC44C
usb 7-1: Manufacturer: TC44C
usb 7-1: SerialNumber: 5000000001
hid-generic 0003:262A:9302.001E: No inputs registered, leaving
hid-generic 0003:262A:9302.001E: hidraw6: USB HID v1.00 Device [DDHIFI TC44C] on usb-0000:08:00.3-1/input0
usb 7-1: 2:4 : unsupported format bits 0x100000000
This last "unsupported format" is actually wrong: we know the
format is a RAW_DATA which we assume is DSD, so there is no need
to print the confusing message.
This we unset bit 31 of the format after recognizing it, to avoid
the message.
Suggested-by: Takashi Iwai <tiwai@suse.com>
Signed-off-by: Adrian Ratiu <adrian.ratiu@collabora.com>
Link: https://patch.msgid.link/20241209090529.16134-2-adrian.ratiu@collabora.com
Signed-off-by: Takashi Iwai <tiwai@suse.de>
When looking up a non-existent file, efivarfs returns -EINVAL if the
file does not conform to the NAME-GUID format and -ENOENT if it does.
This is caused by efivars_d_hash() returning -EINVAL if the name is not
formatted correctly. This error is returned before simple_lookup()
returns a negative dentry, and is the error value that the user sees.
Fix by removing this check. If the file does not exist, simple_lookup()
will return a negative dentry leading to -ENOENT and efivarfs_create()
already has a validity check before it creates an entry (and will
correctly return -EINVAL)
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: <stable@vger.kernel.org>
[ardb: make efivarfs_valid_name() static]
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Introduces the alc2xx-fixup-headset-mic model to simplify enabling
headset microphones on ALC2XX codecs.
Many recent configurations, as well as older systems that lacked this
fix for a long time, leave headset microphones inactive by default.
This addition provides a flexible workaround using the existing
ALC2XX_FIXUP_HEADSET_MIC quirk.
Signed-off-by: Vasiliy Kovalev <kovalev@altlinux.org>
Link: https://patch.msgid.link/20241207201836.6879-1-kovalev@altlinux.org
Signed-off-by: Takashi Iwai <tiwai@suse.de>
CA0132 used the PCI SSID lookup helper that doesn't support the model
string matching or quirk aliasing.
Replace it with the standard HD-audio quirk helpers for supporting
those, and add the definition of the model strings for supported
quirks, too. There should be no visible change to the outside for the
working system, but the driver will parse the model option and apply
the quirk based on it from now on.
Link: https://patch.msgid.link/20241207133754.3658-1-tiwai@suse.de
Signed-off-by: Takashi Iwai <tiwai@suse.de>
The OF node reference obtained by of_parse_phandle_with_args() is not
released on early return. Add a of_node_put() call before returning.
Fixes: 8996b89d6b ("ata: add platform driver for Calxeda AHCI controller")
Signed-off-by: Joe Hattori <joe@pf.is.s.u-tokyo.ac.jp>
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Vladimir Oltean says:
====================
Ocelot PTP fixes
This is another attempt at "net: mscc: ocelot: be resilient to loss of
PTP packets during transmission".
https://lore.kernel.org/netdev/20241203164755.16115-1-vladimir.oltean@nxp.com/
The central change is in patch 4/5. It recovers a port from a very
reproducible condition where previously, it would become unable to take
any hardware TX timestamp.
Then we have patches 1/5 and 5/5 which are extra bug fixes.
Patches 2/5 and 3/5 are logical changes split out of the monolithic
patch previously submitted as v1, so that patch 4/5 is hopefully easier
to follow.
====================
Link: https://patch.msgid.link/20241205145519.1236778-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
An unsupported RX filter will leave the port with TX timestamping still
applied as per the new request, rather than the old setting. When
parsing the tx_type, don't apply it just yet, but delay that until after
we've parsed the rx_filter as well (and potentially returned -ERANGE for
that).
Similarly, copy_to_user() may fail, which is a rare occurrence, but
should still be treated by unwinding what was done.
Fixes: 96ca08c058 ("net: mscc: ocelot: set up traps for PTP packets")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20241205145519.1236778-6-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The Felix DSA driver presents unique challenges that make the simplistic
ocelot PTP TX timestamping procedure unreliable: any transmitted packet
may be lost in hardware before it ever leaves our local system.
This may happen because there is congestion on the DSA conduit, the
switch CPU port or even user port (Qdiscs like taprio may delay packets
indefinitely by design).
The technical problem is that the kernel, i.e. ocelot_port_add_txtstamp_skb(),
runs out of timestamp IDs eventually, because it never detects that
packets are lost, and keeps the IDs of the lost packets on hold
indefinitely. The manifestation of the issue once the entire timestamp
ID range becomes busy looks like this in dmesg:
mscc_felix 0000:00:00.5: port 0 delivering skb without TX timestamp
mscc_felix 0000:00:00.5: port 1 delivering skb without TX timestamp
At the surface level, we need a timeout timer so that the kernel knows a
timestamp ID is available again. But there is a deeper problem with the
implementation, which is the monotonically increasing ocelot_port->ts_id.
In the presence of packet loss, it will be impossible to detect that and
reuse one of the holes created in the range of free timestamp IDs.
What we actually need is a bitmap of 63 timestamp IDs tracking which one
is available. That is able to use up holes caused by packet loss, but
also gives us a unique opportunity to not implement an actual timer_list
for the timeout timer (very complicated in terms of locking).
We could only declare a timestamp ID stale on demand (lazily), aka when
there's no other timestamp ID available. There are pros and cons to this
approach: the implementation is much more simple than per-packet timers
would be, but most of the stale packets would be quasi-leaked - not
really leaked, but blocked in driver memory, since this algorithm sees
no reason to free them.
An improved technique would be to check for stale timestamp IDs every
time we allocate a new one. Assuming a constant flux of PTP packets,
this avoids stale packets being blocked in memory, but of course,
packets lost at the end of the flux are still blocked until the flux
resumes (nobody left to kick them out).
Since implementing per-packet timers is way too complicated, this should
be good enough.
Testing procedure:
Persistently block traffic class 5 and try to run PTP on it:
$ tc qdisc replace dev swp3 parent root taprio num_tc 8 \
map 0 1 2 3 4 5 6 7 queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \
base-time 0 sched-entry S 0xdf 100000 flags 0x2
[ 126.948141] mscc_felix 0000:00:00.5: port 3 tc 5 min gate length 0 ns not enough for max frame size 1526 at 1000 Mbps, dropping frames over 1 octets including FCS
$ ptp4l -i swp3 -2 -P -m --socket_priority 5 --fault_reset_interval ASAP --logSyncInterval -3
ptp4l[70.351]: port 1 (swp3): INITIALIZING to LISTENING on INIT_COMPLETE
ptp4l[70.354]: port 0 (/var/run/ptp4l): INITIALIZING to LISTENING on INIT_COMPLETE
ptp4l[70.358]: port 0 (/var/run/ptp4lro): INITIALIZING to LISTENING on INIT_COMPLETE
[ 70.394583] mscc_felix 0000:00:00.5: port 3 timestamp id 0
ptp4l[70.406]: timed out while polling for tx timestamp
ptp4l[70.406]: increasing tx_timestamp_timeout or increasing kworker priority may correct this issue, but a driver bug likely causes it
ptp4l[70.406]: port 1 (swp3): send peer delay response failed
ptp4l[70.407]: port 1 (swp3): clearing fault immediately
ptp4l[70.952]: port 1 (swp3): new foreign master d858d7.fffe.00ca6d-1
[ 71.394858] mscc_felix 0000:00:00.5: port 3 timestamp id 1
ptp4l[71.400]: timed out while polling for tx timestamp
ptp4l[71.400]: increasing tx_timestamp_timeout or increasing kworker priority may correct this issue, but a driver bug likely causes it
ptp4l[71.401]: port 1 (swp3): send peer delay response failed
ptp4l[71.401]: port 1 (swp3): clearing fault immediately
[ 72.393616] mscc_felix 0000:00:00.5: port 3 timestamp id 2
ptp4l[72.401]: timed out while polling for tx timestamp
ptp4l[72.402]: increasing tx_timestamp_timeout or increasing kworker priority may correct this issue, but a driver bug likely causes it
ptp4l[72.402]: port 1 (swp3): send peer delay response failed
ptp4l[72.402]: port 1 (swp3): clearing fault immediately
ptp4l[72.952]: port 1 (swp3): new foreign master d858d7.fffe.00ca6d-1
[ 73.395291] mscc_felix 0000:00:00.5: port 3 timestamp id 3
ptp4l[73.400]: timed out while polling for tx timestamp
ptp4l[73.400]: increasing tx_timestamp_timeout or increasing kworker priority may correct this issue, but a driver bug likely causes it
ptp4l[73.400]: port 1 (swp3): send peer delay response failed
ptp4l[73.400]: port 1 (swp3): clearing fault immediately
[ 74.394282] mscc_felix 0000:00:00.5: port 3 timestamp id 4
ptp4l[74.400]: timed out while polling for tx timestamp
ptp4l[74.401]: increasing tx_timestamp_timeout or increasing kworker priority may correct this issue, but a driver bug likely causes it
ptp4l[74.401]: port 1 (swp3): send peer delay response failed
ptp4l[74.401]: port 1 (swp3): clearing fault immediately
ptp4l[74.953]: port 1 (swp3): new foreign master d858d7.fffe.00ca6d-1
[ 75.396830] mscc_felix 0000:00:00.5: port 3 invalidating stale timestamp ID 0 which seems lost
[ 75.405760] mscc_felix 0000:00:00.5: port 3 timestamp id 0
ptp4l[75.410]: timed out while polling for tx timestamp
ptp4l[75.411]: increasing tx_timestamp_timeout or increasing kworker priority may correct this issue, but a driver bug likely causes it
ptp4l[75.411]: port 1 (swp3): send peer delay response failed
ptp4l[75.411]: port 1 (swp3): clearing fault immediately
(...)
Remove the blocking condition and see that the port recovers:
$ same tc command as above, but use "sched-entry S 0xff" instead
$ same ptp4l command as above
ptp4l[99.489]: port 1 (swp3): INITIALIZING to LISTENING on INIT_COMPLETE
ptp4l[99.490]: port 0 (/var/run/ptp4l): INITIALIZING to LISTENING on INIT_COMPLETE
ptp4l[99.492]: port 0 (/var/run/ptp4lro): INITIALIZING to LISTENING on INIT_COMPLETE
[ 100.403768] mscc_felix 0000:00:00.5: port 3 invalidating stale timestamp ID 0 which seems lost
[ 100.412545] mscc_felix 0000:00:00.5: port 3 invalidating stale timestamp ID 1 which seems lost
[ 100.421283] mscc_felix 0000:00:00.5: port 3 invalidating stale timestamp ID 2 which seems lost
[ 100.430015] mscc_felix 0000:00:00.5: port 3 invalidating stale timestamp ID 3 which seems lost
[ 100.438744] mscc_felix 0000:00:00.5: port 3 invalidating stale timestamp ID 4 which seems lost
[ 100.447470] mscc_felix 0000:00:00.5: port 3 timestamp id 0
[ 100.505919] mscc_felix 0000:00:00.5: port 3 timestamp id 0
ptp4l[100.963]: port 1 (swp3): new foreign master d858d7.fffe.00ca6d-1
[ 101.405077] mscc_felix 0000:00:00.5: port 3 timestamp id 0
[ 101.507953] mscc_felix 0000:00:00.5: port 3 timestamp id 0
[ 102.405405] mscc_felix 0000:00:00.5: port 3 timestamp id 0
[ 102.509391] mscc_felix 0000:00:00.5: port 3 timestamp id 0
[ 103.406003] mscc_felix 0000:00:00.5: port 3 timestamp id 0
[ 103.510011] mscc_felix 0000:00:00.5: port 3 timestamp id 0
[ 104.405601] mscc_felix 0000:00:00.5: port 3 timestamp id 0
[ 104.510624] mscc_felix 0000:00:00.5: port 3 timestamp id 0
ptp4l[104.965]: selected best master clock d858d7.fffe.00ca6d
ptp4l[104.966]: port 1 (swp3): assuming the grand master role
ptp4l[104.967]: port 1 (swp3): LISTENING to GRAND_MASTER on RS_GRAND_MASTER
[ 105.106201] mscc_felix 0000:00:00.5: port 3 timestamp id 0
[ 105.232420] mscc_felix 0000:00:00.5: port 3 timestamp id 0
[ 105.359001] mscc_felix 0000:00:00.5: port 3 timestamp id 0
[ 105.405500] mscc_felix 0000:00:00.5: port 3 timestamp id 0
[ 105.485356] mscc_felix 0000:00:00.5: port 3 timestamp id 0
[ 105.511220] mscc_felix 0000:00:00.5: port 3 timestamp id 0
[ 105.610938] mscc_felix 0000:00:00.5: port 3 timestamp id 0
[ 105.737237] mscc_felix 0000:00:00.5: port 3 timestamp id 0
(...)
Notice that in this new usage pattern, a non-congested port should
basically use timestamp ID 0 all the time, progressing to higher numbers
only if there are unacknowledged timestamps in flight. Compare this to
the old usage, where the timestamp ID used to monotonically increase
modulo OCELOT_MAX_PTP_ID.
In terms of implementation, this simplifies the bookkeeping of the
ocelot_port :: ts_id and ptp_skbs_in_flight. Since we need to traverse
the list of two-step timestampable skbs for each new packet anyway, the
information can already be computed and does not need to be stored.
Also, ocelot_port->tx_skbs is always accessed under the switch-wide
ocelot->ts_id_lock IRQ-unsafe spinlock, so we don't need the skb queue's
lock and can use the unlocked primitives safely.
This problem was actually detected using the tc-taprio offload, and is
causing trouble in TSN scenarios, which Felix (NXP LS1028A / VSC9959)
supports but Ocelot (VSC7514) does not. Thus, I've selected the commit
to blame as the one adding initial timestamping support for the Felix
switch.
Fixes: c0bcf53766 ("net: dsa: ocelot: add hardware timestamping support for Felix")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20241205145519.1236778-5-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ocelot_get_txtstamp() is a threaded IRQ handler, requested explicitly as
such by both ocelot_ptp_rdy_irq_handler() and vsc9959_irq_handler().
As such, it runs with IRQs enabled, and not in hardirq context. Thus,
ocelot_port_add_txtstamp_skb() has no reason to turn off IRQs, it cannot
be preempted by ocelot_get_txtstamp(). For the same reason,
dev_kfree_skb_any_reason() will always evaluate as kfree_skb_reason() in
this calling context, so just simplify the dev_kfree_skb_any() call to
kfree_skb().
Also, ocelot_port_txtstamp_request() runs from NET_TX softirq context,
not with hardirqs enabled. Thus, ocelot_get_txtstamp() which shares the
ocelot_port->tx_skbs.lock lock with it, has no reason to disable hardirqs.
This is part of a larger rework of the TX timestamping procedure.
A logical subportion of the rework has been split into a separate
change.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20241205145519.1236778-4-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This condition, theoretically impossible to trigger, is not really
handled well. By "continuing", we are skipping the write to SYS_PTP_NXT
which advances the timestamp FIFO to the next entry. So we are reading
the same FIFO entry all over again, printing stack traces and eventually
killing the kernel.
No real problem has been observed here. This is part of a larger rework
of the timestamp IRQ procedure, with this logical change split out into
a patch of its own. We will need to "goto next_ts" for other conditions
as well.
Fixes: 9fde506e0c ("net: mscc: ocelot: warn when a PTP IRQ is raised for an unknown skb")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20241205145519.1236778-3-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit 66600fac7a ("net: stmmac: TSO: Fix unbalanced DMA map/unmap
for non-paged SKB data") moved the assignment of tx_skbuff_dma[]'s
members to be later in stmmac_tso_xmit().
The buf (dma cookie) and len stored in this structure are passed to
dma_unmap_single() by stmmac_tx_clean(). The DMA API requires that
the dma cookie passed to dma_unmap_single() is the same as the value
returned from dma_map_single(). However, by moving the assignment
later, this is not the case when priv->dma_cap.addr64 > 32 as "des"
is offset by proto_hdr_len.
This causes problems such as:
dwc-eth-dwmac 2490000.ethernet eth0: Tx DMA map failed
and with DMA_API_DEBUG enabled:
DMA-API: dwc-eth-dwmac 2490000.ethernet: device driver tries to +free DMA memory it has not allocated [device address=0x000000ffffcf65c0] [size=66 bytes]
Fix this by maintaining "des" as the original DMA cookie, and use
tso_des to pass the offset DMA cookie to stmmac_tso_allocator().
Full details of the crashes can be found at:
https://lore.kernel.org/all/d8112193-0386-4e14-b516-37c2d838171a@nvidia.com/https://lore.kernel.org/all/klkzp5yn5kq5efgtrow6wbvnc46bcqfxs65nz3qy77ujr5turc@bwwhelz2l4dw/
Reported-by: Jon Hunter <jonathanh@nvidia.com>
Reported-by: Thierry Reding <thierry.reding@gmail.com>
Fixes: 66600fac7a ("net: stmmac: TSO: Fix unbalanced DMA map/unmap for non-paged SKB data")
Tested-by: Jon Hunter <jonathanh@nvidia.com>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Furong Xu <0x1207@gmail.com>
Link: https://patch.msgid.link/E1tJXcx-006N4Z-PC@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Linus noticed that the new if_not_guard() definition is fragile:
"This macro generates actively wrong code if it happens to be inside an
if-statement or a loop without a block.
IOW, code like this:
for (iterate-over-something)
if_not_guard(a)
return -BUSY;
looks like will build fine, but will generate completely incorrect code."
The reason is that the __if_not_guard() macro is multi-statement, so
while most kernel developers expect macros to be simple or at least
compound statements - but for __if_not_guard() it is not so:
#define __if_not_guard(_name, _id, args...) \
BUILD_BUG_ON(!__is_cond_ptr(_name)); \
CLASS(_name, _id)(args); \
if (!__guard_ptr(_name)(&_id))
To add insult to injury, the placement of the BUILD_BUG_ON() line makes
the macro appear to compile fine, but it will generate incorrect code
as Linus reported, for example if used within iteration or conditional
statements that will use the first statement of a macro as a loop body
or conditional statement body.
[ I'd also like to note that the original submission by David Lechner did
not contain the BUILD_BUG_ON() line, so it was safer than what we ended
up committing. Mea culpa. ]
It doesn't appear to be possible to turn this macro into a robust
single or compound statement that could be used in single statements,
due to the necessity to define an auto scope variable with an open
scope and the necessity of it having to expand to a partial 'if'
statement with no body.
Instead of trying to work around this fragility, just remove the
construct before it gets used.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: David Lechner <dlechner@baylibre.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/Z1LBnX9TpZLR5Dkf@gmail.com
Michael Chan says:
====================
bnxt_en: Bug fixes
There are 2 bug fixes in this series. This first one fixes the issue
of setting the gso_type incorrectly for HW GRO packets on 5750X (Thor)
chips. This can cause HW GRO packets to be dropped by the stack if
they are re-segmented. The second one fixes a potential division by
zero crash when dumping FW log coredump.
====================
Link: https://patch.msgid.link/20241204215918.1692597-1-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The existing code is using RSS profile to determine IPV4/IPV6 GSO type
on all chips older than 5760X. This won't work on 5750X chips that may
be using modified RSS profiles. This commit from 2018 has updated the
driver to not use RSS profile for HW GRO packets on newer chips:
50f011b63d ("bnxt_en: Update RSS setup and GRO-HW logic according to the latest spec.")
However, a recent commit to add support for the newest 5760X chip broke
the logic. If the GRO packet needs to be re-segmented by the stack, the
wrong GSO type will cause the packet to be dropped.
Fix it to only use RSS profile to determine GSO type on the oldest
5730X/5740X chips which cannot use the new method and is safe to use the
RSS profiles.
Also fix the L3/L4 hash type for RX packets by not using the RSS
profile for the same reason. Use the ITYPE field in the RX completion
to determine L3/L4 hash types correctly.
Fixes: a7445d6980 ("bnxt_en: Add support for new RX and TPA_START completion types for P7")
Reviewed-by: Colin Winegarden <colin.winegarden@broadcom.com>
Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://patch.msgid.link/20241204215918.1692597-2-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Petr Machata says:
====================
selftests: mlxsw: Add few fixes for sharedbuffer test
Danielle Ratson writes:
Currently, the sharedbuffer test fails sometimes because it is reading a
maximum occupancy that is larger than expected on some different cases.
This is happening because the test assumes that the packet it is sending
is the only packet being passed to the device.
In addition, some duplications on one hand, and redundant test cases on
the other hand, were found in the test.
Add egress filters on h1 and h2 that will guarantee that the packets in
the buffer are sent in the test, and remove the redundant test cases.
====================
Link: https://patch.msgid.link/cover.1733414773.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
For historical reasons, the legacy decompressor code on various
architectures supports 7 different compression types for the compressed
kernel image.
EFI zboot is not a compression library museum, and so the options can be
limited to what is likely to be useful in practice:
- GZIP is tried and tested, and is still one of the fastest at
decompression time, although the compression ratio is not very high;
moreover, Fedora is already shipping EFI zboot kernels for arm64 that
use GZIP, and QEMU implements direct support for it when booting a
kernel without firmware loaded;
- ZSTD has a very high compression ratio (although not the highest), and
is almost as fast as GZIP at decompression time.
Reducing the number of options makes it less of a hassle for other
consumers of the EFI zboot format (such as QEMU today, and kexec in the
future) to support it transparently without having to carry 7 different
decompression libraries.
Acked-by: Gerd Hoffmann <kraxel@redhat.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Commit e546fe1da9 ("block: Rework bio_split() return value") changed
bio_split() so that it can return errors.
Add error handling for it in btrfs_split_bio() and ultimately
btrfs_submit_chunk(). As the bio is not submitted, the bio counter must
be decremented to pair btrfs_bio_counter_inc_blocked().
Reviewed-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Before commit e820dbeb6a ("btrfs: convert btrfs_buffered_write() to
use folios"), function prepare_one_folio() will always wait for folio
writeback to finish before returning the folio.
However commit e820dbeb6a ("btrfs: convert btrfs_buffered_write() to
use folios") changed to use FGP_STABLE to do the writeback wait, but
FGP_STABLE is calling folio_wait_stable(), which only calls
folio_wait_writeback() if the address space has AS_STABLE_WRITES, which
is not set for btrfs inodes.
This means we will not wait for the folio writeback at all.
[CAUSE]
The cause is FGP_STABLE is not waiting for writeback unconditionally, but
only for address spaces with AS_STABLE_WRITES, normally such flag is set
when the super block has SB_I_STABLE_WRITES flag.
Such super block flag is set when the block device has hardware digest
support or has internal checksum requirement.
I'd argue btrfs should set such super block due to its default data
checksum behavior, but it is not set yet, so this means FGP_STABLE flag
will have no effect at all.
(For NODATASUM inodes, we can skip the waiting in theory but that should
be an optimization in the future.)
This can lead to data checksum mismatch, as we can modify the folio
while it's still under writeback, this will make the contents differ
from the contents at submission and checksum calculation.
[FIX]
Instead of fully relying on FGP_STABLE, manually do the folio writeback
waiting, until we set the address space or super flag.
Fixes: e820dbeb6a ("btrfs: convert btrfs_buffered_write() to use folios")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With the __counted_by annocation in cfg80211_scan_request struct,
the "n_channels" struct member must be set before accessing the
"channels" array. Failing to do so will trigger a runtime warning
when enabling CONFIG_UBSAN_BOUNDS and CONFIG_FORTIFY_SOURCE.
Fixes: e3eac9f32e ("wifi: cfg80211: Annotate struct cfg80211_scan_request with __counted_by")
Signed-off-by: Haoyu Li <lihaoyu499@gmail.com>
Link: https://patch.msgid.link/20241203152049.348806-1-lihaoyu499@gmail.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Commit 7c7e6c8924 ("tty: serial: handle HAS_IOPORT dependencies")
triggers warning backtraces on a number of platforms which don't support
IO ports.
WARNING: CPU: 0 PID: 0 at drivers/tty/serial/8250/8250_port.c:470 serial8250_set_defaults+0x148/0x1d8
Unsupported UART type 0
The problem is seen because serial8250_set_defaults() is called for
all members of the serial8250_ports[] array even if that array is
not initialized.
Work around the problem by only displaying the warning if the port
type is not 0 (UPIO_PORT) or if iobase is set for the port.
Fixes: 7c7e6c8924 ("tty: serial: handle HAS_IOPORT dependencies")
Acked-by: Arnd Bergmann <arnd@arndb.de>
Cc: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Tested-by: Stafford Horne <shorne@gmail.com>
Link: https://lore.kernel.org/r/20241205143033.2695333-1-linux@roeck-us.net
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The dr_domain_add_vport_cap() function generally returns NULL on error
but sometimes we want it to return ERR_PTR(-EBUSY) so the caller can
retry. The problem here is that "ret" can be either -EBUSY or -ENOMEM
and if it's and -ENOMEM then the error pointer is propogated back and
eventually dereferenced in dr_ste_v0_build_src_gvmi_qpn_tag().
Fixes: 11a45def2e ("net/mlx5: DR, Add support for SF vports")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/07477254-e179-43e2-b1b3-3b9db4674195@stanley.mountain
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When TT changes list is too big to fit in packet due to MTU size, an
empty OGM is sent expected other node to send TT request to get the
changes. The issue is that tt.last_changeset was not built thus the
originator was responding with previous changes to those TT requests
(see batadv_send_my_tt_response). Also the changes list was never
cleaned up effectively never ending growing from this point onwards,
repeatedly sending the same TT response changes over and over, and
creating a new empty OGM every OGM interval expecting for the local
changes to be purged.
When there is more TT changes that can fit in packet, drop all changes,
send empty OGM and wait for TT request so we can respond with a full
table instead.
Fixes: e1bf0c1409 ("batman-adv: tvlv - convert tt data sent within OGMs")
Signed-off-by: Remi Pommarel <repk@triplefau.lt>
Acked-by: Antonio Quartulli <Antonio@mandelbit.com>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
The number of entries filled by batadv_tt_tvlv_generate() can be less
than initially expected in batadv_tt_prepare_tvlv_{global,local}_data()
(changes can be removed by batadv_tt_local_event() in ADD+DEL sequence
in the meantime as the lock held during the whole tvlv global/local data
generation).
Thus tvlv_len could be bigger than the actual TT entry size that need
to be sent so full table TT_RESPONSE could hold invalid TT entries such
as below.
* 00:00:00:00:00:00 -1 [....] ( 0) 88:12:4e:ad:7e:ba (179) (0x45845380)
* 00:00:00:00:78:79 4092 [.W..] ( 0) 88:12:4e:ad:7e:3c (145) (0x8ebadb8b)
Remove the extra allocated space to avoid sending uninitialized entries
for full table TT_RESPONSE in both batadv_send_other_tt_response() and
batadv_send_my_tt_response().
Fixes: 7ea7b4a142 ("batman-adv: make the TT CRC logic VLAN specific")
Signed-off-by: Remi Pommarel <repk@triplefau.lt>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
The number of TT changes can be less than initially expected in
batadv_tt_tvlv_container_update() (changes can be removed by
batadv_tt_local_event() in ADD+DEL sequence between reading
tt_diff_entries_num and actually iterating the change list under lock).
Thus tt_diff_len could be bigger than the actual changes size that need
to be sent. Because batadv_send_my_tt_response sends the whole
packet, uninitialized data can be interpreted as TT changes on other
nodes leading to weird TT global entries on those nodes such as:
* 00:00:00:00:00:00 -1 [....] ( 0) 88:12:4e:ad:7e:ba (179) (0x45845380)
* 00:00:00:00:78:79 4092 [.W..] ( 0) 88:12:4e:ad:7e:3c (145) (0x8ebadb8b)
All of the above also applies to OGM tvlv container buffer's tvlv_len.
Remove the extra allocated space to avoid sending uninitialized TT
changes in batadv_send_my_tt_response() and batadv_v_ogm_send_softif().
Fixes: e1bf0c1409 ("batman-adv: tvlv - convert tt data sent within OGMs")
Signed-off-by: Remi Pommarel <repk@triplefau.lt>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Set the default workload type to bootup type on smu v13.0.7.
This is because of the constraint on smu v13.0.7.
Gfx activity has an even higher set point on 3D fullscreen
mode than the one on bootup mode. This causes the 3D fullscreen
mode's performance is worse than the bootup mode's performance
for the lightweighted/medium workload. For the high workload,
the performance is the same between 3D fullscreen mode and bootup
mode.
v2: set the default workload in ASIC specific file
Signed-off-by: Kenneth Feng <kenneth.feng@amd.com>
Reviewed-by: Yang Wang <kevinyang.wang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org # 6.11.x
Refactor such that individual SMU IP versions can choose the startup
power profile mode. If no preference, then use the generic default power
profile selection logic.
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Yang Wang <kevinyang.wang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org # 6.11.x
base.sched may not be set for each instance and should not
be used for cases such as non-IB tests.
Fixes: 2320c9e6a7 ("drm/sched: memset() 'job' in drm_sched_job_init()")
Signed-off-by: David (Ming Qiang) Wu <David.Wu3@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Use second jump table in sriov for live migration or mulitple VF
support so different VF can load different version of MEC as long
as they support sjt
Signed-off-by: Victor Zhao <Victor.Zhao@amd.com>
Reviewed-by: Yang Wang <kevinyang.wang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Align the page tracking maximum message size with the device's
capability instead of relying on PAGE_SIZE.
This adjustment resolves a mismatch on systems where PAGE_SIZE is 64K,
but the firmware only supports a maximum message size of 4K.
Now that we rely on the device's capability for max_message_size, we
must account for potential future increases in its value.
Key considerations include:
- Supporting message sizes that exceed a single system page (e.g., an 8K
message on a 4K system).
- Ensuring the RQ size is adjusted to accommodate at least 4
WQEs/messages, in line with the device specification.
The above has been addressed as part of the patch.
Fixes: 79c3cf2799 ("vfio/mlx5: Init QP based resources for dirty tracking")
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Tested-by: Yingshun Cui <yicui@redhat.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20241205122654.235619-1-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
These days, the Fixed Virtual Platforms(FVP) Base RevC model supports
more PCI devices. Update the max bus number so that Linux can enumerate
them correctly. Without this, the kernel throws the below error while
booting with the default hierarchy
| pci_bus 0000:01: busn_res: [bus 01] end is updated to 01
| pci_bus 0000:02: busn_res: can not insert [bus 02-01] under
| [bus 00-01] (conflicts with (null) [bus 00-01])
| pci_bus 0000:02: busn_res: [bus 02-01] end is updated to 02
| pci_bus 0000:02: busn_res: can not insert [bus 02] under
| [bus 00-01] (conflicts with (null) [bus 00-01])
| pci_bus 0000:03: busn_res: can not insert [bus 03-01] under
| [bus 00-01] (conflicts with (null) [bus 00-01])
| pci_bus 0000:03: busn_res: [bus 03-01] end is updated to 03
| pci_bus 0000:03: busn_res: can not insert [bus 03] under
| [bus 00-01] (conflicts with (null) [bus 00-01])
| pci_bus 0000:04: busn_res: can not insert [bus 04-01] under
| [bus 00-01] (conflicts with (null) [bus 00-01])
| pci_bus 0000:04: busn_res: [bus 04-01] end is updated to 04
| pci_bus 0000:04: busn_res: can not insert [bus 04] under
| [bus 00-01] (conflicts with (null) [bus 00-01])
| pci 0000:00:01.0: BAR 14: assigned [mem 0x50000000-0x500fffff]
| pci-host-generic 40000000.pci: ECAM at [mem 0x40000000-0x4fffffff]
| for [bus 00-01]
The change is using 0xff as max bus number because the ECAM window is
256MB in size. Below is the lspci output with and without the change:
without fix
===========
| 00:00.0 Host bridge: ARM Device 00ba (rev 01)
| 00:01.0 PCI bridge: ARM Device 0def
| 00:02.0 PCI bridge: ARM Device 0def
| 00:03.0 PCI bridge: ARM Device 0def
| 00:04.0 PCI bridge: ARM Device 0def
| 00:1e.0 Unassigned class [ff00]: ARM Device ff80
| 00:1e.1 Unassigned class [ff00]: ARM Device ff80
| 00:1f.0 SATA controller: Device 0abc:aced (rev 01)
| 01:00.0 SATA controller: Device 0abc:aced (rev 01)
with fix
========
| 00:00.0 Host bridge: ARM Device 00ba (rev 01)
| 00:01.0 PCI bridge: ARM Device 0def
| 00:02.0 PCI bridge: ARM Device 0def
| 00:03.0 PCI bridge: ARM Device 0def
| 00:04.0 PCI bridge: ARM Device 0def
| 00:1e.0 Unassigned class [ff00]: ARM Device ff80
| 00:1e.1 Unassigned class [ff00]: ARM Device ff80
| 00:1f.0 SATA controller: Device 0abc:aced (rev 01)
| 01:00.0 SATA controller: Device 0abc:aced (rev 01)
| 02:00.0 Unassigned class [ff00]: ARM Device ff80
| 02:00.4 Unassigned class [ff00]: ARM Device ff80
| 03:00.0 PCI bridge: ARM Device 0def
| 04:00.0 PCI bridge: ARM Device 0def
| 04:01.0 PCI bridge: ARM Device 0def
| 04:02.0 PCI bridge: ARM Device 0def
| 05:00.0 SATA controller: Device 0abc:aced (rev 01)
| 06:00.0 Unassigned class [ff00]: ARM Device ff80
| 06:00.7 Unassigned class [ff00]: ARM Device ff80
| 07:00.0 Unassigned class [ff00]: ARM Device ff80
| 07:00.3 Unassigned class [ff00]: ARM Device ff80
| 08:00.0 Unassigned class [ff00]: ARM Device ff80
| 08:00.1 Unassigned class [ff00]: ARM Device ff80
Cc: Sudeep Holla <sudeep.holla@arm.com>
Cc: Lorenzo Pieralisi <lpieralisi@kernel.org>
Cc: Rob Herring <robh@kernel.org>
Cc: Krzysztof Kozlowski <krzk+dt@kernel.org>
Cc: Conor Dooley <conor+dt@kernel.org>
Reviewed-by: Liviu Dudau <liviu.dudau@arm.com>
Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
Message-Id: <20241128152543.1821878-1-aneesh.kumar@kernel.org>
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
It should only have generic flags in the array but the recent header
sync brought a new flags to fcntl.h and caused a build error. Let's
update the shell script to exclude flags specific to name_to_handle_at().
CC trace/beauty/fs_at_flags.o
In file included from trace/beauty/fs_at_flags.c:21:
tools/perf/trace/beauty/generated/fs_at_flags_array.c:13:30: error: initialized field overwritten [-Werror=override-init]
13 | [ilog2(0x002) + 1] = "HANDLE_CONNECTABLE",
| ^~~~~~~~~~~~~~~~~~~~
tools/perf/trace/beauty/generated/fs_at_flags_array.c:13:30: note: (near initialization for ‘fs_at_flags[2]’)
Reviewed-by: James Clark <james.clark@linaro.org>
Link: https://lore.kernel.org/r/20241203035349.1901262-12-namhyung@kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
To pick up the changes in this cset:
6140be90ec ("fs/xattr: add *at family syscalls")
This addresses these perf build warnings:
Warning: Kernel ABI header differences:
diff -u tools/include/uapi/asm-generic/unistd.h include/uapi/asm-generic/unistd.h
diff -u tools/perf/arch/x86/entry/syscalls/syscall_32.tbl arch/x86/entry/syscalls/syscall_32.tbl
diff -u tools/perf/arch/x86/entry/syscalls/syscall_64.tbl arch/x86/entry/syscalls/syscall_64.tbl
diff -u tools/perf/arch/powerpc/entry/syscalls/syscall.tbl arch/powerpc/kernel/syscalls/syscall.tbl
diff -u tools/perf/arch/s390/entry/syscalls/syscall.tbl arch/s390/kernel/syscalls/syscall.tbl
diff -u tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl arch/mips/kernel/syscalls/syscall_n64.tbl
The arm64 changes are not included as it requires more changes in the
tools. It'll be worked for the later cycle.
Please see tools/include/uapi/README for further details.
Reviewed-by: James Clark <james.clark@linaro.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christian Brauner <brauner@kernel.org>
CC: x86@kernel.org
CC: linux-mips@vger.kernel.org
CC: linuxppc-dev@lists.ozlabs.org
CC: linux-s390@vger.kernel.org
Link: https://lore.kernel.org/r/20241203035349.1901262-7-namhyung@kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
To pick up the changes in this cset:
a0423af92c ("x86: KVM: Advertise CPUIDs for new instructions in Clearwater Forest")
0c487010cb ("x86/cpufeatures: Add X86_FEATURE_AMD_WORKLOAD_CLASS feature bit")
1ad4667066 ("x86/cpufeatures: Add X86_FEATURE_AMD_HETEROGENEOUS_CORES")
104edc6efc ("x86/cpufeatures: Rename X86_FEATURE_FAST_CPPC to have AMD prefix")
3ea87dfa31 ("x86/cpufeatures: Add a IBPB_NO_RET BUG flag")
ff898623af ("x86/cpufeatures: Define X86_FEATURE_AMD_IBPB_RET")
dcb988cdac ("KVM: x86: Quirk initialization of feature MSRs to KVM's max configuration")
This addresses these perf build warnings:
Warning: Kernel ABI header differences:
diff -u tools/arch/x86/include/asm/cpufeatures.h arch/x86/include/asm/cpufeatures.h
diff -u tools/arch/x86/include/uapi/asm/kvm.h arch/x86/include/uapi/asm/kvm.h
Please see tools/include/uapi/README for further details.
Reviewed-by: James Clark <james.clark@linaro.org>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Link: https://lore.kernel.org/r/20241203035349.1901262-5-namhyung@kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
To pick up the changes in this cset:
18d92bb57c ("perf/core: Add aux_pause, aux_resume, aux_start_paused")
This addresses these perf build warnings:
Warning: Kernel ABI header differences:
diff -u tools/include/uapi/linux/perf_event.h include/uapi/linux/perf_event.h
Please see tools/include/uapi/README for further details.
Reviewed-by: James Clark <james.clark@linaro.org>
Link: https://lore.kernel.org/r/20241203035349.1901262-3-namhyung@kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
After commit 78ecb03756 ("staging: gpib: make port I/O code
conditional"), building tnt4882.ko on platforms without HAS_IOPORT (such
as hexagon and s390) fails with:
ERROR: modpost: "inb_wrapper" [drivers/staging/gpib/tnt4882/tnt4882.ko] undefined!
ERROR: modpost: "inw_wrapper" [drivers/staging/gpib/tnt4882/tnt4882.ko] undefined!
ERROR: modpost: "nec7210_locking_ioport_write_byte" [drivers/staging/gpib/tnt4882/tnt4882.ko] undefined!
ERROR: modpost: "nec7210_locking_ioport_read_byte" [drivers/staging/gpib/tnt4882/tnt4882.ko] undefined!
ERROR: modpost: "outb_wrapper" [drivers/staging/gpib/tnt4882/tnt4882.ko] undefined!
ERROR: modpost: "outw_wrapper" [drivers/staging/gpib/tnt4882/tnt4882.ko] undefined!
Only allow tnt4882.ko to be built when CONFIG_HAS_IOPORT is set to avoid
this build failure, as this driver unconditionally needs it.
Fixes: 78ecb03756 ("staging: gpib: make port I/O code conditional")
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Link: https://lore.kernel.org/r/20241123-gpib-tnt4882-depends-on-has_ioport-v1-1-033c58b64751@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
On the Renesas RZ/G3S, when doing suspend to RAM, the uart_suspend_port()
is called. The uart_suspend_port() calls 3 times the
struct uart_port::ops::tx_empty() before shutting down the port.
According to the documentation, the struct uart_port::ops::tx_empty()
API tests whether the transmitter FIFO and shifter for the port is
empty.
The Renesas RZ/G3S SCIFA IP reports the number of data units stored in the
transmit FIFO through the FDR (FIFO Data Count Register). The data units
in the FIFOs are written in the shift register and transmitted from there.
The TEND bit in the Serial Status Register reports if the data was
transmitted from the shift register.
In the previous code, in the tx_empty() API implemented by the sh-sci
driver, it is considered that the TX is empty if the hardware reports the
TEND bit set and the number of data units in the FIFO is zero.
According to the HW manual, the TEND bit has the following meaning:
0: Transmission is in the waiting state or in progress.
1: Transmission is completed.
It has been noticed that when opening the serial device w/o using it and
then switch to a power saving mode, the tx_empty() call in the
uart_port_suspend() function fails, leading to the "Unable to drain
transmitter" message being printed on the console. This is because the
TEND=0 if nothing has been transmitted and the FIFOs are empty. As the
TEND=0 has double meaning (waiting state, in progress) we can't
determined the scenario described above.
Add a software workaround for this. This sets a variable if any data has
been sent on the serial console (when using PIO) or if the DMA callback has
been called (meaning something has been transmitted). In the tx_empty()
API the status of the DMA transaction is also checked and if it is
completed or in progress the code falls back in checking the hardware
registers instead of relying on the software variable.
Fixes: 73a19e4c03 ("serial: sh-sci: Add DMA support.")
Cc: stable@vger.kernel.org
Signed-off-by: Claudiu Beznea <claudiu.beznea.uj@bp.renesas.com>
Link: https://lore.kernel.org/r/20241125115856.513642-1-claudiu.beznea.uj@bp.renesas.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
OPM PPM LPM
| 1.send cmd | |
|-------------------------->| |
| |-- |
| | | 2.set busy bit in CCI |
| |<- |
| 3.notify the OPM | |
|<--------------------------| |
| | 4.send cmd to be executed |
| |-------------------------->|
| | |
| | 5.cmd completed |
| |<--------------------------|
| | |
| |-- |
| | | 6.set cmd completed |
| |<- bit in CCI |
| | |
| 7.notify the OPM | |
|<--------------------------| |
| | |
| 8.handle notification | |
| from point 3, read CCI | |
|<--------------------------| |
| | |
When the PPM receives command from the OPM (p.1) it sets the busy bit
in the CCI (p.2), sends notification to the OPM (p.3) and forwards the
command to be executed by the LPM (p.4). When the PPM receives command
completion from the LPM (p.5) it sets command completion bit in the CCI
(p.6) and sends notification to the OPM (p.7). If command execution by
the LPM is fast enough then when the OPM starts handling the notification
from p.3 in p.8 and reads the CCI value it will see command completion bit
set and will call complete(). Then complete() might be called again when
the OPM handles notification from p.7.
This fix replaces test_bit() with test_and_clear_bit()
in ucsi_notify_common() in order to call complete() only
once per request.
This fix also reinitializes completion variable in
ucsi_sync_control_common() before a command is sent.
Fixes: 584e8df589 ("usb: typec: ucsi: extract common code for command handling")
Cc: stable@vger.kernel.org
Signed-off-by: Łukasz Bartosik <ukaszb@chromium.org>
Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Reviewed-by: Heikki Krogerus <heikki.krogerus@linux.intel.com>
Reviewed-by: Benson Leung <bleung@chromium.org>
Link: https://lore.kernel.org/r/20241203102318.3386345-1-ukaszb@chromium.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
On Raspberry Pis without onboard USB hub frequent device reconnects
can trigger a interrupt storm after DWC2 entered host clock gating.
This is caused by a race between _dwc2_hcd_suspend() and the port
interrupt, which sets port_connect_status. The issue occurs if
port_connect_status is still 1, but there is no connection anymore:
usb 1-1: USB disconnect, device number 25
dwc2 3f980000.usb: _dwc2_hcd_suspend: port_connect_status: 1
dwc2 3f980000.usb: Entering host clock gating.
Disabling IRQ #66
irq 66: nobody cared (try booting with the "irqpoll" option)
CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.12.0-gc1bb81b13202-dirty #322
Hardware name: BCM2835
Call trace:
unwind_backtrace from show_stack+0x10/0x14
show_stack from dump_stack_lvl+0x50/0x64
dump_stack_lvl from __report_bad_irq+0x38/0xc0
__report_bad_irq from note_interrupt+0x2ac/0x2f4
note_interrupt from handle_irq_event+0x88/0x8c
handle_irq_event from handle_level_irq+0xb4/0x1ac
handle_level_irq from generic_handle_domain_irq+0x24/0x34
generic_handle_domain_irq from bcm2836_chained_handle_irq+0x24/0x28
bcm2836_chained_handle_irq from generic_handle_domain_irq+0x24/0x34
generic_handle_domain_irq from generic_handle_arch_irq+0x34/0x44
generic_handle_arch_irq from __irq_svc+0x88/0xb0
Exception stack(0xc1d01f20 to 0xc1d01f68)
1f20: 0004ef3c 00000001 00000000 00000000 c1d09780 c1f6bb5c c1d04e54 c1c60ca8
1f40: c1d04e94 00000000 00000000 c1d092a8 c1f6af20 c1d01f70 c1211b98 c1212f40
1f60: 60000013 ffffffff
__irq_svc from default_idle_call+0x1c/0xb0
default_idle_call from do_idle+0x21c/0x284
do_idle from cpu_startup_entry+0x28/0x2c
cpu_startup_entry from kernel_init+0x0/0x12c
handlers:
[<e3a25c00>] dwc2_handle_common_intr
[<58bf98a3>] usb_hcd_irq
Disabling IRQ #66
So avoid this by reading the connection status directly.
Fixes: 113f86d0c3 ("usb: dwc2: Update partial power down entering by system suspend")
Signed-off-by: Stefan Wahren <wahrenst@gmx.net>
Link: https://lore.kernel.org/r/20241202001631.75473-4-wahrenst@gmx.net
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
On Rasperry Pis without onboard USB hub the power cycle during
power connect init only disable the port but never enabled it again:
usb usb1-port1: attempt power cycle
The port relevant part in dwc2_hcd_hub_control() is skipped in case
port_connect_status = 0 under the assumption the core is or will be soon
in device mode. But this assumption is wrong, because after ClearPortFeature
USB_PORT_FEAT_POWER the port_connect_status will also be 0 and
SetPortFeature (incl. USB_PORT_FEAT_POWER) will be a no-op.
Fix the behavior of dwc2_hcd_hub_control() by replacing the
port_connect_status check with dwc2_is_device_mode().
Link: https://github.com/raspberrypi/linux/issues/6247
Fixes: 7359d482eb ("staging: HCD files for the DWC2 driver")
Signed-off-by: Stefan Wahren <wahrenst@gmx.net>
Link: https://lore.kernel.org/r/20241202001631.75473-3-wahrenst@gmx.net
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The Raspberry Pi can suffer on interrupt storms on HCD resume. The dwc2
driver sometimes misses to enable HCD_FLAG_HW_ACCESSIBLE before re-enabling
the interrupts. This causes a situation where both handler ignore a incoming
port interrupt and force the upper layers to disable the dwc2 interrupt
line. This leaves the USB interface in a unusable state:
irq 66: nobody cared (try booting with the "irqpoll" option)
CPU: 0 PID: 0 Comm: swapper/0 Tainted: G W 6.10.0-rc3
Hardware name: BCM2835
Call trace:
unwind_backtrace from show_stack+0x10/0x14
show_stack from dump_stack_lvl+0x50/0x64
dump_stack_lvl from __report_bad_irq+0x38/0xc0
__report_bad_irq from note_interrupt+0x2ac/0x2f4
note_interrupt from handle_irq_event+0x88/0x8c
handle_irq_event from handle_level_irq+0xb4/0x1ac
handle_level_irq from generic_handle_domain_irq+0x24/0x34
generic_handle_domain_irq from bcm2836_chained_handle_irq+0x24/0x28
bcm2836_chained_handle_irq from generic_handle_domain_irq+0x24/0x34
generic_handle_domain_irq from generic_handle_arch_irq+0x34/0x44
generic_handle_arch_irq from __irq_svc+0x88/0xb0
Exception stack(0xc1b01f20 to 0xc1b01f68)
1f20: 0005c0d4 00000001 00000000 00000000 c1b09780 c1d6b32c c1b04e54 c1a5eae8
1f40: c1b04e90 00000000 00000000 00000000 c1d6a8a0 c1b01f70 c11d2da8 c11d4160
1f60: 60000013 ffffffff
__irq_svc from default_idle_call+0x1c/0xb0
default_idle_call from do_idle+0x21c/0x284
do_idle from cpu_startup_entry+0x28/0x2c
cpu_startup_entry from kernel_init+0x0/0x12c
handlers:
[<f539e0f4>] dwc2_handle_common_intr
[<75cd278b>] usb_hcd_irq
Disabling IRQ #66
So enable the HCD_FLAG_HW_ACCESSIBLE flag in case there is a port
connection.
Fixes: c74c26f6e3 ("usb: dwc2: Fix partial power down exiting by system resume")
Closes: https://lore.kernel.org/linux-usb/3fd0c2fb-4752-45b3-94eb-42352703e1fd@gmx.net/T/
Link: https://lore.kernel.org/all/5e8cbce0-3260-2971-484f-fc73a3b2bd28@synopsys.com/
Signed-off-by: Stefan Wahren <wahrenst@gmx.net>
Link: https://lore.kernel.org/r/20241202001631.75473-2-wahrenst@gmx.net
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Considering that in some extreme cases,
when u_serial driver is accessed by multiple threads,
Thread A is executing the open operation and calling the gs_open,
Thread B is executing the disconnect operation and calling the
gserial_disconnect function,The port->port_usb pointer will be set to NULL.
E.g.
Thread A Thread B
gs_open() gadget_unbind_driver()
gs_start_io() composite_disconnect()
gs_start_rx() gserial_disconnect()
... ...
spin_unlock(&port->port_lock)
status = usb_ep_queue() spin_lock(&port->port_lock)
spin_lock(&port->port_lock) port->port_usb = NULL
gs_free_requests(port->port_usb->in) spin_unlock(&port->port_lock)
Crash
This causes thread A to access a null pointer (port->port_usb is null)
when calling the gs_free_requests function, causing a crash.
If port_usb is NULL, the release request will be skipped as it
will be done by gserial_disconnect.
So add a null pointer check to gs_start_io before attempting
to access the value of the pointer port->port_usb.
Call trace:
gs_start_io+0x164/0x25c
gs_open+0x108/0x13c
tty_open+0x314/0x638
chrdev_open+0x1b8/0x258
do_dentry_open+0x2c4/0x700
vfs_open+0x2c/0x3c
path_openat+0xa64/0xc60
do_filp_open+0xb8/0x164
do_sys_openat2+0x84/0xf0
__arm64_sys_openat+0x70/0x9c
invoke_syscall+0x58/0x114
el0_svc_common+0x80/0xe0
do_el0_svc+0x1c/0x28
el0_svc+0x38/0x68
Fixes: c1dca562be ("usb gadget: split out serial core")
Cc: stable@vger.kernel.org
Suggested-by: Prashanth K <quic_prashk@quicinc.com>
Signed-off-by: Lianqin Hu <hulianqin@vivo.com>
Acked-by: Prashanth K <quic_prashk@quicinc.com>
Link: https://lore.kernel.org/r/TYUPR06MB62178DC3473F9E1A537DCD02D2362@TYUPR06MB6217.apcprd06.prod.outlook.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
USB5744 SMBus initialization is done once in probe() and doing it in resume
is not supported so avoid going into suspend and reset the HUB.
There is a sysfs property 'always_powered_in_suspend' to implement this
feature but since default state should be set to a working configuration
so override this property value.
It fixes the suspend/resume testcase on Kria KR260 Robotics Starter Kit.
Fixes: 6782311d04 ("usb: misc: onboard_usb_dev: add Microchip usb5744 SMBus programming support")
Cc: stable@vger.kernel.org
Signed-off-by: Radhey Shyam Pandey <radhey.shyam.pandey@amd.com>
Link: https://lore.kernel.org/r/1733165302-1694891-1-git-send-email-radhey.shyam.pandey@amd.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When the USB3 PHY is not defined in the Linux device tree, there could
still be a case where there is a USB3 PHY active on the board and enabled
by the first stage bootloader. If serdes clock is being used then the USB
will fail to enumerate devices in 2.0 only mode.
To solve this, make sure that the PIPE clock is deselected whenever the
USB3 PHY is not defined and guarantees that the USB2 only mode will work
in all cases.
Fixes: 9678f3361a ("usb: dwc3: xilinx: Skip resets and USB3 register settings for USB2.0 mode")
Cc: stable@vger.kernel.org
Signed-off-by: Neal Frager <neal.frager@amd.com>
Signed-off-by: Radhey Shyam Pandey <radhey.shyam.pandey@amd.com>
Acked-by: Peter Korsgaard <peter@korsgaard.com>
Link: https://lore.kernel.org/r/1733163111-1414816-1-git-send-email-radhey.shyam.pandey@amd.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Before commit 53a2d95df8 ("usb: core: add phy notify connect and
disconnect"), phy initialization will be skipped even when shared hcd
doesn't set skip_phy_initialization flag. However, the situation is
changed after the commit. The hcd.c will initialize phy when add shared
hcd. This behavior is unexpected for some platforms which will handle phy
initialization by themselves. To avoid the issue, this will only check
skip_phy_initialization flag of primary hcd since shared hcd normally
follow primary hcd setting.
Fixes: 53a2d95df8 ("usb: core: add phy notify connect and disconnect")
Cc: stable@vger.kernel.org
Signed-off-by: Xu Yang <xu.yang_2@nxp.com>
Link: https://lore.kernel.org/r/20241105090120.2438366-1-xu.yang_2@nxp.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The UMP Function Block info m1.0 field (represented by is_midi1 sysfs
entry) is an enumeration from 0 to 2, while the midi2 gadget driver
incorrectly copies it to the corresponding snd_ump_block_info.flags
bits as-is. This made the wrong bit flags set when m1.0 = 2.
This patch corrects the wrong interpretation of is_midi1 bits.
Fixes: 29ee7a4ddd ("usb: gadget: midi2: Add configfs support")
Cc: stable@vger.kernel.org
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Link: https://lore.kernel.org/r/20241127070213.8232-1-tiwai@suse.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When unbind and bind the device again, kernel will dump below warning:
[ 173.972130] sysfs: cannot create duplicate filename '/devices/platform/soc/4c010010.usb/software_node'
[ 173.981564] CPU: 2 UID: 0 PID: 536 Comm: sh Not tainted 6.12.0-rc6-06344-g2aed7c4a5c56 #144
[ 173.989923] Hardware name: NXP i.MX95 15X15 board (DT)
[ 173.995062] Call trace:
[ 173.997509] dump_backtrace+0x90/0xe8
[ 174.001196] show_stack+0x18/0x24
[ 174.004524] dump_stack_lvl+0x74/0x8c
[ 174.008198] dump_stack+0x18/0x24
[ 174.011526] sysfs_warn_dup+0x64/0x80
[ 174.015201] sysfs_do_create_link_sd+0xf0/0xf8
[ 174.019656] sysfs_create_link+0x20/0x40
[ 174.023590] software_node_notify+0x90/0x100
[ 174.027872] device_create_managed_software_node+0xec/0x108
...
The '4c010010.usb' device is a platform device created during the initcall
and is never removed, which causes its associated software node to persist
indefinitely.
The existing device_create_managed_software_node() does not provide a
corresponding removal function.
Replace device_create_managed_software_node() with the
device_add_software_node() and device_remove_software_node() pair to ensure
proper addition and removal of software nodes, addressing this issue.
Fixes: a9400f1979 ("usb: dwc3: imx8mp: add 2 software managed quirk properties for host mode")
Cc: stable@vger.kernel.org
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Signed-off-by: Xu Yang <xu.yang_2@nxp.com>
Acked-by: Thinh Nguyen <Thinh.Nguyen@synopsys.com>
Link: https://lore.kernel.org/r/20241126032841.2458338-1-xu.yang_2@nxp.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The refcounts of the OF nodes obtained by of_get_child_by_name() calls
in anx7411_typec_switch_probe() are not decremented. Replace them with
device_get_named_child_node() calls and store the return values to the
newly created fwnode_handle fields in anx7411_data, and call
fwnode_handle_put() on them in the error path and in the unregister
functions.
Fixes: e45d7337dc ("usb: typec: anx7411: Use of_get_child_by_name() instead of of_find_node_by_name()")
Cc: stable@vger.kernel.org
Signed-off-by: Joe Hattori <joe@pf.is.s.u-tokyo.ac.jp>
Reviewed-by: Heikki Krogerus <heikki.krogerus@linux.intel.com>
Link: https://lore.kernel.org/r/20241126014909.3687917-1-joe@pf.is.s.u-tokyo.ac.jp
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If the current USB request was aborted, the spi thread would not respond
to any further requests. This is because the "curr_urb" pointer would
not become NULL, so no further requests would be taken off the queue.
The solution here is to set the "urb_done" flag, as this will cause the
correct handling of the URB. Also clear interrupts that should only be
expected if an URB is in progress.
Fixes: 2d53139f31 ("Add support for using a MAX3421E chip as a host driver.")
Cc: stable <stable@kernel.org>
Signed-off-by: Mark Tomlinson <mark.tomlinson@alliedtelesis.co.nz>
Link: https://lore.kernel.org/r/20241124221430.1106080-1-mark.tomlinson@alliedtelesis.co.nz
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Currently, ffa_dev->properties is set after the ffa_device_register()
call return in ffa_setup_partitions(). This could potentially result in
a race where the partition's properties is accessed while probing
struct ffa_device before it is set.
Update the ffa_device_register() to receive ffa_partition_info so all
the data from the partition information received from the firmware can
be updated into the struct ffa_device before the calling device_register()
in ffa_device_register().
Fixes: e781858488 ("firmware: arm_ffa: Add initial FFA bus support for device enumeration")
Signed-off-by: Levi Yun <yeoreum.yun@arm.com>
Message-Id: <20241203143109.1030514-2-yeoreum.yun@arm.com>
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Catalin reports that a hypervisor lying to a guest about the size
of the ASID field may result in unexpected issues:
- if the underlying HW does only supports 8 bit ASIDs, the ASID
field in a TLBI VAE1* operation is only 8 bits, and the HW will
ignore the other 8 bits
- if on the contrary the HW is 16 bit capable, the ASID field
in the same TLBI operation is always 16 bits, irrespective of
the value of TCR_ELx.AS.
This could lead to missed invalidations if the guest was lead to
assume that the HW had 8 bit ASIDs while they really are 16 bit wide.
In order to avoid any potential disaster that would be hard to debug,
prenent the migration between a host with 8 bit ASIDs to one with
wider ASIDs (the converse was obviously always forbidden). This is
also consistent with what we already do for VMIDs.
If it becomes absolutely mandatory to support such a migration path
in the future, we will have to trap and emulate all TLBIs, something
that nobody should look forward to.
Fixes: d5a32b60dc ("KVM: arm64: Allow userspace to change ID_AA64MMFR{0-2}_EL1")
Reported-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Cc: stable@vger.kernel.org
Cc: Will Deacon <will@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Oliver Upton <oliver.upton@linux.dev>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/r/20241203190236.505759-1-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
With the new __counted_by annotation in clk_hw_onecell_data, the "num"
struct member must be set before accessing the "hws" array. Failing to
do so will trigger a runtime warning when enabling CONFIG_UBSAN_BOUNDS
and CONFIG_FORTIFY_SOURCE.
Fixes: f316cdff8d ("clk: Annotate struct clk_hw_onecell_data with __counted_by")
Signed-off-by: Haoyu Li <lihaoyu499@gmail.com>
Link: https://lore.kernel.org/r/20241203142915.345523-1-lihaoyu499@gmail.com
Signed-off-by: Stephen Boyd <sboyd@kernel.org>
The Documentation for EN7581 had a typo and still referenced the EN7523
BUS base source frequency. This was in conflict with a different page in
the Documentration that state that the BUS runs at 300MHz (600MHz source
with divisor set to 2) and the actual watchdog that tick at half the BUS
clock (150MHz). This was verified with the watchdog by timing the
seconds that the system takes to reboot (due too watchdog) and by
operating on different values of the BUS divisor.
The correct values for source of BUS clock are 600MHz and 540MHz.
This was also confirmed by Airoha.
Cc: stable@vger.kernel.org
Fixes: 66bc47326c ("clk: en7523: Add EN7581 support")
Signed-off-by: Christian Marangi <ansuelsmth@gmail.com>
Link: https://lore.kernel.org/r/20241116105710.19748-1-ansuelsmth@gmail.com
Acked-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Stephen Boyd <sboyd@kernel.org>
When activating a swap file we acquire the root's snapshot drew lock and
then check if the root is dead, failing and returning with -EPERM if it's
dead but without unlocking the root's snapshot lock. Fix this by adding
the missing unlock.
Fixes: 60021bd754 ("btrfs: prevent subvol with swapfile from being deleted")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
The following reproducer can cause btrfs mount to fail:
dev="/dev/test/scratch1"
mnt1="/mnt/test"
mnt2="/mnt/scratch"
mkfs.btrfs -f $dev
mount $dev $mnt1
btrfs subvolume create $mnt1/subvol1
btrfs subvolume create $mnt1/subvol2
umount $mnt1
mount $dev $mnt1 -o subvol=subvol1
while mount -o remount,ro $mnt1; do mount -o remount,rw $mnt1; done &
bg=$!
while mount $dev $mnt2 -o subvol=subvol2; do umount $mnt2; done
kill $bg
wait
umount -R $mnt1
umount -R $mnt2
The script will fail with the following error:
mount: /mnt/scratch: /dev/mapper/test-scratch1 already mounted on /mnt/test.
dmesg(1) may have more information after failed mount system call.
umount: /mnt/test: target is busy.
umount: /mnt/scratch/: not mounted
And there is no kernel error message.
[CAUSE]
During the btrfs mount, to support mounting different subvolumes with
different RO/RW flags, we need to detect that and retry if needed:
Retry with matching RO flags if the initial mount fail with -EBUSY.
The problem is, during that retry we do not hold any super block lock
(s_umount), this means there can be a remount process changing the RO
flags of the original fs super block.
If so, we can have an EBUSY error during retry. And this time we treat
any failure as an error, without any retry and cause the above EBUSY
mount failure.
[FIX]
The current retry behavior is racy because we do not have a super block
thus no way to hold s_umount to prevent the race with remount.
Solve the root problem by allowing fc->sb_flags to mismatch from the
sb->s_flags at btrfs_get_tree_super().
Then at the re-entry point btrfs_get_tree_subvol(), manually check the
fc->s_flags against sb->s_flags, if it's a RO->RW mismatch, then
reconfigure with s_umount lock hold.
Reported-by: Enno Gotthold <egotthold@suse.com>
Reported-by: Fabian Vogt <fvogt@suse.com>
[ Special thanks for the reproducer and early analysis pointing to btrfs. ]
Fixes: f044b31867 ("btrfs: handle the ro->rw transition for mounting different subvolumes")
Link: https://bugzilla.suse.com/show_bug.cgi?id=1231836
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The newly added SCMI vendor driver references functions in the
protocol driver but needs a Kconfig dependency to ensure it can link,
essentially the Kconfig dependency needs to be reversed to match the
link time dependency:
| arm-linux-gnueabi-ld: sound/soc/fsl/fsl_mqs.o: in function `fsl_mqs_sm_write':
| fsl_mqs.c:(.text+0x1aa): undefined reference to `scmi_imx_misc_ctrl_set'
| arm-linux-gnueabi-ld: sound/soc/fsl/fsl_mqs.o: in function `fsl_mqs_sm_read':
| fsl_mqs.c:(.text+0x1ee): undefined reference to `scmi_imx_misc_ctrl_get'
This however only works after changing the dependency in the SND_SOC_FSL_MQS
driver as well, which uses 'select IMX_SCMI_MISC_DRV' to turn on a
driver it depends on. This is generally a bad idea, so the best solution
is to change that into a dependency.
To allow the ASoC driver to keep building with the SCMI support, this
needs to be an optional dependency that enforces the link-time
dependency if IMX_SCMI_MISC_DRV is a loadable module but not
depend on it if that is disabled.
Fixes: 61c9f03e22 ("firmware: arm_scmi: Add initial support for i.MX MISC protocol")
Fixes: 101c902359 ("ASoC: fsl_mqs: Support accessing registers by scmi interface")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Mark Brown <broonie@kernel.org>
Acked-by: Shengjiu Wang <shengjiu.wang@gmail.com>
Message-Id: <20241115230555.2435004-1-arnd@kernel.org>
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Since adding support for opting out of virtual monitor support, a zero vif
addr was used to indicate passive vs active monitor to the driver.
This would break the vif->addr when changing the netdev mac address before
switching the interface from monitor to sta mode.
Fix the regression by adding a separate flag to indicate whether vif->addr
is valid.
Reported-by: syzbot+9ea265d998de25ac6a46@syzkaller.appspotmail.com
Fixes: 9d40f7e327 ("wifi: mac80211: add flag to opt out of virtual monitor support")
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://patch.msgid.link/20241115115850.37449-1-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
If we got an unprotected action frame with CSA and then we heard the
beacon with the CSA IE, we'll block the queues with the CSA reason
twice. Since this reason is refcounted, we won't wake up the queues
since we wake them up only once and the ref count will never reach 0.
This led to blocked queues that prevented any activity (even
disconnection wouldn't reset the queue state and the only way to recover
would be to reload the kernel module.
Fix this by not refcounting the CSA reason.
It becomes now pointless to maintain the csa_blocked_queues state.
Remove it.
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Fixes: 414e090bc4 ("wifi: mac80211: restrict public action ECSA frame handling")
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219447
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20241119173108.5ea90828c2cc.I4f89e58572fb71ae48e47a81e74595cac410fbac@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Currently, during link deletion, the link ID is first removed from the
valid_links bitmap before performing any clean-up operations. However, some
functions require the link ID to remain in the valid_links bitmap. One
such example is cfg80211_cac_event(). The flow is -
nl80211_remove_link()
cfg80211_remove_link()
ieee80211_del_intf_link()
ieee80211_vif_set_links()
ieee80211_vif_update_links()
ieee80211_link_stop()
cfg80211_cac_event()
cfg80211_cac_event() requires link ID to be present but it is cleared
already in cfg80211_remove_link(). Ultimately, WARN_ON() is hit.
Therefore, clear the link ID from the bitmap only after completing the link
clean-up.
Signed-off-by: Aditya Kumar Singh <quic_adisi@quicinc.com>
Link: https://patch.msgid.link/20241121-mlo_dfs_fix-v2-1-92c3bf7ab551@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
With the new __counted_by annocation in cfg80211_mbssid_elems,
the "cnt" struct member must be set before accessing the "elem"
array. Failing to do so will trigger a runtime warning when enabling
CONFIG_UBSAN_BOUNDS and CONFIG_FORTIFY_SOURCE.
Fixes: c14679d700 ("wifi: cfg80211: Annotate struct cfg80211_mbssid_elems with __counted_by")
Signed-off-by: Haoyu Li <lihaoyu499@gmail.com>
Link: https://patch.msgid.link/20241123172500.311853-1-lihaoyu499@gmail.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
On 32-bit systems, the size of an unsigned long is 4 bytes,
while a u64 is 8 bytes. Therefore, when using
or_each_set_bit(bit, &bits, sizeof(changed) * BITS_PER_BYTE),
the code is incorrectly searching for a bit in a 32-bit
variable that is expected to be 64 bits in size,
leading to incorrect bit finding.
Solution: Ensure that the size of the bits variable is correctly
adjusted for each architecture.
Call Trace:
? show_regs+0x54/0x58
? __warn+0x6b/0xd4
? ieee80211_link_info_change_notify+0xcc/0xd4 [mac80211]
? report_bug+0x113/0x150
? exc_overflow+0x30/0x30
? handle_bug+0x27/0x44
? exc_invalid_op+0x18/0x50
? handle_exception+0xf6/0xf6
? exc_overflow+0x30/0x30
? ieee80211_link_info_change_notify+0xcc/0xd4 [mac80211]
? exc_overflow+0x30/0x30
? ieee80211_link_info_change_notify+0xcc/0xd4 [mac80211]
? ieee80211_mesh_work+0xff/0x260 [mac80211]
? cfg80211_wiphy_work+0x72/0x98 [cfg80211]
? process_one_work+0xf1/0x1fc
? worker_thread+0x2c0/0x3b4
? kthread+0xc7/0xf0
? mod_delayed_work_on+0x4c/0x4c
? kthread_complete_and_exit+0x14/0x14
? ret_from_fork+0x24/0x38
? kthread_complete_and_exit+0x14/0x14
? ret_from_fork_asm+0xf/0x14
? entry_INT80_32+0xf0/0xf0
Signed-off-by: Issam Hamdi <ih@simonwunderlich.de>
Link: https://patch.msgid.link/20241125162920.2711462-1-ih@simonwunderlich.de
[restore no-op path for no changes]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The audio subsystem of axg based platform is not probing anymore.
This is due to the introduction of RESET_MESON_AUX and the config
not being enabled with the default arm64 defconfig.
This brought another discussion around proper decoupling between
the clock and reset part. While this discussion gets sorted out,
revert back to the initial implementation.
This reverts
* commit 681ed497d6 ("clk: amlogic: axg-audio: fix Kconfig dependency on RESET_MESON_AUX")
* commit 664988eb47 ("clk: amlogic: axg-audio: use the auxiliary reset driver")
Both are reverted with single change to avoid creating more compilation
problems.
Fixes: 681ed497d6 ("clk: amlogic: axg-audio: fix Kconfig dependency on RESET_MESON_AUX")
Cc: Arnd Bergmann <arnd@arndb.de>
Reported-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Jerome Brunet <jbrunet@baylibre.com>
Link: https://lore.kernel.org/r/20241128-clk-audio-fix-rst-missing-v2-1-cf437d1a73da@baylibre.com
Signed-off-by: Stephen Boyd <sboyd@kernel.org>
This reverts commit 25f1c96a0e.
The offending commit results in errors like
cpu cpu0: _opp_config_clk_single: failed to set clock rate: -22
spamming the logs on the Lenovo ThinkPad X13s and other Qualcomm
machines when cpufreq tries to update the CPUFreq HW Engine clocks.
As mentioned in commit 4370232c72 ("cpufreq: qcom-hw: Add CPU clock
provider support"):
[T]he frequency supplied by the driver is the actual frequency
that comes out of the EPSS/OSM block after the DCVS operation.
This frequency is not same as what the CPUFreq framework has set
but it is the one that gets supplied to the CPUs after
throttling by LMh.
which seems to suggest that the driver relies on the previous behaviour
of clk_set_rate().
Since this affects many Qualcomm machines, let's revert for now.
Fixes: 25f1c96a0e ("clk: Fix invalid execution of clk_set_rate")
Reported-by: Aishwarya TCV <aishwarya.tcv@arm.com>
Link: https://lore.kernel.org/all/e2d83e57-ad07-411b-99f6-a4fc3c4534fa@arm.com/
Cc: Chuan Liu <chuan.liu@amlogic.com>
Cc: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Johan Hovold <johan+linaro@kernel.org>
Link: https://lore.kernel.org/r/20241202100621.29209-1-johan+linaro@kernel.org
Signed-off-by: Stephen Boyd <sboyd@kernel.org>
This signal handler loops over all tests on ctrl-C, but it's active
while the test list is being constructed. process.pid is 0, then -1,
then finally set to the child pid on fork. If the Ctrl-C is received
during this point a kill(-1, SIGINT) can be sent which affects all
processes.
Make sure the child has forked first before forwarding the signal. This
can be reproduced with ctrl-C immediately after launching perf test
which terminates the ssh connection.
Fixes: 553d5efeb3 ("perf test: Add a signal handler to kill forked child processes")
Signed-off-by: James Clark <james.clark@linaro.org>
Reviewed-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20241129151948.3199732-1-james.clark@linaro.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
The build-id events written at the end of the record session are broken
due to unexpected data. The write_buildid() writes the fixed length
event first and then variable length filename.
But a recent change made it write more data in the padding area
accidentally. So readers of the event see zero-filled data for the
next entry and treat it incorrectly. This resulted in wrong kernel
symbols because the kernel DSO loaded a random vmlinux image in the
path as it didn't have a valid build-id.
Fixes: ae39ba1655 ("perf inject: Fix build ID injection")
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/Z0aRFFW9xMh3mqKB@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
The PEBS kernel warnings can still be observed with the below case.
when the below commands are running in parallel for a while.
while true;
do
perf record --no-buildid -a --intr-regs=AX \
-e cpu/event=0xd0,umask=0x81/pp \
-c 10003 -o /dev/null ./triad;
done &
while true;
do
perf record -e 'cpu/mem-loads,ldlat=3/uP' -W -d -- ./dtlb
done
The commit b752ea0c28 ("perf/x86/intel/ds: Flush PEBS DS when changing
PEBS_DATA_CFG") intends to flush the entire PEBS buffer before the
hardware is reprogrammed. However, it fails in the above case.
The first perf command utilizes the large PEBS, while the second perf
command only utilizes a single PEBS. When the second perf event is
added, only the n_pebs++. The intel_pmu_pebs_enable() is invoked after
intel_pmu_pebs_add(). So the cpuc->n_pebs == cpuc->n_large_pebs check in
the intel_pmu_drain_large_pebs() fails. The PEBS DS is not flushed.
The new PEBS event should not be taken into account when flushing the
existing PEBS DS.
The check is unnecessary here. Before the hardware is reprogrammed, all
the stale records must be drained unconditionally.
For single PEBS or PEBS-vi-pt, the DS must be empty. The drain_pebs()
can handle the empty case. There is no harm to unconditionally drain the
PEBS DS.
Fixes: b752ea0c28 ("perf/x86/intel/ds: Flush PEBS DS when changing PEBS_DATA_CFG")
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20241119135504.1463839-2-kan.liang@linux.intel.com
Anders had bisected a crash using PREEMPT_RT with linux-next and
isolated it down to commit 894d1b3db4 ("locking/mutex: Remove
wakeups from under mutex::wait_lock"), where it seemed the
wake_q structure was somehow getting corrupted causing a null
pointer traversal.
I was able to easily repoduce this with PREEMPT_RT and managed
to isolate down that through various call stacks we were
actually calling wake_up_q() twice on the same wake_q.
I found that in the problematic commit, I had added the
wake_up_q() call in task_blocks_on_rt_mutex() around
__ww_mutex_add_waiter(), following a similar pattern in
__mutex_lock_common().
However, its just wrong. We haven't dropped the lock->wait_lock,
so its contrary to the point of the original patch. And it
didn't match the __mutex_lock_common() logic of re-initializing
the wake_q after calling it midway in the stack.
Looking at it now, the wake_up_q() call is incorrect and should
just be removed. So drop the erronious logic I had added.
Fixes: 894d1b3db4 ("locking/mutex: Remove wakeups from under mutex::wait_lock")
Closes: https://lore.kernel.org/lkml/6afb936f-17c7-43fa-90e0-b9e780866097@app.fastmail.com/
Reported-by: Anders Roxell <anders.roxell@linaro.org>
Reported-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Tested-by: Anders Roxell <anders.roxell@linaro.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20241114190051.552665-1-jstultz@google.com
When running the following command:
while true; do
stress-ng --cyclic 30 --timeout 30s --minimize --quiet
done
a warning is eventually triggered:
WARNING: CPU: 43 PID: 2848 at kernel/sched/deadline.c:794
setup_new_dl_entity+0x13e/0x180
...
Call Trace:
<TASK>
? show_trace_log_lvl+0x1c4/0x2df
? enqueue_dl_entity+0x631/0x6e0
? setup_new_dl_entity+0x13e/0x180
? __warn+0x7e/0xd0
? report_bug+0x11a/0x1a0
? handle_bug+0x3c/0x70
? exc_invalid_op+0x14/0x70
? asm_exc_invalid_op+0x16/0x20
enqueue_dl_entity+0x631/0x6e0
enqueue_task_dl+0x7d/0x120
__do_set_cpus_allowed+0xe3/0x280
__set_cpus_allowed_ptr_locked+0x140/0x1d0
__set_cpus_allowed_ptr+0x54/0xa0
migrate_enable+0x7e/0x150
rt_spin_unlock+0x1c/0x90
group_send_sig_info+0xf7/0x1a0
? kill_pid_info+0x1f/0x1d0
kill_pid_info+0x78/0x1d0
kill_proc_info+0x5b/0x110
__x64_sys_kill+0x93/0xc0
do_syscall_64+0x5c/0xf0
entry_SYSCALL_64_after_hwframe+0x6e/0x76
RIP: 0033:0x7f0dab31f92b
This warning occurs because set_cpus_allowed dequeues and enqueues tasks
with the ENQUEUE_RESTORE flag set. If the task is boosted, the warning
is triggered. A boosted task already had its parameters set by
rt_mutex_setprio, and a new call to setup_new_dl_entity is unnecessary,
hence the WARN_ON call.
Check if we are requeueing a boosted task and avoid calling
setup_new_dl_entity if that's the case.
Fixes: 295d6d5e37 ("sched/deadline: Fix switching to -deadline")
Signed-off-by: Wander Lairson Costa <wander@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lore.kernel.org/r/20240724142253.27145-2-wander@redhat.com
Scheduler raises a SCHED_SOFTIRQ to trigger a load balancing event on
from the IPI handler on the idle CPU. If the SMP function is invoked
from an idle CPU via flush_smp_call_function_queue() then the HARD-IRQ
flag is not set and raise_softirq_irqoff() needlessly wakes ksoftirqd
because soft interrupts are handled before ksoftirqd get on the CPU.
Adding a trace_printk() in nohz_csd_func() at the spot of raising
SCHED_SOFTIRQ and enabling trace events for sched_switch, sched_wakeup,
and softirq_entry (for SCHED_SOFTIRQ vector alone) helps observing the
current behavior:
<idle>-0 [000] dN.1.: nohz_csd_func: Raising SCHED_SOFTIRQ from nohz_csd_func
<idle>-0 [000] dN.4.: sched_wakeup: comm=ksoftirqd/0 pid=16 prio=120 target_cpu=000
<idle>-0 [000] .Ns1.: softirq_entry: vec=7 [action=SCHED]
<idle>-0 [000] .Ns1.: softirq_exit: vec=7 [action=SCHED]
<idle>-0 [000] d..2.: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=ksoftirqd/0 next_pid=16 next_prio=120
ksoftirqd/0-16 [000] d..2.: sched_switch: prev_comm=ksoftirqd/0 prev_pid=16 prev_prio=120 prev_state=S ==> next_comm=swapper/0 next_pid=0 next_prio=120
...
Use __raise_softirq_irqoff() to raise the softirq. The SMP function call
is always invoked on the requested CPU in an interrupt handler. It is
guaranteed that soft interrupts are handled at the end.
Following are the observations with the changes when enabling the same
set of events:
<idle>-0 [000] dN.1.: nohz_csd_func: Raising SCHED_SOFTIRQ for nohz_idle_balance
<idle>-0 [000] dN.1.: softirq_raise: vec=7 [action=SCHED]
<idle>-0 [000] .Ns1.: softirq_entry: vec=7 [action=SCHED]
No unnecessary ksoftirqd wakeups are seen from idle task's context to
service the softirq.
Fixes: b2a02fc43a ("smp: Optimize send_call_function_single_ipi()")
Closes: https://lore.kernel.org/lkml/fcf823f-195e-6c9a-eac3-25f870cb35ac@inria.fr/ [1]
Reported-by: Julia Lawall <julia.lawall@inria.fr>
Suggested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lore.kernel.org/r/20241119054432.6405-5-kprateek.nayak@amd.com
Commit b2a02fc43a ("smp: Optimize send_call_function_single_ipi()")
optimizes IPIs to idle CPUs in TIF_POLLING_NRFLAG mode by setting the
TIF_NEED_RESCHED flag in idle task's thread info and relying on
flush_smp_call_function_queue() in idle exit path to run the
call-function. A softirq raised by the call-function is handled shortly
after in do_softirq_post_smp_call_flush() but the TIF_NEED_RESCHED flag
remains set and is only cleared later when schedule_idle() calls
__schedule().
need_resched() check in _nohz_idle_balance() exists to bail out of load
balancing if another task has woken up on the CPU currently in-charge of
idle load balancing which is being processed in SCHED_SOFTIRQ context.
Since the optimization mentioned above overloads the interpretation of
TIF_NEED_RESCHED, check for idle_cpu() before going with the existing
need_resched() check which can catch a genuine task wakeup on an idle
CPU processing SCHED_SOFTIRQ from do_softirq_post_smp_call_flush(), as
well as the case where ksoftirqd needs to be preempted as a result of
new task wakeup or slice expiry.
In case of PREEMPT_RT or threadirqs, although the idle load balancing
may be inhibited in some cases on the ilb CPU, the fact that ksoftirqd
is the only fair task going back to sleep will trigger a newidle balance
on the CPU which will alleviate some imbalance if it exists if idle
balance fails to do so.
Fixes: b2a02fc43a ("smp: Optimize send_call_function_single_ipi()")
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241119054432.6405-4-kprateek.nayak@amd.com
The need_resched() check currently in nohz_csd_func() can be tracked
to have been added in scheduler_ipi() back in 2011 via commit
ca38062e57 ("sched: Use resched IPI to kick off the nohz idle balance")
Since then, it has travelled quite a bit but it seems like an idle_cpu()
check currently is sufficient to detect the need to bail out from an
idle load balancing. To justify this removal, consider all the following
case where an idle load balancing could race with a task wakeup:
o Since commit f3dd3f6745 ("sched: Remove the limitation of WF_ON_CPU
on wakelist if wakee cpu is idle") a target perceived to be idle
(target_rq->nr_running == 0) will return true for
ttwu_queue_cond(target) which will offload the task wakeup to the idle
target via an IPI.
In all such cases target_rq->ttwu_pending will be set to 1 before
queuing the wake function.
If an idle load balance races here, following scenarios are possible:
- The CPU is not in TIF_POLLING_NRFLAG mode in which case an actual
IPI is sent to the CPU to wake it out of idle. If the
nohz_csd_func() queues before sched_ttwu_pending(), the idle load
balance will bail out since idle_cpu(target) returns 0 since
target_rq->ttwu_pending is 1. If the nohz_csd_func() is queued after
sched_ttwu_pending() it should see rq->nr_running to be non-zero and
bail out of idle load balancing.
- The CPU is in TIF_POLLING_NRFLAG mode and instead of an actual IPI,
the sender will simply set TIF_NEED_RESCHED for the target to put it
out of idle and flush_smp_call_function_queue() in do_idle() will
execute the call function. Depending on the ordering of the queuing
of nohz_csd_func() and sched_ttwu_pending(), the idle_cpu() check in
nohz_csd_func() should either see target_rq->ttwu_pending = 1 or
target_rq->nr_running to be non-zero if there is a genuine task
wakeup racing with the idle load balance kick.
o The waker CPU perceives the target CPU to be busy
(targer_rq->nr_running != 0) but the CPU is in fact going idle and due
to a series of unfortunate events, the system reaches a case where the
waker CPU decides to perform the wakeup by itself in ttwu_queue() on
the target CPU but target is concurrently selected for idle load
balance (XXX: Can this happen? I'm not sure, but we'll consider the
mother of all coincidences to estimate the worst case scenario).
ttwu_do_activate() calls enqueue_task() which would increment
"rq->nr_running" post which it calls wakeup_preempt() which is
responsible for setting TIF_NEED_RESCHED (via a resched IPI or by
setting TIF_NEED_RESCHED on a TIF_POLLING_NRFLAG idle CPU) The key
thing to note in this case is that rq->nr_running is already non-zero
in case of a wakeup before TIF_NEED_RESCHED is set which would
lead to idle_cpu() check returning false.
In all cases, it seems that need_resched() check is unnecessary when
checking for idle_cpu() first since an impending wakeup racing with idle
load balancer will either set the "rq->ttwu_pending" or indicate a newly
woken task via "rq->nr_running".
Chasing the reason why this check might have existed in the first place,
I came across Peter's suggestion on the fist iteration of Suresh's
patch from 2011 [1] where the condition to raise the SCHED_SOFTIRQ was:
sched_ttwu_do_pending(list);
if (unlikely((rq->idle == current) &&
rq->nohz_balance_kick &&
!need_resched()))
raise_softirq_irqoff(SCHED_SOFTIRQ);
Since the condition to raise the SCHED_SOFIRQ was preceded by
sched_ttwu_do_pending() (which is equivalent of sched_ttwu_pending()) in
the current upstream kernel, the need_resched() check was necessary to
catch a newly queued task. Peter suggested modifying it to:
if (idle_cpu() && rq->nohz_balance_kick && !need_resched())
raise_softirq_irqoff(SCHED_SOFTIRQ);
where idle_cpu() seems to have replaced "rq->idle == current" check.
Even back then, the idle_cpu() check would have been sufficient to catch
a new task being enqueued. Since commit b2a02fc43a ("smp: Optimize
send_call_function_single_ipi()") overloads the interpretation of
TIF_NEED_RESCHED for TIF_POLLING_NRFLAG idling, remove the
need_resched() check in nohz_csd_func() to raise SCHED_SOFTIRQ based
on Peter's suggestion.
Fixes: b2a02fc43a ("smp: Optimize send_call_function_single_ipi()")
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241119054432.6405-3-kprateek.nayak@amd.com
do_softirq_post_smp_call_flush() on PREEMPT_RT kernels carries a
WARN_ON_ONCE() for any SOFTIRQ being raised from an SMP-call-function.
Since do_softirq_post_smp_call_flush() is called with preempt disabled,
raising a SOFTIRQ during flush_smp_call_function_queue() can lead to
longer preempt disabled sections.
Since commit b2a02fc43a ("smp: Optimize
send_call_function_single_ipi()") IPIs to an idle CPU in
TIF_POLLING_NRFLAG mode can be optimized out by instead setting
TIF_NEED_RESCHED bit in idle task's thread_info and relying on the
flush_smp_call_function_queue() in the idle-exit path to run the
SMP-call-function.
To trigger an idle load balancing, the scheduler queues
nohz_csd_function() responsible for triggering an idle load balancing on
a target nohz idle CPU and sends an IPI. Only now, this IPI is optimized
out and the SMP-call-function is executed from
flush_smp_call_function_queue() in do_idle() which can raise a
SCHED_SOFTIRQ to trigger the balancing.
So far, this went undetected since, the need_resched() check in
nohz_csd_function() would make it bail out of idle load balancing early
as the idle thread does not clear TIF_POLLING_NRFLAG before calling
flush_smp_call_function_queue(). The need_resched() check was added with
the intent to catch a new task wakeup, however, it has recently
discovered to be unnecessary and will be removed in the subsequent
commit after which nohz_csd_function() can raise a SCHED_SOFTIRQ from
flush_smp_call_function_queue() to trigger an idle load balance on an
idle target in TIF_POLLING_NRFLAG mode.
nohz_csd_function() bails out early if "idle_cpu()" check for the
target CPU, and does not lock the target CPU's rq until the very end,
once it has found tasks to run on the CPU and will not inhibit the
wakeup of, or running of a newly woken up higher priority task. Account
for this and prevent a WARN_ON_ONCE() when SCHED_SOFTIRQ is raised from
flush_smp_call_function_queue().
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241119054432.6405-2-kprateek.nayak@amd.com
Commit 8f9ea86fdf added some logic to sched_setaffinity that included
a WARN when a per-task affinity assignment races with a cpuset update.
Specifically, we can have a race where a cpuset update results in the
task affinity no longer being a subset of the cpuset. That's fine; we
have a fallback to instead use the cpuset mask. However, we have a WARN
set up that will trigger if the cpuset mask has no overlap at all with
the requested task affinity. This shouldn't be a warning condition; its
trivial to create this condition.
Reproduced the warning by the following setup:
- $PID inside a cpuset cgroup
- another thread repeatedly switching the cpuset cpus from 1-2 to just 1
- another thread repeatedly setting the $PID affinity (via taskset) to 2
Fixes: 8f9ea86fdf ("sched: Always preserve the user requested cpumask")
Signed-off-by: Josh Don <joshdon@google.com>
Acked-and-tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Tested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Link: https://lkml.kernel.org/r/20241111182738.1832953-1-joshdon@google.com
The G.a revision of the ARM ARM had it pretty clear that HCR_EL2.FWB
had no influence on "The way that stage 1 memory types and attributes
are combined with stage 2 Device type and attributes." (D5.5.5).
However, this wording was lost in further revisions of the architecture.
Restore the intended behaviour, which is to take the strongest memory
type of S1 and S2 in this case, as if FWB was 0. The specification is
being fixed accordingly.
Fixes: be04cebf3e ("KVM: arm64: nv: Add emulation of AT S12E{0,1}{R,W}")
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20241125094756.609590-1-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
The System Controller Management Interface firmware (SCMI FW) is
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.