Commit Graph

77916 Commits

Author SHA1 Message Date
wangfe
e7cd191f83 xfrm: add SA information to the offloaded packet
In packet offload mode, append Security Association (SA) information
to each packet, replicating the crypto offload implementation.
The XFRM_XMIT flag is set to enable packet to be returned immediately
from the validate_xmit_xfrm function, thus aligning with the existing
code path for packet offload mode.

This SA info helps HW offload match packets to their correct security
policies. The XFRM interface ID is included, which is crucial in setups
with multiple XFRM interfaces where source/destination addresses alone
can't pinpoint the right policy.

Signed-off-by: wangfe <wangfe@google.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2024-08-27 11:28:58 +02:00
Florian Westphal
a54ad727f7 xfrm: policy: remove remaining use of inexact list
No consumers anymore, remove it.  After this, insertion of policies
no longer require list walk of all inexact policies but only those
that are reachable via the candidate sets.

This gives almost linear insertion speeds provided the inserted
policies are for non-overlapping networks.

Before:
Inserted 1000   policies in 70 ms
Inserted 10000  policies in 1155 ms
Inserted 100000 policies in 216848 ms

After:
Inserted 1000   policies in 56 ms
Inserted 10000  policies in 478 ms
Inserted 100000 policies in 4580 ms

Insertion of 1m entries takes about ~40s after this change
on my test vm.

Cc: Noel Kuntze <noel@familie-kuntze.de>
Cc: Tobias Brunner <tobias@strongswan.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2024-08-24 09:57:55 +02:00
Florian Westphal
563d5ca93e xfrm: switch migrate to xfrm_policy_lookup_bytype
XFRM_MIGRATE still uses the old lookup method:
first check the bydst hash table, then search the list of all the other
policies.

Switch MIGRATE to use the same lookup function as the packetpath.

This is done to remove the last remaining users of the pernet
xfrm.policy_inexact lists with the intent of removing this list.

After this patch, policies are still added to the list on insertion
and they are rehashed as-needed but no single API makes use of these
anymore.

This change is compile tested only.

Cc: Tobias Brunner <tobias@strongswan.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2024-08-24 09:56:09 +02:00
Florian Westphal
33f611cf7d xfrm: policy: don't iterate inexact policies twice at insert time
Since commit
6be3b0db6d ("xfrm: policy: add inexact policy search tree infrastructure")
policy lookup no longer walks a list but has a set of candidate lists.

This set has to be searched for the best match.
In case there are several matches, the priority wins.

If the priority is also the same, then the historic behaviour with
a single list was to return the first match (first-in-list).

With introduction of serval lists, this doesn't work and a new
'pos' member was added that reflects the xfrm_policy structs position
in the list.

This value is not exported to userspace and it does not need to be
the 'position in the list', it just needs to make sure that
a->pos < b->pos means that a was added to the lists more recently
than b.

This re-walk is expensive when many inexact policies are in use.

Speed this up: when appending the policy to the end of the walker list,
then just take the ->pos value of the last entry made and add 1.

Add a slowpath version to prevent overflow, if we'd assign UINT_MAX
then iterate the entire list and fix the ordering.

While this speeds up insertion considerably finding the insertion spot
in the inexact list still requires a partial list walk.

This is addressed in followup patches.

Before:
./xfrm_policy_add_speed.sh
Inserted 1000   policies in 72 ms
Inserted 10000  policies in 1540 ms
Inserted 100000 policies in 334780 ms

After:
Inserted 1000   policies in 68 ms
Inserted 10000  policies in 1137 ms
Inserted 100000 policies in 157307 ms

Reported-by: Noel Kuntze <noel@familie-kuntze.de>
Cc: Tobias Brunner <tobias@strongswan.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2024-08-24 09:55:36 +02:00
Christian Hopps
6ad8bc92a4 net: add copy from skb_seq_state to buffer function
Add an skb helper function to copy a range of bytes from within
an existing skb_seq_state.

Signed-off-by: Christian Hopps <chopps@labn.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2024-08-20 08:11:48 +02:00
Patrisious Haddad
9b49f55838 xfrm: Remove documentation WARN_ON to limit return values for offloaded SA
The original idea to put WARN_ON() on return value from driver code was
to make sure that packet offload doesn't have silent fallback to
SW implementation, like crypto offload has.

In reality, this is not needed as all *swan implementations followed
this request and used explicit configuration style to make sure that
"users will get what they ask".
So instead of forcing drivers to make sure that even their internal flows
don't return -EOPNOTSUPP, let's remove this WARN_ON.

Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2024-08-16 08:17:05 +02:00
Patrick Rohr
990c304930 Add support for PIO p flag
draft-ietf-6man-pio-pflag is adding a new flag to the Prefix Information
Option to signal that the network can allocate a unique IPv6 prefix per
client via DHCPv6-PD (see draft-ietf-v6ops-dhcp-pd-per-device).

When ra_honor_pio_pflag is enabled, the presence of a P-flag causes
SLAAC autoconfiguration to be disabled for that particular PIO.

An automated test has been added in Android (r.android.com/3195335) to
go along with this change.

Cc: Maciej Żenczykowski <maze@google.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Cc: David Lamparter <equinox@opensourcerouting.org>
Cc: Simon Horman <horms@kernel.org>
Signed-off-by: Patrick Rohr <prohr@google.com>
Reviewed-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-07-31 13:49:48 +01:00
Zhengchao Shao
0908503ade net/smc: remove unused input parameters in smcr_new_buf_create
The input parameter "is_rmb" of the smcr_new_buf_create function
has never been used, remove it.

Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-07-31 11:48:58 +01:00
Zhengchao Shao
d37307eaac net/smc: remove redundant code in smc_connect_check_aclc
When the SMC client perform CLC handshake, it will check whether
the clc header type is correct in receiving SMC_CLC_ACCEPT packet.
The specific invoking path is as follows:
__smc_connect
  smc_connect_clc
    smc_clc_wait_msg
      smc_clc_msg_hdr_valid
        smc_clc_msg_acc_conf_valid
Therefore, the smc_connect_check_aclc interface invoked by
__smc_connect does not need to check type again.

Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-07-31 11:48:58 +01:00
Zhengchao Shao
5a79575711 net/smc: remove the fallback in __smc_connect
When the SMC client begins to connect to server, smcd_version is set
to SMC_V1 + SMC_V2. If fail to get VLAN ID, only SMC_V2 information
is left in smcd_version. And smcd_version will not be changed to 0.
Therefore, remove the fallback caused by the failure to get VLAN ID.

Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-07-31 11:48:58 +01:00
Zhengchao Shao
1018825a95 net/smc: remove unreferenced header in smc_loopback.h file
Because linux/err.h is unreferenced in smc_loopback.h file, so
remove it.

Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: D. Wythe <alibuda@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-07-31 11:48:58 +01:00
James Chapman
5dfa598b24 l2tp: use pre_exit pernet hook to avoid rcu_barrier
Move the work of closing all tunnels from the pernet exit hook to
pre_exit since the core does rcu synchronisation between these steps
and we can therefore remove rcu_barrier from l2tp code.

Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-07-31 09:25:14 +01:00
James Chapman
d93b8a63f0 l2tp: cleanup eth/ppp pseudowire setup code
l2tp eth/ppp pseudowire setup/cleanup uses kfree() in some error
paths. Drop the refcount instead such that the session object is
always freed when the refcount reaches 0.

Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-07-31 09:25:13 +01:00
James Chapman
0aa45570c3 l2tp: add idr consistency check in session_register
l2tp_session_register uses an idr_alloc then idr_replace pattern to
insert sessions into the session IDR. To catch invalid locking, add a
WARN_ON_ONCE if the IDR entry is modified by another thread between
alloc and replace steps.

Also add comments to make expectations clear.

Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-07-31 09:25:13 +01:00
James Chapman
89b768ec2d l2tp: use rcu list add/del when updating lists
l2tp_v3_session_htable and tunnel->session_list are read by lockless
getters using RCU. Use rcu list variants when adding or removing list
items.

Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-07-31 09:25:13 +01:00
James Chapman
24256415d1 l2tp: prevent possible tunnel refcount underflow
When a session is created, it sets a backpointer to its tunnel. When
the session refcount drops to 0, l2tp_session_free drops the tunnel
refcount if session->tunnel is non-NULL. However, session->tunnel is
set in l2tp_session_create, before the tunnel refcount is incremented
by l2tp_session_register, which leaves a small window where
session->tunnel is non-NULL when the tunnel refcount hasn't been
bumped.

Moving the assignment to l2tp_session_register is trivial but
l2tp_session_create calls l2tp_session_set_header_len which uses
session->tunnel to get the tunnel's encap. Add an encap arg to
l2tp_session_set_header_len to avoid using session->tunnel.

If l2tpv3 sessions have colliding IDs, it is possible for
l2tp_v3_session_get to race with l2tp_session_register and fetch a
session which doesn't yet have session->tunnel set. Add a check for
this case.

Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-07-31 09:25:13 +01:00
James Chapman
c5cbaef992 l2tp: refactor ppp socket/session relationship
Each l2tp ppp session has an associated pppox socket. l2tp_ppp uses
the session's pppox socket refcount to manage session lifetimes; the
pppox socket holds a ref on the session which is dropped by the socket
destructor. This complicates session cleanup.

Given l2tp sessions are refcounted, it makes more sense to reverse
this relationship such that the session keeps the socket alive, not
the other way around. So refactor l2tp_ppp to have the session hold a
ref on its socket while it references it. When the session is closed,
it drops its socket ref when it detaches from its socket. If the
socket is closed first, it initiates the closing of its session, if
one is attached. The socket/session can then be freed asynchronously
when their refcounts drop to 0.

Use the session's session_close callback to detach the pppox socket
since this will be done on the work queue together with the rest of
the session cleanup via l2tp_session_delete.

Also, since l2tp_ppp uses the pppox socket's sk_user_data, use the rcu
sk_user_data access helpers when accessing it and set the socket's
SOCK_RCU_FREE flag to have pppox sockets freed by rcu.

Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-07-31 09:25:13 +01:00
James Chapman
d17e899995 l2tp: free sessions using rcu
l2tp sessions may be accessed under an rcu read lock. Have them freed
via rcu and remove the now unneeded synchronize_rcu when a session is
removed.

Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-07-31 09:25:13 +01:00
James Chapman
fc7ec7f554 l2tp: delete sessions using work queue
When a tunnel is closed, l2tp_tunnel_closeall closes all sessions in
the tunnel. Move the work of deleting each session to the work queue
so that sessions are deleted using the same codepath whether they are
closed by user API request or their parent tunnel is closing. This
also avoids the locking dance in l2tp_tunnel_closeall where the
tunnel's session list lock was unlocked and relocked in the loop.

In l2tp_exit_net, use drain_workqueue instead of flush_workqueue
because the processing of tunnel_delete work may queue session_delete
work items which must also be processed.

Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-07-31 09:25:13 +01:00
James Chapman
29717a4fb7 l2tp: simplify tunnel and socket cleanup
When the l2tp tunnel socket used sk_user_data to point to its
associated l2tp tunnel, socket and tunnel cleanup had to make use of
the socket's destructor to free the tunnel only when the socket could
no longer be accessed.

Now that sk_user_data is no longer used, we can simplify socket and
tunnel cleanup:

  * If the tunnel closes first, it cleans up and drops its socket ref
    when the tunnel refcount drops to zero. If its socket was provided
    by userspace, the socket is closed and freed asynchronously, when
    userspace closes it. If its socket is a kernel socket, the tunnel
    closes the socket itself during cleanup and drops its socket ref
    when the tunnel's refcount drops to zero.

  * If the socket closes first, we initiate the closing of its
    associated tunnel. For UDP sockets, this is via the socket's
    encap_destroy hook. For L2TPIP sockets, this is via the socket's
    destroy callback. The tunnel holds a socket ref while it
    references the sock. When the tunnel is freed, it drops its socket
    ref and the socket will be cleaned up when its own refcount drops
    to zero, asynchronous to the tunnel free.

  * The tunnel socket destructor is no longer needed since the tunnel
    is no longer freed through the socket destructor.

Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-07-31 09:25:13 +01:00
James Chapman
0fa51a7c6f l2tp: remove unused tunnel magic field
Since l2tp no longer derives tunnel pointers directly via
sk_user_data, it is no longer useful for l2tp to check tunnel pointers
using a magic feather. Drop the tunnel's magic field.

Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-07-31 09:25:13 +01:00
James Chapman
4a4cd70369 l2tp: don't set sk_user_data in tunnel socket
l2tp no longer uses the tunnel socket's sk_user_data so drop the code
which sets it.

In l2tp_validate_socket use l2tp_sk_to_tunnel to check whether a given
socket is already attached to an l2tp tunnel since we can no longer
use non-null sk_user_data to indicate this.

Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-07-31 09:25:13 +01:00
James Chapman
eeb11209e0 l2tp: don't use tunnel socket sk_user_data in ppp procfs output
l2tp's ppp procfs output can be used to show internal state of
pppol2tp. It includes a 'user-data-ok' field, which is derived from
the tunnel socket's sk_user_data being non-NULL. Use tunnel->sock
being non-NULL to indicate this instead.

Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-07-31 09:25:12 +01:00
James Chapman
ed8ebee6de l2tp: have l2tp_ip_destroy_sock use ip_flush_pending_frames
Use the recently exported ip_flush_pending_frames instead of a
free-coded version and lock the socket while we call it.

Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-07-31 09:25:12 +01:00
James Chapman
4ff8863419 ipv4: export ip_flush_pending_frames
To avoid protocol modules implementing their own, export
ip_flush_pending_frames.

Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-07-31 09:25:12 +01:00
James Chapman
2e7a280692 l2tp: lookup tunnel from socket without using sk_user_data
l2tp_sk_to_tunnel derives the tunnel from sk_user_data. Instead,
lookup the tunnel by walking the tunnel IDR for a tunnel using the
indicated sock. This is slow but l2tp_sk_to_tunnel is not used in
the datapath so performance isn't critical.

l2tp_tunnel_destruct needs a variant of l2tp_sk_to_tunnel which does
not bump the tunnel refcount since the tunnel refcount is already 0.

Change l2tp_sk_to_tunnel sk arg to const since it does not modify sk.

Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-07-31 09:25:12 +01:00
Dr. David Alan Gilbert
0a658d088c net/tcp: Expand goo.gl link
The goo.gl URL shortener is deprecated and is due to stop
expanding existing links in 2025.

Expand the link in Kconfig.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20240729205337.48058-1-linux@treblig.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-30 18:35:12 -07:00
Linus Torvalds
1722389b0d Merge tag 'net-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
 "Including fixes from bpf and netfilter.

  A lot of networking people were at a conference last week, busy
  catching COVID, so relatively short PR.

  Current release - regressions:

   - tcp: process the 3rd ACK with sk_socket for TFO and MPTCP

  Current release - new code bugs:

   - l2tp: protect session IDR and tunnel session list with one lock,
     make sure the state is coherent to avoid a warning

   - eth: bnxt_en: update xdp_rxq_info in queue restart logic

   - eth: airoha: fix location of the MBI_RX_AGE_SEL_MASK field

  Previous releases - regressions:

   - xsk: require XDP_UMEM_TX_METADATA_LEN to actuate tx_metadata_len,
     the field reuses previously un-validated pad

  Previous releases - always broken:

   - tap/tun: drop short frames to prevent crashes later in the stack

   - eth: ice: add a per-VF limit on number of FDIR filters

   - af_unix: disable MSG_OOB handling for sockets in sockmap/sockhash"

* tag 'net-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (34 commits)
  tun: add missing verification for short frame
  tap: add missing verification for short frame
  mISDN: Fix a use after free in hfcmulti_tx()
  gve: Fix an edge case for TSO skb validity check
  bnxt_en: update xdp_rxq_info in queue restart logic
  tcp: process the 3rd ACK with sk_socket for TFO/MPTCP
  selftests/bpf: Add XDP_UMEM_TX_METADATA_LEN to XSK TX metadata test
  xsk: Require XDP_UMEM_TX_METADATA_LEN to actuate tx_metadata_len
  bpf: Fix a segment issue when downgrading gso_size
  net: mediatek: Fix potential NULL pointer dereference in dummy net_device handling
  MAINTAINERS: make Breno the netconsole maintainer
  MAINTAINERS: Update bonding entry
  net: nexthop: Initialize all fields in dumped nexthops
  net: stmmac: Correct byte order of perfect_match
  selftests: forwarding: skip if kernel not support setting bridge fdb learning limit
  tipc: Return non-zero value from tipc_udp_addr2str() on error
  netfilter: nft_set_pipapo_avx2: disable softinterrupts
  ice: Fix recipe read procedure
  ice: Add a per-VF limit on number of FDIR filters
  net: bonding: correctly annotate RCU in bond_should_notify_peers()
  ...
2024-07-25 13:32:25 -07:00
Linus Torvalds
b485625078 Merge tag 'constfy-sysctl-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl
Pull sysctl constification from Joel Granados:
 "Treewide constification of the ctl_table argument of proc_handlers
  using a coccinelle script and some manual code formatting fixups.

  This is a prerequisite to moving the static ctl_table structs into
  read-only data section which will ensure that proc_handler function
  pointers cannot be modified"

* tag 'constfy-sysctl-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl:
  sysctl: treewide: constify the ctl_table argument of proc_handlers
2024-07-25 12:58:36 -07:00
Linus Torvalds
c2a96b7f18 Merge tag 'driver-core-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
 "Here is the big set of driver core changes for 6.11-rc1.

  Lots of stuff in here, with not a huge diffstat, but apis are evolving
  which required lots of files to be touched. Highlights of the changes
  in here are:

   - platform remove callback api final fixups (Uwe took many releases
     to get here, finally!)

   - Rust bindings for basic firmware apis and initial driver-core
     interactions.

     It's not all that useful for a "write a whole driver in rust" type
     of thing, but the firmware bindings do help out the phy rust
     drivers, and the driver core bindings give a solid base on which
     others can start their work.

     There is still a long way to go here before we have a multitude of
     rust drivers being added, but it's a great first step.

   - driver core const api changes.

     This reached across all bus types, and there are some fix-ups for
     some not-common bus types that linux-next and 0-day testing shook
     out.

     This work is being done to help make the rust bindings more safe,
     as well as the C code, moving toward the end-goal of allowing us to
     put driver structures into read-only memory. We aren't there yet,
     but are getting closer.

   - minor devres cleanups and fixes found by code inspection

   - arch_topology minor changes

   - other minor driver core cleanups

  All of these have been in linux-next for a very long time with no
  reported problems"

* tag 'driver-core-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (55 commits)
  ARM: sa1100: make match function take a const pointer
  sysfs/cpu: Make crash_hotplug attribute world-readable
  dio: Have dio_bus_match() callback take a const *
  zorro: make match function take a const pointer
  driver core: module: make module_[add|remove]_driver take a const *
  driver core: make driver_find_device() take a const *
  driver core: make driver_[create|remove]_file take a const *
  firmware_loader: fix soundness issue in `request_internal`
  firmware_loader: annotate doctests as `no_run`
  devres: Correct code style for functions that return a pointer type
  devres: Initialize an uninitialized struct member
  devres: Fix memory leakage caused by driver API devm_free_percpu()
  devres: Fix devm_krealloc() wasting memory
  driver core: platform: Switch to use kmemdup_array()
  driver core: have match() callback in struct bus_type take a const *
  MAINTAINERS: add Rust device abstractions to DRIVER CORE
  device: rust: improve safety comments
  MAINTAINERS: add Danilo as FIRMWARE LOADER maintainer
  MAINTAINERS: add Rust FW abstractions to FIRMWARE LOADER
  firmware: rust: improve safety comments
  ...
2024-07-25 10:42:22 -07:00
Jakub Kicinski
f7578df913 Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2024-07-25

We've added 14 non-merge commits during the last 8 day(s) which contain
a total of 19 files changed, 177 insertions(+), 70 deletions(-).

The main changes are:

1) Fix af_unix to disable MSG_OOB handling for sockets in BPF sockmap and
   BPF sockhash. Also add test coverage for this case, from Michal Luczaj.

2) Fix a segmentation issue when downgrading gso_size in the BPF helper
   bpf_skb_adjust_room(), from Fred Li.

3) Fix a compiler warning in resolve_btfids due to a missing type cast,
   from Liwei Song.

4) Fix stack allocation for arm64 to align the stack pointer at a 16 byte
   boundary in the fexit_sleep BPF selftest, from Puranjay Mohan.

5) Fix a xsk regression to require a flag when actuating tx_metadata_len,
   from Stanislav Fomichev.

6) Fix function prototype BTF dumping in libbpf for prototypes that have
   no input arguments, from Andrii Nakryiko.

7) Fix stacktrace symbol resolution in perf script for BPF programs
   containing subprograms, from Hou Tao.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  selftests/bpf: Add XDP_UMEM_TX_METADATA_LEN to XSK TX metadata test
  xsk: Require XDP_UMEM_TX_METADATA_LEN to actuate tx_metadata_len
  bpf: Fix a segment issue when downgrading gso_size
  tools/resolve_btfids: Fix comparison of distinct pointer types warning in resolve_btfids
  bpf, events: Use prog to emit ksymbol event for main program
  selftests/bpf: Test sockmap redirect for AF_UNIX MSG_OOB
  selftests/bpf: Parametrize AF_UNIX redir functions to accept send() flags
  selftests/bpf: Support SOCK_STREAM in unix_inet_redir_to_connected()
  af_unix: Disable MSG_OOB handling for sockets in sockmap/sockhash
  bpftool: Fix typo in usage help
  libbpf: Fix no-args func prototype BTF dumping syntax
  MAINTAINERS: Update powerpc BPF JIT maintainers
  MAINTAINERS: Update email address of Naveen
  selftests/bpf: fexit_sleep: Fix stack allocation for arm64
====================

Link: https://patch.msgid.link/20240725114312.32197-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-25 07:40:25 -07:00
Matthieu Baerts (NGI0)
c166829268 tcp: process the 3rd ACK with sk_socket for TFO/MPTCP
The 'Fixes' commit recently changed the behaviour of TCP by skipping the
processing of the 3rd ACK when a sk->sk_socket is set. The goal was to
skip tcp_ack_snd_check() in tcp_rcv_state_process() not to send an
unnecessary ACK in case of simultaneous connect(). Unfortunately, that
had an impact on TFO and MPTCP.

I started to look at the impact on MPTCP, because the MPTCP CI found
some issues with the MPTCP Packetdrill tests [1]. Then Paolo Abeni
suggested me to look at the impact on TFO with "plain" TCP.

For MPTCP, when receiving the 3rd ACK of a request adding a new path
(MP_JOIN), sk->sk_socket will be set, and point to the MPTCP sock that
has been created when the MPTCP connection got established before with
the first path. The newly added 'goto' will then skip the processing of
the segment text (step 7) and not go through tcp_data_queue() where the
MPTCP options are validated, and some actions are triggered, e.g.
sending the MPJ 4th ACK [2] as demonstrated by the new errors when
running a packetdrill test [3] establishing a second subflow.

This doesn't fully break MPTCP, mainly the 4th MPJ ACK that will be
delayed. Still, we don't want to have this behaviour as it delays the
switch to the fully established mode, and invalid MPTCP options in this
3rd ACK will not be caught any more. This modification also affects the
MPTCP + TFO feature as well, and being the reason why the selftests
started to be unstable the last few days [4].

For TFO, the existing 'basic-cookie-not-reqd' test [5] was no longer
passing: if the 3rd ACK contains data, and the connection is accept()ed
before receiving them, these data would no longer be processed, and thus
not ACKed.

One last thing about MPTCP, in case of simultaneous connect(), a
fallback to TCP will be done, which seems fine:

  `../common/defaults.sh`

   0 socket(..., SOCK_STREAM|SOCK_NONBLOCK, IPPROTO_MPTCP) = 3
  +0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)

  +0 > S  0:0(0)                 <mss 1460, sackOK, TS val 100 ecr 0,   nop, wscale 8, mpcapable v1 flags[flag_h] nokey>
  +0 < S  0:0(0) win 1000        <mss 1460, sackOK, TS val 407 ecr 0,   nop, wscale 8, mpcapable v1 flags[flag_h] nokey>
  +0 > S. 0:0(0) ack 1           <mss 1460, sackOK, TS val 330 ecr 0,   nop, wscale 8, mpcapable v1 flags[flag_h] nokey>
  +0 < S. 0:0(0) ack 1 win 65535 <mss 1460, sackOK, TS val 700 ecr 100, nop, wscale 8, mpcapable v1 flags[flag_h] key[skey=2]>
  +0 >  . 1:1(0) ack 1           <nop, nop, TS val 845707014 ecr 700, nop, nop, sack 0:1>

Simultaneous SYN-data crossing is also not supported by TFO, see [6].

Kuniyuki Iwashima suggested to restrict the processing to SYN+ACK only:
that's a more generic solution than the one initially proposed, and
also enough to fix the issues described above.

Later on, Eric Dumazet mentioned that an ACK should still be sent in
reaction to the second SYN+ACK that is received: not sending a DUPACK
here seems wrong and could hurt:

   0 socket(..., SOCK_STREAM|SOCK_NONBLOCK, IPPROTO_TCP) = 3
  +0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)

  +0 > S  0:0(0)                <mss 1460, sackOK, TS val 1000 ecr 0,nop,wscale 8>
  +0 < S  0:0(0)       win 1000 <mss 1000, sackOK, nop, nop>
  +0 > S. 0:0(0) ack 1          <mss 1460, sackOK, TS val 3308134035 ecr 0,nop,wscale 8>
  +0 < S. 0:0(0) ack 1 win 1000 <mss 1000, sackOK, nop, nop>
  +0 >  . 1:1(0) ack 1          <nop, nop, sack 0:1>  // <== Here

So in this version, the 'goto consume' is dropped, to always send an ACK
when switching from TCP_SYN_RECV to TCP_ESTABLISHED. This ACK will be
seen as a DUPACK -- with DSACK if SACK has been negotiated -- in case of
simultaneous SYN crossing: that's what is expected here.

Link: https://github.com/multipath-tcp/mptcp_net-next/actions/runs/9936227696 [1]
Link: https://datatracker.ietf.org/doc/html/rfc8684#fig_tokens [2]
Link: https://github.com/multipath-tcp/packetdrill/blob/mptcp-net-next/gtests/net/mptcp/syscalls/accept.pkt#L28 [3]
Link: https://netdev.bots.linux.dev/contest.html?executor=vmksft-mptcp-dbg&test=mptcp-connect-sh [4]
Link: https://github.com/google/packetdrill/blob/master/gtests/net/tcp/fastopen/server/basic-cookie-not-reqd.pkt#L21 [5]
Link: https://github.com/google/packetdrill/blob/master/gtests/net/tcp/fastopen/client/simultaneous-fast-open.pkt [6]
Fixes: 23e89e8ee7 ("tcp: Don't drop SYN+ACK for simultaneous connect().")
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Suggested-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20240724-upstream-net-next-20240716-tcp-3rd-ack-consume-sk_socket-v3-1-d48339764ce9@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-07-25 12:58:19 +02:00
Stanislav Fomichev
d5e726d914 xsk: Require XDP_UMEM_TX_METADATA_LEN to actuate tx_metadata_len
Julian reports that commit 341ac980ea ("xsk: Support tx_metadata_len")
can break existing use cases which don't zero-initialize xdp_umem_reg
padding. Introduce new XDP_UMEM_TX_METADATA_LEN to make sure we
interpret the padding as tx_metadata_len only when being explicitly
asked.

Fixes: 341ac980ea ("xsk: Support tx_metadata_len")
Reported-by: Julian Schindel <mail@arctic-alpaca.de>
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/bpf/20240713015253.121248-2-sdf@fomichev.me
2024-07-25 11:57:27 +02:00
Fred Li
fa5ef65561 bpf: Fix a segment issue when downgrading gso_size
Linearize the skb when downgrading gso_size because it may trigger a
BUG_ON() later when the skb is segmented as described in [1,2].

Fixes: 2be7e212d5 ("bpf: add bpf_skb_adjust_room helper")
Signed-off-by: Fred Li <dracodingfly@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/all/20240626065555.35460-2-dracodingfly@gmail.com [1]
Link: https://lore.kernel.org/all/668d5cf1ec330_1c18c32947@willemb.c.googlers.com.notmuch [2]
Link: https://lore.kernel.org/bpf/20240719024653.77006-1-dracodingfly@gmail.com
2024-07-25 11:50:14 +02:00
Paolo Abeni
e6d08d7ecf Merge tag 'nf-24-07-24' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains a Netfilter fix for net:

Patch #1 if FPU is busy, then pipapo set backend falls back to standard
         set element lookup. Moreover, disable bh while at this.
	 From Florian Westphal.

netfilter pull request 24-07-24

* tag 'nf-24-07-24' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  netfilter: nft_set_pipapo_avx2: disable softinterrupts
====================

Link: https://patch.msgid.link/20240724081305.3152-1-pablo@netfilter.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-07-25 11:17:21 +02:00
Joel Granados
78eb4ea25c sysctl: treewide: constify the ctl_table argument of proc_handlers
const qualify the struct ctl_table argument in the proc_handler function
signatures. This is a prerequisite to moving the static ctl_table
structs into .rodata data which will ensure that proc_handler function
pointers cannot be modified.

This patch has been generated by the following coccinelle script:

```
  virtual patch

  @r1@
  identifier ctl, write, buffer, lenp, ppos;
  identifier func !~ "appldata_(timer|interval)_handler|sched_(rt|rr)_handler|rds_tcp_skbuf_handler|proc_sctp_do_(hmac_alg|rto_min|rto_max|udp_port|alpha_beta|auth|probe_interval)";
  @@

  int func(
  - struct ctl_table *ctl
  + const struct ctl_table *ctl
    ,int write, void *buffer, size_t *lenp, loff_t *ppos);

  @r2@
  identifier func, ctl, write, buffer, lenp, ppos;
  @@

  int func(
  - struct ctl_table *ctl
  + const struct ctl_table *ctl
    ,int write, void *buffer, size_t *lenp, loff_t *ppos)
  { ... }

  @r3@
  identifier func;
  @@

  int func(
  - struct ctl_table *
  + const struct ctl_table *
    ,int , void *, size_t *, loff_t *);

  @r4@
  identifier func, ctl;
  @@

  int func(
  - struct ctl_table *ctl
  + const struct ctl_table *ctl
    ,int , void *, size_t *, loff_t *);

  @r5@
  identifier func, write, buffer, lenp, ppos;
  @@

  int func(
  - struct ctl_table *
  + const struct ctl_table *
    ,int write, void *buffer, size_t *lenp, loff_t *ppos);

```

* Code formatting was adjusted in xfs_sysctl.c to comply with code
  conventions. The xfs_stats_clear_proc_handler,
  xfs_panic_mask_proc_handler and xfs_deprecated_dointvec_minmax where
  adjusted.

* The ctl_table argument in proc_watchdog_common was const qualified.
  This is called from a proc_handler itself and is calling back into
  another proc_handler, making it necessary to change it as part of the
  proc_handler migration.

Co-developed-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Co-developed-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Joel Granados <j.granados@samsung.com>
2024-07-24 20:59:29 +02:00
Petr Machata
6d745cd0e9 net: nexthop: Initialize all fields in dumped nexthops
struct nexthop_grp contains two reserved fields that are not initialized by
nla_put_nh_group(), and carry garbage. This can be observed e.g. with
strace (edited for clarity):

    # ip nexthop add id 1 dev lo
    # ip nexthop add id 101 group 1
    # strace -e recvmsg ip nexthop get id 101
    ...
    recvmsg(... [{nla_len=12, nla_type=NHA_GROUP},
                 [{id=1, weight=0, resvd1=0x69, resvd2=0x67}]] ...) = 52

The fields are reserved and therefore not currently used. But as they are, they
leak kernel memory, and the fact they are not just zero complicates repurposing
of the fields for new ends. Initialize the full structure.

Fixes: 430a049190 ("nexthop: Add support for nexthop groups")
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-07-24 15:13:43 +01:00
Shigeru Yoshida
fa96c6baef tipc: Return non-zero value from tipc_udp_addr2str() on error
tipc_udp_addr2str() should return non-zero value if the UDP media
address is invalid. Otherwise, a buffer overflow access can occur in
tipc_media_addr_printf(). Fix this by returning 1 on an invalid UDP
media address.

Fixes: d0f91938be ("tipc: add ip/udp media type")
Signed-off-by: Shigeru Yoshida <syoshida@redhat.com>
Reviewed-by: Tung Nguyen <tung.q.nguyen@endava.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2024-07-24 12:18:03 +01:00
Florian Westphal
a16909ae99 netfilter: nft_set_pipapo_avx2: disable softinterrupts
We need to disable softinterrupts, else we get following problem:

1. pipapo_avx2 called from process context; fpu usable
2. preempt_disable() called, pcpu scratchmap in use
3. softirq handles rx or tx, we re-enter pipapo_avx2
4. fpu busy, fallback to generic non-avx version
5. fallback reuses scratch map and index, which are in use
   by the preempted process

Handle this same way as generic version by first disabling
softinterrupts while the scratchmap is in use.

Fixes: f0b3d33806 ("netfilter: nft_set_pipapo_avx2: Add irq_fpu_usable() check, fallback to non-AVX2 version")
Cc: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-07-24 10:01:59 +02:00
James Chapman
d587d82542 l2tp: make session IDR and tunnel session list coherent
Modify l2tp_session_register and l2tp_session_unhash so that the
session IDR and tunnel session lists remain coherent. To do so, hold
the session IDR lock and the tunnel's session list lock when making
any changes to either list.

Without this change, a rare race condition could hit the WARN_ON_ONCE
in l2tp_session_unhash if a thread replaced the IDR entry while
another thread was registering the same ID.

 [ 7126.151795][T17511] WARNING: CPU: 3 PID: 17511 at net/l2tp/l2tp_core.c:1282 l2tp_session_delete.part.0+0x87e/0xbc0
 [ 7126.163754][T17511]  ? show_regs+0x93/0xa0
 [ 7126.164157][T17511]  ? __warn+0xe5/0x3c0
 [ 7126.164536][T17511]  ? l2tp_session_delete.part.0+0x87e/0xbc0
 [ 7126.165070][T17511]  ? report_bug+0x2e1/0x500
 [ 7126.165486][T17511]  ? l2tp_session_delete.part.0+0x87e/0xbc0
 [ 7126.166013][T17511]  ? handle_bug+0x99/0x130
 [ 7126.166428][T17511]  ? exc_invalid_op+0x35/0x80
 [ 7126.166890][T17511]  ? asm_exc_invalid_op+0x1a/0x20
 [ 7126.167372][T17511]  ? l2tp_session_delete.part.0+0x87d/0xbc0
 [ 7126.167900][T17511]  ? l2tp_session_delete.part.0+0x87e/0xbc0
 [ 7126.168429][T17511]  ? __local_bh_enable_ip+0xa4/0x120
 [ 7126.168917][T17511]  l2tp_session_delete+0x40/0x50
 [ 7126.169369][T17511]  pppol2tp_release+0x1a1/0x3f0
 [ 7126.169817][T17511]  __sock_release+0xb3/0x270
 [ 7126.170247][T17511]  ? __pfx_sock_close+0x10/0x10
 [ 7126.170697][T17511]  sock_close+0x1c/0x30
 [ 7126.171087][T17511]  __fput+0x40b/0xb90
 [ 7126.171470][T17511]  task_work_run+0x16c/0x260
 [ 7126.171897][T17511]  ? __pfx_task_work_run+0x10/0x10
 [ 7126.172362][T17511]  ? srso_alias_return_thunk+0x5/0xfbef5
 [ 7126.172863][T17511]  ? do_raw_spin_unlock+0x174/0x230
 [ 7126.173348][T17511]  do_exit+0xaae/0x2b40
 [ 7126.173730][T17511]  ? srso_alias_return_thunk+0x5/0xfbef5
 [ 7126.174235][T17511]  ? __pfx_lock_release+0x10/0x10
 [ 7126.174690][T17511]  ? srso_alias_return_thunk+0x5/0xfbef5
 [ 7126.175190][T17511]  ? do_raw_spin_lock+0x12c/0x2b0
 [ 7126.175650][T17511]  ? __pfx_do_exit+0x10/0x10
 [ 7126.176072][T17511]  ? _raw_spin_unlock_irq+0x23/0x50
 [ 7126.176543][T17511]  do_group_exit+0xd3/0x2a0
 [ 7126.176990][T17511]  __x64_sys_exit_group+0x3e/0x50
 [ 7126.177456][T17511]  x64_sys_call+0x1821/0x1830
 [ 7126.177895][T17511]  do_syscall_64+0xcb/0x250
 [ 7126.178317][T17511]  entry_SYSCALL_64_after_hwframe+0x77/0x7f

Fixes: aa5e17e1f5 ("l2tp: store l2tpv3 sessions in per-net IDR")
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: Tom Parkin <tparkin@katalix.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20240718134348.289865-1-jchapman@katalix.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-07-23 11:24:46 +02:00
Ido Schimmel
cc73bbab4b ipv4: Fix incorrect source address in Record Route option
The Record Route IP option records the addresses of the routers that
routed the packet. In the case of forwarded packets, the kernel performs
a route lookup via fib_lookup() and fills in the preferred source
address of the matched route.

The lookup is performed with the DS field of the forwarded packet, but
using the RT_TOS() macro which only masks one of the two ECN bits. If
the packet is ECT(0) or CE, the matched route might be different than
the route via which the packet was forwarded as the input path masks
both of the ECN bits, resulting in the wrong address being filled in the
Record Route option.

Fix by masking both of the ECN bits.

Fixes: 8e36360ae8 ("ipv4: Remove route key identity dependencies in ip_rt_get_source().")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Link: https://patch.msgid.link/20240718123407.434778-1-idosch@nvidia.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-07-23 10:28:16 +02:00
Linus Torvalds
527eff227d Merge tag 'mm-nonmm-stable-2024-07-21-15-07' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:

 - In the series "treewide: Refactor heap related implementation",
   Kuan-Wei Chiu has significantly reworked the min_heap library code
   and has taught bcachefs to use the new more generic implementation.

 - Yury Norov's series "Cleanup cpumask.h inclusion in core headers"
   reworks the cpumask and nodemask headers to make things generally
   more rational.

 - Kuan-Wei Chiu has sent along some maintenance work against our
   sorting library code in the series "lib/sort: Optimizations and
   cleanups".

 - More library maintainance work from Christophe Jaillet in the series
   "Remove usage of the deprecated ida_simple_xx() API".

 - Ryusuke Konishi continues with the nilfs2 fixes and clanups in the
   series "nilfs2: eliminate the call to inode_attach_wb()".

 - Kuan-Ying Lee has some fixes to the gdb scripts in the series "Fix
   GDB command error".

 - Plus the usual shower of singleton patches all over the place. Please
   see the relevant changelogs for details.

* tag 'mm-nonmm-stable-2024-07-21-15-07' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (98 commits)
  ia64: scrub ia64 from poison.h
  watchdog/perf: properly initialize the turbo mode timestamp and rearm counter
  tsacct: replace strncpy() with strscpy()
  lib/bch.c: use swap() to improve code
  test_bpf: convert comma to semicolon
  init/modpost: conditionally check section mismatch to __meminit*
  init: remove unused __MEMINIT* macros
  nilfs2: Constify struct kobj_type
  nilfs2: avoid undefined behavior in nilfs_cnt32_ge macro
  math: rational: add missing MODULE_DESCRIPTION() macro
  lib/zlib: add missing MODULE_DESCRIPTION() macro
  fs: ufs: add MODULE_DESCRIPTION()
  lib/rbtree.c: fix the example typo
  ocfs2: add bounds checking to ocfs2_check_dir_entry()
  fs: add kernel-doc comments to ocfs2_prepare_orphan_dir()
  coredump: simplify zap_process()
  selftests/fpu: add missing MODULE_DESCRIPTION() macro
  compiler.h: simplify data_race() macro
  build-id: require program headers to be right after ELF header
  resource: add missing MODULE_DESCRIPTION()
  ...
2024-07-21 17:56:22 -07:00
Linus Torvalds
d7e78951a8 Merge tag 'net-6.11-rc0' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
 "Including fixes from netfilter.

  Notably this includes fixes for a s390 build breakage.

  Current release - new code bugs:

   - eth: fbnic: fix s390 build

   - eth: airoha: fix NULL pointer dereference in
     airoha_qdma_cleanup_rx_queue()

  Previous releases - regressions:

   - flow_dissector: use DEBUG_NET_WARN_ON_ONCE

   - ipv4: fix incorrect TOS in route get reply

   - dsa: fix chip-wide frame size config in some drivers

  Previous releases - always broken:

   - netfilter: nf_set_pipapo: fix initial map fill

   - eth: gve: fix XDP TX completion handling when counters overflow"

* tag 'net-6.11-rc0' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net:
  eth: fbnic: don't build the driver when skb has more than 21 frags
  net: dsa: b53: Limit chip-wide jumbo frame config to CPU ports
  net: dsa: mv88e6xxx: Limit chip-wide frame size config to CPU ports
  net: airoha: Fix NULL pointer dereference in airoha_qdma_cleanup_rx_queue()
  net: wwan: t7xx: add support for Dell DW5933e
  ipv4: Fix incorrect TOS in fibmatch route get reply
  ipv4: Fix incorrect TOS in route get reply
  net: flow_dissector: use DEBUG_NET_WARN_ON_ONCE
  driver core: auxiliary bus: Fix documentation of auxiliary_device
  net: airoha: fix error branch in airoha_dev_xmit and airoha_set_gdm_ports
  gve: Fix XDP TX completion handling when counters overflow
  ipvs: properly dereference pe in ip_vs_add_service
  selftests: netfilter: add test case for recent mismatch bug
  netfilter: nf_set_pipapo: fix initial map fill
  netfilter: ctnetlink: use helper function to calculate expect ID
  eth: fbnic: fix s390 build.
2024-07-19 14:58:12 -07:00
Linus Torvalds
f4f92db439 Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull virtio updates from Michael Tsirkin:
 "Several new features here:

   - Virtio find vqs API has been reworked (required to fix the
     scalability issue we have with adminq, which I hope to merge later
     in the cycle)

   - vDPA driver for Marvell OCTEON

   - virtio fs performance improvement

   - mlx5 migration speedups

  Fixes, cleanups all over the place"

* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (56 commits)
  virtio: rename virtio_find_vqs_info() to virtio_find_vqs()
  virtio: remove unused virtio_find_vqs() and virtio_find_vqs_ctx() helpers
  virtio: convert the rest virtio_find_vqs() users to virtio_find_vqs_info()
  virtio_balloon: convert to use virtio_find_vqs_info()
  virtiofs: convert to use virtio_find_vqs_info()
  scsi: virtio_scsi: convert to use virtio_find_vqs_info()
  virtio_net: convert to use virtio_find_vqs_info()
  virtio_crypto: convert to use virtio_find_vqs_info()
  virtio_console: convert to use virtio_find_vqs_info()
  virtio_blk: convert to use virtio_find_vqs_info()
  virtio: rename find_vqs_info() op to find_vqs()
  virtio: remove the original find_vqs() op
  virtio: call virtio_find_vqs_info() from virtio_find_single_vq() directly
  virtio: convert find_vqs() op implementations to find_vqs_info()
  virtio_pci: convert vp_*find_vqs() ops to find_vqs_info()
  virtio: introduce virtio_queue_info struct and find_vqs_info() config op
  virtio: make virtio_find_single_vq() call virtio_find_vqs()
  virtio: make virtio_find_vqs() call virtio_find_vqs_ctx()
  caif_virtio: use virtio_find_single_vq() for single virtqueue finding
  vdpa/mlx5: Don't enable non-active VQs in .set_vq_ready()
  ...
2024-07-19 11:57:55 -07:00
Linus Torvalds
4f40c636b2 Merge tag 'nfs-for-6.11-1' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client updates from Anna Schumaker:
 "New Features:
   - Add support for large folios
   - Implement rpcrdma generic device removal notification
   - Add client support for attribute delegations
   - Use a LAYOUTRETURN during reboot recovery to report layoutstats
     and errors
   - Improve throughput for random buffered writes
   - Add NVMe support to pnfs/blocklayout

  Bugfixes:
   - Fix rpcrdma_reqs_reset()
   - Avoid soft lockups when using UDP
   - Fix an nfs/blocklayout premature PR key unregestration
   - Another fix for EXCHGID4_FLAG_USE_PNFS_DS for DS server
   - Do not extend writes to the entire folio
   - Pass explicit offset and count values to tracepoints
   - Fix a race to wake up sleeping SUNRPC sync tasks
   - Fix gss_status tracepoint output

  Cleanups:
   - Add missing MODULE_DESCRIPTION() macros
   - Add blocklayout / SCSI layout tracepoints
   - Remove asm-generic headers from xprtrdma verbs.c
   - Remove unused 'struct mnt_fhstatus'
   - Other delegation related cleanups
   - Other folio related cleanups
   - Other pNFS related cleanups
   - Other xprtrdma cleanups"

* tag 'nfs-for-6.11-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (63 commits)
  SUNRPC: Fixup gss_status tracepoint error output
  SUNRPC: Fix a race to wake a sync task
  nfs: split nfs_read_folio
  nfs: pass explicit offset/count to trace events
  nfs: do not extend writes to the entire folio
  nfs/blocklayout: add support for NVMe
  nfs: remove nfs_page_length
  nfs: remove the unused max_deviceinfo_size field from struct pnfs_layoutdriver_type
  nfs: don't reuse partially completed requests in nfs_lock_and_join_requests
  nfs: move nfs_wait_on_request to write.c
  nfs: fold nfs_page_group_lock_subrequests into nfs_lock_and_join_requests
  nfs: fold nfs_folio_find_and_lock_request into nfs_lock_and_join_requests
  nfs: simplify nfs_folio_find_and_lock_request
  nfs: remove nfs_folio_private_request
  nfs: remove dead code for the old swap over NFS implementation
  NFSv4.1 another fix for EXCHGID4_FLAG_USE_PNFS_DS for DS server
  nfs: Block on write congestion
  nfs: Properly initialize server->writeback
  nfs: Drop pointless check from nfs_commit_release_pages()
  nfs/blocklayout: SCSI layout trace points for reservation key reg/unreg
  ...
2024-07-18 17:17:30 -07:00
Benjamin Coddington
ed0172af5d SUNRPC: Fix a race to wake a sync task
We've observed NFS clients with sync tasks sleeping in __rpc_execute
waiting on RPC_TASK_QUEUED that have not responded to a wake-up from
rpc_make_runnable().  I suspect this problem usually goes unnoticed,
because on a busy client the task will eventually be re-awoken by another
task completion or xprt event.  However, if the state manager is draining
the slot table, a sync task missing a wake-up can result in a hung client.

We've been able to prove that the waker in rpc_make_runnable() successfully
calls wake_up_bit() (ie- there's no race to tk_runstate), but the
wake_up_bit() call fails to wake the waiter.  I suspect the waker is
missing the load of the bit's wait_queue_head, so waitqueue_active() is
false.  There are some very helpful comments about this problem above
wake_up_bit(), prepare_to_wait(), and waitqueue_active().

Fix this by inserting smp_mb__after_atomic() before the wake_up_bit(),
which pairs with prepare_to_wait() calling set_current_state().

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2024-07-18 10:49:12 -04:00
Paolo Abeni
a1b7dbca14 Merge tag 'nf-24-07-17' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:

====================
Netfilter/IPVS fixes for net

The following patchset contains Netfilter/IPVS fixes for net:

1) Call nf_expect_get_id() to delete expectation by ID. By trial and
   error it is possible to leak the LSB of the expectation address on
   x86_64. This bug is a leftover when converting the existing code
   to use nf_expect_get_id().

2) Incorrect initialization in pipapo set backend leads to packet
   mismatches. From Florian Westphal.

3) Extend netfilter's selftests to cover for the pipapo set backend,
   also from Florian.

4) Fix sparse warning in IPVS when adding service, from Chen Hanxiao.

netfilter pull request 24-07-17

* tag 'nf-24-07-17' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  ipvs: properly dereference pe in ip_vs_add_service
  selftests: netfilter: add test case for recent mismatch bug
  netfilter: nf_set_pipapo: fix initial map fill
  netfilter: ctnetlink: use helper function to calculate expect ID
====================

Link: https://patch.msgid.link/20240717215214.225394-1-pablo@netfilter.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-07-18 13:28:34 +02:00
Ido Schimmel
f036e68212 ipv4: Fix incorrect TOS in fibmatch route get reply
The TOS value that is returned to user space in the route get reply is
the one with which the lookup was performed ('fl4->flowi4_tos'). This is
fine when the matched route is configured with a TOS as it would not
match if its TOS value did not match the one with which the lookup was
performed.

However, matching on TOS is only performed when the route's TOS is not
zero. It is therefore possible to have the kernel incorrectly return a
non-zero TOS:

 # ip link add name dummy1 up type dummy
 # ip address add 192.0.2.1/24 dev dummy1
 # ip route get fibmatch 192.0.2.2 tos 0xfc
 192.0.2.0/24 tos 0x1c dev dummy1 proto kernel scope link src 192.0.2.1

Fix by instead returning the DSCP field from the FIB result structure
which was populated during the route lookup.

Output after the patch:

 # ip link add name dummy1 up type dummy
 # ip address add 192.0.2.1/24 dev dummy1
 # ip route get fibmatch 192.0.2.2 tos 0xfc
 192.0.2.0/24 dev dummy1 proto kernel scope link src 192.0.2.1

Extend the existing selftests to not only verify that the correct route
is returned, but that it is also returned with correct "tos" value (or
without it).

Fixes: b61798130f ("net: ipv4: RTM_GETROUTE: return matched fib result when requested")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-07-18 11:11:02 +02:00
Ido Schimmel
338bb57e4c ipv4: Fix incorrect TOS in route get reply
The TOS value that is returned to user space in the route get reply is
the one with which the lookup was performed ('fl4->flowi4_tos'). This is
fine when the matched route is configured with a TOS as it would not
match if its TOS value did not match the one with which the lookup was
performed.

However, matching on TOS is only performed when the route's TOS is not
zero. It is therefore possible to have the kernel incorrectly return a
non-zero TOS:

 # ip link add name dummy1 up type dummy
 # ip address add 192.0.2.1/24 dev dummy1
 # ip route get 192.0.2.2 tos 0xfc
 192.0.2.2 tos 0x1c dev dummy1 src 192.0.2.1 uid 0
     cache

Fix by adding a DSCP field to the FIB result structure (inside an
existing 4 bytes hole), populating it in the route lookup and using it
when filling the route get reply.

Output after the patch:

 # ip link add name dummy1 up type dummy
 # ip address add 192.0.2.1/24 dev dummy1
 # ip route get 192.0.2.2 tos 0xfc
 192.0.2.2 dev dummy1 src 192.0.2.1 uid 0
     cache

Fixes: 1a00fee4ff ("ipv4: Remove rt_key_{src,dst,tos} from struct rtable.")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-07-18 11:11:02 +02:00
Pablo Neira Ayuso
120f1c857a net: flow_dissector: use DEBUG_NET_WARN_ON_ONCE
The following splat is easy to reproduce upstream as well as in -stable
kernels. Florian Westphal provided the following commit:

  d1dab4f71d ("net: add and use __skb_get_hash_symmetric_net")

but this complementary fix has been also suggested by Willem de Bruijn
and it can be easily backported to -stable kernel which consists in
using DEBUG_NET_WARN_ON_ONCE instead to silence the following splat
given __skb_get_hash() is used by the nftables tracing infrastructure to
to identify packets in traces.

[69133.561393] ------------[ cut here ]------------
[69133.561404] WARNING: CPU: 0 PID: 43576 at net/core/flow_dissector.c:1104 __skb_flow_dissect+0x134f/
[...]
[69133.561944] CPU: 0 PID: 43576 Comm: socat Not tainted 6.10.0-rc7+ #379
[69133.561959] RIP: 0010:__skb_flow_dissect+0x134f/0x2ad0
[69133.561970] Code: 83 f9 04 0f 84 b3 00 00 00 45 85 c9 0f 84 aa 00 00 00 41 83 f9 02 0f 84 81 fc ff
ff 44 0f b7 b4 24 80 00 00 00 e9 8b f9 ff ff <0f> 0b e9 20 f3 ff ff 41 f6 c6 20 0f 84 e4 ef ff ff 48 8d 7b 12 e8
[69133.561979] RSP: 0018:ffffc90000006fc0 EFLAGS: 00010246
[69133.561988] RAX: 0000000000000000 RBX: ffffffff82f33e20 RCX: ffffffff81ab7e19
[69133.561994] RDX: dffffc0000000000 RSI: ffffc90000007388 RDI: ffff888103a1b418
[69133.562001] RBP: ffffc90000007310 R08: 0000000000000000 R09: 0000000000000000
[69133.562007] R10: ffffc90000007388 R11: ffffffff810cface R12: ffff888103a1b400
[69133.562013] R13: 0000000000000000 R14: ffffffff82f33e2a R15: ffffffff82f33e28
[69133.562020] FS:  00007f40f7131740(0000) GS:ffff888390800000(0000) knlGS:0000000000000000
[69133.562027] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[69133.562033] CR2: 00007f40f7346ee0 CR3: 000000015d200001 CR4: 00000000001706f0
[69133.562040] Call Trace:
[69133.562044]  <IRQ>
[69133.562049]  ? __warn+0x9f/0x1a0
[ 1211.841384]  ? __skb_flow_dissect+0x107e/0x2860
[...]
[ 1211.841496]  ? bpf_flow_dissect+0x160/0x160
[ 1211.841753]  __skb_get_hash+0x97/0x280
[ 1211.841765]  ? __skb_get_hash_symmetric+0x230/0x230
[ 1211.841776]  ? mod_find+0xbf/0xe0
[ 1211.841786]  ? get_stack_info_noinstr+0x12/0xe0
[ 1211.841798]  ? bpf_ksym_find+0x56/0xe0
[ 1211.841807]  ? __rcu_read_unlock+0x2a/0x70
[ 1211.841819]  nft_trace_init+0x1b9/0x1c0 [nf_tables]
[ 1211.841895]  ? nft_trace_notify+0x830/0x830 [nf_tables]
[ 1211.841964]  ? get_stack_info+0x2b/0x80
[ 1211.841975]  ? nft_do_chain_arp+0x80/0x80 [nf_tables]
[ 1211.842044]  nft_do_chain+0x79c/0x850 [nf_tables]

Fixes: 9b52e3f267 ("flow_dissector: handle no-skb use case")
Suggested-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20240715141442.43775-1-pablo@netfilter.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-07-18 10:52:17 +02:00