linux/include/net
Eric Dumazet 68822bdf76 net: generalize skb freeing deferral to per-cpu lists
Logic added in commit f35f821935 ("tcp: defer skb freeing after socket
lock is released") helped bulk TCP flows to move the cost of skbs
frees outside of critical section where socket lock was held.

But for RPC traffic, or hosts with RFS enabled, the solution is far from
being ideal.

For RPC traffic, recvmsg() has to return to user space right after
skb payload has been consumed, meaning that BH handler has no chance
to pick the skb before recvmsg() thread. This issue is more visible
with BIG TCP, as more RPC fit one skb.

For RFS, even if BH handler picks the skbs, they are still picked
from the cpu on which user thread is running.

Ideally, it is better to free the skbs (and associated page frags)
on the cpu that originally allocated them.

This patch removes the per socket anchor (sk->defer_list) and
instead uses a per-cpu list, which will hold more skbs per round.

This new per-cpu list is drained at the end of net_action_rx(),
after incoming packets have been processed, to lower latencies.

In normal conditions, skbs are added to the per-cpu list with
no further action. In the (unlikely) cases where the cpu does not
run net_action_rx() handler fast enough, we use an IPI to raise
NET_RX_SOFTIRQ on the remote cpu.

Also, we do not bother draining the per-cpu list from dev_cpu_dead()
This is because skbs in this list have no requirement on how fast
they should be freed.

Note that we can add in the future a small per-cpu cache
if we see any contention on sd->defer_lock.

Tested on a pair of hosts with 100Gbit NIC, RFS enabled,
and /proc/sys/net/ipv4/tcp_rmem[2] tuned to 16MB to work around
page recycling strategy used by NIC driver (its page pool capacity
being too small compared to number of skbs/pages held in sockets
receive queues)

Note that this tuning was only done to demonstrate worse
conditions for skb freeing for this particular test.
These conditions can happen in more general production workload.

10 runs of one TCP_STREAM flow

Before:
Average throughput: 49685 Mbit.

Kernel profiles on cpu running user thread recvmsg() show high cost for
skb freeing related functions (*)

    57.81%  [kernel]       [k] copy_user_enhanced_fast_string
(*) 12.87%  [kernel]       [k] skb_release_data
(*)  4.25%  [kernel]       [k] __free_one_page
(*)  3.57%  [kernel]       [k] __list_del_entry_valid
     1.85%  [kernel]       [k] __netif_receive_skb_core
     1.60%  [kernel]       [k] __skb_datagram_iter
(*)  1.59%  [kernel]       [k] free_unref_page_commit
(*)  1.16%  [kernel]       [k] __slab_free
     1.16%  [kernel]       [k] _copy_to_iter
(*)  1.01%  [kernel]       [k] kfree
(*)  0.88%  [kernel]       [k] free_unref_page
     0.57%  [kernel]       [k] ip6_rcv_core
     0.55%  [kernel]       [k] ip6t_do_table
     0.54%  [kernel]       [k] flush_smp_call_function_queue
(*)  0.54%  [kernel]       [k] free_pcppages_bulk
     0.51%  [kernel]       [k] llist_reverse_order
     0.38%  [kernel]       [k] process_backlog
(*)  0.38%  [kernel]       [k] free_pcp_prepare
     0.37%  [kernel]       [k] tcp_recvmsg_locked
(*)  0.37%  [kernel]       [k] __list_add_valid
     0.34%  [kernel]       [k] sock_rfree
     0.34%  [kernel]       [k] _raw_spin_lock_irq
(*)  0.33%  [kernel]       [k] __page_cache_release
     0.33%  [kernel]       [k] tcp_v6_rcv
(*)  0.33%  [kernel]       [k] __put_page
(*)  0.29%  [kernel]       [k] __mod_zone_page_state
     0.27%  [kernel]       [k] _raw_spin_lock

After patch:
Average throughput: 73076 Mbit.

Kernel profiles on cpu running user thread recvmsg() looks better:

    81.35%  [kernel]       [k] copy_user_enhanced_fast_string
     1.95%  [kernel]       [k] _copy_to_iter
     1.95%  [kernel]       [k] __skb_datagram_iter
     1.27%  [kernel]       [k] __netif_receive_skb_core
     1.03%  [kernel]       [k] ip6t_do_table
     0.60%  [kernel]       [k] sock_rfree
     0.50%  [kernel]       [k] tcp_v6_rcv
     0.47%  [kernel]       [k] ip6_rcv_core
     0.45%  [kernel]       [k] read_tsc
     0.44%  [kernel]       [k] _raw_spin_lock_irqsave
     0.37%  [kernel]       [k] _raw_spin_lock
     0.37%  [kernel]       [k] native_irq_return_iret
     0.33%  [kernel]       [k] __inet6_lookup_established
     0.31%  [kernel]       [k] ip6_protocol_deliver_rcu
     0.29%  [kernel]       [k] tcp_rcv_established
     0.29%  [kernel]       [k] llist_reverse_order

v2: kdoc issue (kernel bots)
    do not defer if (alloc_cpu == smp_processor_id()) (Paolo)
    replace the sk_buff_head with a single-linked list (Jakub)
    add a READ_ONCE()/WRITE_ONCE() for the lockless read of sd->defer_list

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20220422201237.416238-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-04-26 17:05:59 -07:00
..
9p net/p9: load default transports 2022-01-10 10:00:09 +09:00
bluetooth Networking changes for 5.18. 2022-03-24 13:13:26 -07:00
caif net: remove the caif_hsi driver 2021-07-01 13:19:48 -07:00
iucv net/af_iucv: Use struct_group() to zero struct iucv_sock region 2021-11-19 11:52:25 +00:00
netfilter netfilter: ecache: move to separate structure 2022-04-08 12:08:58 +02:00
netns ipv6: make ip6_rt_gc_expire an atomic_t 2022-04-15 14:28:50 -07:00
nfc NFC: add NCI_UNREG flag to eliminate the race 2021-11-17 20:17:05 -08:00
phonet
sctp net: remove noblock parameter from recvmsg() entities 2022-04-12 15:00:25 +02:00
tc_act net: sched: support hash selecting tx queue 2022-04-19 12:20:45 +02:00
6lowpan.h 6lowpan: Replace zero-length array with flexible-array member 2020-02-28 14:51:30 +01:00
Space.h wan: remove sbni/granch driver 2021-08-03 13:05:26 +01:00
act_api.h net/sched: act_api: Add extack to offload_act_setup() callback 2022-04-08 13:45:43 +01:00
addrconf.h net: Add new protocol attribute to IP addresses 2022-02-18 21:20:06 -08:00
af_ieee802154.h
af_rxrpc.h afs: Don't truncate iter during data fetch 2021-04-23 10:17:26 +01:00
af_unix.h af_unix: Replace the big lock with small locks. 2021-11-26 18:01:58 -08:00
af_vsock.h vsock: each transport cycles only on its own sockets 2022-03-11 23:14:19 -08:00
ah.h
amt.h amt: add mld report message handler 2021-11-01 13:36:09 +00:00
arp.h ipv4: Invalidate neighbour for broadcast address upon address addition 2022-02-21 11:44:30 +00:00
atmclip.h
ax25.h Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2022-02-03 17:36:16 -08:00
ax88796.h ax88796: export ax_NS8390_init() hook 2021-08-03 13:05:25 +01:00
bareudp.h bareudp: Move definition of struct bareudp_conf to bareudp.c 2021-12-13 12:34:09 +00:00
bond_3ad.h bonding: fix data-races around agg_select_timer 2022-02-15 14:35:18 +00:00
bond_alb.h bonding: make tx_rebalance_counter an atomic 2021-12-03 14:16:48 +00:00
bond_options.h bonding: add new option ns_ip6_target 2022-02-21 12:13:45 +00:00
bonding.h net: add per-cpu storage and net->core_stats 2022-03-11 23:17:24 -08:00
bpf_sk_storage.h bpf: struct sock is declared twice in bpf_sk_storage header 2021-03-26 17:43:55 +01:00
busy_poll.h tcp: fix another uninit-value (sk_rx_queue_mapping) 2021-12-03 14:15:49 +00:00
calipso.h
cfg80211-wext.h
cfg80211.h cfg80211: Support configuration of station EHT capabilities 2022-02-16 15:43:25 +01:00
cfg802154.h net: ieee802154: Provide a kdoc to the address structure 2022-02-01 21:03:48 +01:00
checksum.h powerpc/net: Implement powerpc specific csum_shift() to remove branch 2022-03-11 10:57:22 +00:00
cipso_ipv4.h cipso: Remove unused inline functions 2020-07-15 07:45:24 -07:00
cls_cgroup.h bpf: Allow to retrieve cgroup v1 classid from v2 hooks 2020-03-27 19:40:38 -07:00
codel.h codel: remove unnecessary pkt_sched.h include 2021-12-22 15:03:51 -08:00
codel_impl.h codel: remove unnecessary sock.h include 2021-12-22 15:03:47 -08:00
codel_qdisc.h codel: remove unnecessary pkt_sched.h include 2021-12-22 15:03:51 -08:00
compat.h net/ipv4/ipv6: Replace one-element arraya with flexible-array members 2021-08-05 11:46:42 +01:00
datalink.h llc/snap: constify dev_addr passing 2021-10-13 09:40:46 -07:00
dcbevent.h
dcbnl.h
devlink.h devlink: introduce line card device info infrastructure 2022-04-25 10:42:28 +01:00
dn.h decnet: constify dev_addr passing 2021-10-13 09:40:46 -07:00
dn_dev.h
dn_fib.h net: convert fib_treeref from int to refcount_t 2021-07-30 15:33:24 +02:00
dn_neigh.h
dn_nsp.h
dn_route.h
dsa.h net: dsa: pass extack to dsa_switch_ops :: port_mirror_add() 2022-03-17 17:42:47 -07:00
dsfield.h ipv6: Annotate bitwise IPv6 dsfield pointer cast 2019-12-16 16:09:44 -08:00
dst.h net: dst: add net device refcount tracking to dst_entry 2021-12-06 16:05:10 -08:00
dst_cache.h wireguard: device: reset peer src endpoint when netns exits 2021-11-29 19:50:45 -08:00
dst_metadata.h net: fix a memleak when uncloning an skb dst and its metadata 2022-02-09 11:41:47 +00:00
dst_ops.h net/dst: use a smaller percpu_counter batch for dst entries accounting 2020-05-08 21:33:33 -07:00
erspan.h erspan: Add type I version 0 support. 2020-05-05 13:23:29 -07:00
esp.h esp: limit skb_page_frag_refill use to a single page 2022-04-13 10:16:11 +02:00
espintcp.h xfrm: espintcp: save and call old ->sk_destruct 2020-04-20 07:34:16 +02:00
ethoc.h
failover.h net: failover: add net device refcount tracker 2021-12-06 16:06:02 -08:00
fib_notifier.h ipv6: Remove old route notifications and convert listeners 2019-12-24 22:37:30 -08:00
fib_rules.h fib: expand fib_rule_policy 2021-12-16 07:18:35 -08:00
firewire.h
flow.h net: Add l3mdev index to flow struct and avoid oif reset for port devices 2022-03-15 20:20:02 -07:00
flow_dissector.h flow_dissector: Add number of vlan tags dissector 2022-04-20 11:09:13 +01:00
flow_offload.h net/sched: add vlan push_eth and pop_eth action to the hardware IR 2022-03-16 19:59:36 -07:00
fou.h
fq.h net/fq_impl: do not maintain a backlog-sorted list of flows 2021-01-21 13:33:45 +01:00
fq_impl.h net/fq_impl: do not maintain a backlog-sorted list of flows 2021-01-21 13:33:45 +01:00
garp.h treewide: Use sizeof_field() macro 2019-12-09 10:36:44 -08:00
gen_stats.h net: sched: Remove Qdisc::running sequence counter 2021-10-18 12:54:41 +01:00
genetlink.h mptcp: avoid lock_fast usage in accept path 2021-02-12 16:31:46 -08:00
geneve.h
gre.h ip_gre: add csum offload support for gre header 2021-01-29 20:39:14 -08:00
gro.h net: gro: Fix a 'directive in macro's argument list' sparse warning 2022-02-18 11:00:25 +00:00
gro_cells.h
gtp.h gtp: Add support for checking GTP device type 2022-03-11 08:28:27 -08:00
gue.h GUE: Fix a typo 2020-06-22 21:12:44 -07:00
hwbm.h net: hwbm: if CONFIG_NET_HWBM unset, make stub functions static 2019-10-25 16:24:32 -07:00
icmp.h ipv6: ICMPV6: add response to ICMPV6 RFC 8335 PROBE messages 2021-06-28 14:29:45 -07:00
ieee80211_radiotap.h ieee80211: radiotap: fix -Wcast-qual warnings 2022-02-04 16:25:21 +01:00
ieee802154_netdev.h
if_inet6.h ipv6: fix locking issues with loops over idev->addr_list 2022-04-06 22:09:39 -07:00
ife.h
ila.h
inet6_connection_sock.h
inet6_hashtables.h net: Track socket refcounts in skb_steal_sock() 2020-03-30 13:45:04 -07:00
inet_common.h bpf: Allow rewriting to ports under ip_unprivileged_port_start 2021-01-27 18:18:15 -08:00
inet_connection_sock.h tcp: Use BPF timeout setting for SYN ACK RTO 2022-02-02 14:45:18 +00:00
inet_dscp.h ipv6: Define dscp_t and stop taking ECN bits into account in fib6-rules 2022-02-07 20:12:45 -08:00
inet_ecn.h net: add skb_get_dsfield() helper 2021-10-15 11:33:08 +01:00
inet_frag.h net: ip: Handle delivery_time in ip defrag 2022-03-03 14:38:48 +00:00
inet_hashtables.h tcp: seq_file: Replace listening_hash with lhash2 2021-07-23 16:44:57 -07:00
inet_sock.h ipv4/raw: support binding to nonlocal addresses 2021-11-17 20:21:52 -08:00
inet_timewait_sock.h tcp/dccp: get rid of inet_twsk_purge() 2022-01-25 11:25:21 +00:00
inetpeer.h
ioam6.h treewide: Replace zero-length arrays with flexible-array members 2022-02-17 07:00:39 -06:00
ip.h ipv4: Make ip_idents_reserve static 2022-01-31 11:33:10 +00:00
ip6_checksum.h net: move gro definitions to include/net/gro.h 2021-11-16 13:16:54 +00:00
ip6_fib.h Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2022-02-17 11:44:20 -08:00
ip6_route.h ipv6: ip6_skb_dst_mtu() cleanups 2021-11-19 20:09:55 -08:00
ip6_tunnel.h ipv6: add net device refcount tracker to struct ip6_tnl 2021-12-06 16:05:11 -08:00
ip_fib.h ipv4: Use dscp_t in struct fib_entry_notifier_info 2022-04-11 17:37:50 -07:00
ip_tunnels.h net: Handle l3mdev in ip_tunnel_init_flow 2022-04-15 14:27:30 -07:00
ip_vs.h ipvs: add sysctl_run_estimation to support disable estimation 2021-10-07 19:52:58 +02:00
ipcomp.h
ipconfig.h
ipv6.h Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2022-02-17 11:44:20 -08:00
ipv6_frag.h net: don't include ndisc.h from ipv6.h 2022-02-04 14:15:11 -08:00
ipv6_stubs.h net: ipv6: add fib6_nh_release_dsts stub 2021-11-22 15:44:49 +00:00
iw_handler.h
kcm.h
l3mdev.h l3mdev: add infrastructure for table to VRF mapping 2020-06-20 17:22:22 -07:00
lag.h
lapb.h net: lapb: Make "lapb_t1timer_running" able to detect an already running timer 2021-03-23 14:14:50 -07:00
lib80211.h
llc.h llc: fix out-of-bound array index in llc_sk_dev_hash() 2021-11-07 19:25:29 +00:00
llc_c_ac.h
llc_c_ev.h
llc_c_st.h
llc_conn.h llc: add net device refcount tracker 2021-12-07 20:44:59 -08:00
llc_if.h llc/snap: constify dev_addr passing 2021-10-13 09:40:46 -07:00
llc_pdu.h net: llc: fix skb_over_panic 2021-07-27 13:05:56 +01:00
llc_s_ac.h
llc_s_ev.h
llc_s_st.h
llc_sap.h
lwtunnel.h netfilter: add netfilter hooks to SRv6 data plane 2021-08-30 01:51:36 +02:00
mac80211.h mac80211: MBSSID beacon handling in AP mode 2022-03-15 11:36:26 +01:00
mac802154.h net: mac802154: Explain the use of ieee802154_wake/stop_queue() 2022-01-28 11:23:41 +01:00
macsec.h net: macsec: fix the length used to copy the key for offloading 2021-06-24 12:41:12 -07:00
mctp.h mctp: Use output netdev to allocate skb headroom 2022-04-01 12:04:15 +01:00
mctpdevice.h mctp: Pass flow data & flow release events to drivers 2021-10-29 13:23:51 +01:00
mip6.h net: mip6: Replace zero-length array with flexible-array member 2020-03-02 11:16:27 -08:00
mld.h mld: add new workqueues for process mld events 2021-03-26 15:14:56 -07:00
mpls.h net: Make mpls_entry_encode() available for generic users 2020-05-29 21:20:20 -07:00
mpls_iptunnel.h net: mpls: Replace zero-length array with flexible-array member 2020-02-28 12:08:37 -08:00
mptcp.h mptcp: infinite mapping sending 2022-04-23 11:51:05 +01:00
mrp.h treewide: Use sizeof_field() macro 2019-12-09 10:36:44 -08:00
ncsi.h
ndisc.h Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2022-03-03 11:55:12 -08:00
neighbour.h net, neigh: Do not trigger immediate probes on NUD_FAILED from neigh_managed_work 2022-02-02 20:30:18 -08:00
net_failover.h
net_namespace.h net: make net->dev_unreg_count atomic 2022-02-10 15:30:26 +00:00
net_ratelimit.h
net_trackers.h net: add networking namespace refcount tracker 2021-12-10 06:38:26 -08:00
netevent.h
netlabel.h
netlink.h net: netlink: add the case when nlh is NULL 2021-07-27 11:43:50 +01:00
netprio_cgroup.h netprio: use css ID instead of cgroup ID 2019-11-12 08:18:03 -08:00
netrom.h
nexthop.h net: ipv4: Fix rtnexthop len when RTA_FLOW is present 2021-09-24 14:07:10 +01:00
nl802154.h net: ieee802154: handle iftypes as u32 2021-11-16 18:02:46 +01:00
nsh.h
p8022.h
page_pool.h net: page_pool: introduce ethtool stats 2022-04-15 10:43:47 +01:00
pie.h pie: realign comment 2020-03-04 13:25:55 -08:00
ping.h net: remove noblock parameter from recvmsg() entities 2022-04-12 15:00:25 +02:00
pkt_cls.h net/sched: act_api: Add extack to offload_act_setup() callback 2022-04-08 13:45:43 +01:00
pkt_sched.h net: sched: remove psched_tdiff_bounded() 2022-01-27 13:53:27 +00:00
pptp.h
protocol.h net: Remove the member netns_ok 2021-05-17 15:29:35 -07:00
psample.h psample: Add a fwd declaration for skbuff 2021-08-09 15:34:21 -07:00
psnap.h
raw.h
rawv6.h
red.h sch_red: fix off-by-one checks in red_check_params() 2021-03-25 17:40:43 -07:00
regulatory.h net/wireless: regulatory.h: drop duplicate word in comment 2020-07-31 09:24:23 +02:00
request_sock.h tcp: Use BPF timeout setting for SYN ACK RTO 2022-02-02 14:45:18 +00:00
rose.h rose: constify dev_addr passing 2021-10-13 09:40:45 -07:00
route.h ipv4: Avoid using RTO_ONLINK with ip_route_connect(). 2022-04-22 13:06:03 +01:00
rpl.h net: ipv6: Use struct_size() helper and kcalloc() 2020-06-23 20:27:09 -07:00
rsi_91x.h
rtnetlink.h net: rtnetlink: add bulk delete support flag 2022-04-13 12:46:26 +01:00
rtnh.h
sch_generic.h net: sched: remove qdisc_qlen_cpu() 2022-01-27 13:53:27 +00:00
scm.h fs: Move __scm_install_fd() to __receive_fd() 2020-07-13 11:03:44 -07:00
secure_seq.h
seg6.h udp6: Use Segment Routing Header for dest address if present 2022-01-04 12:17:35 +00:00
seg6_hmac.h
seg6_local.h
selftests.h net: selftest: fix build issue if INET is disabled 2021-04-28 14:06:45 -07:00
slhc_vj.h
smc.h net/smc: introduce CHID callback for ISM devices 2020-09-28 15:19:03 -07:00
snmp.h net/tls: add skeleton of MIB statistics 2019-10-05 16:29:00 -07:00
sock.h net: generalize skb freeing deferral to per-cpu lists 2022-04-26 17:05:59 -07:00
sock_reuseport.h tcp: Add reuseport_migrate_sock() to select a new listener. 2021-06-15 18:01:05 +02:00
stp.h
strparser.h tls: rx: don't store the decryption status in socket context 2022-04-08 11:49:08 +01:00
switchdev.h net: bridge: mst: Notify switchdev drivers of MST state changes 2022-03-17 16:49:58 -07:00
tcp.h net: generalize skb freeing deferral to per-cpu lists 2022-04-26 17:05:59 -07:00
tcp_states.h
timewait_sock.h
tipc.h
tls.h net: remove noblock parameter from recvmsg() entities 2022-04-12 15:00:25 +02:00
tls_toe.h net/tls: rename tls_hw_* functions tls_toe_* 2019-10-04 14:07:07 -07:00
transp_v6.h tcp: move ipv4_specific to tcp include file 2020-06-23 20:10:15 -07:00
tso.h net: tso: cache transport header length 2020-06-18 20:46:23 -07:00
tun_proto.h
udp.h net: remove noblock parameter from recvmsg() entities 2022-04-12 15:00:25 +02:00
udp_tunnel.h udp: call udp_encap_enable for v6 sockets when enabling encap 2021-02-04 18:37:14 -08:00
udplite.h udplite: remove udplite_csum_outgoing() 2022-01-27 13:53:27 +00:00
vsock_addr.h vsock: remove include/linux/vm_sockets.h file 2019-11-14 18:12:17 -08:00
vxlan.h drivers: vxlan: vnifilter: per vni stats 2022-03-01 08:38:02 +00:00
wext.h
x25.h net/x25: add new state X25_STATE_5 2019-12-09 10:28:43 -08:00
x25device.h
xdp.h net: veth: Account total xdp_frame len running ndo_xdp_xmit 2022-03-17 20:33:52 +01:00
xdp_priv.h xsk: Wipe out dead zero_copy_allocator declarations 2021-12-14 00:24:24 +01:00
xdp_sock.h net: Don't include filter.h from net/sock.h 2021-12-29 08:48:14 -08:00
xdp_sock_drv.h i40e: xsk: Move tmp desc array from driver to pool 2022-01-27 17:25:32 +01:00
xfrm.h Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ 2022-03-19 14:49:08 +00:00
xsk_buff_pool.h i40e: xsk: Move tmp desc array from driver to pool 2022-01-27 17:25:32 +01:00