linux/include/net
Eric Dumazet 68822bdf76 net: generalize skb freeing deferral to per-cpu lists
Logic added in commit f35f821935 ("tcp: defer skb freeing after socket
lock is released") helped bulk TCP flows to move the cost of skbs
frees outside of critical section where socket lock was held.

But for RPC traffic, or hosts with RFS enabled, the solution is far from
being ideal.

For RPC traffic, recvmsg() has to return to user space right after
skb payload has been consumed, meaning that BH handler has no chance
to pick the skb before recvmsg() thread. This issue is more visible
with BIG TCP, as more RPC fit one skb.

For RFS, even if BH handler picks the skbs, they are still picked
from the cpu on which user thread is running.

Ideally, it is better to free the skbs (and associated page frags)
on the cpu that originally allocated them.

This patch removes the per socket anchor (sk->defer_list) and
instead uses a per-cpu list, which will hold more skbs per round.

This new per-cpu list is drained at the end of net_action_rx(),
after incoming packets have been processed, to lower latencies.

In normal conditions, skbs are added to the per-cpu list with
no further action. In the (unlikely) cases where the cpu does not
run net_action_rx() handler fast enough, we use an IPI to raise
NET_RX_SOFTIRQ on the remote cpu.

Also, we do not bother draining the per-cpu list from dev_cpu_dead()
This is because skbs in this list have no requirement on how fast
they should be freed.

Note that we can add in the future a small per-cpu cache
if we see any contention on sd->defer_lock.

Tested on a pair of hosts with 100Gbit NIC, RFS enabled,
and /proc/sys/net/ipv4/tcp_rmem[2] tuned to 16MB to work around
page recycling strategy used by NIC driver (its page pool capacity
being too small compared to number of skbs/pages held in sockets
receive queues)

Note that this tuning was only done to demonstrate worse
conditions for skb freeing for this particular test.
These conditions can happen in more general production workload.

10 runs of one TCP_STREAM flow

Before:
Average throughput: 49685 Mbit.

Kernel profiles on cpu running user thread recvmsg() show high cost for
skb freeing related functions (*)

    57.81%  [kernel]       [k] copy_user_enhanced_fast_string
(*) 12.87%  [kernel]       [k] skb_release_data
(*)  4.25%  [kernel]       [k] __free_one_page
(*)  3.57%  [kernel]       [k] __list_del_entry_valid
     1.85%  [kernel]       [k] __netif_receive_skb_core
     1.60%  [kernel]       [k] __skb_datagram_iter
(*)  1.59%  [kernel]       [k] free_unref_page_commit
(*)  1.16%  [kernel]       [k] __slab_free
     1.16%  [kernel]       [k] _copy_to_iter
(*)  1.01%  [kernel]       [k] kfree
(*)  0.88%  [kernel]       [k] free_unref_page
     0.57%  [kernel]       [k] ip6_rcv_core
     0.55%  [kernel]       [k] ip6t_do_table
     0.54%  [kernel]       [k] flush_smp_call_function_queue
(*)  0.54%  [kernel]       [k] free_pcppages_bulk
     0.51%  [kernel]       [k] llist_reverse_order
     0.38%  [kernel]       [k] process_backlog
(*)  0.38%  [kernel]       [k] free_pcp_prepare
     0.37%  [kernel]       [k] tcp_recvmsg_locked
(*)  0.37%  [kernel]       [k] __list_add_valid
     0.34%  [kernel]       [k] sock_rfree
     0.34%  [kernel]       [k] _raw_spin_lock_irq
(*)  0.33%  [kernel]       [k] __page_cache_release
     0.33%  [kernel]       [k] tcp_v6_rcv
(*)  0.33%  [kernel]       [k] __put_page
(*)  0.29%  [kernel]       [k] __mod_zone_page_state
     0.27%  [kernel]       [k] _raw_spin_lock

After patch:
Average throughput: 73076 Mbit.

Kernel profiles on cpu running user thread recvmsg() looks better:

    81.35%  [kernel]       [k] copy_user_enhanced_fast_string
     1.95%  [kernel]       [k] _copy_to_iter
     1.95%  [kernel]       [k] __skb_datagram_iter
     1.27%  [kernel]       [k] __netif_receive_skb_core
     1.03%  [kernel]       [k] ip6t_do_table
     0.60%  [kernel]       [k] sock_rfree
     0.50%  [kernel]       [k] tcp_v6_rcv
     0.47%  [kernel]       [k] ip6_rcv_core
     0.45%  [kernel]       [k] read_tsc
     0.44%  [kernel]       [k] _raw_spin_lock_irqsave
     0.37%  [kernel]       [k] _raw_spin_lock
     0.37%  [kernel]       [k] native_irq_return_iret
     0.33%  [kernel]       [k] __inet6_lookup_established
     0.31%  [kernel]       [k] ip6_protocol_deliver_rcu
     0.29%  [kernel]       [k] tcp_rcv_established
     0.29%  [kernel]       [k] llist_reverse_order

v2: kdoc issue (kernel bots)
    do not defer if (alloc_cpu == smp_processor_id()) (Paolo)
    replace the sk_buff_head with a single-linked list (Jakub)
    add a READ_ONCE()/WRITE_ONCE() for the lockless read of sd->defer_list

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20220422201237.416238-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-04-26 17:05:59 -07:00
..
9p net/p9: load default transports 2022-01-10 10:00:09 +09:00
bluetooth Networking changes for 5.18. 2022-03-24 13:13:26 -07:00
caif net: remove the caif_hsi driver 2021-07-01 13:19:48 -07:00
iucv net/af_iucv: Use struct_group() to zero struct iucv_sock region 2021-11-19 11:52:25 +00:00
netfilter netfilter: ecache: move to separate structure 2022-04-08 12:08:58 +02:00
netns ipv6: make ip6_rt_gc_expire an atomic_t 2022-04-15 14:28:50 -07:00
nfc NFC: add NCI_UNREG flag to eliminate the race 2021-11-17 20:17:05 -08:00
phonet
sctp net: remove noblock parameter from recvmsg() entities 2022-04-12 15:00:25 +02:00
tc_act net: sched: support hash selecting tx queue 2022-04-19 12:20:45 +02:00
6lowpan.h
act_api.h net/sched: act_api: Add extack to offload_act_setup() callback 2022-04-08 13:45:43 +01:00
addrconf.h net: Add new protocol attribute to IP addresses 2022-02-18 21:20:06 -08:00
af_ieee802154.h
af_rxrpc.h afs: Don't truncate iter during data fetch 2021-04-23 10:17:26 +01:00
af_unix.h af_unix: Replace the big lock with small locks. 2021-11-26 18:01:58 -08:00
af_vsock.h vsock: each transport cycles only on its own sockets 2022-03-11 23:14:19 -08:00
ah.h
amt.h amt: add mld report message handler 2021-11-01 13:36:09 +00:00
arp.h ipv4: Invalidate neighbour for broadcast address upon address addition 2022-02-21 11:44:30 +00:00
atmclip.h
ax25.h Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2022-02-03 17:36:16 -08:00
ax88796.h ax88796: export ax_NS8390_init() hook 2021-08-03 13:05:25 +01:00
bareudp.h bareudp: Move definition of struct bareudp_conf to bareudp.c 2021-12-13 12:34:09 +00:00
bond_3ad.h bonding: fix data-races around agg_select_timer 2022-02-15 14:35:18 +00:00
bond_alb.h bonding: make tx_rebalance_counter an atomic 2021-12-03 14:16:48 +00:00
bond_options.h bonding: add new option ns_ip6_target 2022-02-21 12:13:45 +00:00
bonding.h net: add per-cpu storage and net->core_stats 2022-03-11 23:17:24 -08:00
bpf_sk_storage.h bpf: struct sock is declared twice in bpf_sk_storage header 2021-03-26 17:43:55 +01:00
busy_poll.h tcp: fix another uninit-value (sk_rx_queue_mapping) 2021-12-03 14:15:49 +00:00
calipso.h
cfg80211-wext.h
cfg80211.h cfg80211: Support configuration of station EHT capabilities 2022-02-16 15:43:25 +01:00
cfg802154.h net: ieee802154: Provide a kdoc to the address structure 2022-02-01 21:03:48 +01:00
checksum.h powerpc/net: Implement powerpc specific csum_shift() to remove branch 2022-03-11 10:57:22 +00:00
cipso_ipv4.h
cls_cgroup.h
codel_impl.h codel: remove unnecessary sock.h include 2021-12-22 15:03:47 -08:00
codel_qdisc.h codel: remove unnecessary pkt_sched.h include 2021-12-22 15:03:51 -08:00
codel.h codel: remove unnecessary pkt_sched.h include 2021-12-22 15:03:51 -08:00
compat.h net/ipv4/ipv6: Replace one-element arraya with flexible-array members 2021-08-05 11:46:42 +01:00
datalink.h llc/snap: constify dev_addr passing 2021-10-13 09:40:46 -07:00
dcbevent.h
dcbnl.h
devlink.h devlink: introduce line card device info infrastructure 2022-04-25 10:42:28 +01:00
dn_dev.h
dn_fib.h net: convert fib_treeref from int to refcount_t 2021-07-30 15:33:24 +02:00
dn_neigh.h
dn_nsp.h
dn_route.h
dn.h decnet: constify dev_addr passing 2021-10-13 09:40:46 -07:00
dsa.h net: dsa: pass extack to dsa_switch_ops :: port_mirror_add() 2022-03-17 17:42:47 -07:00
dsfield.h
dst_cache.h wireguard: device: reset peer src endpoint when netns exits 2021-11-29 19:50:45 -08:00
dst_metadata.h net: fix a memleak when uncloning an skb dst and its metadata 2022-02-09 11:41:47 +00:00
dst_ops.h
dst.h net: dst: add net device refcount tracking to dst_entry 2021-12-06 16:05:10 -08:00
erspan.h
esp.h esp: limit skb_page_frag_refill use to a single page 2022-04-13 10:16:11 +02:00
espintcp.h
ethoc.h
failover.h net: failover: add net device refcount tracker 2021-12-06 16:06:02 -08:00
fib_notifier.h
fib_rules.h fib: expand fib_rule_policy 2021-12-16 07:18:35 -08:00
firewire.h
flow_dissector.h flow_dissector: Add number of vlan tags dissector 2022-04-20 11:09:13 +01:00
flow_offload.h net/sched: add vlan push_eth and pop_eth action to the hardware IR 2022-03-16 19:59:36 -07:00
flow.h net: Add l3mdev index to flow struct and avoid oif reset for port devices 2022-03-15 20:20:02 -07:00
fou.h
fq_impl.h
fq.h
garp.h
gen_stats.h net: sched: Remove Qdisc::running sequence counter 2021-10-18 12:54:41 +01:00
genetlink.h
geneve.h
gre.h
gro_cells.h
gro.h net: gro: Fix a 'directive in macro's argument list' sparse warning 2022-02-18 11:00:25 +00:00
gtp.h gtp: Add support for checking GTP device type 2022-03-11 08:28:27 -08:00
gue.h
hwbm.h
icmp.h ipv6: ICMPV6: add response to ICMPV6 RFC 8335 PROBE messages 2021-06-28 14:29:45 -07:00
ieee80211_radiotap.h ieee80211: radiotap: fix -Wcast-qual warnings 2022-02-04 16:25:21 +01:00
ieee802154_netdev.h
if_inet6.h ipv6: fix locking issues with loops over idev->addr_list 2022-04-06 22:09:39 -07:00
ife.h
ila.h
inet6_connection_sock.h
inet6_hashtables.h
inet_common.h
inet_connection_sock.h tcp: Use BPF timeout setting for SYN ACK RTO 2022-02-02 14:45:18 +00:00
inet_dscp.h ipv6: Define dscp_t and stop taking ECN bits into account in fib6-rules 2022-02-07 20:12:45 -08:00
inet_ecn.h net: add skb_get_dsfield() helper 2021-10-15 11:33:08 +01:00
inet_frag.h net: ip: Handle delivery_time in ip defrag 2022-03-03 14:38:48 +00:00
inet_hashtables.h tcp: seq_file: Replace listening_hash with lhash2 2021-07-23 16:44:57 -07:00
inet_sock.h ipv4/raw: support binding to nonlocal addresses 2021-11-17 20:21:52 -08:00
inet_timewait_sock.h tcp/dccp: get rid of inet_twsk_purge() 2022-01-25 11:25:21 +00:00
inetpeer.h
ioam6.h treewide: Replace zero-length arrays with flexible-array members 2022-02-17 07:00:39 -06:00
ip6_checksum.h net: move gro definitions to include/net/gro.h 2021-11-16 13:16:54 +00:00
ip6_fib.h Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2022-02-17 11:44:20 -08:00
ip6_route.h ipv6: ip6_skb_dst_mtu() cleanups 2021-11-19 20:09:55 -08:00
ip6_tunnel.h ipv6: add net device refcount tracker to struct ip6_tnl 2021-12-06 16:05:11 -08:00
ip_fib.h ipv4: Use dscp_t in struct fib_entry_notifier_info 2022-04-11 17:37:50 -07:00
ip_tunnels.h net: Handle l3mdev in ip_tunnel_init_flow 2022-04-15 14:27:30 -07:00
ip_vs.h ipvs: add sysctl_run_estimation to support disable estimation 2021-10-07 19:52:58 +02:00
ip.h ipv4: Make ip_idents_reserve static 2022-01-31 11:33:10 +00:00
ipcomp.h
ipconfig.h
ipv6_frag.h net: don't include ndisc.h from ipv6.h 2022-02-04 14:15:11 -08:00
ipv6_stubs.h net: ipv6: add fib6_nh_release_dsts stub 2021-11-22 15:44:49 +00:00
ipv6.h Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2022-02-17 11:44:20 -08:00
iw_handler.h
kcm.h
l3mdev.h
lag.h
lapb.h net: lapb: Make "lapb_t1timer_running" able to detect an already running timer 2021-03-23 14:14:50 -07:00
lib80211.h
llc_c_ac.h
llc_c_ev.h
llc_c_st.h
llc_conn.h llc: add net device refcount tracker 2021-12-07 20:44:59 -08:00
llc_if.h llc/snap: constify dev_addr passing 2021-10-13 09:40:46 -07:00
llc_pdu.h net: llc: fix skb_over_panic 2021-07-27 13:05:56 +01:00
llc_s_ac.h
llc_s_ev.h
llc_s_st.h
llc_sap.h
llc.h llc: fix out-of-bound array index in llc_sk_dev_hash() 2021-11-07 19:25:29 +00:00
lwtunnel.h netfilter: add netfilter hooks to SRv6 data plane 2021-08-30 01:51:36 +02:00
mac80211.h mac80211: MBSSID beacon handling in AP mode 2022-03-15 11:36:26 +01:00
mac802154.h net: mac802154: Explain the use of ieee802154_wake/stop_queue() 2022-01-28 11:23:41 +01:00
macsec.h net: macsec: fix the length used to copy the key for offloading 2021-06-24 12:41:12 -07:00
mctp.h mctp: Use output netdev to allocate skb headroom 2022-04-01 12:04:15 +01:00
mctpdevice.h mctp: Pass flow data & flow release events to drivers 2021-10-29 13:23:51 +01:00
mip6.h
mld.h mld: add new workqueues for process mld events 2021-03-26 15:14:56 -07:00
mpls_iptunnel.h
mpls.h
mptcp.h mptcp: infinite mapping sending 2022-04-23 11:51:05 +01:00
mrp.h
ncsi.h
ndisc.h Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2022-03-03 11:55:12 -08:00
neighbour.h net, neigh: Do not trigger immediate probes on NUD_FAILED from neigh_managed_work 2022-02-02 20:30:18 -08:00
net_failover.h
net_namespace.h net: make net->dev_unreg_count atomic 2022-02-10 15:30:26 +00:00
net_ratelimit.h
net_trackers.h net: add networking namespace refcount tracker 2021-12-10 06:38:26 -08:00
netevent.h
netlabel.h
netlink.h net: netlink: add the case when nlh is NULL 2021-07-27 11:43:50 +01:00
netprio_cgroup.h
netrom.h
nexthop.h net: ipv4: Fix rtnexthop len when RTA_FLOW is present 2021-09-24 14:07:10 +01:00
nl802154.h net: ieee802154: handle iftypes as u32 2021-11-16 18:02:46 +01:00
nsh.h
p8022.h
page_pool.h net: page_pool: introduce ethtool stats 2022-04-15 10:43:47 +01:00
pie.h
ping.h net: remove noblock parameter from recvmsg() entities 2022-04-12 15:00:25 +02:00
pkt_cls.h net/sched: act_api: Add extack to offload_act_setup() callback 2022-04-08 13:45:43 +01:00
pkt_sched.h net: sched: remove psched_tdiff_bounded() 2022-01-27 13:53:27 +00:00
pptp.h
protocol.h net: Remove the member netns_ok 2021-05-17 15:29:35 -07:00
psample.h psample: Add a fwd declaration for skbuff 2021-08-09 15:34:21 -07:00
psnap.h
raw.h
rawv6.h
red.h sch_red: fix off-by-one checks in red_check_params() 2021-03-25 17:40:43 -07:00
regulatory.h
request_sock.h tcp: Use BPF timeout setting for SYN ACK RTO 2022-02-02 14:45:18 +00:00
rose.h rose: constify dev_addr passing 2021-10-13 09:40:45 -07:00
route.h ipv4: Avoid using RTO_ONLINK with ip_route_connect(). 2022-04-22 13:06:03 +01:00
rpl.h
rsi_91x.h
rtnetlink.h net: rtnetlink: add bulk delete support flag 2022-04-13 12:46:26 +01:00
rtnh.h
sch_generic.h net: sched: remove qdisc_qlen_cpu() 2022-01-27 13:53:27 +00:00
scm.h
secure_seq.h
seg6_hmac.h
seg6_local.h
seg6.h udp6: Use Segment Routing Header for dest address if present 2022-01-04 12:17:35 +00:00
selftests.h net: selftest: fix build issue if INET is disabled 2021-04-28 14:06:45 -07:00
slhc_vj.h
smc.h
snmp.h
sock_reuseport.h tcp: Add reuseport_migrate_sock() to select a new listener. 2021-06-15 18:01:05 +02:00
sock.h net: generalize skb freeing deferral to per-cpu lists 2022-04-26 17:05:59 -07:00
Space.h wan: remove sbni/granch driver 2021-08-03 13:05:26 +01:00
stp.h
strparser.h tls: rx: don't store the decryption status in socket context 2022-04-08 11:49:08 +01:00
switchdev.h net: bridge: mst: Notify switchdev drivers of MST state changes 2022-03-17 16:49:58 -07:00
tcp_states.h
tcp.h net: generalize skb freeing deferral to per-cpu lists 2022-04-26 17:05:59 -07:00
timewait_sock.h
tipc.h
tls_toe.h
tls.h net: remove noblock parameter from recvmsg() entities 2022-04-12 15:00:25 +02:00
transp_v6.h
tso.h
tun_proto.h
udp_tunnel.h
udp.h net: remove noblock parameter from recvmsg() entities 2022-04-12 15:00:25 +02:00
udplite.h udplite: remove udplite_csum_outgoing() 2022-01-27 13:53:27 +00:00
vsock_addr.h
vxlan.h drivers: vxlan: vnifilter: per vni stats 2022-03-01 08:38:02 +00:00
wext.h
x25.h
x25device.h
xdp_priv.h xsk: Wipe out dead zero_copy_allocator declarations 2021-12-14 00:24:24 +01:00
xdp_sock_drv.h i40e: xsk: Move tmp desc array from driver to pool 2022-01-27 17:25:32 +01:00
xdp_sock.h net: Don't include filter.h from net/sock.h 2021-12-29 08:48:14 -08:00
xdp.h net: veth: Account total xdp_frame len running ndo_xdp_xmit 2022-03-17 20:33:52 +01:00
xfrm.h Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ 2022-03-19 14:49:08 +00:00
xsk_buff_pool.h i40e: xsk: Move tmp desc array from driver to pool 2022-01-27 17:25:32 +01:00