linux/net/ipv6
Eric Dumazet 68822bdf76 net: generalize skb freeing deferral to per-cpu lists
Logic added in commit f35f821935 ("tcp: defer skb freeing after socket
lock is released") helped bulk TCP flows to move the cost of skbs
frees outside of critical section where socket lock was held.

But for RPC traffic, or hosts with RFS enabled, the solution is far from
being ideal.

For RPC traffic, recvmsg() has to return to user space right after
skb payload has been consumed, meaning that BH handler has no chance
to pick the skb before recvmsg() thread. This issue is more visible
with BIG TCP, as more RPC fit one skb.

For RFS, even if BH handler picks the skbs, they are still picked
from the cpu on which user thread is running.

Ideally, it is better to free the skbs (and associated page frags)
on the cpu that originally allocated them.

This patch removes the per socket anchor (sk->defer_list) and
instead uses a per-cpu list, which will hold more skbs per round.

This new per-cpu list is drained at the end of net_action_rx(),
after incoming packets have been processed, to lower latencies.

In normal conditions, skbs are added to the per-cpu list with
no further action. In the (unlikely) cases where the cpu does not
run net_action_rx() handler fast enough, we use an IPI to raise
NET_RX_SOFTIRQ on the remote cpu.

Also, we do not bother draining the per-cpu list from dev_cpu_dead()
This is because skbs in this list have no requirement on how fast
they should be freed.

Note that we can add in the future a small per-cpu cache
if we see any contention on sd->defer_lock.

Tested on a pair of hosts with 100Gbit NIC, RFS enabled,
and /proc/sys/net/ipv4/tcp_rmem[2] tuned to 16MB to work around
page recycling strategy used by NIC driver (its page pool capacity
being too small compared to number of skbs/pages held in sockets
receive queues)

Note that this tuning was only done to demonstrate worse
conditions for skb freeing for this particular test.
These conditions can happen in more general production workload.

10 runs of one TCP_STREAM flow

Before:
Average throughput: 49685 Mbit.

Kernel profiles on cpu running user thread recvmsg() show high cost for
skb freeing related functions (*)

    57.81%  [kernel]       [k] copy_user_enhanced_fast_string
(*) 12.87%  [kernel]       [k] skb_release_data
(*)  4.25%  [kernel]       [k] __free_one_page
(*)  3.57%  [kernel]       [k] __list_del_entry_valid
     1.85%  [kernel]       [k] __netif_receive_skb_core
     1.60%  [kernel]       [k] __skb_datagram_iter
(*)  1.59%  [kernel]       [k] free_unref_page_commit
(*)  1.16%  [kernel]       [k] __slab_free
     1.16%  [kernel]       [k] _copy_to_iter
(*)  1.01%  [kernel]       [k] kfree
(*)  0.88%  [kernel]       [k] free_unref_page
     0.57%  [kernel]       [k] ip6_rcv_core
     0.55%  [kernel]       [k] ip6t_do_table
     0.54%  [kernel]       [k] flush_smp_call_function_queue
(*)  0.54%  [kernel]       [k] free_pcppages_bulk
     0.51%  [kernel]       [k] llist_reverse_order
     0.38%  [kernel]       [k] process_backlog
(*)  0.38%  [kernel]       [k] free_pcp_prepare
     0.37%  [kernel]       [k] tcp_recvmsg_locked
(*)  0.37%  [kernel]       [k] __list_add_valid
     0.34%  [kernel]       [k] sock_rfree
     0.34%  [kernel]       [k] _raw_spin_lock_irq
(*)  0.33%  [kernel]       [k] __page_cache_release
     0.33%  [kernel]       [k] tcp_v6_rcv
(*)  0.33%  [kernel]       [k] __put_page
(*)  0.29%  [kernel]       [k] __mod_zone_page_state
     0.27%  [kernel]       [k] _raw_spin_lock

After patch:
Average throughput: 73076 Mbit.

Kernel profiles on cpu running user thread recvmsg() looks better:

    81.35%  [kernel]       [k] copy_user_enhanced_fast_string
     1.95%  [kernel]       [k] _copy_to_iter
     1.95%  [kernel]       [k] __skb_datagram_iter
     1.27%  [kernel]       [k] __netif_receive_skb_core
     1.03%  [kernel]       [k] ip6t_do_table
     0.60%  [kernel]       [k] sock_rfree
     0.50%  [kernel]       [k] tcp_v6_rcv
     0.47%  [kernel]       [k] ip6_rcv_core
     0.45%  [kernel]       [k] read_tsc
     0.44%  [kernel]       [k] _raw_spin_lock_irqsave
     0.37%  [kernel]       [k] _raw_spin_lock
     0.37%  [kernel]       [k] native_irq_return_iret
     0.33%  [kernel]       [k] __inet6_lookup_established
     0.31%  [kernel]       [k] ip6_protocol_deliver_rcu
     0.29%  [kernel]       [k] tcp_rcv_established
     0.29%  [kernel]       [k] llist_reverse_order

v2: kdoc issue (kernel bots)
    do not defer if (alloc_cpu == smp_processor_id()) (Paolo)
    replace the sk_buff_head with a single-linked list (Jakub)
    add a READ_ONCE()/WRITE_ONCE() for the lockless read of sd->defer_list

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20220422201237.416238-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-04-26 17:05:59 -07:00
..
ila net: ipv6: check return value of rhashtable_init 2021-09-28 12:59:24 +01:00
netfilter netfilter: nft_fib: reverse path filter for policy-based routing on iif 2022-04-11 12:10:09 +02:00
addrconf_core.c ipv6: add net device refcount tracker to struct inet6_dev 2021-12-06 16:05:11 -08:00
addrconf.c net/ipv6: Enforce limits for accept_unsolicited_na sysctl 2022-04-22 10:32:54 +01:00
addrlabel.c ipv6: addrlabel: fix possible memory leak in ip6addrlbl_net_init 2020-11-25 11:20:16 -08:00
af_inet6.c ipv6: Use ipv6_only_sock() helper in condition. 2022-04-22 12:47:50 +01:00
ah6.c ipv6: ah6: use swap() to make code cleaner 2021-11-18 12:00:15 +00:00
anycast.c
calipso.c cipso,calipso: resolve a number of problems with the DOI refcounts 2021-03-04 15:26:57 -08:00
datagram.c ipv6: Remove __ipv6_only_sock(). 2022-04-22 12:47:50 +01:00
esp6_offload.c net: Fix esp GSO on inter address family tunnels. 2022-03-07 13:14:04 +01:00
esp6.c esp: limit skb_page_frag_refill use to a single page 2022-04-13 10:16:11 +02:00
exthdrs_core.c
exthdrs_offload.c
exthdrs.c net: ipv6: add skb drop reasons to TLV parse 2022-04-13 13:09:57 +01:00
fib6_notifier.c
fib6_rules.c ipv6: change fib6_rules_net_exit() to batch mode 2022-02-08 20:41:34 -08:00
fou6.c
icmp.c net: icmp: introduce function icmpv6_param_prob_reason() 2022-04-13 13:09:57 +01:00
inet6_connection_sock.c lsm,selinux: pass flowi_common instead of flowi to the LSM hooks 2020-11-23 18:36:21 -05:00
inet6_hashtables.c tcp: Don't acquire inet_listen_hashbucket::lock with disabled BH. 2022-02-09 21:28:36 -08:00
ioam6_iptunnel.c ipv6: ioam: Insertion frequency in lwtunnel output 2022-02-04 20:24:45 -08:00
ioam6.c net: ipv6: Get rcv timestamp if needed when handling hop-by-hop IOAM option 2022-03-03 14:38:48 +00:00
ip6_checksum.c
ip6_fib.c ipv6: annotate accesses to fn->fn_sernum 2022-01-20 20:18:37 -08:00
ip6_flowlabel.c ipv6: per-netns exclusive flowlabel checks 2022-02-16 20:37:47 -08:00
ip6_gre.c ip6_gre: Fix skb_under_panic in __gre6_xmit() 2022-04-15 11:48:34 +01:00
ip6_icmp.c net: icmp: pass zeroed opts from icmp{,v6}_ndo_send before sending 2021-02-23 11:29:52 -08:00
ip6_input.c ipv6: fix NULL deref in ip6_rcv_core() 2022-04-15 14:05:18 -07:00
ip6_offload.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2022-02-24 17:54:25 -08:00
ip6_offload.h
ip6_output.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2022-04-15 09:26:00 +02:00
ip6_tunnel.c ip6_tunnel: Remove duplicate assignments 2022-04-06 15:31:15 +01:00
ip6_udp_tunnel.c
ip6_vti.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2021-12-30 12:12:12 -08:00
ip6mr.c net: ipv6mr: fix unused variable warning with CONFIG_IPV6_PIMSM_V2=n 2022-04-06 15:14:30 +01:00
ipcomp6.c xfrm: remove hdr_offset indirection 2021-06-11 14:48:50 +02:00
ipv6_sockglue.c ipv6: annotate some data-races around sk->sk_prot 2022-02-18 11:53:28 +00:00
Kconfig ipv6: ioam: Add support for the ip6ip6 encapsulation 2021-10-04 12:53:35 +01:00
Makefile net: ipv6: use ipv6-y directly instead of ipv6-objs 2021-09-28 13:13:40 +01:00
mcast_snoop.c net: bridge: mcast: fix broken length + header check for MRDv6 Adv. 2021-04-27 14:02:06 -07:00
mcast.c ipv6: fix skb drops in igmp6_event_query() and igmp6_event_report() 2022-03-03 09:47:06 -08:00
mip6.c xfrm: ipv6: move mip6_rthdr_offset into xfrm core 2021-06-11 14:48:50 +02:00
ndisc.c net/ipv6: Introduce accept_unsolicited_na knob to implement router-side changes for RFC9131 2022-04-17 13:23:49 +01:00
netfilter.c net: Add skb->mono_delivery_time to distinguish mono delivery_time from (rcv) timestamp 2022-03-03 14:38:48 +00:00
output_core.c ipv6: use prandom_u32() for ID generation 2021-05-31 22:12:08 -07:00
ping.c net: ping6: support setting basic SOL_IPV6 options via cmsg 2022-02-17 14:22:09 +00:00
proc.c
protocol.c
raw.c net: remove noblock parameter from recvmsg() entities 2022-04-12 15:00:25 +02:00
reassembly.c net: ipv6: Handle delivery_time in ipv6 defrag 2022-03-03 14:38:48 +00:00
route.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2022-04-22 09:56:00 +02:00
rpl_iptunnel.c net: ipv6: rpl_iptunnel: simplify the return expression of rpl_do_srh() 2020-12-08 16:22:54 -08:00
rpl.c
seg6_hmac.c net: ipv6: check return value of rhashtable_init 2021-09-28 12:59:24 +01:00
seg6_iptunnel.c seg6: fix the iif in the IPv6 socket control block 2021-12-09 07:55:42 -08:00
seg6_local.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2022-01-05 14:36:10 -08:00
seg6.c icmp: ICMPV6: Examine invoking packet for Segment Route Headers. 2022-01-04 12:17:35 +00:00
sit.c sit: allow encapsulated IPv6 traffic to be delivered locally 2022-01-12 13:56:07 -08:00
syncookies.c net: align static siphash keys 2021-11-16 19:07:54 -08:00
sysctl_net_ipv6.c ipv6: ioam: Data plane support for Pre-allocated Trace 2021-07-21 08:14:33 -07:00
tcp_ipv6.c net: generalize skb freeing deferral to per-cpu lists 2022-04-26 17:05:59 -07:00
tcpv6_offload.c net: move gro definitions to include/net/gro.h 2021-11-16 13:16:54 +00:00
tunnel6.c
udp_impl.h net: remove noblock parameter from recvmsg() entities 2022-04-12 15:00:25 +02:00
udp_offload.c gro: remove rcu_read_lock/rcu_read_unlock from gro_receive handlers 2021-11-24 17:21:42 -08:00
udp.c ipv6: Remove __ipv6_only_sock(). 2022-04-22 12:47:50 +01:00
udplite.c
xfrm6_input.c
xfrm6_output.c xfrm: fix tunnel model fragmentation behavior 2022-03-01 12:08:40 +01:00
xfrm6_policy.c net: Add l3mdev index to flow struct and avoid oif reset for port devices 2022-03-15 20:20:02 -07:00
xfrm6_protocol.c
xfrm6_state.c
xfrm6_tunnel.c xfrm: remove description from xfrm_type struct 2021-06-09 09:38:52 +02:00