linux/include/net
Kuniyuki Iwashima d1e5e6408b tcp: Introduce optional per-netns ehash.
The more sockets we have in the hash table, the longer we spend looking
up the socket.  While running a number of small workloads on the same
host, they penalise each other and cause performance degradation.

The root cause might be a single workload that consumes much more
resources than the others.  It often happens on a cloud service where
different workloads share the same computing resource.

On EC2 c5.24xlarge instance (196 GiB memory and 524288 (1Mi / 2) ehash
entries), after running iperf3 in different netns, creating 24Mi sockets
without data transfer in the root netns causes about 10% performance
regression for the iperf3's connection.

 thash_entries		sockets		length		Gbps
	524288		      1		     1		50.7
			   24Mi		    48		45.1

It is basically related to the length of the list of each hash bucket.
For testing purposes to see how performance drops along the length,
I set 131072 (1Mi / 8) to thash_entries, and here's the result.

 thash_entries		sockets		length		Gbps
        131072		      1		     1		50.7
			    1Mi		     8		49.9
			    2Mi		    16		48.9
			    4Mi		    32		47.3
			    8Mi		    64		44.6
			   16Mi		   128		40.6
			   24Mi		   192		36.3
			   32Mi		   256		32.5
			   40Mi		   320		27.0
			   48Mi		   384		25.0

To resolve the socket lookup degradation, we introduce an optional
per-netns hash table for TCP, but it's just ehash, and we still share
the global bhash, bhash2 and lhash2.

With a smaller ehash, we can look up non-listener sockets faster and
isolate such noisy neighbours.  In addition, we can reduce lock contention.

We can control the ehash size by a new sysctl knob.  However, depending
on workloads, it will require very sensitive tuning, so we disable the
feature by default (net.ipv4.tcp_child_ehash_entries == 0).  Moreover,
we can fall back to using the global ehash in case we fail to allocate
enough memory for a new ehash.  The maximum size is 16Mi, which is large
enough that even if we have 48Mi sockets, the average list length is 3,
and regression would be less than 1%.

We can check the current ehash size by another read-only sysctl knob,
net.ipv4.tcp_ehash_entries.  A negative value means the netns shares
the global ehash (per-netns ehash is disabled or failed to allocate
memory).

  # dmesg | cut -d ' ' -f 5- | grep "established hash"
  TCP established hash table entries: 524288 (order: 10, 4194304 bytes, vmalloc hugepage)

  # sysctl net.ipv4.tcp_ehash_entries
  net.ipv4.tcp_ehash_entries = 524288  # can be changed by thash_entries

  # sysctl net.ipv4.tcp_child_ehash_entries
  net.ipv4.tcp_child_ehash_entries = 0  # disabled by default

  # ip netns add test1
  # ip netns exec test1 sysctl net.ipv4.tcp_ehash_entries
  net.ipv4.tcp_ehash_entries = -524288  # share the global ehash

  # sysctl -w net.ipv4.tcp_child_ehash_entries=100
  net.ipv4.tcp_child_ehash_entries = 100

  # ip netns add test2
  # ip netns exec test2 sysctl net.ipv4.tcp_ehash_entries
  net.ipv4.tcp_ehash_entries = 128  # own a per-netns ehash with 2^n buckets

When more than two processes in the same netns create per-netns ehash
concurrently with different sizes, we need to guarantee the size in
one of the following ways:

  1) Share the global ehash and create per-netns ehash

  First, unshare() with tcp_child_ehash_entries==0.  It creates dedicated
  netns sysctl knobs where we can safely change tcp_child_ehash_entries
  and clone()/unshare() to create a per-netns ehash.

  2) Control write on sysctl by BPF

  We can use BPF_PROG_TYPE_CGROUP_SYSCTL to allow/deny read/write on
  sysctl knobs.

Note that the global ehash allocated at the boot time is spread over
available NUMA nodes, but inet_pernet_hashinfo_alloc() will allocate
pages for each per-netns ehash depending on the current process's NUMA
policy.  By default, the allocation is done in the local node only, so
the per-netns hash table could fully reside on a random node.  Thus,
depending on the NUMA policy the netns is created with and the CPU the
current thread is running on, we could see some performance differences
for highly optimised networking applications.

Note also that the default values of two sysctl knobs depend on the ehash
size and should be tuned carefully:

  tcp_max_tw_buckets  : tcp_child_ehash_entries / 2
  tcp_max_syn_backlog : max(128, tcp_child_ehash_entries / 128)

As a bonus, we can dismantle netns faster.  Currently, while destroying
netns, we call inet_twsk_purge(), which walks through the global ehash.
It can be potentially big because it can have many sockets other than
TIME_WAIT in all netns.  Splitting ehash changes that situation, where
it's only necessary for inet_twsk_purge() to clean up TIME_WAIT sockets
in each netns.

With regard to this, we do not free the per-netns ehash in inet_twsk_kill()
to avoid UAF while iterating the per-netns ehash in inet_twsk_purge().
Instead, we do it in tcp_sk_exit_batch() after calling tcp_twsk_purge() to
keep it protocol-family-independent.

In the future, we could optimise ehash lookup/iteration further by removing
netns comparison for the per-netns ehash.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-09-20 10:21:50 -07:00
..
9p 9p: Add client parameter to p9_req_put() 2022-07-09 14:38:35 +09:00
bluetooth Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2022-07-28 18:21:16 -07:00
caif net: remove the caif_hsi driver 2021-07-01 13:19:48 -07:00
iucv net/af_iucv: Use struct_group() to zero struct iucv_sock region 2021-11-19 11:52:25 +00:00
netfilter Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next 2022-09-09 08:08:51 +01:00
netns tcp: Introduce optional per-netns ehash. 2022-09-20 10:21:50 -07:00
nfc NFC: add NCI_UNREG flag to eliminate the race 2021-11-17 20:17:05 -08:00
phonet net: add missing includes and forward declarations under net/ 2022-07-22 12:53:22 +01:00
sctp net: remove noblock parameter from recvmsg() entities 2022-04-12 15:00:25 +02:00
tc_act Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2022-05-12 16:15:30 -07:00
6lowpan.h
act_api.h net: sched: act: move global static variable net_id to tc_action_ops 2022-09-09 08:24:41 +01:00
addrconf.h ipv6/addrconf: fix a null-ptr-deref bug for ip6_ptr 2022-07-28 10:42:44 -07:00
af_ieee802154.h
af_rxrpc.h rxrpc: Remove rxrpc_get_reply_time() which is no longer used 2022-09-01 11:44:13 +01:00
af_unix.h af_unix: Remove unix_table_locks. 2022-06-22 12:59:43 +01:00
af_vsock.h vsock: add API call for data ready 2022-08-23 10:43:11 +02:00
ah.h
amt.h net: add missing includes and forward declarations under net/ 2022-07-22 12:53:22 +01:00
arp.h ipv4: Invalidate neighbour for broadcast address upon address addition 2022-02-21 11:44:30 +00:00
atmclip.h
ax25.h ax25: fix incorrect dev_tracker usage 2022-07-28 22:06:15 -07:00
ax88796.h ax88796: Fix some typo in a comment 2022-08-09 22:14:02 -07:00
bareudp.h bareudp: Move definition of struct bareudp_conf to bareudp.c 2021-12-13 12:34:09 +00:00
bond_3ad.h bonding: 3ad: make ad_ticks_per_sec a const 2022-08-22 18:30:24 -07:00
bond_alb.h bonding: make tx_rebalance_counter an atomic 2021-12-03 14:16:48 +00:00
bond_options.h net: add missing includes and forward declarations under net/ 2022-07-22 12:53:22 +01:00
bonding.h net: bonding: replace dev_trans_start() with the jiffies of the last ARP/NS 2022-08-03 19:20:12 -07:00
bpf_sk_storage.h bpf: struct sock is declared twice in bpf_sk_storage header 2021-03-26 17:43:55 +01:00
busy_poll.h net: Fix a data-race around sysctl_net_busy_poll. 2022-08-24 13:46:58 +01:00
calipso.h
cfg80211-wext.h
cfg80211.h wifi: cfg80211: Add link_id to cfg80211_ch_switch_started_notify() 2022-08-25 11:07:26 +02:00
cfg802154.h net: wrap the wireless pointers in struct net_device in an ifdef 2022-05-22 21:51:54 +01:00
checksum.h powerpc/net: Implement powerpc specific csum_shift() to remove branch 2022-03-11 10:57:22 +00:00
cipso_ipv4.h
cls_cgroup.h
codel_impl.h codel: remove unnecessary sock.h include 2021-12-22 15:03:47 -08:00
codel_qdisc.h net: add missing includes and forward declarations under net/ 2022-07-22 12:53:22 +01:00
codel.h codel: remove unnecessary pkt_sched.h include 2021-12-22 15:03:51 -08:00
compat.h net: copy from user before calling __get_compat_msghdr 2022-07-24 18:39:17 -06:00
datalink.h net: add missing includes and forward declarations under net/ 2022-07-22 12:53:22 +01:00
dcbevent.h net: add missing includes and forward declarations under net/ 2022-07-22 12:53:22 +01:00
dcbnl.h net: add missing includes and forward declarations under net/ 2022-07-22 12:53:22 +01:00
devlink.h net: devlink: stub port params cmds for they are unused internally 2022-08-30 13:19:47 +02:00
dropreason.h net: skb: export skb drop reaons to user by TRACE_DEFINE_ENUM 2022-09-07 15:28:08 +01:00
dsa.h net: dsa: allow masters to join a LAG 2022-09-20 10:32:36 +02:00
dsfield.h
dst_cache.h wireguard: device: reset peer src endpoint when netns exits 2021-11-29 19:50:45 -08:00
dst_metadata.h net/macsec: Add MACsec skb_metadata_dst Tx Data path support 2022-09-07 14:02:08 +01:00
dst_ops.h
dst.h net: dst: add net device refcount tracking to dst_entry 2021-12-06 16:05:10 -08:00
erspan.h net: add missing includes and forward declarations under net/ 2022-07-22 12:53:22 +01:00
esp.h net: add missing includes and forward declarations under net/ 2022-07-22 12:53:22 +01:00
espintcp.h
ethoc.h net: add missing includes and forward declarations under net/ 2022-07-22 12:53:22 +01:00
failover.h net: failover: add net device refcount tracker 2021-12-06 16:06:02 -08:00
fib_notifier.h
fib_rules.h fib: expand fib_rule_policy 2021-12-16 07:18:35 -08:00
firewire.h firewire: net: Make use of get_unaligned_be48(), put_unaligned_be48() 2022-07-28 22:21:54 -07:00
flow_dissector.h flow_dissector: Add L2TPv3 dissectors 2022-09-20 09:13:38 +02:00
flow_offload.h flow_offload: Introduce flow_match_l2tpv3 2022-09-20 09:13:38 +02:00
flow.h net: Add l3mdev index to flow struct and avoid oif reset for port devices 2022-03-15 20:20:02 -07:00
fou.h
fq_impl.h net/fq_impl: Use the bitmap API to allocate bitmaps 2022-07-11 19:49:38 -07:00
fq.h net: add missing includes and forward declarations under net/ 2022-07-22 12:53:22 +01:00
garp.h net: add missing includes and forward declarations under net/ 2022-07-22 12:53:22 +01:00
gen_stats.h net: sched: Remove Qdisc::running sequence counter 2021-10-18 12:54:41 +01:00
genetlink.h netlink: add helpers for extack attr presence checking 2022-08-30 12:20:43 +02:00
geneve.h
gre.h ip_gre: add csum offload support for gre header 2021-01-29 20:39:14 -08:00
gro_cells.h
gro.h Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2022-08-25 16:07:42 -07:00
gtp.h net: add missing includes and forward declarations under net/ 2022-07-22 12:53:22 +01:00
gue.h net: add missing includes and forward declarations under net/ 2022-07-22 12:53:22 +01:00
hwbm.h net: add missing includes and forward declarations under net/ 2022-07-22 12:53:22 +01:00
icmp.h ipv6: ICMPV6: add response to ICMPV6 RFC 8335 PROBE messages 2021-06-28 14:29:45 -07:00
ieee80211_radiotap.h ieee80211: radiotap: fix -Wcast-qual warnings 2022-02-04 16:25:21 +01:00
ieee802154_netdev.h
if_inet6.h ipv6: fix locking issues with loops over idev->addr_list 2022-04-06 22:09:39 -07:00
ife.h
ila.h net: add missing includes and forward declarations under net/ 2022-07-22 12:53:22 +01:00
inet6_connection_sock.h net: add missing includes and forward declarations under net/ 2022-07-22 12:53:22 +01:00
inet6_hashtables.h net: allow unbound socket for packets in VRF when tcp_l3mdev_accept set 2022-07-29 11:58:54 +01:00
inet_common.h net: add missing includes and forward declarations under net/ 2022-07-22 12:53:22 +01:00
inet_connection_sock.h net: Add a bhash2 table hashed by port and address 2022-08-24 19:30:07 -07:00
inet_dscp.h ipv6: Define dscp_t and stop taking ECN bits into account in fib6-rules 2022-02-07 20:12:45 -08:00
inet_ecn.h net: add skb_get_dsfield() helper 2021-10-15 11:33:08 +01:00
inet_frag.h net: add missing includes and forward declarations under net/ 2022-07-22 12:53:22 +01:00
inet_hashtables.h tcp: Introduce optional per-netns ehash. 2022-09-20 10:21:50 -07:00
inet_sock.h net: allow unbound socket for packets in VRF when tcp_l3mdev_accept set 2022-07-29 11:58:54 +01:00
inet_timewait_sock.h Revert "tcp/dccp: get rid of inet_twsk_purge()" 2022-05-13 12:24:12 +01:00
inetpeer.h
ioam6.h treewide: Replace zero-length arrays with flexible-array members 2022-02-17 07:00:39 -06:00
ip6_checksum.h net: move gro definitions to include/net/gro.h 2021-11-16 13:16:54 +00:00
ip6_fib.h Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2022-02-17 11:44:20 -08:00
ip6_route.h net: add missing includes and forward declarations under net/ 2022-07-22 12:53:22 +01:00
ip6_tunnel.h ip_gre, ip6_gre: Fix race condition on o_seqno in collect_md mode 2022-04-25 11:40:45 +01:00
ip_fib.h ipv4: Use dscp_t in struct fib_entry_notifier_info 2022-04-11 17:37:50 -07:00
ip_tunnels.h ip_tunnel: Respect tunnel key's "flow_flags" in IP tunnels 2022-08-18 21:18:28 +02:00
ip_vs.h ipvs: add sysctl_run_estimation to support disable estimation 2021-10-07 19:52:58 +02:00
ip.h bpf: Change bpf_getsockopt(SOL_IP) to reuse do_ip_getsockopt() 2022-09-02 20:34:32 -07:00
ipcomp.h net: add missing includes and forward declarations under net/ 2022-07-22 12:53:22 +01:00
ipconfig.h net: add missing includes and forward declarations under net/ 2022-07-22 12:53:22 +01:00
ipv6_frag.h net: don't include ndisc.h from ipv6.h 2022-02-04 14:15:11 -08:00
ipv6_stubs.h bpf: Change bpf_getsockopt(SOL_IPV6) to reuse do_ipv6_getsockopt() 2022-09-02 20:34:32 -07:00
ipv6.h bpf: Change bpf_getsockopt(SOL_IPV6) to reuse do_ipv6_getsockopt() 2022-09-02 20:34:32 -07:00
iw_handler.h
kcm.h
l3mdev.h
lag.h
lapb.h net: lapb: Make "lapb_t1timer_running" able to detect an already running timer 2021-03-23 14:14:50 -07:00
lib80211.h
llc_c_ac.h net: add missing includes and forward declarations under net/ 2022-07-22 12:53:22 +01:00
llc_c_ev.h
llc_c_st.h net: add missing includes and forward declarations under net/ 2022-07-22 12:53:22 +01:00
llc_conn.h llc: add net device refcount tracker 2021-12-07 20:44:59 -08:00
llc_if.h llc/snap: constify dev_addr passing 2021-10-13 09:40:46 -07:00
llc_pdu.h net: llc: fix skb_over_panic 2021-07-27 13:05:56 +01:00
llc_s_ac.h net: add missing includes and forward declarations under net/ 2022-07-22 12:53:22 +01:00
llc_s_ev.h net: add missing includes and forward declarations under net/ 2022-07-22 12:53:22 +01:00
llc_s_st.h add missing includes and forward declarations to networking includes under linux/ 2022-07-28 11:29:36 +02:00
llc_sap.h
llc.h llc: fix out-of-bound array index in llc_sk_dev_hash() 2021-11-07 19:25:29 +00:00
lwtunnel.h netfilter: add netfilter hooks to SRv6 data plane 2021-08-30 01:51:36 +02:00
mac80211.h wifi: mac80211: maintain link_id in link_sta 2022-08-25 10:41:25 +02:00
mac802154.h net: mac802154: Create an error helper for asynchronous offloading errors 2022-04-25 20:51:12 +02:00
macsec.h net/macsec: Move some code for sharing with various drivers that implements offload 2022-09-07 14:02:08 +01:00
mctp.h mctp: Use output netdev to allocate skb headroom 2022-04-01 12:04:15 +01:00
mctpdevice.h mctp: Pass flow data & flow release events to drivers 2021-10-29 13:23:51 +01:00
mip6.h
mld.h mld: add new workqueues for process mld events 2021-03-26 15:14:56 -07:00
mpls_iptunnel.h net: add missing includes and forward declarations under net/ 2022-07-22 12:53:22 +01:00
mpls.h
mptcp.h mptcp, btf: Add struct mptcp_sock definition when CONFIG_MPTCP is disabled 2022-08-08 15:30:45 +02:00
mrp.h net: add missing includes and forward declarations under net/ 2022-07-22 12:53:22 +01:00
ncsi.h net: add missing includes and forward declarations under net/ 2022-07-22 12:53:22 +01:00
ndisc.h Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2022-03-03 11:55:12 -08:00
neighbour.h neighbour: make proxy_queue.qlen limit per-device 2022-08-15 11:25:09 +01:00
net_debug.h net: add CONFIG_DEBUG_NET 2022-05-11 12:43:10 +01:00
net_failover.h
net_namespace.h netfilter: nf_flow_table: count pending offload workqueue tasks 2022-07-11 16:25:14 +02:00
net_ratelimit.h
net_trackers.h net: add networking namespace refcount tracker 2021-12-10 06:38:26 -08:00
netevent.h net: add missing includes and forward declarations under net/ 2022-07-22 12:53:22 +01:00
netlabel.h
netlink.h netlink: introduce NLA_POLICY_MAX_BE 2022-09-07 12:33:43 +01:00
netprio_cgroup.h
netrom.h net: add missing includes and forward declarations under net/ 2022-07-22 12:53:22 +01:00
nexthop.h net: ipv4: Fix rtnexthop len when RTA_FLOW is present 2021-09-24 14:07:10 +01:00
nl802154.h net: ieee802154: Fix compilation error when CONFIG_IEEE802154_NL802154_EXPERIMENTAL is disabled 2022-09-02 19:59:08 -07:00
nsh.h
p8022.h net: add missing includes and forward declarations under net/ 2022-07-22 12:53:22 +01:00
page_pool.h net: page_pool: introduce ethtool stats 2022-04-15 10:43:47 +01:00
pie.h
ping.h net: remove noblock parameter from recvmsg() entities 2022-04-12 15:00:25 +02:00
pkt_cls.h net/sched: remove return value of unregister_tcf_proto_ops 2022-07-13 14:46:59 +01:00
pkt_sched.h net: sched: remove the unused return value of unregister_qdisc 2022-08-16 19:37:06 -07:00
pptp.h net: add missing includes and forward declarations under net/ 2022-07-22 12:53:22 +01:00
protocol.h tcp/udp: Make early_demux back namespacified. 2022-07-15 18:50:35 -07:00
psample.h psample: Add a fwd declaration for skbuff 2021-08-09 15:34:21 -07:00
psnap.h net: add missing includes and forward declarations under net/ 2022-07-22 12:53:22 +01:00
raw.h Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2022-07-14 15:27:35 -07:00
rawv6.h raw: convert raw sockets to RCU 2022-06-19 10:00:02 +01:00
red.h net: sched: gred/red: remove unused variables in struct red_stats 2022-08-31 19:39:53 -07:00
regulatory.h net: add missing includes and forward declarations under net/ 2022-07-22 12:53:22 +01:00
request_sock.h tcp: Use BPF timeout setting for SYN ACK RTO 2022-02-02 14:45:18 +00:00
rose.h net: rose: add netdev ref tracker to 'struct rose_sock' 2022-08-01 11:59:23 -07:00
route.h Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next 2022-07-25 13:25:39 +01:00
rpl.h
rsi_91x.h
rtnetlink.h net: rtnetlink: add bulk delete support flag 2022-04-13 12:46:26 +01:00
rtnh.h
sch_generic.h net: sched: remove unnecessary init of qdisc skb head 2022-08-26 12:03:04 +01:00
scm.h
secure_seq.h net: add missing includes and forward declarations under net/ 2022-07-22 12:53:22 +01:00
seg6_hmac.h
seg6_local.h
seg6.h udp6: Use Segment Routing Header for dest address if present 2022-01-04 12:17:35 +00:00
selftests.h net: selftest: fix build issue if INET is disabled 2021-04-28 14:06:45 -07:00
slhc_vj.h
smc.h net/smc: Pass on DMBE bit mask in IRQ handler 2022-07-27 13:24:42 +01:00
snmp.h
sock_reuseport.h tcp: Add reuseport_migrate_sock() to select a new listener. 2021-06-15 18:01:05 +02:00
sock.h Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next 2022-09-06 23:21:18 +02:00
Space.h wan: remove sbni/granch driver 2021-08-03 13:05:26 +01:00
stp.h net: add missing includes and forward declarations under net/ 2022-07-22 12:53:22 +01:00
strparser.h tls: rx: remove the message decrypted tracking 2022-07-18 11:24:10 +01:00
switchdev.h net: switchdev: add reminder near struct switchdev_notifier_fdb_info 2022-06-29 20:37:36 -07:00
tcp_states.h
tcp.h tcp: Save unnecessary inet_twsk_purge() calls. 2022-09-20 10:21:50 -07:00
timewait_sock.h
tipc.h
tls_toe.h
tls.h net/tls: Use RCU API to access tls_ctx->netdev 2022-08-10 22:58:43 -07:00
transp_v6.h net: add missing includes and forward declarations under net/ 2022-07-22 12:53:22 +01:00
tso.h
tun_proto.h net: add missing includes and forward declarations under net/ 2022-07-22 12:53:22 +01:00
udp_tunnel.h rxrpc: Fix ICMP/ICMP6 error handling 2022-09-01 11:42:12 +01:00
udp.h Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2022-07-21 13:03:39 -07:00
udplite.h net: add missing includes and forward declarations under net/ 2022-07-22 12:53:22 +01:00
vsock_addr.h
vxlan.h drivers: vxlan: vnifilter: per vni stats 2022-03-01 08:38:02 +00:00
wext.h
x25.h
x25device.h
xdp_priv.h net: add missing includes and forward declarations under net/ 2022-07-22 12:53:22 +01:00
xdp_sock_drv.h Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2022-08-03 09:04:55 +02:00
xdp_sock.h net: Don't include filter.h from net/sock.h 2021-12-29 08:48:14 -08:00
xdp.h net: veth: Account total xdp_frame len running ndo_xdp_xmit 2022-03-17 20:33:52 +01:00
xfrm.h Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next 2022-07-25 13:25:39 +01:00
xsk_buff_pool.h xsk: Fix possible crash when multiple sockets are created 2022-04-26 16:19:54 +02:00