linux

History

Kuniyuki Iwashima d1e5e6408b tcp: Introduce optional per-netns ehash. The more sockets we have in the hash table, the longer we spend looking up the socket. While running a number of small workloads on the same host, they penalise each other and cause performance degradation. The root cause might be a single workload that consumes much more resources than the others. It often happens on a cloud service where different workloads share the same computing resource. On EC2 c5.24xlarge instance (196 GiB memory and 524288 (1Mi / 2) ehash entries), after running iperf3 in different netns, creating 24Mi sockets without data transfer in the root netns causes about 10% performance regression for the iperf3's connection. thash_entries sockets length Gbps 524288 1 1 50.7 24Mi 48 45.1 It is basically related to the length of the list of each hash bucket. For testing purposes to see how performance drops along the length, I set 131072 (1Mi / 8) to thash_entries, and here's the result. thash_entries sockets length Gbps 131072 1 1 50.7 1Mi 8 49.9 2Mi 16 48.9 4Mi 32 47.3 8Mi 64 44.6 16Mi 128 40.6 24Mi 192 36.3 32Mi 256 32.5 40Mi 320 27.0 48Mi 384 25.0 To resolve the socket lookup degradation, we introduce an optional per-netns hash table for TCP, but it's just ehash, and we still share the global bhash, bhash2 and lhash2. With a smaller ehash, we can look up non-listener sockets faster and isolate such noisy neighbours. In addition, we can reduce lock contention. We can control the ehash size by a new sysctl knob. However, depending on workloads, it will require very sensitive tuning, so we disable the feature by default (net.ipv4.tcp_child_ehash_entries == 0). Moreover, we can fall back to using the global ehash in case we fail to allocate enough memory for a new ehash. The maximum size is 16Mi, which is large enough that even if we have 48Mi sockets, the average list length is 3, and regression would be less than 1%. We can check the current ehash size by another read-only sysctl knob, net.ipv4.tcp_ehash_entries. A negative value means the netns shares the global ehash (per-netns ehash is disabled or failed to allocate memory). # dmesg \| cut -d ' ' -f 5- \| grep "established hash" TCP established hash table entries: 524288 (order: 10, 4194304 bytes, vmalloc hugepage) # sysctl net.ipv4.tcp_ehash_entries net.ipv4.tcp_ehash_entries = 524288 # can be changed by thash_entries # sysctl net.ipv4.tcp_child_ehash_entries net.ipv4.tcp_child_ehash_entries = 0 # disabled by default # ip netns add test1 # ip netns exec test1 sysctl net.ipv4.tcp_ehash_entries net.ipv4.tcp_ehash_entries = -524288 # share the global ehash # sysctl -w net.ipv4.tcp_child_ehash_entries=100 net.ipv4.tcp_child_ehash_entries = 100 # ip netns add test2 # ip netns exec test2 sysctl net.ipv4.tcp_ehash_entries net.ipv4.tcp_ehash_entries = 128 # own a per-netns ehash with 2^n buckets When more than two processes in the same netns create per-netns ehash concurrently with different sizes, we need to guarantee the size in one of the following ways: 1) Share the global ehash and create per-netns ehash First, unshare() with tcp_child_ehash_entries==0. It creates dedicated netns sysctl knobs where we can safely change tcp_child_ehash_entries and clone()/unshare() to create a per-netns ehash. 2) Control write on sysctl by BPF We can use BPF_PROG_TYPE_CGROUP_SYSCTL to allow/deny read/write on sysctl knobs. Note that the global ehash allocated at the boot time is spread over available NUMA nodes, but inet_pernet_hashinfo_alloc() will allocate pages for each per-netns ehash depending on the current process's NUMA policy. By default, the allocation is done in the local node only, so the per-netns hash table could fully reside on a random node. Thus, depending on the NUMA policy the netns is created with and the CPU the current thread is running on, we could see some performance differences for highly optimised networking applications. Note also that the default values of two sysctl knobs depend on the ehash size and should be tuned carefully: tcp_max_tw_buckets : tcp_child_ehash_entries / 2 tcp_max_syn_backlog : max(128, tcp_child_ehash_entries / 128) As a bonus, we can dismantle netns faster. Currently, while destroying netns, we call inet_twsk_purge(), which walks through the global ehash. It can be potentially big because it can have many sockets other than TIME_WAIT in all netns. Splitting ehash changes that situation, where it's only necessary for inet_twsk_purge() to clean up TIME_WAIT sockets in each netns. With regard to this, we do not free the per-netns ehash in inet_twsk_kill() to avoid UAF while iterating the per-netns ehash in inet_twsk_purge(). Instead, we do it in tcp_sk_exit_batch() after calling tcp_twsk_purge() to keep it protocol-family-independent. In the future, we could optimise ehash lookup/iteration further by removing netns comparison for the per-netns ehash. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>		2022-09-20 10:21:50 -07:00
..
caif	tty: cumulate and document tty_struct::flow* members	2021-05-13 16:57:16 +02:00
device_drivers	docs: networking: device drivers: flexcan: fix invalid email	2022-09-06 08:32:12 +02:00
devlink	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net	2022-09-01 12:58:02 -07:00
dsa	docs: net: dsa: update information about multiple CPU ports	2022-09-20 10:32:36 +02:00
mac80211_hwsim
6lowpan.rst
6pack.rst
af_xdp.rst	doc, af_xdp: Fix bind flags option typo	2021-07-12 16:55:01 +02:00
alias.rst
arcnet-hardware.rst
arcnet.rst	Documentation: networking: arcnet: drop doubled word	2020-07-04 17:46:21 -07:00
atm.rst
ax25.rst	Documentation: networking: ax25: drop doubled word	2020-07-04 17:46:21 -07:00
bareudp.rst	Documentation: bareudp: Corrected description of bareudp module.	2020-07-28 17:53:03 -07:00
batman-adv.rst	batman-adv: Move IRC channel to hackint.org	2021-08-08 20:05:46 +02:00
bonding.rst	Documentation: bonding: clarify supported modes for tlb_dynamic_lb	2022-08-30 23:17:54 -07:00
bridge.rst
can_ucan_protocol.rst	Documentation: networking: can_ucan_protocol: drop doubled words	2020-07-04 17:46:21 -07:00
can.rst	can: Break loopback loop on loopback documentation	2022-06-11 22:40:13 +02:00
cdc_mbim.rst
checksum-offloads.rst
dccp.rst	net: dccp: Add SIOCOUTQ IOCTL support (send buffer fill)	2020-07-22 17:00:37 -07:00
dctcp.rst
dns_resolver.rst
driver.rst	Documentation: networking: correct possessive "its"	2022-08-31 12:36:08 -07:00
eql.rst
ethtool-netlink.rst	net: ethtool: extend ringparam set/get APIs for tx_push	2022-04-15 11:41:35 -07:00
failover.rst
fib_trie.rst
filter.rst	bpf, docs: Split general purpose eBPF documentation out of filter.rst	2021-11-30 10:52:11 -08:00
gen_stats.rst
generic_netlink.rst
generic-hdlc.rst
gtp.rst
ieee802154.rst	docs: net: ieee802154.rst: fix C expressions	2020-10-15 07:49:41 +02:00
ila.rst
index.rst	Remove DECnet support from kernel	2022-08-22 14:26:30 +01:00
ioam6-sysctl.rst	ipv6: ioam: Documentation for new IOAM sysctls	2021-07-21 08:14:33 -07:00
ip_dynaddr.rst
ip-sysctl.rst	tcp: Introduce optional per-netns ehash.	2022-09-20 10:21:50 -07:00
ipddp.rst
ipsec.rst
ipv6.rst
ipvlan.rst	Documentation: networking: correct possessive "its"	2022-08-31 12:36:08 -07:00
ipvs-sysctl.rst	netfilter: ipvs: Fix reuse connection if RS weight is 0	2021-11-08 11:42:47 +01:00
j1939.rst	can: j1939: add tables for the CAN identifier and its fields	2020-11-20 09:43:29 +01:00
kapi.rst	wimax: move out to staging	2020-10-29 19:27:45 +01:00
kcm.rst
l2tp.rst	Documentation: networking: correct possessive "its"	2022-08-31 12:36:08 -07:00
lapb-module.rst
mac80211-auth-assoc-deauth.txt
mac80211-injection.rst	doc: networking: wireless: fix wiki website url	2020-06-08 10:05:53 +02:00
mctp.rst	mctp: Add SIOCMCTP{ALLOC,DROP}TAG ioctls for tag control	2022-02-09 12:00:11 +00:00
mpls-sysctl.rst
mptcp-sysctl.rst	mptcp: Add a per-namespace sysctl to set the default path manager type	2022-04-29 17:25:14 -07:00
msg_zerocopy.rst	docs: use the lore redirector everywhere	2021-10-12 13:58:19 -06:00
multiqueue.rst
net_dim.rst
net_failover.rst	Documentation: networking: net_failover: Fix documentation	2021-11-17 13:59:49 +00:00
netconsole.rst
netdev-features.rst	net: hsr: add offloading support	2021-02-11 13:24:44 -08:00
netdevices.rst	net: bonding: move ioctl handling to private ndo operation	2021-07-27 20:11:45 +01:00
netfilter-sysctl.rst
netif-msg.rst
nexthop-group-resilient.rst	Documentation: net: Document resilient next-hop groups	2021-03-29 13:51:38 -07:00
nf_conntrack-sysctl.rst	netfilter: conntrack: add nf_conntrack_events autodetect mode	2022-05-13 18:56:28 +02:00
nf_flowtable.rst	docs: nf_flowtable: fix compilation and warnings	2021-03-25 17:42:02 -07:00
nfc.rst
openvswitch.rst
operstates.rst	docs: operstates: document IF_OPER_TESTING	2021-08-02 15:16:04 +01:00
packet_mmap.rst	docs: networking: Replace strncpy() with strscpy()	2021-06-04 11:21:43 -06:00
page_pool.rst	Documentation: update networking/page_pool.rst	2022-03-03 09:55:28 +00:00
phonet.rst
phy.rst	net: phy: Add 1000BASE-KX interface mode	2022-09-05 14:30:42 +01:00
pktgen.rst	pktgen: document the latest pktgen usage options	2021-08-25 13:44:30 +01:00
plip.rst
ppp_generic.rst	docs: update ppp_generic.rst to document new ioctls	2020-12-10 13:57:36 -08:00
proc_net_tcp.rst
radiotap-headers.rst
rds.rst	Doc: networking: Fix the title's Sphinx overline in rds.rst	2021-11-29 15:18:21 -07:00
regulatory.rst	doc: networking: wireless: fix wiki website url	2020-06-08 10:05:53 +02:00
rxrpc.rst	rxrpc: Remove rxrpc_get_reply_time() which is no longer used	2022-09-01 11:44:13 +01:00
scaling.rst	docs: networking: update XPS to account for netif_set_xps_queue	2020-10-13 16:21:54 -07:00
sctp.rst
secid.rst
seg6-sysctl.rst	doc: move seg6_flowlabel to seg6-sysctl.rst	2021-04-14 13:13:15 -07:00
segmentation-offloads.rst
sfp-phylink.rst	doc: sfp-phylink: Fix a broken reference	2022-08-02 21:45:07 -07:00
skbuff.rst	skbuff: render the checksum comment to documentation	2022-05-10 17:48:37 -07:00
smc-sysctl.rst	net/smc: Introduce a sysctl for setting SMC-R buffer type	2022-07-18 11:19:17 +01:00
snmp_counter.rst	net-next: docs: Fix typos in snmp_counter.rst	2021-01-05 17:07:38 -08:00
statistics.rst	docs: networking: extend the statistics documentation	2021-04-16 16:59:20 -07:00
strparser.rst
switchdev.rst	Documentation: networking: correct possessive "its"	2022-08-31 12:36:08 -07:00
sysfs-tagging.rst	Documentation: better locations for sysfs-pci, sysfs-tagging	2020-10-09 09:33:23 -06:00
tc-actions-env-rules.rst
tcp-thin.rst
team.rst
timestamping.rst	docs: networking: Use netif_rx().	2022-03-04 12:02:19 +00:00
tipc.rst	Documentation: add more details in tipc.rst	2021-07-01 13:18:18 -07:00
tls-offload-layers.svg
tls-offload-reorder-bad.svg
tls-offload-reorder-good.svg
tls-offload.rst	net: Disable NETIF_F_HW_TLS_RX when RXCSUM is disabled	2021-01-19 15:58:05 -08:00
tls.rst	tls: rx: add counter for NoPad violations	2022-07-11 19:48:33 -07:00
tproxy.rst
tuntap.rst	docs: networking: Replace strncpy() with strscpy()	2021-06-04 11:21:43 -06:00
udplite.rst
vrf.rst	doc: Document unexpected tcp_l3mdev_accept=1 behavior	2021-08-23 11:53:24 +01:00
vxlan.rst	docs: vxlan: add info about device features	2020-09-28 12:50:12 -07:00
x25-iface.rst	net: x25: Queue received packets in the drivers instead of per-CPU queues	2021-04-05 11:42:12 -07:00
x25.rst	net: x25: Remove unimplemented X.25-over-LLC code stubs	2020-12-12 17:15:33 -08:00
xfrm_device.rst	docs: networking: Fix a typo	2021-03-20 19:02:42 -07:00
xfrm_proc.rst
xfrm_sync.rst
xfrm_sysctl.rst