linux/net
Eric Dumazet a74f0fa082 tcp: reduce POLLOUT events caused by TCP_NOTSENT_LOWAT
TCP_NOTSENT_LOWAT socket option or sysctl was added in linux-3.12
as a step to enable bigger tcp sndbuf limits.

It works reasonably well, but the following happens :

Once the limit is reached, TCP stack generates
an [E]POLLOUT event for every incoming ACK packet.

This causes a high number of context switches.

This patch implements the strategy David Miller added
in sock_def_write_space() :

 - If TCP socket has a notsent_lowat constraint of X bytes,
   allow sendmsg() to fill up to X bytes, but send [E]POLLOUT
   only if number of notsent bytes is below X/2

This considerably reduces TCP_NOTSENT_LOWAT overhead,
while allowing to keep the pipe full.

Tested:
 100 ms RTT netem testbed between A and B, 100 concurrent TCP_STREAM

A:/# cat /proc/sys/net/ipv4/tcp_wmem
4096	262144	64000000
A:/# super_netperf 100 -H B -l 1000 -- -K bbr &

A:/# grep TCP /proc/net/sockstat
TCP: inuse 203 orphan 0 tw 19 alloc 414 mem 1364904 # This is about 54 MB of memory per flow :/

A:/# vmstat 5 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0      0 256220672  13532 694976    0    0    10     0   28   14  0  1 99  0  0
 2  0      0 256320016  13532 698480    0    0   512     0 715901 5927  0 10 90  0  0
 0  0      0 256197232  13532 700992    0    0   735    13 771161 5849  0 11 89  0  0
 1  0      0 256233824  13532 703320    0    0   512    23 719650 6635  0 11 89  0  0
 2  0      0 256226880  13532 705780    0    0   642     4 775650 6009  0 12 88  0  0

A:/# echo 2097152 >/proc/sys/net/ipv4/tcp_notsent_lowat

A:/# grep TCP /proc/net/sockstat
TCP: inuse 203 orphan 0 tw 19 alloc 414 mem 86411 # 3.5 MB per flow

A:/# vmstat 5 5  # check that context switches have not inflated too much.
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  0      0 260386512  13592 662148    0    0    10     0   17   14  0  1 99  0  0
 0  0      0 260519680  13592 604184    0    0   512    13 726843 12424  0 10 90  0  0
 1  1      0 260435424  13592 598360    0    0   512    25 764645 12925  0 10 90  0  0
 1  0      0 260855392  13592 578380    0    0   512     7 722943 13624  0 11 88  0  0
 1  0      0 260445008  13592 601176    0    0   614    34 772288 14317  0 10 90  0  0

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-04 21:21:18 -08:00
..
6lowpan
9p Merge branch 'work.afs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2018-11-03 10:35:52 -07:00
802
8021q net: 8021q: move vlan offload registrations into vlan_core 2018-11-16 19:51:08 -08:00
appletalk
atm Revert "net: simplify sock_poll_wait" 2018-10-23 10:57:06 -07:00
ax25 ax25: remove blank line at EOF 2018-07-24 14:10:42 -07:00
batman-adv Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-11-19 10:55:00 -08:00
bluetooth Merge branch 'work.afs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2018-11-01 19:58:52 -07:00
bpf bpf: add tests for direct packet access from CGROUP_SKB 2018-10-19 13:49:34 -07:00
bpfilter net: bpfilter: Set user mode helper's command line 2018-10-22 19:37:36 -07:00
bridge net: bridge: Extend br_vlan_get_pvid() for bridge ports 2018-11-30 17:06:28 -08:00
caif Revert "net: simplify sock_poll_wait" 2018-10-23 10:57:06 -07:00
can can: raw: check for CAN FD capable netdev in raw_sendmsg() 2018-11-09 17:19:34 +01:00
ceph libceph: fall back to sendmsg for slab pages 2018-11-19 17:59:47 +01:00
core tcp: reduce POLLOUT events caused by TCP_NOTSENT_LOWAT 2018-12-04 21:21:18 -08:00
dcb net: dcb: Add priority-to-DSCP map getters 2018-07-27 13:17:50 -07:00
dccp net: Convert protocol error handlers from void to int 2018-11-08 17:13:08 -08:00
decnet net/decnet: add missing indentation 2018-11-16 19:42:49 -08:00
dns_resolver dns: Allow the dns resolver to retrieve a server set 2018-10-04 09:40:52 -07:00
dsa rocker, dsa, ethsw: Don't filter VLAN events on bridge itself 2018-11-23 18:02:24 -08:00
ethernet net: ethernet: provide nvmem_get_mac_address() 2018-12-03 15:40:30 -08:00
hsr
ieee802154 net/ipfrag: let ip[6]frag_high_thresh in ns be higher than in init_net 2018-09-21 19:45:52 -07:00
ife
ipv4 net: Do not route unicast IP packets twice 2018-12-04 08:36:36 -08:00
ipv6 net: Do not route unicast IP packets twice 2018-12-04 08:36:36 -08:00
iucv iucv: Remove SKB list assumptions. 2018-11-10 16:55:11 -08:00
kcm Revert "kcm: remove any offset before parsing messages" 2018-09-17 18:43:42 -07:00
key Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next 2018-07-27 09:33:37 -07:00
l2tp l2tp: fix a sock refcnt leak in l2tp_tunnel_register 2018-11-14 22:49:31 -08:00
l3mdev l3mdev: add function to retreive upper master 2018-12-03 14:15:26 -08:00
lapb
llc llc: do not use sk_eat_skb() 2018-10-22 19:59:20 -07:00
mac80211 mac80211: implement ieee80211_tx_rate_update to update rate 2018-10-12 13:05:40 +02:00
mac802154 mac802154: Remove VLA usage of skcipher 2018-09-28 12:46:07 +08:00
mpls net/mpls: Handle kernel side filtering of route dumps 2018-10-16 00:14:07 -07:00
ncsi net/ncsi: Add NCSI Mellanox OEM command 2018-11-27 16:37:20 -08:00
netfilter Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-11-28 22:10:54 -08:00
netlabel netlabel: check for IPV4MASK in addrinfo_get 2018-09-21 18:58:34 -07:00
netlink netlink: Add answer_flags to netlink_callback 2018-10-16 00:13:12 -07:00
netrom
nfc Merge branch 'work.tty-ioctl' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2018-10-24 14:43:41 +01:00
nsh
openvswitch OVS: remove VLAN_TAG_PRESENT - fixup 2018-11-10 13:42:16 -08:00
packet packet: copy user buffers before orphan or clone 2018-11-23 11:08:03 -08:00
phonet
psample
qrtr
rds Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-10-12 21:38:46 -07:00
rfkill Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-09-04 21:33:03 -07:00
rose
rxrpc rxrpc: Fix life check 2018-11-15 11:35:40 -08:00
sched net/sched: act_tunnel_key: Don't dump dst port if it wasn't set 2018-12-04 20:53:37 -08:00
sctp Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-11-28 22:10:54 -08:00
smc Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-11-24 17:01:43 -08:00
strparser bpf, sockmap: convert to generic sk_msg interface 2018-10-15 12:23:19 -07:00
sunrpc Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-11-19 10:55:00 -08:00
switchdev switchdev: Replace port obj add/del SDO with a notification 2018-11-23 18:02:24 -08:00
tipc Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-11-28 22:10:54 -08:00
tls bpf: helper to pop data from messages 2018-11-28 22:07:57 +01:00
unix Revert "net: simplify sock_poll_wait" 2018-10-23 10:57:06 -07:00
vmw_vsock vsock: split dwork to avoid reinitializations 2018-08-07 12:39:13 -07:00
wimax wimax: remove blank lines at EOF 2018-07-24 14:10:42 -07:00
wireless nl80211: Add per peer statistics to compute FCS error rate 2018-10-12 12:56:34 +02:00
x25 x25: remove blank lines at EOF 2018-07-24 14:10:42 -07:00
xdp Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-10-19 11:03:06 -07:00
xfrm Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2018-11-03 18:25:17 -07:00
compat.c y2038: socket: Change recvmmsg to use __kernel_timespec 2018-08-29 15:42:24 +02:00
Kconfig bpf, sockmap: convert to generic sk_msg interface 2018-10-15 12:23:19 -07:00
Makefile
socket.c socket: do a generic_file_splice_read when proto_ops has no splice_read 2018-11-17 21:34:11 -08:00
sysctl_net.c