linux/include/net
Eric Dumazet c9bee3b7fd tcp: TCP_NOTSENT_LOWAT socket option
Idea of this patch is to add optional limitation of number of
unsent bytes in TCP sockets, to reduce usage of kernel memory.

TCP receiver might announce a big window, and TCP sender autotuning
might allow a large amount of bytes in write queue, but this has little
performance impact if a large part of this buffering is wasted :

Write queue needs to be large only to deal with large BDP, not
necessarily to cope with scheduling delays (incoming ACKS make room
for the application to queue more bytes)

For most workloads, using a value of 128 KB or less is OK to give
applications enough time to react to POLLOUT events in time
(or being awaken in a blocking sendmsg())

This patch adds two ways to set the limit :

1) Per socket option TCP_NOTSENT_LOWAT

2) A sysctl (/proc/sys/net/ipv4/tcp_notsent_lowat) for sockets
not using TCP_NOTSENT_LOWAT socket option (or setting a zero value)
Default value being UINT_MAX (0xFFFFFFFF), meaning this has no effect.

This changes poll()/select()/epoll() to report POLLOUT
only if number of unsent bytes is below tp->nosent_lowat

Note this might increase number of sendmsg()/sendfile() calls
when using non blocking sockets,
and increase number of context switches for blocking sockets.

Note this is not related to SO_SNDLOWAT (as SO_SNDLOWAT is
defined as :
 Specify the minimum number of bytes in the buffer until
 the socket layer will pass the data to the protocol)

Tested:

netperf sessions, and watching /proc/net/protocols "memory" column for TCP

With 200 concurrent netperf -t TCP_STREAM sessions, amount of kernel memory
used by TCP buffers shrinks by ~55 % (20567 pages instead of 45458)

lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
TCPv6     1880      2   45458   no     208   yes  ipv6        y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
TCP       1696    508   45458   no     208   yes  kernel      y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y

lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
TCPv6     1880      2   20567   no     208   yes  ipv6        y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
TCP       1696    508   20567   no     208   yes  kernel      y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y

Using 128KB has no bad effect on the throughput or cpu usage
of a single flow, although there is an increase of context switches.

A bonus is that we hold socket lock for a shorter amount
of time and should improve latencies of ACK processing.

lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
Local       Remote      Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service
Send Socket Recv Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand
Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units
Final       Final                                             %     Method %      Method
1651584     6291456     16384  20.00   17447.90   10^6bits/s  3.13  S      -1.00  U      0.353   -1.000  usec/KB

 Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':

           412,514 context-switches

     200.034645535 seconds time elapsed

lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
Local       Remote      Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service
Send Socket Recv Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand
Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units
Final       Final                                             %     Method %      Method
1593240     6291456     16384  20.00   17321.16   10^6bits/s  3.35  S      -1.00  U      0.381   -1.000  usec/KB

 Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':

         2,675,818 context-switches

     200.029651391 seconds time elapsed

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-By: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-24 17:54:48 -07:00
..
9p 9P: Add cancelled() to the transport functions. 2013-07-07 22:18:18 -05:00
bluetooth Bluetooth: Remove unneeded flag 2013-06-23 00:23:53 +01:00
caif caif: Remove my bouncing email address. 2013-04-23 13:25:51 -04:00
irda irda: small read past the end of array in debug code 2013-04-19 17:32:31 -04:00
iucv af_iucv: fix recvmsg by replacing skb_pull() function 2013-04-08 17:16:57 -04:00
netfilter net_sched: add 64bit rate estimators 2013-06-11 02:51:03 -07:00
netns netfilter: {ipt,ebt}_ULOG: rise warning on deprecation 2013-05-23 14:23:16 +02:00
nfc NFC: Add secure elements addition and removal API 2013-06-14 13:44:58 +02:00
phonet net: remove my future former mail address 2012-06-17 16:29:38 -07:00
sctp net: sctp: trivial: update mailing list address 2013-07-24 17:53:38 -07:00
tc_act
act_api.h net_sched: add 64bit rate estimators 2013-06-11 02:51:03 -07:00
addrconf.h ipv6,mcast: always hold idev->lock before mca_lock 2013-07-01 23:39:21 -07:00
af_ieee802154.h
af_rxrpc.h
af_unix.h af_unix: fix a fatal race with bit fields 2013-05-01 15:13:49 -04:00
ah.h
arp.h net: Dont use ifindices in hash fns 2012-08-09 16:18:06 -07:00
atmclip.h atm: clip: Use device neigh support on top of "arp_tbl". 2011-11-30 18:51:03 -05:00
ax25.h hlist: drop the node parameter from iterators 2013-02-27 19:10:24 -08:00
ax88796.h
busy_poll.h net: rename busy poll socket op and globals 2013-07-10 17:08:27 -07:00
cfg80211-wext.h
cfg80211.h cfg80211: require passing BSS struct back to cfg80211_assoc_timeout 2013-06-19 18:55:39 +02:00
checksum.h net: core: add function for incremental IPv6 pseudo header checksum updates 2012-08-30 03:00:16 +02:00
cipso_ipv4.h cipso: handle CIPSO options correctly when NetLabel is disabled 2012-06-01 14:18:29 -04:00
cls_cgroup.h cls_cgroup: remove task_struct parameter from sock_update_classid() 2013-04-09 13:19:35 -04:00
codel.h codel: refine one condition to avoid a nul rec_inv_sqrt 2012-08-10 16:52:54 -07:00
compat.h net: cleanup unsigned to unsigned int 2012-04-15 12:44:40 -04:00
datalink.h
dcbevent.h dcb: Add stub routines for !CONFIG_DCB 2011-10-06 15:49:51 -04:00
dcbnl.h net/dcb: Add an optional max rate attribute 2012-04-05 05:08:04 -04:00
dn_dev.h
dn_fib.h decnet: Parse netlink attributes on our own 2013-03-22 10:31:16 -04:00
dn_neigh.h
dn_nsp.h
dn_route.h decnet: use correct RCU API to deref sk_dst_cache field 2013-01-28 00:15:27 -05:00
dn.h net: cleanup unsigned to unsigned int 2012-04-15 12:44:40 -04:00
dsa.h dsa: Include linux/if_ether.h to fix build error 2011-12-01 11:41:06 -05:00
dsfield.h ipv6: Optimize ipv6_change_dsfield(). 2013-01-09 23:59:53 -08:00
dst_ops.h net: Fix warnings in dst_ops.h 2012-07-19 10:43:03 -07:00
dst.h Fix dst_neigh_lookup/dst_neigh_lookup_skb return value handling bug 2013-03-15 09:06:58 -04:00
esp.h
ethoc.h
fib_rules.h ipv4: Elide fib_validate_source() completely when possible. 2012-06-29 01:36:36 -07:00
firewire.h firewire net, ipv4 arp: Extend hardware address and remove driver-level packet inspection. 2013-03-26 12:32:13 -04:00
flow_keys.h flow_keys: include thoff into flow_keys for later usage 2013-03-20 12:14:36 -04:00
flow.h ipv4: Add FLOWI_FLAG_KNOWN_NH 2012-10-08 17:42:36 -04:00
garp.h
gen_stats.h net_sched: add 64bit rate estimators 2013-06-11 02:51:03 -07:00
genetlink.h genl: Allow concurrent genl callbacks. 2013-04-25 01:43:15 -04:00
gre.h net: gre: move GSO functions to gre_offload 2013-07-03 14:37:39 -07:00
gro_cells.h gro: Fix kcalloc argument order 2013-01-27 22:46:33 -05:00
icmp.h ipv4: fix error handling in icmp_protocol. 2013-02-22 15:10:18 -05:00
ieee80211_radiotap.h mac80211: add STBC flag for radiotap 2013-05-24 12:07:25 +02:00
ieee802154_netdev.h ieee802154/nl-mac.c: make some MLME operations optional 2013-04-08 12:00:16 -04:00
ieee802154.h 6LoWPAN: add fragmentation support 2011-11-14 00:19:42 -05:00
if_inet6.h ipv6: resend MLD report if a link-local address completes DAD 2013-06-28 21:19:17 -07:00
inet6_connection_sock.h ipv6: Add helper inet6_csk_update_pmtu(). 2012-07-16 03:44:56 -07:00
inet6_hashtables.h ipv6: use a stronger hash for tcp 2013-02-21 18:15:58 -05:00
inet_common.h net-tcp: Fast Open client - sendmsg(MSG_FASTOPEN) 2012-07-19 11:02:03 -07:00
inet_connection_sock.h tcp: Tail loss probe (TLP) 2013-03-12 08:30:34 -04:00
inet_ecn.h net: Correct comparisons and calculations using skb->tail and skb-transport_header 2013-05-28 23:49:07 -07:00
inet_frag.h net: frag, fix race conditions in LRU list maintenance 2013-05-06 11:06:51 -04:00
inet_hashtables.h hlist: drop the node parameter from iterators 2013-02-27 19:10:24 -08:00
inet_sock.h ipv4: remove is_data also from ip_options documentation. 2013-06-12 03:13:50 -07:00
inet_timewait_sock.h hlist: drop the node parameter from iterators 2013-02-27 19:10:24 -08:00
inetpeer.h ipv4: Maintain redirect and PMTU info in struct rtable again. 2012-07-10 22:40:14 -07:00
ip6_checksum.h ipv6: move csum_ipv6_magic() and udp6_csum_init() into static library 2013-01-08 17:56:10 -08:00
ip6_fib.h ipv6: fix race condition regarding dst->expires and dst->from. 2013-02-20 15:11:45 -05:00
ip6_route.h ipv6: Remove unused neigh argument for icmp6_dst_alloc() and its callers. 2013-01-18 14:41:13 -05:00
ip6_tunnel.h GRE: Refactor GRE tunneling code. 2013-03-26 12:27:18 -04:00
ip_fib.h ipv4: remove fib_update_nh_saddrs() declaration. 2013-07-02 00:33:52 -07:00
ip_tunnels.h sit: add support of x-netns 2013-06-27 22:30:47 -07:00
ip_vs.h ipvs: add sync_persist_mode flag 2013-06-26 18:01:46 +09:00
ip.h ipv4: Add a socket release callback for datagram sockets 2013-01-21 14:17:05 -05:00
ipcomp.h
ipconfig.h
ipv6.h ipv6: Convert use of typedef ctl_table to struct ctl_table 2013-06-19 23:18:07 -07:00
ipx.h
iw_handler.h
lapb.h lapb: Neaten debugging 2012-05-17 18:45:20 -04:00
lib80211.h hostap: Don't use create_proc_read_entry() 2013-04-29 15:41:56 -04:00
llc_c_ac.h
llc_c_ev.h net: cleanup unsigned to unsigned int 2012-04-15 12:44:40 -04:00
llc_c_st.h
llc_conn.h
llc_if.h
llc_pdu.h net: delete all instances of special processing for token ring 2012-05-15 20:14:35 -04:00
llc_s_ac.h
llc_s_ev.h
llc_s_st.h
llc_sap.h
llc.h llc: Remove stray reference to sysctl_llc_station_ack_timeout. 2012-09-17 13:13:24 -04:00
mac80211.h mac80211: track AP's beacon rate and give it to the driver 2013-06-13 11:58:47 +02:00
mac802154.h mac802154: add wpan device-class support 2012-06-26 21:06:11 -07:00
mip6.h
mld.h
mrp.h net/802: Implement Multiple Registration Protocol (MRP) 2013-02-10 20:37:22 -05:00
ndisc.h ndisc: Convert use of typedef ctl_table to struct ctl_table 2013-06-19 23:18:07 -07:00
neighbour.h net neighbour, decnet: Ensure to align device private data on preferred alignment. 2013-02-11 00:21:44 -05:00
net_namespace.h netns: exclude ipvs from struct net when IPVS disabled 2013-06-26 18:01:46 +09:00
net_ratelimit.h
netdma.h
netevent.h ipv6 netevent: Remove old_neigh from netevent_redirect. 2013-01-14 15:04:59 -05:00
netlabel.h userns: Convert the audit loginuid to be a kuid 2012-09-17 18:08:54 -07:00
netlink.h netlink: Rename pid to portid to avoid confusion 2012-09-10 15:30:41 -04:00
netprio_cgroup.h netprio_cgroup: remove task_struct parameter from sock_update_netprio() 2013-04-09 13:19:37 -04:00
netrom.h hlist: drop the node parameter from iterators 2013-02-27 19:10:24 -08:00
nexthop.h
nl802154.h
p8022.h
ping.h Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2013-06-05 16:37:30 -07:00
pkt_cls.h pkt_sched: namespace aware act_mirred 2013-01-14 15:09:36 -05:00
pkt_sched.h sch_api: introduce qdisc_watchdog_schedule_ns() 2013-02-12 18:59:45 -05:00
protocol.h net: Remove code duplication between offload structures 2012-11-15 17:39:51 -05:00
psnap.h
raw.h
rawv6.h ipv6: bool/const conversions phase2 2012-05-19 01:08:16 -04:00
red.h net_sched: red: Make minor corrections to comments 2012-04-16 23:53:11 -04:00
regulatory.h regulatory: use RCU to protect last_request 2013-01-03 13:01:30 +01:00
request_sock.h net: remove a stale comment for dl_next 2013-04-22 15:55:48 -04:00
rose.h
route.h ipv4: avoid a test in ip_rt_put() 2012-11-03 14:59:04 -04:00
rtnetlink.h rtnetlink: Remove passing of attributes into rtnl_doit functions 2013-03-22 10:31:16 -04:00
sch_generic.h net_sched: psched_ratecfg_precompute() improvements 2013-06-11 22:39:47 -07:00
scm.h Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2013-04-22 20:32:51 -04:00
secure_seq.h net: defer net_secret[] initialization 2013-04-29 15:14:02 -04:00
slhc_vj.h
snmp.h net: avoid reloads in SNMP_UPD_PO_STATS 2012-08-06 13:40:47 -07:00
sock.h tcp: TCP_NOTSENT_LOWAT socket option 2013-07-24 17:54:48 -07:00
stp.h
tcp_memcontrol.h cgroup: pass struct mem_cgroup instead of struct cgroup to socket memcg 2012-04-10 10:04:07 -07:00
tcp_states.h
tcp.h tcp: TCP_NOTSENT_LOWAT socket option 2013-07-24 17:54:48 -07:00
timewait_sock.h [PATCH] tcp: Cache inetpeer in timewait socket, and only when necessary. 2012-06-09 14:56:12 -07:00
transp_v6.h transp_v6.h: style neatening 2013-06-04 16:43:42 -07:00
udp.h ipv6: call udp_push_pending_frames when uncorking a socket with AF_INET pending data 2013-07-02 12:44:18 -07:00
udplite.h net: ipv4: Standardize prefixes for message logging 2012-03-12 17:05:21 -07:00
wext.h
wimax.h net: cleanup unsigned to unsigned int 2012-04-15 12:44:40 -04:00
wpan-phy.h mac802154: monitor device support 2012-05-16 15:17:08 -04:00
x25.h net: cleanup unsigned to unsigned int 2012-04-15 12:44:40 -04:00
x25device.h
xfrm.h xfrm: force a garbage collection after deleting a policy 2013-05-31 17:30:07 -07:00