linux/Documentation
Eric Dumazet c9bee3b7fd tcp: TCP_NOTSENT_LOWAT socket option
Idea of this patch is to add optional limitation of number of
unsent bytes in TCP sockets, to reduce usage of kernel memory.

TCP receiver might announce a big window, and TCP sender autotuning
might allow a large amount of bytes in write queue, but this has little
performance impact if a large part of this buffering is wasted :

Write queue needs to be large only to deal with large BDP, not
necessarily to cope with scheduling delays (incoming ACKS make room
for the application to queue more bytes)

For most workloads, using a value of 128 KB or less is OK to give
applications enough time to react to POLLOUT events in time
(or being awaken in a blocking sendmsg())

This patch adds two ways to set the limit :

1) Per socket option TCP_NOTSENT_LOWAT

2) A sysctl (/proc/sys/net/ipv4/tcp_notsent_lowat) for sockets
not using TCP_NOTSENT_LOWAT socket option (or setting a zero value)
Default value being UINT_MAX (0xFFFFFFFF), meaning this has no effect.

This changes poll()/select()/epoll() to report POLLOUT
only if number of unsent bytes is below tp->nosent_lowat

Note this might increase number of sendmsg()/sendfile() calls
when using non blocking sockets,
and increase number of context switches for blocking sockets.

Note this is not related to SO_SNDLOWAT (as SO_SNDLOWAT is
defined as :
 Specify the minimum number of bytes in the buffer until
 the socket layer will pass the data to the protocol)

Tested:

netperf sessions, and watching /proc/net/protocols "memory" column for TCP

With 200 concurrent netperf -t TCP_STREAM sessions, amount of kernel memory
used by TCP buffers shrinks by ~55 % (20567 pages instead of 45458)

lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
TCPv6     1880      2   45458   no     208   yes  ipv6        y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
TCP       1696    508   45458   no     208   yes  kernel      y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y

lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
TCPv6     1880      2   20567   no     208   yes  ipv6        y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
TCP       1696    508   20567   no     208   yes  kernel      y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y

Using 128KB has no bad effect on the throughput or cpu usage
of a single flow, although there is an increase of context switches.

A bonus is that we hold socket lock for a shorter amount
of time and should improve latencies of ACK processing.

lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
Local       Remote      Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service
Send Socket Recv Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand
Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units
Final       Final                                             %     Method %      Method
1651584     6291456     16384  20.00   17447.90   10^6bits/s  3.13  S      -1.00  U      0.353   -1.000  usec/KB

 Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':

           412,514 context-switches

     200.034645535 seconds time elapsed

lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
Local       Remote      Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service
Send Socket Recv Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand
Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units
Final       Final                                             %     Method %      Method
1593240     6291456     16384  20.00   17321.16   10^6bits/s  3.35  S      -1.00  U      0.381   -1.000  usec/KB

 Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':

         2,675,818 context-switches

     200.029651391 seconds time elapsed

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-By: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-24 17:54:48 -07:00
..
ABI Merge branch 'for_linus' of git://cavan.codon.org.uk/platform-drivers-x86 2013-07-13 18:08:23 -07:00
accounting Documentation/accounting/getdelays.c: avoid strncpy in accounting tool 2013-07-03 16:08:06 -07:00
acpi Power management and ACPI updates for 3.11-rc1 2013-07-03 14:35:40 -07:00
aoe aoe: allow user to disable target failure timeout 2012-12-17 17:15:25 -08:00
arm Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2013-07-04 11:40:58 -07:00
arm64 arm64: KVM: document kernel object mappings in HYP 2013-06-12 16:42:20 +01:00
auxdisplay
backlight backlight: lp855x: remove duplicate platform data 2013-04-29 18:28:19 -07:00
blackfin
block doc: fix misspellings with 'codespell' tool 2013-05-28 12:02:12 +02:00
blockdev nbd: update documentation and link to mailinglist 2013-02-27 19:10:22 -08:00
bus-devices ARM: OMAP2+: gpmc: generic timing calculation 2012-11-09 18:07:11 +05:30
cdrom
cgroups Merge branch 'for-3.11/core' of git://git.kernel.dk/linux-block 2013-07-11 13:03:24 -07:00
connector connector: Move cn_test.c away from NLMSG_PUT(). 2012-06-26 21:19:02 -07:00
console TTY:console: update document console.txt 2013-05-21 10:21:57 -07:00
cpu-freq cpufreq: rename index as driver_data in cpufreq_frequency_table 2013-06-04 14:25:59 +02:00
cpuidle cpuidle: make a single register function for all 2013-04-23 13:45:22 +02:00
cris
crypto drivers/dma: remove unused support for MEMSET operations 2013-07-03 16:07:42 -07:00
development-process
device-mapper Add a device-mapper target called dm-switch to provide a multipath 2013-07-11 13:05:40 -07:00
devicetree Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input 2013-07-13 18:05:13 -07:00
DocBook Merge branch 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media 2013-07-13 12:09:57 -07:00
driver-model lib: devres: Introduce devm_ioremap_resource() 2013-01-22 09:41:43 -08:00
dvb [media] get_dvb_firmware: Fix the location of firmware for Terratec HTC 2013-01-01 11:18:26 -02:00
early-userspace Documentation/early-userspace/README fix a typo 2013-06-05 16:24:58 +02:00
EDID drm: Add 1600x1200 (UXGA) screen resolution to the built-in EDIDs 2013-04-12 14:06:16 +10:00
extcon
fault-injection doc: fix quite a few typos within Documentation 2012-11-19 14:28:24 +01:00
fb Merge branch 'drm-next' of git://people.freedesktop.org/~airlied/linux 2013-07-09 16:04:31 -07:00
filesystems xfs: update (#2) for 3.11-rc1 2013-07-13 11:40:24 -07:00
firmware_class firmware loader: document firmware cache mechanism 2012-11-14 15:07:18 -08:00
fmc FMC: add a char-device mezzanine driver 2013-06-18 15:42:15 -07:00
frv
hid HID: remove x bit from sensor doc 2012-12-14 08:48:59 +01:00
hwmon New driver to support GMT G762/G763 pwm fan controllers 2013-07-03 19:56:35 -07:00
i2c Merge branch 'i2c/for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux 2013-07-04 14:02:09 -07:00
i2o
ia64 Fix example error_injection_tool 2013-04-02 09:39:55 -07:00
ide
infiniband IB/ipoib: Add rtnl_link_ops support 2012-09-20 16:49:17 -04:00
input Input: MT - Specify that ABS_MT_SLOT must have a minimum of 0 2013-06-13 21:32:00 +02:00
ioctl s390/sclp: Add SCLP character device driver 2013-06-26 21:10:13 +02:00
isdn
ja_JP
kbuild Merge branch 'kconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild 2013-07-10 16:06:46 -07:00
kdump Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2013-07-04 11:40:58 -07:00
ko_KR
laptops doc: fix misspellings with 'codespell' tool 2013-05-28 12:02:12 +02:00
leds leds: lp55xx: configure the clock detection 2013-04-01 11:04:53 -07:00
m68k block: remove refs to XD disks from documentation 2013-05-17 15:17:12 +02:00
make
memory-devices
metag doc: fix misspellings with 'codespell' tool 2013-05-28 12:02:12 +02:00
mips
misc-devices doc: fix misspellings with 'codespell' tool 2013-05-28 12:02:12 +02:00
mmc mmc: core: Add in support to expose PRV for v4 MMCs 2013-03-22 12:10:42 -04:00
mn10300
mtd
namespaces userns: Recommend use of memory control groups. 2013-01-26 22:20:06 -08:00
netlabel
networking tcp: TCP_NOTSENT_LOWAT socket option 2013-07-24 17:54:48 -07:00
nfc NFC: Update pn544 documentation 2013-01-10 01:27:46 +01:00
parisc parisc: document the shadow registers 2013-07-09 22:09:19 +02:00
PCI PCI/MSI: Enable multiple MSIs with pci_enable_msi_block_auto() 2013-01-24 17:25:13 +01:00
pcmcia
power Merge branch 'pm-assorted' 2013-06-28 13:01:40 +02:00
powerpc powerpc/perf: Core EBB support for 64-bit book3s 2013-07-01 11:50:10 +10:00
pps
prctl seccomp: Make syscall skipping and nr changes more consistent 2012-10-02 21:14:29 +10:00
pti
ptp
rapidio rapidio: documentation update 2013-07-03 16:08:05 -07:00
RCU Merge branches 'cbnum.2013.06.10a', 'doc.2013.06.10a', 'fixes.2013.06.10a', 'srcu.2013.06.10a' and 'tiny.2013.06.10a' into HEAD 2013-06-10 13:46:44 -07:00
s390 s390/cio: add condev keyword to cio_ignore 2013-05-02 15:50:20 +02:00
scheduler sched: Rename sched.c as sched/core.c in comments and Documentation 2013-06-19 12:58:42 +02:00
scsi [SCSI] megaraid_sas: Changelog and driver version update 2013-06-24 17:52:10 -07:00
security Smack: add support for modification of existing rules 2013-03-19 14:16:42 -07:00
serial stallion: final cleanup 2013-06-03 14:31:39 -07:00
sh
sound ALSA: hda - Update HD-Audio-Models.txt 2013-06-17 11:16:33 +02:00
spi Documentation: remove __dev* attributes. 2013-01-03 15:57:16 -08:00
sysctl net: rename busy poll socket op and globals 2013-07-10 17:08:27 -07:00
target target: Simplify fabric sense data length handling 2012-09-17 17:12:58 -07:00
thermal Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux 2013-07-11 12:26:08 -07:00
timers nohz_full: Document additional restrictions 2013-06-10 13:42:38 -07:00
trace The majority of the changes here are cleanups for the large changes that 2013-07-11 09:02:09 -07:00
usb Driver core patches for 3.11-rc1 2013-07-02 11:44:19 -07:00
vDSO
video4linux Merge branch 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media 2013-07-13 12:09:57 -07:00
virtual Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2013-07-04 11:40:58 -07:00
vm zswap: add documentation 2013-07-10 18:11:34 -07:00
w1 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2013-07-04 11:40:58 -07:00
watchdog watchdog: delete mpcore_wdt driver 2013-07-11 21:47:58 +02:00
wimax
x86 arm: add support for LZ4-compressed kernel 2013-07-09 10:33:30 -07:00
xtensa xtensa: document MMUv3 setup sequence 2013-05-09 01:07:09 -07:00
zh_CN [media] v4l2-framework: replace g_chip_ident by g_std in the examples 2013-06-17 09:18:29 -03:00
.gitignore
00-INDEX FMC: add documentation for the core 2013-06-18 15:41:03 -07:00
applying-patches.txt
atomic_ops.txt Documentation: Memory barrier semantics of atomic_xchg() 2013-01-08 14:14:55 -08:00
bad_memory.txt
basic_profiling.txt
bcache.txt Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2013-07-04 11:40:58 -07:00
binfmt_misc.txt
braille-console.txt
bt8xxgpio.txt
btmrvl.txt
BUG-HUNTING
bus-virt-phys-mapping.txt
cachetlb.txt
Changes
circular-buffers.txt
clk.txt doc: clk: Fix incorrect wording 2013-06-10 23:54:14 +02:00
coccinelle.txt Coccinelle: Update information about the minimal version required 2013-07-03 22:58:20 +02:00
CodingStyle Documentation/CodingStyle: allow multiple return statements per function 2013-07-03 16:08:01 -07:00
cpu-hotplug.txt kernel: delete __cpuinit usage from all core kernel files 2013-07-14 19:36:59 -04:00
cpu-load.txt
cputopology.txt
crc32.txt
dcdbas.txt
debugging-modules.txt
debugging-via-ohci1394.txt
dell_rbu.txt
devices.txt /dev/oldmem: Remove the interface 2013-07-03 16:08:03 -07:00
digsig.txt
DMA-API-HOWTO.txt Documentation/DMA-API-HOWTO.txt: fix typo 2013-02-27 19:10:23 -08:00
DMA-API.txt dma-debug: New interfaces to debug dma mapping errors 2012-10-24 17:06:43 +02:00
DMA-attributes.txt common: DMA-mapping: add DMA_ATTR_FORCE_CONTIGUOUS attribute 2012-11-29 03:30:34 -08:00
dma-buf-sharing.txt dma-buf: replace dma_buf_export() with dma_buf_export_named() 2013-05-01 16:35:36 +05:30
DMA-ISA-LPC.txt
dmaengine.txt
dmatest.txt dmatest: do not allow to interrupt ongoing tests 2013-06-08 02:13:44 +05:30
dontdiff x86: remove offsets.h from .gitignore and dontdiff 2012-11-19 14:10:53 +01:00
dynamic-debug-howto.txt doc: fix misspellings with 'codespell' tool 2013-05-28 12:02:12 +02:00
edac.txt Merge branch 'devel' 2012-07-29 21:11:05 -03:00
eisa.txt
email-clients.txt
flexible-arrays.txt
futex-requeue-pi.txt
gcov.txt
gpio.txt Remove GENERIC_GPIO config option 2013-04-16 18:47:19 +09:00
highuid.txt
HOWTO Documentation: Updated broken link in HOWTO 2013-06-03 14:22:57 -07:00
hw_random.txt hwrng: Fix a wrong comment in Documentation/hw_random.txt 2013-03-10 18:16:36 +08:00
hwspinlock.txt
init.txt
initrd.txt Documentation/initrd.txt: Change the location of util-linux 2012-05-25 16:18:34 +02:00
intel_txt.txt Documentation: remove depends on CONFIG_EXPERIMENTAL 2013-01-11 11:38:03 -08:00
Intel-IOMMU.txt
io_ordering.txt
io-mapping.txt
iostats.txt iostats.txt: add easy-to-find description for field 6 2013-04-29 15:18:50 +02:00
IPMI.txt ipmi: add options to disable openfirmware and PCI scanning 2013-02-27 19:10:21 -08:00
IRQ-affinity.txt
IRQ-domain.txt irqdomain: update documentation 2012-12-05 23:52:10 +00:00
IRQ.txt
irqflags-tracing.txt
isapnp.txt
java.txt
kernel-doc-nano-HOWTO.txt kernel-doc: Update references to SGML to refs to XML instead. 2013-05-28 12:02:11 +02:00
kernel-docs.txt
kernel-parameters.txt The majority of the changes here are cleanups for the large changes that 2013-07-11 09:02:09 -07:00
kernel-per-CPU-kthreads.txt Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2013-07-04 11:40:58 -07:00
kmemcheck.txt
kmemleak.txt
kobject.txt Documentation: Fix "struct kobj_type" to include newer members. 2012-09-04 16:06:34 -07:00
kprobes.txt
kref.txt kref: Add kref_get_unless_zero documentation 2012-11-28 18:36:06 +10:00
ldm.txt
local_ops.txt
lockdep-design.txt
lockstat.txt locking/stat: Fix a typo 2013-02-19 08:42:37 +01:00
lockup-watchdogs.txt
logo.gif
logo.txt
magic-number.txt wanrouter: completely decouple obsolete code from kernel. 2013-01-31 19:20:33 -05:00
Makefile
ManagementStyle Documentation: ManagementStyle: fixed typo 2012-06-28 12:03:15 +02:00
md.txt md: remove doubled description for sync_max, merging it within sync_min/sync_max 2013-07-03 09:43:28 +10:00
media-framework.txt Merge branch 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media 2013-07-13 12:09:57 -07:00
memory-barriers.txt Documentation: Memory barrier semantics of atomic_xchg() 2013-01-08 14:14:55 -08:00
memory-hotplug.txt hotplug: update nodemasks management 2012-12-12 17:38:33 -08:00
mono.txt
mutex-design.txt
nommu-mmap.txt
numastat.txt
oops-tracing.txt
padata.txt
parport-lowlevel.txt
parport.txt
percpu-rw-semaphore.txt percpu-rw-semaphore: fix documentation typos 2012-09-26 19:56:15 +02:00
pi-futex.txt
pinctrl.txt Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2013-07-04 11:40:58 -07:00
pnp.txt
preempt-locking.txt
printk-formats.txt lib: vsprintf: add IPv4/v6 generic %p[Ii]S[pfs] format specifier 2013-07-01 23:22:13 -07:00
pwm.txt pwm: Add sysfs interface 2013-06-21 11:32:51 +02:00
ramoops.txt pstore/ftrace: Convert to its own enable/disable debugfs knob 2012-09-06 22:16:58 -07:00
rbtree.txt rbtree: move augmented rbtree functionality to rbtree_augmented.h 2012-10-09 16:22:40 +09:00
remoteproc.txt remoteproc: add rproc_report_crash function to notify rproc crashes 2012-09-18 12:53:22 +03:00
rfkill.txt
robust-futex-ABI.txt
robust-futexes.txt
rpmsg.txt Documentation: remove __dev* attributes. 2013-01-03 15:57:16 -08:00
rt-mutex-design.txt sched: Rename sched.c as sched/core.c in comments and Documentation 2013-06-19 12:58:42 +02:00
rt-mutex.txt
rtc.txt rtc: add ability to push out an existing wakealarm using sysfs 2013-07-03 16:07:54 -07:00
SAK.txt
SecurityBugs
serial-console.txt
sgi-ioc4.txt
sgi-visws.txt
SM501.txt
smsc_ece1099.txt mfd: smsc: Add support for smsc gpio io/keypad driver 2012-10-01 15:27:48 +02:00
sparse.txt Documentation/sparse.txt: document context annotations for lock checking 2012-12-17 17:15:23 -08:00
spinlocks.txt sched: Rename sched.c as sched/core.c in comments and Documentation 2013-06-19 12:58:42 +02:00
stable_api_nonsense.txt
stable_kernel_rules.txt stable: Allow merging of backports for serious user-visible performance issues 2012-06-25 12:11:58 -07:00
static-keys.txt
SubmitChecklist Finally eradicate CONFIG_HOTPLUG 2013-06-03 14:20:18 -07:00
SubmittingDrivers
SubmittingPatches checkpatch: add Suggested-by as a standard signature 2013-04-29 18:28:20 -07:00
svga.txt
sysfs-rules.txt
sysrq.txt Documentation/sysrq: fix inconstistent help message of sysrq key 2013-04-30 17:04:10 -07:00
this_cpu_ops.txt percpu: add documentation on this_cpu operations 2013-04-04 10:24:53 -07:00
unaligned-memory-access.txt
unicode.txt
unshare.txt
vfio.txt vfio Updates for v3.11 2013-07-10 14:50:08 -07:00
VGA-softcursor.txt
vgaarbiter.txt
video-output.txt
vme_api.txt
volatile-considered-harmful.txt
workqueue.txt workqueue: reimplement WQ_HIGHPRI using a separate worker_pool 2012-07-13 22:24:45 -07:00
ww-mutex-design.txt mutex: Add support for wound/wait style locks 2013-06-26 12:10:56 +02:00
xz.txt
zorro.txt