linux/net
mfreemon@cloudflare.com b650d953cd tcp: enforce receive buffer memory limits by allowing the tcp window to shrink
Under certain circumstances, the tcp receive buffer memory limit
set by autotuning (sk_rcvbuf) is increased due to incoming data
packets as a result of the window not closing when it should be.
This can result in the receive buffer growing all the way up to
tcp_rmem[2], even for tcp sessions with a low BDP.

To reproduce:  Connect a TCP session with the receiver doing
nothing and the sender sending small packets (an infinite loop
of socket send() with 4 bytes of payload with a sleep of 1 ms
in between each send()).  This will cause the tcp receive buffer
to grow all the way up to tcp_rmem[2].

As a result, a host can have individual tcp sessions with receive
buffers of size tcp_rmem[2], and the host itself can reach tcp_mem
limits, causing the host to go into tcp memory pressure mode.

The fundamental issue is the relationship between the granularity
of the window scaling factor and the number of byte ACKed back
to the sender.  This problem has previously been identified in
RFC 7323, appendix F [1].

The Linux kernel currently adheres to never shrinking the window.

In addition to the overallocation of memory mentioned above, the
current behavior is functionally incorrect, because once tcp_rmem[2]
is reached when no remediations remain (i.e. tcp collapse fails to
free up any more memory and there are no packets to prune from the
out-of-order queue), the receiver will drop in-window packets
resulting in retransmissions and an eventual timeout of the tcp
session.  A receive buffer full condition should instead result
in a zero window and an indefinite wait.

In practice, this problem is largely hidden for most flows.  It
is not applicable to mice flows.  Elephant flows can send data
fast enough to "overrun" the sk_rcvbuf limit (in a single ACK),
triggering a zero window.

But this problem does show up for other types of flows.  Examples
are websockets and other type of flows that send small amounts of
data spaced apart slightly in time.  In these cases, we directly
encounter the problem described in [1].

RFC 7323, section 2.4 [2], says there are instances when a retracted
window can be offered, and that TCP implementations MUST ensure
that they handle a shrinking window, as specified in RFC 1122,
section 4.2.2.16 [3].  All prior RFCs on the topic of tcp window
management have made clear that sender must accept a shrunk window
from the receiver, including RFC 793 [4] and RFC 1323 [5].

This patch implements the functionality to shrink the tcp window
when necessary to keep the right edge within the memory limit by
autotuning (sk_rcvbuf).  This new functionality is enabled with
the new sysctl: net.ipv4.tcp_shrink_window

Additional information can be found at:
https://blog.cloudflare.com/unbounded-memory-usage-by-tcp-for-receive-buffers-and-how-we-fixed-it/

[1] https://www.rfc-editor.org/rfc/rfc7323#appendix-F
[2] https://www.rfc-editor.org/rfc/rfc7323#section-2.4
[3] https://www.rfc-editor.org/rfc/rfc1122#page-91
[4] https://www.rfc-editor.org/rfc/rfc793
[5] https://www.rfc-editor.org/rfc/rfc1323

Signed-off-by: Mike Freemon <mfreemon@cloudflare.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-17 09:53:53 +01:00
..
6lowpan 6lowpan: Remove redundant initialisation. 2023-03-29 08:22:52 +01:00
9p Including fixes from netfilter. 2023-05-05 19:12:01 -07:00
802
8021q vlan: fix a potential uninit-value in vlan_dev_hard_start_xmit() 2023-05-17 12:55:39 +01:00
appletalk
atm atm: hide unused procfs functions 2023-05-17 21:27:30 -07:00
ax25
batman-adv batman-adv: Broken sync while rescheduling delayed work 2023-05-26 23:14:49 +02:00
bluetooth Bluetooth: L2CAP: Add missing checks for invalid DCID 2023-06-05 17:24:14 -07:00
bpf bpf: Move kernel test kfuncs to bpf_testmod 2023-05-16 22:09:24 -07:00
bpfilter
bridge skbuff: bridge: Add layer 2 miss indication 2023-05-30 23:37:00 -07:00
caif net: caif: Fix use-after-free in cfusbl_device_notify() 2023-03-02 22:22:07 -08:00
can can: j1939: avoid possible use-after-free when j1939_can_rx_register fails 2023-06-05 08:26:40 +02:00
ceph Networking changes for 6.3. 2023-02-21 18:24:12 -08:00
core net: add check for current MAC address in dev_set_mac_address 2023-06-15 22:54:54 -07:00
dcb
dccp net: ioctl: Use kernel memory on protocol ioctl callbacks 2023-06-15 22:33:26 -07:00
devlink devlink: report devlink_port_type_warn source device 2023-06-17 00:31:14 -07:00
dns_resolver
dsa net: dsa: add support for mac_prepare() and mac_finish() calls 2023-05-26 10:39:40 +01:00
ethernet
ethtool net: create device lookup API with reference tracking 2023-06-15 08:21:11 +01:00
handshake Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2023-06-15 22:19:41 -07:00
hsr hsr: ratelimit only when errors are printed 2023-03-16 21:11:03 -07:00
ieee802154 net: ioctl: Use kernel memory on protocol ioctl callbacks 2023-06-15 22:33:26 -07:00
ife
ipv4 tcp: enforce receive buffer memory limits by allowing the tcp window to shrink 2023-06-17 09:53:53 +01:00
ipv6 ip, ip6: Fix splice to raw and ping sockets 2023-06-16 11:45:16 -07:00
iucv net/iucv: Fix size of interrupt data 2023-03-16 17:34:40 -07:00
kcm kcm: Fix unnecessary psock unreservation. 2023-06-17 00:08:27 -07:00
key af_key: Reject optional tunnel/BEET mode templates in outbound policies 2023-05-10 07:04:51 +02:00
l2tp net: ioctl: Use kernel memory on protocol ioctl callbacks 2023-06-15 22:33:26 -07:00
l3mdev
lapb
llc net: deal with most data-races in sk_wait_event() 2023-05-10 10:03:32 +01:00
mac80211 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2023-06-15 22:19:41 -07:00
mac802154 mac802154: Rename kfree_rcu() to kvfree_rcu_mightsleep() 2023-04-05 13:48:04 +00:00
mctp net: mctp: remove redundant RTN_UNICAST check 2023-06-17 00:25:24 -07:00
mpls net: move gso declarations and functions to their own files 2023-06-10 00:11:41 -07:00
mptcp net: ioctl: Use kernel memory on protocol ioctl callbacks 2023-06-15 22:33:26 -07:00
ncsi net/ncsi: change from ndo_set_mac_address to dev_set_mac_address 2023-06-09 10:32:51 +01:00
netfilter Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2023-06-15 22:19:41 -07:00
netlabel netlabel: fix shift wrapping bug in netlbl_catmap_setlong() 2023-06-10 19:54:06 +01:00
netlink netlink: support extack in dump ->start() 2023-06-12 11:32:44 +01:00
netrom netrom: fix info-leak in nr_write_internal() 2023-05-25 21:02:29 -07:00
nfc nfc: llcp: fix possible use of uninitialized variable in nfc_llcp_send_connect() 2023-05-15 13:03:34 +01:00
nsh net: move gso declarations and functions to their own files 2023-06-10 00:11:41 -07:00
openvswitch net: openvswitch: add support for l4 symmetric hashing 2023-06-12 09:46:30 +01:00
packet af_packet: do not use READ_ONCE() in packet_bind() 2023-05-29 22:03:48 -07:00
phonet net: ioctl: Use kernel memory on protocol ioctl callbacks 2023-06-15 22:33:26 -07:00
psample
qrtr net: qrtr: Fix an uninit variable access bug in qrtr_tx_resume() 2023-04-13 09:35:30 +02:00
rds rds: rds_rm_zerocopy_callback() correct order for list_add_tail() 2023-02-13 09:33:39 +00:00
rfkill net: rfkill-gpio: Add explicit include for of.h 2023-04-06 20:36:27 +02:00
rose net/rose: Fix to not accept on connected socket 2023-01-28 00:19:57 -08:00
rxrpc rxrpc: Truncate UTS_RELEASE for rxrpc version 2023-05-30 10:01:06 +02:00
sched Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2023-06-15 22:19:41 -07:00
sctp net: ioctl: Use kernel memory on protocol ioctl callbacks 2023-06-15 22:33:26 -07:00
smc net/smc: Avoid to access invalid RMBs' MRs in SMCRv1 ADD LINK CONT 2023-06-03 20:51:04 +01:00
strparser
sunrpc sunrpc: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage 2023-06-12 21:13:23 -07:00
switchdev
tipc Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2023-06-15 22:19:41 -07:00
tls net: tls: make the offload check helper take skb not socket 2023-06-15 09:01:05 +01:00
unix af_unix: Kconfig: make CONFIG_UNIX bool 2023-06-12 10:45:50 +01:00
vmw_vsock bpf, sockmap: Pass skb ownership through read_skb 2023-05-23 16:09:47 +02:00
wireless Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2023-06-15 22:19:41 -07:00
x25 net/x25: Fix to not accept on connected socket 2023-01-25 09:51:04 +00:00
xdp xsk: Use pool->dma_pages to check for DMA 2023-04-27 22:24:51 +02:00
xfrm net: move gso declarations and functions to their own files 2023-06-10 00:11:41 -07:00
compat.c net/compat: Update msg_control_is_user when setting a kernel pointer 2023-04-14 11:09:27 +01:00
devres.c
Kconfig net/core: Enable socket busy polling on -RT 2023-05-26 08:51:26 +01:00
Kconfig.debug
Makefile net/handshake: Create a NETLINK service for handling handshake requests 2023-04-19 18:48:48 -07:00
socket.c splice, net: Add a splice_eof op to file-ops and socket-ops 2023-06-08 19:40:30 -07:00
sysctl_net.c