linux/net/ipv4
Eric Dumazet 75c119afe1 tcp: implement rb-tree based retransmit queue
Using a linear list to store all skbs in write queue has been okay
for quite a while : O(N) is not too bad when N < 500.

Things get messy when N is the order of 100,000 : Modern TCP stacks
want 10Gbit+ of throughput even with 200 ms RTT flows.

40 ns per cache line miss means a full scan can use 4 ms,
blowing away CPU caches.

SACK processing often can use various hints to avoid parsing
whole retransmit queue. But with high packet losses and/or high
reordering, hints no longer work.

Sender has to process thousands of unfriendly SACK, accumulating
a huge socket backlog, burning a cpu and massively dropping packets.

Using an rb-tree for retransmit queue has been avoided for years
because it added complexity and overhead, but now is the time
to be more resistant and say no to quadratic behavior.

1) RTX queue is no longer part of the write queue : already sent skbs
are stored in one rb-tree.

2) Since reaching the head of write queue no longer needs
sk->sk_send_head, we added an union of sk_send_head and tcp_rtx_queue

Tested:

 On receiver :
 netem on ingress : delay 150ms 200us loss 1
 GRO disabled to force stress and SACK storms.

for f in `seq 1 10`
do
 ./netperf -H lpaa6 -l30 -- -K bbr -o THROUGHPUT|tail -1
done | awk '{print $0} {sum += $0} END {printf "%7u\n",sum}'

Before patch :

323.87
351.48
339.59
338.62
306.72
204.07
304.93
291.88
202.47
176.88
   2840

After patch:

1700.83
2207.98
2070.17
1544.26
2114.76
2124.89
1693.14
1080.91
2216.82
1299.94
  18053

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-07 00:28:54 +01:00
..
netfilter netfilter: xtables: add scheduling opportunity in get_counters 2017-09-08 18:55:27 +02:00
af_inet.c ipv4: Namespaceify tcp_fastopen_key knob 2017-10-01 17:55:54 -07:00
ah4.c Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next 2017-06-23 14:17:31 -04:00
arp.c neigh: increase queue_len_bytes to match wmem_default 2017-08-29 16:10:50 -07:00
cipso_ipv4.c Cipso: cipso_v4_optptr enter infinite loop 2017-08-01 15:31:23 -07:00
datagram.c
devinet.c net: avoid a full fib lookup when rp_filter is disabled. 2017-09-21 15:15:22 -07:00
esp4_offload.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-09-01 17:42:05 -07:00
esp4.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-09-01 17:42:05 -07:00
fib_frontend.c net: avoid a full fib lookup when rp_filter is disabled. 2017-09-21 15:15:22 -07:00
fib_lookup.h ipv4: do metrics match when looking up and deleting a route 2017-08-23 20:37:10 -07:00
fib_notifier.c net: Add module reference to FIB notifiers 2017-09-01 20:33:42 -07:00
fib_rules.c net: fib_rules: Implement notification logic in core 2017-08-03 15:35:59 -07:00
fib_semantics.c net: ipv4: remove fib_info arg to fib_check_nh 2017-09-29 06:19:32 +01:00
fib_trie.c ipv4: do metrics match when looking up and deleting a route 2017-08-23 20:37:10 -07:00
fou.c gue: fix remcsum when GRO on and CHECKSUM_PARTIAL boundary is outer UDP 2017-08-01 16:09:14 -07:00
gre_demux.c
gre_offload.c net: Remove all references to SKB_GSO_UDP. 2017-07-17 09:52:58 -07:00
icmp.c ip/options: explicitly provide net ns to __ip_options_echo() 2017-08-06 20:51:12 -07:00
igmp.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-08-21 17:06:42 -07:00
inet_connection_sock.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-09-23 10:16:53 -07:00
inet_diag.c inet_diag: allow protocols to provide additional data 2017-09-01 18:38:09 -07:00
inet_fragment.c Revert "net: use lib/percpu_counter API for fragmentation mem accounting" 2017-09-03 11:01:05 -07:00
inet_hashtables.c net: ipv4: add second dif to inet socket lookups 2017-08-07 11:39:21 -07:00
inet_timewait_sock.c net: convert sock.sk_refcnt from atomic_t to refcount_t 2017-07-01 07:39:08 -07:00
inetpeer.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-10-05 18:19:22 -07:00
ip_forward.c
ip_fragment.c Revert "net: fix percpu memory leaks" 2017-09-03 11:01:05 -07:00
ip_gre.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-10-05 18:19:22 -07:00
ip_input.c IPv4: early demux can return an error code 2017-10-01 03:55:47 +01:00
ip_options.c ip/options: explicitly provide net ns to __ip_options_echo() 2017-08-06 20:51:12 -07:00
ip_output.c udp: remove unreachable ufo branches 2017-08-22 14:27:18 -07:00
ip_sockglue.c net: ipv4: fix l3slave check for index returned in IP_PKTINFO 2017-09-15 14:27:51 -07:00
ip_tunnel_core.c net: store port/representator id in metadata_dst 2017-06-25 11:42:01 -04:00
ip_tunnel.c ipv4: speedup ipv6 tunnels dismantle 2017-09-19 16:32:24 -07:00
ip_vti.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-10-05 18:19:22 -07:00
ipcomp.c
ipconfig.c networking: convert many more places to skb_put_zero() 2017-06-16 11:48:35 -04:00
ipip.c ipv4: speedup ipv6 tunnels dismantle 2017-09-19 16:32:24 -07:00
ipmr.c ipv4: ipmr: Don't forward packets already forwarded by hardware 2017-10-03 10:06:30 -07:00
Kconfig Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next 2017-02-16 21:25:49 -05:00
Makefile tcp: ULP infrastructure 2017-06-15 12:12:40 -04:00
netfilter.c netfilter: use skb_to_full_sk in ip_route_me_harder 2017-02-28 12:49:36 +01:00
ping.c net: convert sock.sk_refcnt from atomic_t to refcount_t 2017-07-01 07:39:08 -07:00
proc.c tcp: Revert "tcp: remove header prediction" 2017-08-30 11:20:09 -07:00
protocol.c net: Add sysctl to toggle early demux for tcp and udp 2017-03-24 13:17:07 -07:00
raw_diag.c net: ipv6: add second dif to raw socket lookups 2017-08-07 11:39:22 -07:00
raw.c net: ipv4: add second dif to multicast source filter 2017-08-07 11:39:22 -07:00
route.c net/ipv4: Remove unused variable in route.c 2017-10-05 21:16:06 -07:00
syncookies.c ip/options: explicitly provide net ns to __ip_options_echo() 2017-08-06 20:51:12 -07:00
sysctl_net_ipv4.c ipv4: Namespaceify tcp_fastopen_blackhole_timeout knob 2017-10-01 17:55:54 -07:00
tcp_bbr.c tcp_bbr: init pacing rate on first RTT sample 2017-07-15 14:43:29 -07:00
tcp_bic.c tcp: consolidate congestion control undo functions 2017-08-06 21:25:10 -07:00
tcp_cdg.c tcp: consolidate congestion control undo functions 2017-08-06 21:25:10 -07:00
tcp_cong.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-09-01 17:42:05 -07:00
tcp_cubic.c tcp: consolidate congestion control undo functions 2017-08-06 21:25:10 -07:00
tcp_dctcp.c
tcp_diag.c tcp_diag: report TCP MD5 signing keys and addresses 2017-09-01 18:38:09 -07:00
tcp_fastopen.c net: add rb_to_skb() and other rb tree helpers 2017-10-07 00:28:53 +01:00
tcp_highspeed.c tcp: consolidate congestion control undo functions 2017-08-06 21:25:10 -07:00
tcp_htcp.c tcp: fix cwnd undo in Reno and HTCP congestion controls 2017-08-06 21:25:10 -07:00
tcp_hybla.c
tcp_illinois.c tcp: consolidate congestion control undo functions 2017-08-06 21:25:10 -07:00
tcp_input.c tcp: implement rb-tree based retransmit queue 2017-10-07 00:28:54 +01:00
tcp_ipv4.c tcp: implement rb-tree based retransmit queue 2017-10-07 00:28:54 +01:00
tcp_lp.c tcp: switch TCP TS option (RFC 7323) to 1ms clock 2017-05-17 16:06:01 -04:00
tcp_metrics.c tcp: batch tcp_net_metrics_exit 2017-09-19 16:32:23 -07:00
tcp_minisocks.c tcp: new list for sent but unacked skbs for RACK recovery 2017-10-05 21:24:47 -07:00
tcp_nv.c tcp: consolidate congestion control undo functions 2017-08-06 21:25:10 -07:00
tcp_offload.c net: convert sock.sk_wmem_alloc from atomic_t to refcount_t 2017-07-01 07:39:08 -07:00
tcp_output.c tcp: implement rb-tree based retransmit queue 2017-10-07 00:28:54 +01:00
tcp_probe.c tcp: remove redundant argument from tcp_rcv_established() 2017-07-24 17:28:12 -07:00
tcp_rate.c tcp: export do_tcp_sendpages and tcp_rate_check_app_limited functions 2017-06-15 12:12:40 -04:00
tcp_recovery.c tcp: a small refactor of RACK loss detection 2017-10-05 21:24:47 -07:00
tcp_scalable.c tcp: consolidate congestion control undo functions 2017-08-06 21:25:10 -07:00
tcp_timer.c tcp: implement rb-tree based retransmit queue 2017-10-07 00:28:54 +01:00
tcp_ulp.c tcp: ulp: avoid module refcnt leak in tcp_set_ulp 2017-08-14 22:17:05 -07:00
tcp_vegas.c tcp: fix under-evaluated ssthresh in TCP Vegas 2017-09-29 06:07:00 +01:00
tcp_vegas.h
tcp_veno.c tcp: consolidate congestion control undo functions 2017-08-06 21:25:10 -07:00
tcp_westwood.c tcp: Revert "tcp: remove CA_ACK_SLOWPATH" 2017-08-30 11:20:08 -07:00
tcp_yeah.c tcp: consolidate congestion control undo functions 2017-08-06 21:25:10 -07:00
tcp.c tcp: implement rb-tree based retransmit queue 2017-10-07 00:28:54 +01:00
tunnel4.c
udp_diag.c net: ipv6: add second dif to udp socket lookups 2017-08-07 11:39:22 -07:00
udp_impl.h udp: make *udp*_queue_rcv_skb() functions static 2017-05-18 10:23:33 -04:00
udp_offload.c net: avoid skb_warn_bad_offload false positives on UFO 2017-08-08 21:39:01 -07:00
udp_tunnel.c net: add infrastructure to un-offload UDP tunnel port 2017-07-24 13:52:59 -07:00
udp.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-10-05 18:19:22 -07:00
udplite.c
xfrm4_input.c esp: Add a software GRO codepath 2017-02-15 11:04:11 +01:00
xfrm4_mode_beet.c networking: make skb_pull & friends return void pointers 2017-06-16 11:48:39 -04:00
xfrm4_mode_transport.c xfrm: Add encapsulation header offsets while SKB is not encrypted 2017-04-14 10:07:39 +02:00
xfrm4_mode_tunnel.c xfrm: Add encapsulation header offsets while SKB is not encrypted 2017-04-14 10:07:39 +02:00
xfrm4_output.c xfrm: Add an IPsec hardware offloading API 2017-04-14 10:06:10 +02:00
xfrm4_policy.c net: xfrm: support setting an output mark. 2017-08-11 07:03:00 +02:00
xfrm4_protocol.c xfrm: input: constify xfrm_input_afinfo 2017-02-09 10:22:17 +01:00
xfrm4_state.c xfrm: remove unused function 2017-01-10 10:57:12 +01:00
xfrm4_tunnel.c