linux/net/ipv4
Neal Cardwell 0f8782ea14 tcp_bbr: add BBR congestion control
This commit implements a new TCP congestion control algorithm: BBR
(Bottleneck Bandwidth and RTT). A detailed description of BBR will be
published in ACM Queue, Vol. 14 No. 5, September-October 2016, as
"BBR: Congestion-Based Congestion Control".

BBR has significantly increased throughput and reduced latency for
connections on Google's internal backbone networks and google.com and
YouTube Web servers.

BBR requires only changes on the sender side, not in the network or
the receiver side. Thus it can be incrementally deployed on today's
Internet, or in datacenters.

The Internet has predominantly used loss-based congestion control
(largely Reno or CUBIC) since the 1980s, relying on packet loss as the
signal to slow down. While this worked well for many years, loss-based
congestion control is unfortunately out-dated in today's networks. On
today's Internet, loss-based congestion control causes the infamous
bufferbloat problem, often causing seconds of needless queuing delay,
since it fills the bloated buffers in many last-mile links. On today's
high-speed long-haul links using commodity switches with shallow
buffers, loss-based congestion control has abysmal throughput because
it over-reacts to losses caused by transient traffic bursts.

In 1981 Kleinrock and Gale showed that the optimal operating point for
a network maximizes delivered bandwidth while minimizing delay and
loss, not only for single connections but for the network as a
whole. Finding that optimal operating point has been elusive, since
any single network measurement is ambiguous: network measurements are
the result of both bandwidth and propagation delay, and those two
cannot be measured simultaneously.

While it is impossible to disambiguate any single bandwidth or RTT
measurement, a connection's behavior over time tells a clearer
story. BBR uses a measurement strategy designed to resolve this
ambiguity. It combines these measurements with a robust servo loop
using recent control systems advances to implement a distributed
congestion control algorithm that reacts to actual congestion, not
packet loss or transient queue delay, and is designed to converge with
high probability to a point near the optimal operating point.

In a nutshell, BBR creates an explicit model of the network pipe by
sequentially probing the bottleneck bandwidth and RTT. On the arrival
of each ACK, BBR derives the current delivery rate of the last round
trip, and feeds it through a windowed max-filter to estimate the
bottleneck bandwidth. Conversely it uses a windowed min-filter to
estimate the round trip propagation delay. The max-filtered bandwidth
and min-filtered RTT estimates form BBR's model of the network pipe.

Using its model, BBR sets control parameters to govern sending
behavior. The primary control is the pacing rate: BBR applies a gain
multiplier to transmit faster or slower than the observed bottleneck
bandwidth. The conventional congestion window (cwnd) is now the
secondary control; the cwnd is set to a small multiple of the
estimated BDP (bandwidth-delay product) in order to allow full
utilization and bandwidth probing while bounding the potential amount
of queue at the bottleneck.

When a BBR connection starts, it enters STARTUP mode and applies a
high gain to perform an exponential search to quickly probe the
bottleneck bandwidth (doubling its sending rate each round trip, like
slow start). However, instead of continuing until it fills up the
buffer (i.e. a loss), or until delay or ACK spacing reaches some
threshold (like Hystart), it uses its model of the pipe to estimate
when that pipe is full: it estimates the pipe is full when it notices
the estimated bandwidth has stopped growing. At that point it exits
STARTUP and enters DRAIN mode, where it reduces its pacing rate to
drain the queue it estimates it has created.

Then BBR enters steady state. In steady state, PROBE_BW mode cycles
between first pacing faster to probe for more bandwidth, then pacing
slower to drain any queue that created if no more bandwidth was
available, and then cruising at the estimated bandwidth to utilize the
pipe without creating excess queue. Occasionally, on an as-needed
basis, it sends significantly slower to probe for RTT (PROBE_RTT
mode).

BBR has been fully deployed on Google's wide-area backbone networks
and we're experimenting with BBR on Google.com and YouTube on a global
scale.  Replacing CUBIC with BBR has resulted in significant
improvements in network latency and application (RPC, browser, and
video) metrics. For more details please refer to our upcoming ACM
Queue publication.

Example performance results, to illustrate the difference between BBR
and CUBIC:

Resilience to random loss (e.g. from shallow buffers):
  Consider a netperf TCP_STREAM test lasting 30 secs on an emulated
  path with a 10Gbps bottleneck, 100ms RTT, and 1% packet loss
  rate. CUBIC gets 3.27 Mbps, and BBR gets 9150 Mbps (2798x higher).

Low latency with the bloated buffers common in today's last-mile links:
  Consider a netperf TCP_STREAM test lasting 120 secs on an emulated
  path with a 10Mbps bottleneck, 40ms RTT, and 1000-packet bottleneck
  buffer. Both fully utilize the bottleneck bandwidth, but BBR
  achieves this with a median RTT 25x lower (43 ms instead of 1.09
  secs).

Our long-term goal is to improve the congestion control algorithms
used on the Internet. We are hopeful that BBR can help advance the
efforts toward this goal, and motivate the community to do further
research.

Test results, performance evaluations, feedback, and BBR-related
discussions are very welcome in the public e-mail list for BBR:

  https://groups.google.com/forum/#!forum/bbr-dev

NOTE: BBR *must* be used with the fq qdisc ("man tc-fq") with pacing
enabled, since pacing is integral to the BBR design and
implementation. BBR without pacing would not function properly, and
may incur unnecessary high packet loss rates.

Signed-off-by: Van Jacobson <vanj@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-21 00:23:01 -04:00
..
netfilter Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2016-09-12 15:52:44 -07:00
af_inet.c gso: Support partial splitting at the frag_list pointer 2016-09-19 20:59:34 -04:00
ah4.c
arp.c net: rename NET_{ADD|INC}_STATS_BH() 2016-04-27 22:48:24 -04:00
cipso_ipv4.c Merge branch 'stable-4.8' of git://git.infradead.org/users/pcmoore/selinux into next 2016-07-07 10:15:34 +10:00
datagram.c
devinet.c netconf: add a notif when settings are created 2016-09-01 15:18:08 -07:00
esp4.c esp: Fix ESN generation under UDP encapsulation 2016-06-23 11:52:00 -04:00
fib_frontend.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2016-09-12 15:52:44 -07:00
fib_lookup.h
fib_rules.c net: flow: Add l3mdev flow update 2016-09-10 23:12:51 -07:00
fib_semantics.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2016-09-12 15:52:44 -07:00
fib_trie.c ipv4: fix value of ->nlmsg_flags reported in RTM_NEWROUTE events 2016-09-09 16:50:23 -07:00
fou.c fou: make nla_policy const 2016-09-01 14:09:00 -07:00
gre_demux.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2016-06-30 05:03:36 -04:00
gre_offload.c gso: Support partial splitting at the frag_list pointer 2016-09-19 20:59:34 -04:00
icmp.c net: icmp: rename ICMPMSGIN_INC_STATS_BH() 2016-04-27 22:48:23 -04:00
igmp.c net/multicast: should not send source list records when have filter mode change 2016-08-08 16:04:39 -07:00
inet_connection_sock.c timers, net/ipv4/inet: Initialize connection request timers as pinned 2016-07-07 10:35:06 +02:00
inet_diag.c net: inet: diag: expose the socket mark to privileged processes. 2016-09-08 16:13:09 -07:00
inet_fragment.c net: disable fragment reassembly if high_thresh is zero 2016-06-05 22:56:42 -04:00
inet_hashtables.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2016-05-04 00:52:29 -04:00
inet_timewait_sock.c timers, net/ipv4/inet: Initialize connection request timers as pinned 2016-07-07 10:35:06 +02:00
inetpeer.c
ip_forward.c net/ipv4: Introduce IPSKB_FRAG_SEGS bit to inet_skb_parm.flags 2016-07-19 16:40:22 -07:00
ip_fragment.c net: rename IP_INC_STATS_BH() 2016-04-27 22:48:23 -04:00
ip_gre.c net/ip_tunnels: Introduce tunnel_id_to_key32() and key32_to_tunnel_id() 2016-09-10 20:53:55 -07:00
ip_input.c net: original ingress device index in PKTINFO 2016-05-11 19:31:40 -04:00
ip_options.c net: ipv4: Convert IP network timestamps to be y2038 safe 2016-03-01 17:18:44 -05:00
ip_output.c net: l3mdev: remove redundant calls 2016-09-10 23:12:52 -07:00
ip_sockglue.c ipv4: accept u8 in IP_TOS ancillary data 2016-09-08 17:45:57 -07:00
ip_tunnel_core.c ip_tunnel: do not clear l4 hashes 2016-09-09 19:33:11 -07:00
ip_tunnel.c ip_tunnel: add collect_md mode to IPIP tunnel 2016-09-17 10:13:07 -04:00
ip_vti.c vti: flush x-netns xfrm cache when vti interface is removed 2016-08-09 12:57:49 -07:00
ipcomp.c
ipconfig.c net: ipconfig: Fix NULL pointer dereference on RARP/BOOTP/DHCP timeout 2016-08-22 21:04:41 -07:00
ipip.c ip_tunnel: add collect_md mode to IPIP tunnel 2016-09-17 10:13:07 -04:00
ipmr.c net: ipmr/ip6mr: update lastuse on entry change 2016-07-26 15:18:31 -07:00
Kconfig tcp_bbr: add BBR congestion control 2016-09-21 00:23:01 -04:00
Makefile tcp_bbr: add BBR congestion control 2016-09-21 00:23:01 -04:00
netfilter.c
ping.c sock: enable timestamping using control messages 2016-04-04 15:50:30 -04:00
proc.c tcp: md5: add LINUX_MIB_TCPMD5FAILURE counter 2016-08-25 16:43:11 -07:00
protocol.c
raw.c net: ipv4: Remove l3mdev_get_saddr 2016-09-10 23:12:53 -07:00
route.c net: l3mdev: remove redundant calls 2016-09-10 23:12:52 -07:00
syncookies.c net: rename NET_{ADD|INC}_STATS_BH() 2016-04-27 22:48:24 -04:00
sysctl_net_ipv4.c ipv4: Fix non-initialized TTL when CONFIG_SYSCTL=n 2016-05-23 14:32:06 -07:00
tcp_bbr.c tcp_bbr: add BBR congestion control 2016-09-21 00:23:01 -04:00
tcp_bic.c tcp: replace cnt & rtt with struct in pkts_acked() 2016-05-11 14:43:19 -04:00
tcp_cdg.c tcp: cdg: rename struct minmax in tcp_cdg.c to avoid a naming conflict 2016-09-21 00:22:59 -04:00
tcp_cong.c tcp: new CC hook to set sending rate with rate_sample in any CA state 2016-09-21 00:23:01 -04:00
tcp_cubic.c tcp: replace cnt & rtt with struct in pkts_acked() 2016-05-11 14:43:19 -04:00
tcp_dctcp.c tcp: return sizeof tcp_dctcp_info in dctcp_get_info() 2016-06-14 23:46:30 -07:00
tcp_diag.c net: diag: Fix refcnt leak in error path destroying socket 2016-08-23 23:11:36 -07:00
tcp_fastopen.c tcp: fastopen: avoid negative sk_forward_alloc 2016-09-08 16:08:10 -07:00
tcp_highspeed.c
tcp_htcp.c tcp: replace cnt & rtt with struct in pkts_acked() 2016-05-11 14:43:19 -04:00
tcp_hybla.c
tcp_illinois.c tcp: replace cnt & rtt with struct in pkts_acked() 2016-05-11 14:43:19 -04:00
tcp_input.c tcp: new CC hook to set sending rate with rate_sample in any CA state 2016-09-21 00:23:01 -04:00
tcp_ipv4.c tcp: use an RB tree for ooo receive queue 2016-09-08 17:25:58 -07:00
tcp_lp.c tcp: replace cnt & rtt with struct in pkts_acked() 2016-05-11 14:43:19 -04:00
tcp_metrics.c tcp: make nla_policy const 2016-09-01 14:09:01 -07:00
tcp_minisocks.c tcp: track application-limited rate samples 2016-09-21 00:23:00 -04:00
tcp_nv.c tcp: add NV congestion control 2016-06-10 23:07:49 -07:00
tcp_offload.c gso: Support partial splitting at the frag_list pointer 2016-09-19 20:59:34 -04:00
tcp_output.c tcp: export tcp_mss_to_mtu() for congestion control modules 2016-09-21 00:23:01 -04:00
tcp_probe.c net: ipv4: tcp_probe: Replace timespec with timespec64 2016-03-01 17:18:44 -05:00
tcp_rate.c tcp: export data delivery rate 2016-09-21 00:23:00 -04:00
tcp_recovery.c tcp: do not assume TCP code is non preemptible 2016-05-02 17:02:25 -04:00
tcp_scalable.c
tcp_timer.c tcp_timer.c: Add kernel-doc function descriptions 2016-07-15 23:18:14 -07:00
tcp_vegas.c tcp: replace cnt & rtt with struct in pkts_acked() 2016-05-11 14:43:19 -04:00
tcp_vegas.h tcp: replace cnt & rtt with struct in pkts_acked() 2016-05-11 14:43:19 -04:00
tcp_veno.c tcp: replace cnt & rtt with struct in pkts_acked() 2016-05-11 14:43:19 -04:00
tcp_westwood.c tcp: replace cnt & rtt with struct in pkts_acked() 2016-05-11 14:43:19 -04:00
tcp_yeah.c tcp: cwnd does not increase in TCP YeAH 2016-09-08 17:16:12 -07:00
tcp.c tcp: export data delivery rate 2016-09-21 00:23:00 -04:00
tunnel4.c tunnels: correct conditional build of MPLS and IPv6 2016-07-11 13:27:06 -07:00
udp_diag.c net: inet: diag: expose the socket mark to privileged processes. 2016-09-08 16:13:09 -07:00
udp_impl.h
udp_offload.c gso: Support partial splitting at the frag_list pointer 2016-09-19 20:59:34 -04:00
udp_tunnel.c net: Remove deprecated tunnel specific UDP offload functions 2016-06-17 20:23:32 -07:00
udp.c net: ipv4: Remove l3mdev_get_saddr 2016-09-10 23:12:53 -07:00
udplite.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2016-08-30 00:54:02 -04:00
xfrm4_input.c
xfrm4_mode_beet.c
xfrm4_mode_transport.c
xfrm4_mode_tunnel.c
xfrm4_output.c
xfrm4_policy.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2016-09-12 15:52:44 -07:00
xfrm4_protocol.c
xfrm4_state.c
xfrm4_tunnel.c