linux/net
Eric Dumazet 65466904b0 tcp: adjust TSO packet sizes based on min_rtt
Back when tcp_tso_autosize() and TCP pacing were introduced,
our focus was really to reduce burst sizes for long distance
flows.

The simple heuristic of using sk_pacing_rate/1024 has worked
well, but can lead to too small packets for hosts in the same
rack/cluster, when thousands of flows compete for the bottleneck.

Neal Cardwell had the idea of making the TSO burst size
a function of both sk_pacing_rate and tcp_min_rtt()

Indeed, for local flows, sending bigger bursts is better
to reduce cpu costs, as occasional losses can be repaired
quite fast.

This patch is based on Neal Cardwell implementation
done more than two years ago.
bbr is adjusting max_pacing_rate based on measured bandwidth,
while cubic would over estimate max_pacing_rate.

/proc/sys/net/ipv4/tcp_tso_rtt_log can be used to tune or disable
this new feature, in logarithmic steps.

Tested:

100Gbit NIC, two hosts in the same rack, 4K MTU.
600 flows rate-limited to 20000000 bytes per second.

Before patch: (TSO sizes would be limited to 20000000/1024/4096 -> 4 segments per TSO)

~# echo 0 >/proc/sys/net/ipv4/tcp_tso_rtt_log
~# nstat -n;perf stat ./super_netperf 600 -H otrv6 -l 20 -- -K dctcp -q 20000000;nstat|egrep "TcpInSegs|TcpOutSegs|TcpRetransSegs|Delivered"
  96005

 Performance counter stats for './super_netperf 600 -H otrv6 -l 20 -- -K dctcp -q 20000000':

         65,945.29 msec task-clock                #    2.845 CPUs utilized
         1,314,632      context-switches          # 19935.279 M/sec
             5,292      cpu-migrations            #   80.249 M/sec
           940,641      page-faults               # 14264.023 M/sec
   201,117,030,926      cycles                    # 3049769.216 GHz                   (83.45%)
    17,699,435,405      stalled-cycles-frontend   #    8.80% frontend cycles idle     (83.48%)
   136,584,015,071      stalled-cycles-backend    #   67.91% backend cycles idle      (83.44%)
    53,809,530,436      instructions              #    0.27  insn per cycle
                                                  #    2.54  stalled cycles per insn  (83.36%)
     9,062,315,523      branches                  # 137422329.563 M/sec               (83.22%)
       153,008,621      branch-misses             #    1.69% of all branches          (83.32%)

      23.182970846 seconds time elapsed

TcpInSegs                       15648792           0.0
TcpOutSegs                      58659110           0.0  # Average of 3.7 4K segments per TSO packet
TcpExtTCPDelivered              58654791           0.0
TcpExtTCPDeliveredCE            19                 0.0

After patch:

~# echo 9 >/proc/sys/net/ipv4/tcp_tso_rtt_log
~# nstat -n;perf stat ./super_netperf 600 -H otrv6 -l 20 -- -K dctcp -q 20000000;nstat|egrep "TcpInSegs|TcpOutSegs|TcpRetransSegs|Delivered"
  96046

 Performance counter stats for './super_netperf 600 -H otrv6 -l 20 -- -K dctcp -q 20000000':

         48,982.58 msec task-clock                #    2.104 CPUs utilized
           186,014      context-switches          # 3797.599 M/sec
             3,109      cpu-migrations            #   63.472 M/sec
           941,180      page-faults               # 19214.814 M/sec
   153,459,763,868      cycles                    # 3132982.807 GHz                   (83.56%)
    12,069,861,356      stalled-cycles-frontend   #    7.87% frontend cycles idle     (83.32%)
   120,485,917,953      stalled-cycles-backend    #   78.51% backend cycles idle      (83.24%)
    36,803,672,106      instructions              #    0.24  insn per cycle
                                                  #    3.27  stalled cycles per insn  (83.18%)
     5,947,266,275      branches                  # 121417383.427 M/sec               (83.64%)
        87,984,616      branch-misses             #    1.48% of all branches          (83.43%)

      23.281200256 seconds time elapsed

TcpInSegs                       1434706            0.0
TcpOutSegs                      58883378           0.0  # Average of 41 4K segments per TSO packet
TcpExtTCPDelivered              58878971           0.0
TcpExtTCPDeliveredCE            9664               0.0

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Link: https://lore.kernel.org/r/20220309015757.2532973-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-03-09 20:05:44 -08:00
..
6lowpan net: don't include ndisc.h from ipv6.h 2022-02-04 14:15:11 -08:00
9p virtio,vdpa,qemu_fw_cfg: features, cleanups, fixes 2022-01-18 10:05:48 +02:00
802 net: 802: Use memset_startat() to clear struct fields 2021-11-19 11:23:23 +00:00
8021q Revert "vlan: move dev_put into vlan_dev_uninit" 2022-02-23 15:21:13 -08:00
appletalk
atm proc: remove PDE_DATA() completely 2022-01-22 08:33:37 +02:00
ax25 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2022-02-10 17:29:56 -08:00
batman-adv batman-adv: Use netif_rx(). 2022-03-07 11:40:41 +00:00
bluetooth bluetooth: Use netif_rx(). 2022-03-07 11:40:41 +00:00
bpf bpf, test_run: Fix overflow in XDP frags bpf_test_finish 2022-03-02 01:09:15 +01:00
bpfilter
bridge net: bridge: Use netif_rx(). 2022-03-04 12:02:19 +00:00
caif net: caif: Use netif_rx(). 2022-03-04 12:02:19 +00:00
can can: Use netif_rx(). 2022-03-06 11:05:31 +00:00
ceph libceph: optionally use bounce buffer on recv path in crc mode 2022-02-02 18:50:36 +01:00
core SO_ZEROCOPY should return -EOPNOTSUPP rather than -ENOTSUPP 2022-03-08 22:14:49 -08:00
dcb net: dcb: disable softirqs in dcbnl_flush_dev() 2022-03-03 08:01:55 -08:00
dccp dccp: remove max48() 2022-01-27 13:53:27 +00:00
decnet net: decnet: use time_is_before_jiffies() instead of open coding it 2022-02-28 13:21:32 +00:00
dns_resolver
dsa net: dsa: felix: avoid early deletion of host FDB entries 2022-03-09 11:12:10 +00:00
ethernet gro: remove rcu_read_lock/rcu_read_unlock from gro_complete handlers 2021-11-24 17:21:42 -08:00
ethtool ethtool: add support to set/get completion queue event size 2022-02-23 20:33:05 -08:00
hsr flow_dissector: Add support for HSR 2022-03-02 22:44:49 -08:00
ieee802154 net: ipv6: Handle delivery_time in ipv6 defrag 2022-03-03 14:38:48 +00:00
ife
ipv4 tcp: adjust TSO packet sizes based on min_rtt 2022-03-09 20:05:44 -08:00
ipv6 skb: make drop reason booleanable 2022-03-09 11:22:58 +00:00
iucv s390/iucv: sort out physical vs virtual pointers usage 2022-02-22 16:09:13 -08:00
kcm net: Don't include filter.h from net/sock.h 2021-12-29 08:48:14 -08:00
key xfrm: Check if_id in xfrm_migrate 2022-01-26 07:44:01 +01:00
l2tp l2tp: add netns refcount tracker to l2tp_dfs_seq_data 2021-12-10 06:38:27 -08:00
l3mdev
lapb
llc sock: Use sock_owned_by_user_nocheck() instead of sk_lock.owned. 2021-12-10 19:43:00 -08:00
mac80211 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2022-03-03 11:55:12 -08:00
mac802154 mac802154: use dev_addr_set() - manual 2021-10-20 14:27:40 +01:00
mctp mctp: Avoid warning if unregister notifies twice 2022-02-25 22:23:23 -08:00
mpls net: mpls: Fix GCC 12 warning 2022-02-10 15:29:39 +00:00
mptcp mptcp: add fullmesh flag check for adding address 2022-03-08 22:06:12 -08:00
ncsi all: replace find_next{,_zero}_bit with find_first{,_zero}_bit where appropriate 2022-01-15 08:47:31 -08:00
netfilter Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2022-03-03 11:55:12 -08:00
netlabel lsm: security_task_getsecid_subj() -> security_current_getsecid_subj() 2021-11-22 17:52:47 -05:00
netlink net: Don't include filter.h from net/sock.h 2021-12-29 08:48:14 -08:00
netrom netrom: fix api breakage in nr_setsockopt() 2022-01-07 14:11:05 +00:00
nfc nfc: llcp: Revert "NFC: Keep socket alive until the DISC PDU is actually sent" 2022-03-03 10:43:37 +00:00
nsh
openvswitch net: Add skb_clear_tstamp() to keep the mono delivery_time 2022-03-03 14:38:48 +00:00
packet net: Handle delivery_time in skb->tstamp during network tapping with af_packet 2022-03-03 14:38:48 +00:00
phonet phonet: Use netif_rx(). 2022-03-07 11:40:41 +00:00
psample
qrtr bus: mhi: core: Add an API for auto queueing buffers for DL channel 2021-12-17 17:17:14 +01:00
rds Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2021-12-16 16:13:19 -08:00
rfkill rfkill: allow to get the software rfkill state 2021-12-20 11:02:38 +01:00
rose net: Don't include filter.h from net/sock.h 2021-12-29 08:48:14 -08:00
rxrpc rxrpc: Adjust retransmission backoff 2022-01-22 02:03:24 +00:00
sched Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2022-03-03 11:55:12 -08:00
sctp Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2022-01-05 14:36:10 -08:00
smc net/smc: fix -Wmissing-prototypes warning when CONFIG_SYSCTL not set 2022-03-09 20:02:35 -08:00
strparser bpf: sockmap, strparser, and tls are reusing qdisc_skb_cb and colliding 2021-11-09 01:05:28 +01:00
sunrpc Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2022-02-10 17:29:56 -08:00
switchdev net: switchdev: remove lag_mod_cb from switchdev_handle_fdb_event_to_device 2022-02-24 21:31:43 -08:00
tipc tipc: Use netif_rx(). 2022-03-07 11:40:41 +00:00
tls tls: cap the output scatter list to something reasonable 2022-02-04 10:14:07 +00:00
unix Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next 2022-01-24 15:42:29 -08:00
vmw_vsock vsock: remove vsock from connected table when connect is interrupted by a signal 2022-02-17 08:56:02 -08:00
wireless wireless: Use netif_rx(). 2022-03-06 11:05:31 +00:00
x25 net: x25: drop harmless check of !more 2021-12-09 18:35:11 -08:00
xdp i40e: xsk: Move tmp desc array from driver to pool 2022-01-27 17:25:32 +01:00
xfrm Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2022-03-03 11:55:12 -08:00
compat.c
devres.c
Kconfig page_pool: Add allocation stats 2022-03-03 09:55:28 +00:00
Kconfig.debug net: add networking namespace refcount tracker 2021-12-10 06:38:26 -08:00
Makefile
socket.c net: fix documentation for kernel_getsockname 2022-02-14 14:01:19 +00:00
sysctl_net.c sections: move and rename core_kernel_data() to is_kernel_core_data() 2021-11-09 10:02:50 -08:00