linux/net/ipv4
Eric Dumazet d4589926d7 tcp: refine TSO splits
While investigating performance problems on small RPC workloads,
I noticed linux TCP stack was always splitting the last TSO skb
into two parts (skbs). One being a multiple of MSS, and a small one
with the Push flag. This split is done even if TCP_NODELAY is set,
or if no small packet is in flight.

Example with request/response of 4K/4K

IP A > B: . ack 68432 win 2783 <nop,nop,timestamp 6524593 6525001>
IP A > B: . 65537:68433(2896) ack 69632 win 2783 <nop,nop,timestamp 6524593 6525001>
IP A > B: P 68433:69633(1200) ack 69632 win 2783 <nop,nop,timestamp 6524593 6525001>
IP B > A: . ack 68433 win 2768 <nop,nop,timestamp 6525001 6524593>
IP B > A: . 69632:72528(2896) ack 69633 win 2768 <nop,nop,timestamp 6525001 6524593>
IP B > A: P 72528:73728(1200) ack 69633 win 2768 <nop,nop,timestamp 6525001 6524593>
IP A > B: . ack 72528 win 2783 <nop,nop,timestamp 6524593 6525001>
IP A > B: . 69633:72529(2896) ack 73728 win 2783 <nop,nop,timestamp 6524593 6525001>
IP A > B: P 72529:73729(1200) ack 73728 win 2783 <nop,nop,timestamp 6524593 6525001>

We can avoid this split by including the Nagle tests at the right place.

Note : If some NIC had trouble sending TSO packets with a partial
last segment, we would have hit the problem in GRO/forwarding workload already.

tcp_minshall_update() is moved to tcp_output.c and is updated as we might
feed a TSO packet with a partial last segment.

This patch tremendously improves performance, as the traffic now looks
like :

IP A > B: . ack 98304 win 2783 <nop,nop,timestamp 6834277 6834685>
IP A > B: P 94209:98305(4096) ack 98304 win 2783 <nop,nop,timestamp 6834277 6834685>
IP B > A: . ack 98305 win 2768 <nop,nop,timestamp 6834686 6834277>
IP B > A: P 98304:102400(4096) ack 98305 win 2768 <nop,nop,timestamp 6834686 6834277>
IP A > B: . ack 102400 win 2783 <nop,nop,timestamp 6834279 6834686>
IP A > B: P 98305:102401(4096) ack 102400 win 2783 <nop,nop,timestamp 6834279 6834686>
IP B > A: . ack 102401 win 2768 <nop,nop,timestamp 6834687 6834279>
IP B > A: P 102400:106496(4096) ack 102401 win 2768 <nop,nop,timestamp 6834687 6834279>
IP A > B: . ack 106496 win 2783 <nop,nop,timestamp 6834280 6834687>
IP A > B: P 102401:106497(4096) ack 106496 win 2783 <nop,nop,timestamp 6834280 6834687>
IP B > A: . ack 106497 win 2768 <nop,nop,timestamp 6834688 6834280>
IP B > A: P 106496:110592(4096) ack 106497 win 2768 <nop,nop,timestamp 6834688 6834280>

Before :

lpq83:~# nstat >/dev/null;perf stat ./super_netperf 200 -t TCP_RR -H lpq84 -l 20 -- -r 4K,4K
280774

 Performance counter stats for './super_netperf 200 -t TCP_RR -H lpq84 -l 20 -- -r 4K,4K':

     205719.049006 task-clock                #    9.278 CPUs utilized
         8,449,968 context-switches          #    0.041 M/sec
         1,935,997 CPU-migrations            #    0.009 M/sec
           160,541 page-faults               #    0.780 K/sec
   548,478,722,290 cycles                    #    2.666 GHz                     [83.20%]
   455,240,670,857 stalled-cycles-frontend   #   83.00% frontend cycles idle    [83.48%]
   272,881,454,275 stalled-cycles-backend    #   49.75% backend  cycles idle    [66.73%]
   166,091,460,030 instructions              #    0.30  insns per cycle
                                             #    2.74  stalled cycles per insn [83.39%]
    29,150,229,399 branches                  #  141.699 M/sec                   [83.30%]
     1,943,814,026 branch-misses             #    6.67% of all branches         [83.32%]

      22.173517844 seconds time elapsed

lpq83:~# nstat | egrep "IpOutRequests|IpExtOutOctets"
IpOutRequests                   16851063           0.0
IpExtOutOctets                  23878580777        0.0

After patch :

lpq83:~# nstat >/dev/null;perf stat ./super_netperf 200 -t TCP_RR -H lpq84 -l 20 -- -r 4K,4K
280877

 Performance counter stats for './super_netperf 200 -t TCP_RR -H lpq84 -l 20 -- -r 4K,4K':

     107496.071918 task-clock                #    4.847 CPUs utilized
         5,635,458 context-switches          #    0.052 M/sec
         1,374,707 CPU-migrations            #    0.013 M/sec
           160,920 page-faults               #    0.001 M/sec
   281,500,010,924 cycles                    #    2.619 GHz                     [83.28%]
   228,865,069,307 stalled-cycles-frontend   #   81.30% frontend cycles idle    [83.38%]
   142,462,742,658 stalled-cycles-backend    #   50.61% backend  cycles idle    [66.81%]
    95,227,712,566 instructions              #    0.34  insns per cycle
                                             #    2.40  stalled cycles per insn [83.43%]
    16,209,868,171 branches                  #  150.795 M/sec                   [83.20%]
       874,252,952 branch-misses             #    5.39% of all branches         [83.37%]

      22.175821286 seconds time elapsed

lpq83:~# nstat | egrep "IpOutRequests|IpExtOutOctets"
IpOutRequests                   11239428           0.0
IpExtOutOctets                  23595191035        0.0

Indeed, the occupancy of tx skbs (IpExtOutOctets/IpOutRequests) is higher :
2099 instead of 1417, thus helping GRO to be more efficient when using FQ packet
scheduler.

Many thanks to Neal for review and ideas.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Cc: Van Jacobson <vanj@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Tested-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-17 15:15:25 -05:00
..
netfilter ipv4/ipv6: Fix FSF address in file headers 2013-12-06 12:37:56 -05:00
af_inet.c net-gro: Prepare GRO stack for the upcoming tunneling support 2013-12-12 13:47:53 -05:00
ah4.c ipv4: properly refresh rtable entries on pmtu/redirect events 2013-06-03 00:07:42 -07:00
arp.c ipv4: fix wildcard search with inet_confirm_addr() 2013-12-11 14:47:40 -05:00
cipso_ipv4.c ipv4/ipv6: Fix FSF address in file headers 2013-12-06 12:37:56 -05:00
datagram.c ipv4: fix possible seqlock deadlock 2013-11-14 17:31:14 -05:00
devinet.c netconf: add proxy-arp support 2013-12-14 00:58:22 -05:00
esp4.c net: esp{4,6}: get rid of struct esp_data 2013-10-29 06:39:42 +01:00
fib_frontend.c fib_trie: remove duplicated rcu lock 2013-10-18 13:53:59 -04:00
fib_lookup.h net: ipv4/ipv6: Remove extern from function prototypes 2013-10-19 19:12:11 -04:00
fib_rules.c fib_rules: fix suppressor names and default values 2013-08-03 10:40:23 -07:00
fib_semantics.c fib: Use const struct nl_info * in rtmsg_fib 2013-10-18 14:42:15 -04:00
fib_trie.c seq_file: remove "%n" usage from seq_file users 2013-11-15 09:32:20 +09:00
gre_demux.c ipv4: generalize gre_handle_offloads 2013-10-19 19:36:18 -04:00
gre_offload.c ipip: add GSO/TSO support 2013-10-19 19:36:19 -04:00
icmp.c ipv4: processing ancillary IP_TOS or IP_TTL 2013-09-28 15:21:52 -07:00
igmp.c ipv4 igmp: use in_dev_put in timer handlers instead of __in_dev_put 2013-09-30 22:28:56 -07:00
inet_connection_sock.c inet: rename ir_loc_port to ir_num 2013-10-10 14:37:35 -04:00
inet_diag.c inet_diag: use sock_gen_put() 2013-10-17 15:02:02 -04:00
inet_fragment.c inet: remove old fragmentation hash initializing 2013-10-23 17:01:41 -04:00
inet_hashtables.c inet: convert inet_ehash_secret and ipv6_hash_secret to net_get_random_once 2013-10-19 19:45:35 -04:00
inet_lro.c ipv4: replace ip_fast_csum with csum_replace2 2013-03-15 09:12:25 -04:00
inet_timewait_sock.c tcp/dccp: remove twchain 2013-10-08 23:19:24 -04:00
inetpeer.c ip: generate unique IP identificator if local fragmentation is allowed 2013-09-19 14:11:15 -04:00
ip_forward.c ipv4: introduce rt_uses_gateway 2012-10-08 17:42:36 -04:00
ip_fragment.c ipv4: initialize ip4_frags hash secret as late as possible 2013-10-23 17:01:40 -04:00
ip_gre.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2013-08-16 15:37:26 -07:00
ip_input.c net: add SNMP counters tracking incoming ECN bits 2013-08-08 22:24:59 -07:00
ip_options.c net/ipv4: Ensure that location of timestamp option is stored 2013-03-12 05:35:39 -04:00
ip_output.c ipv4: introduce new IP_MTU_DISCOVER mode IP_PMTUDISC_INTERFACE 2013-11-05 21:52:27 -05:00
ip_sockglue.c net: more spelling fixes 2013-12-10 21:57:11 -05:00
ip_tunnel_core.c ipv4: generalize gre_handle_offloads 2013-10-19 19:36:18 -04:00
ip_tunnel.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2013-11-19 15:50:47 -08:00
ip_vti.c xfrm: Release dst if this dst is improper for vti tunnel 2013-11-19 15:50:57 -05:00
ipcomp.c ipv4: properly refresh rtable entries on pmtu/redirect events 2013-06-03 00:07:42 -07:00
ipconfig.c ipconfig: add informative timeout messages while waiting for carrier 2013-04-02 14:35:33 -04:00
ipip.c ipip: add GSO/TSO support 2013-10-19 19:36:19 -04:00
ipmr.c neigh: restore old behaviour of default parms values 2013-12-09 20:56:12 -05:00
Kconfig net: neighbour: Remove CONFIG_ARPD 2013-09-03 21:41:43 -04:00
Makefile net: gre: move GSO functions to gre_offload 2013-07-03 14:37:39 -07:00
netfilter.c netfilter: add my copyright statements 2013-04-18 20:27:55 +02:00
ping.c inet: fix possible seqlock deadlocks 2013-11-29 16:37:36 -05:00
proc.c tcp: auto corking 2013-12-06 12:51:41 -05:00
protocol.c net: remove outdated comment for ipv4 and ipv6 protocol handler 2013-11-28 18:47:51 -05:00
raw.c inet: fix addr_len/msg->msg_namelen assignment in recv_error and rxpmtu functions 2013-11-23 14:46:23 -08:00
route.c ipv4: fix race in concurrent ip_route_input_slow() 2013-11-20 15:28:44 -05:00
syncookies.c inet: split syncookie keys for ipv4 and ipv6 and initialize with net_get_random_once 2013-10-19 19:45:35 -04:00
sysctl_net_ipv4.c tcp: auto corking 2013-12-06 12:51:41 -05:00
tcp_bic.c tcp: properly handle stretch acks in slow start 2013-11-04 19:57:59 -05:00
tcp_cong.c tcp: properly handle stretch acks in slow start 2013-11-04 19:57:59 -05:00
tcp_cubic.c tcp: properly handle stretch acks in slow start 2013-11-04 19:57:59 -05:00
tcp_diag.c
tcp_fastopen.c tcp: enable sockets to use MSG_FASTOPEN by default 2013-11-04 19:57:47 -05:00
tcp_highspeed.c tcp: properly handle stretch acks in slow start 2013-11-04 19:57:59 -05:00
tcp_htcp.c tcp: properly handle stretch acks in slow start 2013-11-04 19:57:59 -05:00
tcp_hybla.c tcp: properly handle stretch acks in slow start 2013-11-04 19:57:59 -05:00
tcp_illinois.c tcp: properly handle stretch acks in slow start 2013-11-04 19:57:59 -05:00
tcp_input.c tcp: properly handle stretch acks in slow start 2013-11-04 19:57:59 -05:00
tcp_ipv4.c inet: fix possible seqlock deadlocks 2013-11-29 16:37:36 -05:00
tcp_lp.c tcp: properly handle stretch acks in slow start 2013-11-04 19:57:59 -05:00
tcp_memcontrol.c tcp_memcontrol: Cleanup/fix cg_proto->memory_pressure handling. 2013-12-05 21:01:01 -05:00
tcp_metrics.c genetlink: only pass array to genl_register_family_with_ops() 2013-11-19 16:39:05 -05:00
tcp_minisocks.c ipv6: make lookups simpler and faster 2013-10-09 00:01:25 -04:00
tcp_offload.c net-gro: Prepare GRO stack for the upcoming tunneling support 2013-12-12 13:47:53 -05:00
tcp_output.c tcp: refine TSO splits 2013-12-17 15:15:25 -05:00
tcp_probe.c ipv6: make lookups simpler and faster 2013-10-09 00:01:25 -04:00
tcp_scalable.c tcp: properly handle stretch acks in slow start 2013-11-04 19:57:59 -05:00
tcp_timer.c tcp: temporarily disable Fast Open on SYN timeout 2013-10-29 22:50:41 -04:00
tcp_vegas.c tcp: properly handle stretch acks in slow start 2013-11-04 19:57:59 -05:00
tcp_vegas.h net: ipv4/ipv6: Remove extern from function prototypes 2013-10-19 19:12:11 -04:00
tcp_veno.c tcp: properly handle stretch acks in slow start 2013-11-04 19:57:59 -05:00
tcp_westwood.c tcp: refactor F-RTO 2013-03-21 11:47:50 -04:00
tcp_yeah.c tcp: properly handle stretch acks in slow start 2013-11-04 19:57:59 -05:00
tcp.c tcp: auto corking 2013-12-06 12:51:41 -05:00
tunnel4.c
udp_diag.c netlink: rename ssk to sk in struct netlink_skb_params 2013-04-19 14:57:56 -04:00
udp_impl.h net: ipv4/ipv6: Remove extern from function prototypes 2013-10-19 19:12:11 -04:00
udp_offload.c ipip: add GSO/TSO support 2013-10-19 19:36:19 -04:00
udp.c inet: fix possible seqlock deadlocks 2013-11-29 16:37:36 -05:00
udplite.c
xfrm4_input.c net: Add skb_unclone() helper function. 2013-02-15 15:10:37 -05:00
xfrm4_mode_beet.c
xfrm4_mode_transport.c
xfrm4_mode_tunnel.c Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next 2013-09-30 15:24:57 -04:00
xfrm4_output.c xfrm: revert ipv4 mtu determination to dst_mtu 2013-08-26 12:40:53 +02:00
xfrm4_policy.c xfrm: Fix null pointer dereference when decoding sessions 2013-11-01 07:08:46 +01:00
xfrm4_state.c xfrm: make local error reporting more robust 2013-08-14 13:07:12 +02:00
xfrm4_tunnel.c sit: add IPv4 over IPv4 support 2013-05-31 17:19:05 -07:00