linux/drivers/net/ethernet
Francois Romieu 509708310c r8169: Add support for interrupt coalesce tuning (ethtool -C)
Kirr: In particular with

	ethtool -C <ifname> rx-usecs 0 rx-frames 0

now it is possible to disable RX delays when NIC usage requires low-latency.

See this thread for context:

	https://www.spinics.net/lists/netdev/msg217665.html

My specific case is that:

We have many computers with gigabit Realtek NICs. For 2 such computers
connected to a gigabit store-and-forward switch the minimum round-trip
time for small pings (`ping -i 0 -w 3 -s 56 -q peer`) is ~ 30μs.

However it turned out that when Ethernet frame length transitions 127 ->
128 bytes (`ping -i 0 -w 3 -s {81 -> 82} -q peer`) the lowest RTT
transitions step-wise to ~ 270μs.

As David Light said this is RX interrupt mitigation done by NIC which creates
the latency. For workloads when low-latency is required with e.g. Intel,
BCM etc NIC drivers one just uses `ethtool -C rx-usecs ...` to reduce
the time NIC delays before interrupting CPU, but it turned out
`ethtool -C` is not supported by r8169 driver.

Like Stéphane ANCELOT I've traced the problem down to IntrMitigate being
hardcoded to != 0 for our chips (we have 8168 based NICs):

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/ethernet/realtek/r8169.c#n5460
static void rtl_hw_start_8169(struct net_device *dev) {
        ...
        /*
         * Undocumented corner. Supposedly:
         * (TxTimer << 12) | (TxPackets << 8) | (RxTimer << 4) | RxPackets
         */
        RTL_W16(IntrMitigate, 0x0000);

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/ethernet/realtek/r8169.c#n6346
static void rtl_hw_start_8168(struct net_device *dev) {
        ...
        RTL_W16(IntrMitigate, 0x5151);

and then I've also found

	https://www.spinics.net/lists/netdev/msg217665.html

and original Francois' patch:

	https://www.spinics.net/lists/netdev/msg217984.html
	https://www.spinics.net/lists/netdev/msg218207.html

So could we please finally get support for tuning r8169 interrupt
coalescing in tree? (so that next poor soul who hits the problem does
not need to go all the way to dig into driver sources and internet
wildly and finally patch locally

        -RTL_W16(IntrMitigate, 0x5151);
        +RTL_W16(IntrMitigate, 0x5100);

guessing whether it is right or not and also having to care to deploy
the patch everywhere it needs to be used, etc...).

To do so I've took original Francois's patch from 2012 and reworked it a bit:

- updated to latest net-next.git;
- adjusted scaling setup based on feedback from Hayes to pick up scaling
  vector depending not only on link speed but also on CPlusCmd[0:1] and to
  adjust CPlusCmd[0:1] correspondingly when setting timings;
- improved a bit (I think so) error handling.

I've tested the patch on "RTL8168d/8111d" (XID 083000c0) and with it and
`ethtool -C rx-usecs 0 rx-frames 0` on both ends it improves:

- minimum RTT latency:

        ~270μs ->  ~30μs (small packet),
        ~330μs -> ~110μs (full 1.5K ethernet frame)

- average RTT latency:

        ~480μs ->  ~50μs (small packet),
        ~560μs -> ~125μs (full 1.5K ethernet frame)

( before:

        root@neo1:# ping -i 0 -w 3 -s 82 -q neo2
        PING neo2.kirr.nexedi.com (192.168.102.21) 82(110) bytes of data.

        --- neo2.kirr.nexedi.com ping statistics ---
        5906 packets transmitted, 5905 received, 0% packet loss, time 2999ms
        rtt min/avg/max/mdev = 0.274/0.485/0.607/0.026 ms, ipg/ewma 0.508/0.489 ms

        root@neo1:# ping -i 0 -w 3 -s 1472 -q neo2
        PING neo2.kirr.nexedi.com (192.168.102.21) 1472(1500) bytes of data.

        --- neo2.kirr.nexedi.com ping statistics ---
        5073 packets transmitted, 5073 received, 0% packet loss, time 2999ms
        rtt min/avg/max/mdev = 0.330/0.566/0.710/0.028 ms, ipg/ewma 0.591/0.544 ms

  after:

        root@neo1# ping -i 0 -w 3 -s 82 -q neo2
        PING neo2.kirr.nexedi.com (192.168.102.21) 82(110) bytes of data.

        --- neo2.kirr.nexedi.com ping statistics ---
        45815 packets transmitted, 45815 received, 0% packet loss, time 3000ms
        rtt min/avg/max/mdev = 0.036/0.051/0.368/0.010 ms, ipg/ewma 0.065/0.053 ms

        root@neo1:# ping -i 0 -w 3 -s 1472 -q neo2
        PING neo2.kirr.nexedi.com (192.168.102.21) 1472(1500) bytes of data.

        --- neo2.kirr.nexedi.com ping statistics ---
        21250 packets transmitted, 21250 received, 0% packet loss, time 3000ms
        rtt min/avg/max/mdev = 0.112/0.125/0.390/0.007 ms, ipg/ewma 0.141/0.125 ms

  the small -> 1.5K latency growth is understandable as it takes ~15μs
  to transmit 1.5K on 1Gbps on the wire and with 2 hosts and 1 switch
  and ICMP ECHO + ECHO reply the packet has to travel 4 ethernet
  segments which is already 60μs;

  probably something a bit else is also there as e.g. on Linux, even
  with `cpupower frequency-set -g performance`, on some computers I've
  noticed the kernel can be spending more time in software-only mode
  when incoming packets go in less frequently. E.g. this program can
  demonstrate the effect for ICMP ECHO processing:

  https://lab.nexedi.com/kirr/bcc/blob/43cfc13b/tools/pinglat.py

  (later this was found to be partly due to C-states exit latencies) )

We have this patch running in our testing setup for 1 months already
without any issues observed.

It remains to be clarified whether RX and TX timers use the same base.
For now I've set them equally, but Francois's original patch version
suggests it could be not the same.

I've got no feedback at all to my original posting of this patch and questions

	https://www.spinics.net/lists/netdev/msg457173.html

neither from Francois, nor from any people from Realtek during one month.

So I suggest we simply apply it to net-next.git now.

Cc: Francois Romieu <romieu@fr.zoreil.com>
Cc: Hayes Wang <hayeswang@realtek.com>
Cc: Realtek linux nic maintainers <nic_swsd@realtek.com>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Stéphane ANCELOT <sancelot@free.fr>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-10-29 11:07:58 +09:00
..
3com drivers/net: 3com/3c515: Convert timers to use timer_setup() 2017-10-27 12:09:15 +09:00
8390 drivers/net: 8390: Convert timers to use timer_setup() 2017-10-28 19:09:49 +09:00
adaptec
adi drivers: net: adi: use setup_timer() helper. 2017-09-21 11:44:43 -07:00
aeroflex
agere drivers: net: et131x: use setup_timer() helper. 2017-09-21 11:44:39 -07:00
alacritech
allwinner
alteon
altera
amazon Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-10-22 13:39:14 +01:00
amd drivers/net: amd: Convert timers to use timer_setup() 2017-10-28 19:09:49 +09:00
apm drivers: net: xgene: Remove return statement from void function 2017-09-05 14:58:25 -07:00
apple net: ethernet: apple: Convert timers to use timer_setup() 2017-10-18 12:40:25 +01:00
aquantia net: aquantia: Bad udp rate on default interrupt coalescing 2017-10-21 12:32:24 +01:00
arc
atheros
aurora
broadcom bnxt_en: Fix randconfig build errors. 2017-10-28 18:24:15 +09:00
brocade bna: Convert timers to use timer_setup() 2017-10-18 12:39:38 +01:00
cadence
calxeda
cavium liquidio: fix kernel panic in VF driver 2017-10-28 18:52:46 +09:00
chelsio drivers/net: chelsio/cxgb*: Convert timers to use timer_setup() 2017-10-28 19:09:49 +09:00
cirrus
cisco drivers: net: enic: use setup_timer() helper. 2017-09-21 11:44:44 -07:00
davicom davicom: Display proper debug level up to 6 2017-09-08 20:53:10 -07:00
dec net: tulip: Convert timers to use timer_setup() 2017-10-18 12:39:38 +01:00
dlink drivers/net: dlink: Convert timers to use timer_setup() 2017-10-28 19:09:49 +09:00
emulex be2net: fix TSO6/GSO issue causing TX-stall on Lancer/BEx 2017-09-13 09:28:18 -07:00
ezchip
faraday net: faraday: ftmac100: Use BUG_ON instead of if condition followed by BUG. 2017-10-27 23:53:14 +09:00
freescale dpaa_eth: remove obsolete comment 2017-10-18 13:44:47 +01:00
fujitsu
hisilicon net: hns3: fix the bug when reuse command description in hclge_add_mac_vlan_tbl 2017-10-26 17:25:35 +09:00
hp
huawei net-next/hinic: Fix a case of Tx Queue is Stopped forever 2017-09-28 10:26:50 -07:00
i825xx dma-mapping updates for 4.14: 2017-09-12 13:30:06 -07:00
ibm ibmvnic: Fix failover error path for non-fatal resets 2017-10-28 00:23:58 +09:00
intel Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-10-22 13:39:14 +01:00
marvell Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-10-05 18:19:22 -07:00
mediatek drivers, net, ethernet: convert mtk_eth.dma_refcnt from atomic_t to refcount_t 2017-10-22 02:22:38 +01:00
mellanox drivers/net: mellanox: Convert timers to use timer_setup() 2017-10-28 19:09:49 +09:00
micrel net: ksz884x: Convert timers to use timer_setup() 2017-10-18 12:39:39 +01:00
microchip
moxa
myricom
natsemi drivers/net: natsemi: Convert timers to use timer_setup() 2017-10-28 19:09:49 +09:00
neterion net: neterion: Convert timers to use timer_setup() 2017-10-18 12:40:26 +01:00
netronome nfp: inform the VF driver needs to be restarted after changing the MAC 2017-10-28 18:59:48 +09:00
nuvoton drivers/net: nuvoton: Convert timers to use timer_setup() 2017-10-27 12:09:15 +09:00
nvidia forcedeth: Convert timers to use timer_setup() 2017-10-18 12:39:39 +01:00
nxp
oki-semi pch_gbe: Switch to new PCI IRQ allocation API 2017-10-16 21:12:32 +01:00
packetengines drivers/net: packetengines: Convert timers to use timer_setup() 2017-10-28 19:09:49 +09:00
pasemi
qlogic qed: Fix iWARP out of order flow 2017-10-19 12:46:43 +01:00
qualcomm net: qualcomm: rmnet: Add support for GRO 2017-10-28 00:10:23 +09:00
rdc
realtek r8169: Add support for interrupt coalesce tuning (ethtool -C) 2017-10-29 11:07:58 +09:00
renesas net: sh_eth: implement R-Car Gen[12] fallback compatibility strings 2017-10-20 08:32:24 +01:00
rocker rocker: fix rocker_tlv_put_* functions for KASAN 2017-09-25 20:18:27 -07:00
samsung drivers/net: sxgbe: Convert timers to use timer_setup() 2017-10-27 12:09:16 +09:00
seeq net: seeq: Convert timers to use timer_setup() 2017-10-18 12:40:26 +01:00
sfc net: ethernet/sfc: Convert timers to use timer_setup() 2017-10-25 12:57:33 +09:00
sgi net/ethernet/sgi: Convert timers to use timer_setup() 2017-10-18 12:40:26 +01:00
silan
sis drivers/net: sis: Convert timers to use timer_setup() 2017-10-25 13:09:47 +09:00
smsc drivers/net: smsc: Convert timers to use timer_setup() 2017-10-28 19:09:50 +09:00
stmicro stmmac: copy unicast mac address to MAC registers 2017-10-28 19:04:29 +09:00
sun net: ethernet: sun: Convert timers to use timer_setup() 2017-10-18 12:40:26 +01:00
synopsys
tehuti
ti net/ti/tlan: Convert timers to use timer_setup() 2017-10-18 12:39:36 +01:00
tile
toshiba drivers: net: spider_net: use setup_timer() helper. 2017-09-21 11:44:40 -07:00
tundra
via dmi: Mark all struct dmi_system_id instances const 2017-09-14 11:59:30 +02:00
wiznet
xilinx
xircom
xscale
dnet.c
dnet.h
ec_bhf.c
ethoc.c
fealnx.c drivers/net: fealnx: Convert timers to use timer_setup() 2017-10-28 19:09:49 +09:00
jme.c
jme.h
Kconfig
korina.c drivers/net: korina: Convert timers to use timer_setup() 2017-10-28 19:09:49 +09:00
lantiq_etop.c
Makefile
netx-eth.c