linux/include
Eric Dumazet 39e6c8208d net: solve a NAPI race
While playing with mlx4 hardware timestamping of RX packets, I found
that some packets were received by TCP stack with a ~200 ms delay...

Since the timestamp was provided by the NIC, and my probe was added
in tcp_v4_rcv() while in BH handler, I was confident it was not
a sender issue, or a drop in the network.

This would happen with a very low probability, but hurting RPC
workloads.

A NAPI driver normally arms the IRQ after the napi_complete_done(),
after NAPI_STATE_SCHED is cleared, so that the hard irq handler can grab
it.

Problem is that if another point in the stack grabs NAPI_STATE_SCHED bit
while IRQ are not disabled, we might have later an IRQ firing and
finding this bit set, right before napi_complete_done() clears it.

This can happen with busy polling users, or if gro_flush_timeout is
used. But some other uses of napi_schedule() in drivers can cause this
as well.

thread 1                                 thread 2 (could be on same cpu, or not)

// busy polling or napi_watchdog()
napi_schedule();
...
napi->poll()

device polling:
read 2 packets from ring buffer
                                          Additional 3rd packet is
available.
                                          device hard irq

                                          // does nothing because
NAPI_STATE_SCHED bit is owned by thread 1
                                          napi_schedule();

napi_complete_done(napi, 2);
rearm_irq();

Note that rearm_irq() will not force the device to send an additional
IRQ for the packet it already signaled (3rd packet in my example)

This patch adds a new NAPI_STATE_MISSED bit, that napi_schedule_prep()
can set if it could not grab NAPI_STATE_SCHED

Then napi_complete_done() properly reschedules the napi to make sure
we do not miss something.

Since we manipulate multiple bits at once, use cmpxchg() like in
sk_busy_loop() to provide proper transactions.

In v2, I changed napi_watchdog() to use a relaxed variant of
napi_schedule_prep() : No need to set NAPI_STATE_MISSED from this point.

In v3, I added more details in the changelog and clears
NAPI_STATE_MISSED in busy_poll_stop()

In v4, I added the ideas given by Alexander Duyck in v3 review

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-01 09:50:58 -08:00
..
acpi Merge branches 'acpi-bus', 'acpi-sleep' and 'acpi-processor' 2017-02-20 14:28:03 +01:00
asm-generic kprobes: move kprobe declarations to asm-generic/kprobes.h 2017-02-27 18:43:45 -08:00
clocksource
crypto
drm Less anger inducing pull request for 4.11 2017-02-23 18:58:18 -08:00
dt-bindings The usual collection of new drivers, non-critical fixes, and updates 2017-02-25 14:28:06 -08:00
keys
kvm
linux net: solve a NAPI race 2017-03-01 09:50:58 -08:00
math-emu
media scripts/spelling.txt: add "an union" pattern and fix typo instances 2017-02-27 18:43:46 -08:00
memory
misc
net scripts/spelling.txt: add "disassocation" pattern and fix typo instances 2017-02-27 18:43:47 -08:00
pcmcia
ras
rdma Merge branch 'for-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup 2017-02-27 21:41:08 -08:00
rxrpc
scsi SCSI misc on 20170220 2017-02-21 11:51:42 -08:00
soc ARC updates for 4.11 rc1 2017-02-22 10:33:53 -08:00
sound treewide: Remove remaining executable attributes from source files 2017-02-25 12:12:50 -08:00
target First set of updates for 4.11 kernel merge window 2017-02-23 08:27:57 -08:00
trace rxrpc: Fix deadlock between call creation and sendmsg/recvmsg 2017-03-01 09:50:58 -08:00
uapi The nfsd update this round is mainly a lot of miscellaneous cleanups and 2017-02-28 15:39:09 -08:00
video
xen scripts/spelling.txt: add "an union" pattern and fix typo instances 2017-02-27 18:43:46 -08:00
Kbuild