mainlining shenanigans
Go to file
Vincent Ray a54ce37036 net: sched: fixed barrier to prevent skbuff sticking in qdisc backlog
In qdisc_run_begin(), smp_mb__before_atomic() used before test_bit()
does not provide any ordering guarantee as test_bit() is not an atomic
operation. This, added to the fact that the spin_trylock() call at
the beginning of qdisc_run_begin() does not guarantee acquire
semantics if it does not grab the lock, makes it possible for the
following statement :

if (test_bit(__QDISC_STATE_MISSED, &qdisc->state))

to be executed before an enqueue operation called before
qdisc_run_begin().

As a result the following race can happen :

           CPU 1                             CPU 2

      qdisc_run_begin()               qdisc_run_begin() /* true */
        set(MISSED)                            .
      /* returns false */                      .
          .                            /* sees MISSED = 1 */
          .                            /* so qdisc not empty */
          .                            __qdisc_run()
          .                                    .
          .                              pfifo_fast_dequeue()
 ----> /* may be done here */                  .
|         .                                clear(MISSED)
|         .                                    .
|         .                                smp_mb __after_atomic();
|         .                                    .
|         .                                /* recheck the queue */
|         .                                /* nothing => exit   */
|   enqueue(skb1)
|         .
|   qdisc_run_begin()
|         .
|     spin_trylock() /* fail */
|         .
|     smp_mb__before_atomic() /* not enough */
|         .
 ---- if (test_bit(MISSED))
        return false;   /* exit */

In the above scenario, CPU 1 and CPU 2 both try to grab the
qdisc->seqlock at the same time. Only CPU 2 succeeds and enters the
bypass code path, where it emits its skb then calls __qdisc_run().

CPU1 fails, sets MISSED and goes down the traditionnal enqueue() +
dequeue() code path. But when executing qdisc_run_begin() for the
second time, after enqueuing its skbuff, it sees the MISSED bit still
set (by itself) and consequently chooses to exit early without setting
it again nor trying to grab the spinlock again.

Meanwhile CPU2 has seen MISSED = 1, cleared it, checked the queue
and found it empty, so it returned.

At the end of the sequence, we end up with skb1 enqueued in the
backlog, both CPUs out of __dev_xmit_skb(), the MISSED bit not set,
and no __netif_schedule() called made. skb1 will now linger in the
qdisc until somebody later performs a full __qdisc_run(). Associated
to the bypass capacity of the qdisc, and the ability of the TCP layer
to avoid resending packets which it knows are still in the qdisc, this
can lead to serious traffic "holes" in a TCP connection.

We fix this by replacing the smp_mb__before_atomic() / test_bit() /
set_bit() / smp_mb__after_atomic() sequence inside qdisc_run_begin()
by a single test_and_set_bit() call, which is more concise and
enforces the needed memory barriers.

Fixes: 89837eb4b2 ("net: sched: add barrier to ensure correct ordering for lockless qdisc")
Signed-off-by: Vincent Ray <vray@kalrayinc.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20220526001746.2437669-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-26 20:45:46 -07:00
arch Networking changes for 5.19. 2022-05-25 12:22:58 -07:00
block Page cache changes for 5.19 2022-05-24 19:55:07 -07:00
certs certs: Explain the rationale to call panic() 2022-05-23 18:47:49 +03:00
crypto certs: Factor out the blacklist hash creation 2022-05-23 18:47:49 +03:00
Documentation dt-bindings: net: adin: Fix adi,phy-output-clock description syntax 2022-05-25 22:03:45 -07:00
drivers net: lan966x: check devm_of_phy_get() for -EDEFER_PROBE 2022-05-26 20:45:29 -07:00
fs Networking changes for 5.19. 2022-05-25 12:22:58 -07:00
include net: sched: fixed barrier to prevent skbuff sticking in qdisc backlog 2022-05-26 20:45:46 -07:00
init slab changes for 5.19 2022-05-25 10:24:04 -07:00
ipc fs: allocate inode by using alloc_inode_sb() 2022-03-22 15:57:03 -07:00
kernel Networking changes for 5.19. 2022-05-25 12:22:58 -07:00
lib Networking changes for 5.19. 2022-05-25 12:22:58 -07:00
LICENSES LICENSES/LGPL-2.1: Add LGPL-2.1-or-later as valid identifiers 2021-12-16 14:33:10 +01:00
mm linux-kselftest-kunit-5.19-rc1 2022-05-25 11:32:53 -07:00
net netfilter: nft_limit: Clone packet limits' cost value 2022-05-26 22:50:34 +02:00
samples Networking changes for 5.19. 2022-05-25 12:22:58 -07:00
scripts Networking changes for 5.19. 2022-05-25 12:22:58 -07:00
security linux-kselftest-kunit-5.19-rc1 2022-05-25 11:32:53 -07:00
sound xen: branch for v5.19-rc1 2022-05-23 20:49:45 -07:00
tools selftests/net: enable lo.accept_local in psock_snd test 2022-05-25 21:58:35 -07:00
usr Kbuild updates for v5.18 2022-03-31 11:59:03 -07:00
virt KVM: Free new dirty bitmap if creating a new memslot fails 2022-05-20 13:02:05 -04:00
.clang-format genirq/msi: Make interrupt allocation less convoluted 2021-12-16 22:22:20 +01:00
.cocciconfig
.get_maintainer.ignore Opt out of scripts/get_maintainer.pl 2019-05-16 10:53:40 -07:00
.gitattributes .gitattributes: use 'dts' diff driver for dts files 2019-12-04 19:44:11 -08:00
.gitignore .gitignore: ignore only top-level modules.builtin 2021-05-02 00:43:35 +09:00
.mailmap Scheduler changes in this cycle were: 2022-05-24 11:11:13 -07:00
COPYING COPYING: state that all contributions really are covered by this file 2020-02-10 13:32:20 -08:00
CREDITS MAINTAINERS: replace a Microchip AT91 maintainer 2022-02-09 11:30:01 +01:00
Kbuild kbuild: rename hostprogs-y/always to hostprogs/always-y 2020-02-04 01:53:07 +09:00
Kconfig kbuild: ensure full rebuild when the compiler is updated 2020-05-12 13:28:33 +09:00
MAINTAINERS Networking changes for 5.19. 2022-05-25 12:22:58 -07:00
Makefile kernel-hardening updates for v5.19-rc1 2022-05-24 12:27:09 -07:00
README Drop all 00-INDEX files from Documentation/ 2018-09-09 15:08:58 -06:00

Linux kernel
============

There are several guides for kernel developers and users. These guides can
be rendered in a number of formats, like HTML and PDF. Please read
Documentation/admin-guide/README.rst first.

In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``.  The formatted documentation can also be read online at:

    https://www.kernel.org/doc/html/latest/

There are various text files in the Documentation/ subdirectory,
several of them using the Restructured Text markup notation.

Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.