linux/net/sched
Yunsheng Lin c4fef01ba4 net: sched: implement TCQ_F_CAN_BYPASS for lockless qdisc
Currently pfifo_fast has both TCQ_F_CAN_BYPASS and TCQ_F_NOLOCK
flag set, but queue discipline by-pass does not work for lockless
qdisc because skb is always enqueued to qdisc even when the qdisc
is empty, see __dev_xmit_skb().

This patch calls sch_direct_xmit() to transmit the skb directly
to the driver for empty lockless qdisc, which aviod enqueuing
and dequeuing operation.

As qdisc->empty is not reliable to indicate a empty qdisc because
there is a time window between enqueuing and setting qdisc->empty.
So we use the MISSED state added in commit a90c57f2ce ("net:
sched: fix packet stuck problem for lockless qdisc"), which
indicate there is lock contention, suggesting that it is better
not to do the qdisc bypass in order to avoid packet out of order
problem.

In order to make MISSED state reliable to indicate a empty qdisc,
we need to ensure that testing and clearing of MISSED state is
within the protection of qdisc->seqlock, only setting MISSED state
can be done without the protection of qdisc->seqlock. A MISSED
state testing is added without the protection of qdisc->seqlock to
aviod doing unnecessary spin_trylock() for contention case.

As the enqueuing is not within the protection of qdisc->seqlock,
there is still a potential data race as mentioned by Jakub [1]:

      thread1               thread2             thread3
qdisc_run_begin() # true
                        qdisc_run_begin(q)
                             set(MISSED)
pfifo_fast_dequeue
  clear(MISSED)
  # recheck the queue
qdisc_run_end()
                            enqueue skb1
                                             qdisc empty # true
                                          qdisc_run_begin() # true
                                          sch_direct_xmit() # skb2
                         qdisc_run_begin()
                            set(MISSED)

When above happens, skb1 enqueued by thread2 is transmited after
skb2 is transmited by thread3 because MISSED state setting and
enqueuing is not under the qdisc->seqlock. If qdisc bypass is
disabled, skb1 has better chance to be transmited quicker than
skb2.

This patch does not take care of the above data race, because we
view this as similar as below:
Even at the same time CPU1 and CPU2 write the skb to two socket
which both heading to the same qdisc, there is no guarantee that
which skb will hit the qdisc first, because there is a lot of
factor like interrupt/softirq/cache miss/scheduling afffecting
that.

There are below cases that need special handling:
1. When MISSED state is cleared before another round of dequeuing
   in pfifo_fast_dequeue(), and __qdisc_run() might not be able to
   dequeue all skb in one round and call __netif_schedule(), which
   might result in a non-empty qdisc without MISSED set. In order
   to avoid this, the MISSED state is set for lockless qdisc and
   __netif_schedule() will be called at the end of qdisc_run_end.

2. The MISSED state also need to be set for lockless qdisc instead
   of calling __netif_schedule() directly when requeuing a skb for
   a similar reason.

3. For netdev queue stopped case, the MISSED case need clearing
   while the netdev queue is stopped, otherwise there may be
   unnecessary __netif_schedule() calling. So a new DRAINING state
   is added to indicate this case, which also indicate a non-empty
   qdisc.

4. As there is already netif_xmit_frozen_or_stopped() checking in
   dequeue_skb() and sch_direct_xmit(), which are both within the
   protection of qdisc->seqlock, but the same checking in
   __dev_xmit_skb() is without the protection, which might cause
   empty indication of a lockless qdisc to be not reliable. So
   remove the checking in __dev_xmit_skb(), and the checking in
   the protection of qdisc->seqlock seems enough to avoid the cpu
   consumption problem for netdev queue stopped case.

1. https://lkml.org/lkml/2021/5/29/215

Acked-by: Jakub Kicinski <kuba@kernel.org>
Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> # flexcan
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-23 12:17:35 -07:00
..
act_api.c net: sched: fix error return code in tcf_del_walker() 2021-06-17 11:36:18 -07:00
act_bpf.c net: sched: fix misspellings using misspell-fixer tool 2020-11-10 17:00:28 -08:00
act_connmark.c net_sched: defer tcf_idr_insert() in tcf_action_init_1() 2020-09-24 19:46:21 -07:00
act_csum.c net_sched: defer tcf_idr_insert() in tcf_action_init_1() 2020-09-24 19:46:21 -07:00
act_ct.c net/sched: act_ct: handle DNAT tuple collision 2021-06-09 15:34:51 -07:00
act_ctinfo.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2020-10-05 18:40:01 -07:00
act_gact.c net_sched: defer tcf_idr_insert() in tcf_action_init_1() 2020-09-24 19:46:21 -07:00
act_gate.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2020-10-05 18:40:01 -07:00
act_ife.c net_sched: defer tcf_idr_insert() in tcf_action_init_1() 2020-09-24 19:46:21 -07:00
act_ipt.c treewide: rename nla_strlcpy to nla_strscpy. 2020-11-16 08:08:54 -08:00
act_meta_mark.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152 2019-05-30 11:26:32 -07:00
act_meta_skbprio.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152 2019-05-30 11:26:32 -07:00
act_meta_skbtcindex.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152 2019-05-30 11:26:32 -07:00
act_mirred.c net/sched: sch_frag: add generic packet fragment support. 2020-11-27 14:36:02 -08:00
act_mpls.c net/sched: act_mpls: ensure LSE is pullable before reading it 2020-12-03 11:13:37 -08:00
act_nat.c net_sched: defer tcf_idr_insert() in tcf_action_init_1() 2020-09-24 19:46:21 -07:00
act_pedit.c net_sched: defer tcf_idr_insert() in tcf_action_init_1() 2020-09-24 19:46:21 -07:00
act_police.c net/sched: act_police: add support for packet-per-second policing 2021-03-13 14:18:09 -08:00
act_sample.c psample: Encapsulate packet metadata in a struct 2021-03-14 15:00:43 -07:00
act_simple.c treewide: rename nla_strlcpy to nla_strscpy. 2020-11-16 08:08:54 -08:00
act_skbedit.c net_sched: defer tcf_idr_insert() in tcf_action_init_1() 2020-09-24 19:46:21 -07:00
act_skbmod.c net_sched: defer tcf_idr_insert() in tcf_action_init_1() 2020-09-24 19:46:21 -07:00
act_tunnel_key.c net/sched: act_tunnel_key: fix OOB write in case of IPv6 ERSPAN tunnels 2020-10-20 21:10:41 -07:00
act_vlan.c net/sched: act_vlan: No dump for unset priority 2021-06-01 16:54:42 -07:00
cls_api.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2021-05-27 09:55:10 -07:00
cls_basic.c net_sched: fix ops->bind_class() implementations 2020-01-27 10:51:43 +01:00
cls_bpf.c net_sched: fix ops->bind_class() implementations 2020-01-27 10:51:43 +01:00
cls_cgroup.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152 2019-05-30 11:26:32 -07:00
cls_flow.c Remove uninitialized_var() macro for v5.9-rc1 2020-08-04 13:49:43 -07:00
cls_flower.c Revert "net/sched: cls_flower: Remove match on n_proto" 2021-06-21 14:46:36 -07:00
cls_fw.c net_sched: fix ops->bind_class() implementations 2020-01-27 10:51:43 +01:00
cls_matchall.c net: qos offload add flow status with dropped count 2020-06-19 12:53:30 -07:00
cls_route.c net_sched: cls_route: remove the right filter from hashtable 2020-03-16 01:59:32 -07:00
cls_rsvp6.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152 2019-05-30 11:26:32 -07:00
cls_rsvp.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152 2019-05-30 11:26:32 -07:00
cls_rsvp.h net: sched: Fix spelling mistakes 2021-05-31 22:44:56 -07:00
cls_tcindex.c net_sched: avoid shift-out-of-bounds in tcindex_set_parms() 2021-01-15 18:11:10 -08:00
cls_u32.c net/sched: cls_u32: simplify the return expression of u32_reoffload_knode() 2020-12-08 16:22:53 -08:00
em_canid.c net: sched: kerneldoc fixes 2020-07-13 17:20:40 -07:00
em_cmp.c net: sched: fix misspellings using misspell-fixer tool 2020-11-10 17:00:28 -08:00
em_ipset.c sched: consistently handle layer3 header accesses in the presence of VLANs 2020-07-03 14:34:53 -07:00
em_ipt.c sched: consistently handle layer3 header accesses in the presence of VLANs 2020-07-03 14:34:53 -07:00
em_meta.c sched: consistently handle layer3 header accesses in the presence of VLANs 2020-07-03 14:34:53 -07:00
em_nbyte.c net: sched: Return the correct errno code 2021-02-06 11:15:28 -08:00
em_text.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152 2019-05-30 11:26:32 -07:00
em_u32.c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152 2019-05-30 11:26:32 -07:00
ematch.c net: sched: Fix spelling mistakes 2021-05-31 22:44:56 -07:00
Kconfig net: sched: incorrect Kconfig dependencies on Netfilter modules 2020-12-09 15:49:29 -08:00
Makefile net/sched: sch_frag: add generic packet fragment support. 2020-11-27 14:36:02 -08:00
sch_api.c net: sched: avoid duplicates in classes dump 2021-03-04 14:27:47 -08:00
sch_atm.c net: sched: Add extack to Qdisc_class_ops.delete 2021-01-22 20:41:29 -08:00
sch_blackhole.c Revert "net: sched: Pass root lock to Qdisc_ops.enqueue" 2020-07-16 16:48:34 -07:00
sch_cake.c sch_cake: revise docs for RFC 8622 LE PHB support 2021-06-14 12:10:57 -07:00
sch_cbq.c net: sched: Mundane typo fixes 2021-03-24 15:09:11 -07:00
sch_cbs.c net: don't include ethtool.h from netdevice.h 2020-11-23 17:27:04 -08:00
sch_choke.c net: sched: validate stab values 2021-03-10 15:47:52 -08:00
sch_codel.c Revert "net: sched: Pass root lock to Qdisc_ops.enqueue" 2020-07-16 16:48:34 -07:00
sch_drr.c net: sched: Add extack to Qdisc_class_ops.delete 2021-01-22 20:41:29 -08:00
sch_dsmark.c sch_dsmark: fix a NULL deref in qdisc_reset() 2021-05-24 13:11:44 -07:00
sch_etf.c Revert "net: sched: Pass root lock to Qdisc_ops.enqueue" 2020-07-16 16:48:34 -07:00
sch_ets.c Revert "net: sched: Pass root lock to Qdisc_ops.enqueue" 2020-07-16 16:48:34 -07:00
sch_fifo.c Revert "net: sched: Pass root lock to Qdisc_ops.enqueue" 2020-07-16 16:48:34 -07:00
sch_fq_codel.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next 2020-08-05 20:13:21 -07:00
sch_fq_pie.c net/sched: fq_pie: fix OOB access in the traffic path 2021-05-23 17:16:09 -07:00
sch_fq.c Revert "net: sched: Pass root lock to Qdisc_ops.enqueue" 2020-07-16 16:48:34 -07:00
sch_frag.c net/sched: sch_frag: fix stack OOB read while fragmenting IPv4 packets 2021-04-29 15:31:53 -07:00
sch_generic.c net: sched: implement TCQ_F_CAN_BYPASS for lockless qdisc 2021-06-23 12:17:35 -07:00
sch_gred.c net: sched: Fix spelling mistakes 2021-05-31 22:44:56 -07:00
sch_hfsc.c net: sched: Add extack to Qdisc_class_ops.delete 2021-01-22 20:41:29 -08:00
sch_hhf.c Revert "net: sched: Pass root lock to Qdisc_ops.enqueue" 2020-07-16 16:48:34 -07:00
sch_htb.c Merge ra.kernel.org:/pub/scm/linux/kernel/git/netdev/net 2021-06-07 13:01:52 -07:00
sch_ingress.c net: sched: Pass ingress block to tcf_classify_ingress 2020-02-19 17:49:48 -08:00
sch_mq.c net: sched: fix dump qlen for sch_mq/sch_mqprio with NOLOCK subqueues 2019-12-03 11:53:55 -08:00
sch_mqprio.c mqprio: Fix out-of-bounds access in mqprio_dump 2019-12-06 11:58:45 -08:00
sch_multiq.c Revert "net: sched: Pass root lock to Qdisc_ops.enqueue" 2020-07-16 16:48:34 -07:00
sch_netem.c netem: fix zero division in tabledist 2020-10-29 11:45:47 -07:00
sch_pie.c net: sched: fix misspellings using misspell-fixer tool 2020-11-10 17:00:28 -08:00
sch_plug.c Revert "net: sched: Pass root lock to Qdisc_ops.enqueue" 2020-07-16 16:48:34 -07:00
sch_prio.c Revert "net: sched: Pass root lock to Qdisc_ops.enqueue" 2020-07-16 16:48:34 -07:00
sch_qfq.c net: sched: Add extack to Qdisc_class_ops.delete 2021-01-22 20:41:29 -08:00
sch_red.c net: sched: validate stab values 2021-03-10 15:47:52 -08:00
sch_sfb.c net: sched: Add extack to Qdisc_class_ops.delete 2021-01-22 20:41:29 -08:00
sch_sfq.c net: sched: validate stab values 2021-03-10 15:47:52 -08:00
sch_skbprio.c Revert "net: sched: Pass root lock to Qdisc_ops.enqueue" 2020-07-16 16:48:34 -07:00
sch_taprio.c net: taprio offload: enforce qdisc to netdev queue mapping 2021-05-13 13:08:00 -07:00
sch_tbf.c Revert "net: sched: Pass root lock to Qdisc_ops.enqueue" 2020-07-16 16:48:34 -07:00
sch_teql.c net: sched: sch_teql: fix null-pointer dereference 2021-04-08 14:14:42 -07:00