Commit Graph

14796 Commits

Author SHA1 Message Date
Archie Pusaka
39bc74ca01 Bluetooth: use inclusive language when tracking connections
This patch replaces some non-inclusive terms based on the appropriate
language mapping table compiled by the Bluetooth SIG:
https://specificationrefs.bluetooth.com/language-mapping/Appropriate_Language_Mapping_Table.pdf

Specifically, these terms are replaced:
master -> central
slave  -> peripheral

Signed-off-by: Archie Pusaka <apusaka@chromium.org>
Reviewed-by: Miao-chen Chou <mcchou@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2021-06-26 07:12:43 +02:00
Archie Pusaka
fad646e16d Bluetooth: use inclusive language in SMP
This patch replaces some non-inclusive terms based on the appropriate
language mapping table compiled by the Bluetooth SIG:
https://specificationrefs.bluetooth.com/language-mapping/Appropriate_Language_Mapping_Table.pdf

Specifically, these terms are replaced:
master -> initiator
slave  -> responder

Signed-off-by: Archie Pusaka <apusaka@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2021-06-26 07:12:37 +02:00
Archie Pusaka
ef365da180 Bluetooth: use inclusive language in HCI LE features
This patch replaces some non-inclusive terms based on the appropriate
language mapping table compiled by the Bluetooth SIG:
https://specificationrefs.bluetooth.com/language-mapping/Appropriate_Language_Mapping_Table.pdf

Specifically, these terms are replaced:
master -> central
slave  -> peripheral

Signed-off-by: Archie Pusaka <apusaka@chromium.org>
Reviewed-by: Miao-chen Chou <mcchou@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2021-06-26 07:12:37 +02:00
Archie Pusaka
6397729bb7 Bluetooth: use inclusive language to describe CPB
This patch replaces some non-inclusive terms based on the appropriate
language mapping table compiled by the Bluetooth SIG:
https://specificationrefs.bluetooth.com/language-mapping/Appropriate_Language_Mapping_Table.pdf

Specifically, these terms are replaced when describing the
connectionless peripheral broadcast feature:
master -> central
slave  -> peripheral

Signed-off-by: Archie Pusaka <apusaka@chromium.org>
Reviewed-by: Miao-chen Chou <mcchou@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2021-06-26 07:12:36 +02:00
Archie Pusaka
76c185a515 Bluetooth: use inclusive language in hci_core.h
This patch replaces some non-inclusive terms based on the appropriate
language mapping table compiled by the Bluetooth SIG:
https://specificationrefs.bluetooth.com/language-mapping/Appropriate_Language_Mapping_Table.pdf

Specifically, these terms are replaced:
master -> central
slave  -> peripheral

These attributes are not used elsewhere in the code.

Signed-off-by: Archie Pusaka <apusaka@chromium.org>
Reviewed-by: Miao-chen Chou <mcchou@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2021-06-26 07:12:36 +02:00
Manish Mandlik
de75cd0d9b Bluetooth: Add ncmd=0 recovery handling
During command status or command complete event, the controller may set
ncmd=0 indicating that it is not accepting any more commands. In such a
case, host holds off sending any more commands to the controller. If the
controller doesn't recover from such condition, host will wait forever,
until the user decides that the Bluetooth is broken and may power cycles
the Bluetooth.

This patch triggers the hardware error to reset the controller and
driver when it gets into such state as there is no other wat out.

Reviewed-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Signed-off-by: Manish Mandlik <mmandlik@google.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2021-06-26 07:12:27 +02:00
Yu Liu
1c6ed31b16 Bluetooth: Return whether a connection is outbound
When an MGMT_EV_DEVICE_CONNECTED event is reported back to the user
space we will set the flags to tell if the established connection is
outbound or not. This is useful for the user space to log better metrics
and error messages.

Reviewed-by: Miao-chen Chou <mcchou@chromium.org>
Reviewed-by: Alain Michaud <alainm@chromium.org>
Signed-off-by: Yu Liu <yudiliu@google.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2021-06-26 07:12:23 +02:00
David S. Miller
4e3db44a24 wireless-drivers-next patches for v5.14
Second, and most likely the last, set of patches for v5.14. mt76 and
 iwlwifi have most patches in this round, but rtw88 also has some new
 features. Nothing special really standing out.
 
 mt76
 
 * mt7915 MSI support
 
 * disable ASPM on mt7915
 
 * mt7915 tx status reporting
 
 * mt7921 decap offload
 
 rtw88
 
 * beacon filter support
 
 * path diversity support
 
 * firmware crash information via devcoredump
 
 * quirks for disabling pci capabilities
 
 mt7601u
 
 * add USB ID for a XiaoDu WiFi Dongle
 
 ath11k
 
 * enable support for QCN9074 PCI devices
 
 brcmfmac
 
 * support parse country code map from DeviceTree
 
 iwlwifi
 
 * support for new hardware
 
 * support for BIOS control of 11ax enablement in Russia
 
 * support UNII4 band enablement from BIOS
 -----BEGIN PGP SIGNATURE-----
 
 iQFJBAABCgAzFiEEiBjanGPFTz4PRfLobhckVSbrbZsFAmDVyOMVHGt2YWxvQGNv
 ZGVhdXJvcmEub3JnAAoJEG4XJFUm622bXtsH/RB2SKxPY4Rgb7+afjpJIeGDBq8T
 7p5/Xe/PaKMIuusFuJ1gYbTShm5FjdA16F7ZMYSgzpf2b9XBv6hDw7xsPW3qPPtJ
 MJynfnyvpNH9/NWcSrqOMLeFGv1ot5861BLESCXIaWNud+8OMUSw+qf8P6Z80RCD
 tasOayIJVQ/6xLpzbqYoLXfXGeIwlWTiQnQROA8Joq7vQziHSnHTqC7RyTF6SrFn
 AUnGJ10t62hsGt64tt1QeaCV65SCTPjxI2KKpUpvhD/yQtdcEq2f+1vCfnCk+NFz
 VWxCAI6U7s6gf0Lu4/+cJhHt/ZCtLZ2d06g93sfplSWHBImYxG55cO3tHzU=
 =pRFB
 -----END PGP SIGNATURE-----

Merge tag 'wireless-drivers-next-2021-06-25' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next

Kalle Valo says:

====================
wireless-drivers-next patches for v5.14

Second, and most likely the last, set of patches for v5.14. mt76 and
iwlwifi have most patches in this round, but rtw88 also has some new
features. Nothing special really standing out.

mt76

* mt7915 MSI support

* disable ASPM on mt7915

* mt7915 tx status reporting

* mt7921 decap offload

rtw88

* beacon filter support

* path diversity support

* firmware crash information via devcoredump

* quirks for disabling pci capabilities

mt7601u

* add USB ID for a XiaoDu WiFi Dongle

ath11k

* enable support for QCN9074 PCI devices

brcmfmac

* support parse country code map from DeviceTree

iwlwifi

* support for new hardware

* support for BIOS control of 11ax enablement in Russia

* support UNII4 band enablement from BIOS
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-25 11:50:02 -07:00
Xin Long
0dac127c05 sctp: do black hole detection in search complete state
Currently the PLPMUTD probe will stop for a long period (interval * 30)
after it enters search complete state. If there's a pmtu change on the
route path, it takes a long time to be aware if the ICMP TooBig packet
is lost or filtered.

As it says in rfc8899#section-4.3:

  "A DPLPMTUD method MUST NOT rely solely on this method."
  (ICMP PTB message).

This patch is to enable the other method for search complete state:

  "A PL can use the DPLPMTUD probing mechanism to periodically
   generate probe packets of the size of the current PLPMTU."

With this patch, the probe will continue with the current pmtu every
'interval' until the PMTU_RAISE_TIMER 'timeout', which we implement
by adding raise_count to raise the probe size when it counts to 30
and removing the SCTP_PL_COMPLETE check for PMTU_RAISE_TIMER.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-24 12:58:03 -07:00
Kuniyuki Iwashima
10ed7ce42b net/tls: Remove the __TLS_DEC_STATS() macro.
The commit d26b698dd3 ("net/tls: add skeleton of MIB statistics")
introduced __TLS_DEC_STATS(), but it is not used and __SNMP_DEC_STATS() is
not defined also. Let's remove it.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-23 13:47:55 -07:00
David S. Miller
fe87797bf2 mlx5-net-next-2021-06-22
1) Various minor cleanups and fixes from net-next branch
 2) Optimize mlx5 feature check on tx and
    a fix to allow Vxlan with Ipsec offloads
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAmDSYyYACgkQSD+KveBX
 +j45Bgf+M7Vg3OK4fA7ZURrbxu2EiCQB4XprO/it7YSzpiZx634DNNRzWXQ2mLJD
 jx5TcmAFVUKkGmx/qrPYe/Y9c9l5s6JMjwACL5aEawXtvPzI/q1KBx/n5L+CoFYw
 lO1IpBPpkwqIbeIl9cwata7IJ6aTeOGjqfQ//Fodfwga063Aaggig6sEcPkr0Ewe
 wcuJmblnw/qOkSI2BlSOyixYiVjPRDF7cAVRTBK4/DCDFiGJTiaj8w0JgfdS2zVs
 3xMrYajdz7qArMcqbuQe59KojeYj4hxALUSs+s9ks8qeIrbM+hZ9sH5m0dJgM4P6
 7Pg87LD6PFDV31DWF0XM3KgdqfiZxw==
 =aszX
 -----END PGP SIGNATURE-----

Merge tag 'mlx5-net-next-2021-06-22' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
mlx5-net-next-2021-06-22

1) Various minor cleanups and fixes from net-next branch
2) Optimize mlx5 feature check on tx and
   a fix to allow Vxlan with Ipsec offloads
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-23 12:48:07 -07:00
David S. Miller
a7b62112f0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for net-next:

1) Skip non-SCTP packets in the new SCTP chunk support for nft_exthdr,
   from Phil Sutter.

2) Simplify TCP option sanity check for TCP packets, also from Phil.

3) Add a new expression to store when the rule has been used last time.

4) Pass the hook state object to log function, from Florian Westphal.

5) Document the new sysctl knobs to tune the flowtable timeouts,
   from Oz Shlomo.

6) Fix snprintf error check in the new nfnetlink_hook infrastructure,
   from Dan Carpenter.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-23 12:31:28 -07:00
Yunsheng Lin
d3e0f57501 net: sched: remove qdisc->empty for lockless qdisc
As MISSED and DRAINING state are used to indicate a non-empty
qdisc, qdisc->empty is not longer needed, so remove it.

Acked-by: Jakub Kicinski <kuba@kernel.org>
Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> # flexcan
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-23 12:17:35 -07:00
Yunsheng Lin
c4fef01ba4 net: sched: implement TCQ_F_CAN_BYPASS for lockless qdisc
Currently pfifo_fast has both TCQ_F_CAN_BYPASS and TCQ_F_NOLOCK
flag set, but queue discipline by-pass does not work for lockless
qdisc because skb is always enqueued to qdisc even when the qdisc
is empty, see __dev_xmit_skb().

This patch calls sch_direct_xmit() to transmit the skb directly
to the driver for empty lockless qdisc, which aviod enqueuing
and dequeuing operation.

As qdisc->empty is not reliable to indicate a empty qdisc because
there is a time window between enqueuing and setting qdisc->empty.
So we use the MISSED state added in commit a90c57f2ce ("net:
sched: fix packet stuck problem for lockless qdisc"), which
indicate there is lock contention, suggesting that it is better
not to do the qdisc bypass in order to avoid packet out of order
problem.

In order to make MISSED state reliable to indicate a empty qdisc,
we need to ensure that testing and clearing of MISSED state is
within the protection of qdisc->seqlock, only setting MISSED state
can be done without the protection of qdisc->seqlock. A MISSED
state testing is added without the protection of qdisc->seqlock to
aviod doing unnecessary spin_trylock() for contention case.

As the enqueuing is not within the protection of qdisc->seqlock,
there is still a potential data race as mentioned by Jakub [1]:

      thread1               thread2             thread3
qdisc_run_begin() # true
                        qdisc_run_begin(q)
                             set(MISSED)
pfifo_fast_dequeue
  clear(MISSED)
  # recheck the queue
qdisc_run_end()
                            enqueue skb1
                                             qdisc empty # true
                                          qdisc_run_begin() # true
                                          sch_direct_xmit() # skb2
                         qdisc_run_begin()
                            set(MISSED)

When above happens, skb1 enqueued by thread2 is transmited after
skb2 is transmited by thread3 because MISSED state setting and
enqueuing is not under the qdisc->seqlock. If qdisc bypass is
disabled, skb1 has better chance to be transmited quicker than
skb2.

This patch does not take care of the above data race, because we
view this as similar as below:
Even at the same time CPU1 and CPU2 write the skb to two socket
which both heading to the same qdisc, there is no guarantee that
which skb will hit the qdisc first, because there is a lot of
factor like interrupt/softirq/cache miss/scheduling afffecting
that.

There are below cases that need special handling:
1. When MISSED state is cleared before another round of dequeuing
   in pfifo_fast_dequeue(), and __qdisc_run() might not be able to
   dequeue all skb in one round and call __netif_schedule(), which
   might result in a non-empty qdisc without MISSED set. In order
   to avoid this, the MISSED state is set for lockless qdisc and
   __netif_schedule() will be called at the end of qdisc_run_end.

2. The MISSED state also need to be set for lockless qdisc instead
   of calling __netif_schedule() directly when requeuing a skb for
   a similar reason.

3. For netdev queue stopped case, the MISSED case need clearing
   while the netdev queue is stopped, otherwise there may be
   unnecessary __netif_schedule() calling. So a new DRAINING state
   is added to indicate this case, which also indicate a non-empty
   qdisc.

4. As there is already netif_xmit_frozen_or_stopped() checking in
   dequeue_skb() and sch_direct_xmit(), which are both within the
   protection of qdisc->seqlock, but the same checking in
   __dev_xmit_skb() is without the protection, which might cause
   empty indication of a lockless qdisc to be not reliable. So
   remove the checking in __dev_xmit_skb(), and the checking in
   the protection of qdisc->seqlock seems enough to avoid the cpu
   consumption problem for netdev queue stopped case.

1. https://lkml.org/lkml/2021/5/29/215

Acked-by: Jakub Kicinski <kuba@kernel.org>
Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> # flexcan
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-23 12:17:35 -07:00
Yunsheng Lin
dd25296afa net: sched: avoid unnecessary seqcount operation for lockless qdisc
qdisc->running seqcount operation is mainly used to do heuristic
locking on q->busylock for locked qdisc, see qdisc_is_running()
and __dev_xmit_skb().

So avoid doing seqcount operation for qdisc with TCQ_F_NOLOCK
flag.

Acked-by: Jakub Kicinski <kuba@kernel.org>
Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> # flexcan
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-23 12:17:35 -07:00
Dmitry Osipenko
761025b51c cfg80211: Add wiphy_info_once()
Add wiphy_info_once() helper that prints info message only once.

Signed-off-by: Dmitry Osipenko <digetx@gmail.com>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
Link: https://lore.kernel.org/r/20210511211549.30571-1-digetx@gmail.com
2021-06-23 20:44:25 +03:00
Huy Nguyen
fa4535238f net/xfrm: Add inner_ipproto into sec_path
The inner_ipproto saves the inner IP protocol of the plain
text packet. This allows vendor's IPsec feature making offload
decision at skb's features_check and configuring hardware at
ndo_start_xmit.

For example, ConnectX6-DX IPsec device needs the plaintext's
IP protocol to support partial checksum offload on
VXLAN/GENEVE packet over IPsec transport mode tunnel.

Signed-off-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Huy Nguyen <huyn@nvidia.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-06-22 15:24:32 -07:00
Geliang Tang
bab6b88e05 mptcp: add allow_join_id0 in mptcp_out_options
This patch defined a new flag MPTCP_CAP_DENY_JOIN_ID0 for the third bit,
labeled "C" of the MP_CAPABLE option.

Add a new flag allow_join_id0 in struct mptcp_out_options. If this flag is
set, send out the MP_CAPABLE option with the flag MPTCP_CAP_DENY_JOIN_ID0.

Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-22 14:36:01 -07:00
Xin Long
9e47df005c sctp: process sctp over udp icmp err on sctp side
Previously, sctp over udp was using udp tunnel's icmp err process, which
only does sk lookup on sctp side. However for sctp's icmp error process,
there are more things to do, like syncing assoc pmtu/retransmit packets
for toobig type err, and starting proto_unreach_timer for unreach type
err etc.

Now after adding PLPMTUD, which also requires to process toobig type err
on sctp side. This patch is to process icmp err on sctp side by parsing
the type/code/info in .encap_err_lookup and call sctp's icmp processing
functions. Note as the 'redirect' err process needs to know the outer
ip(v6) header's, we have to leave it to udp(v6)_err to handle it.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-22 11:28:52 -07:00
Xin Long
b87641aff9 sctp: do state transition when a probe succeeds on HB ACK recv path
As described in rfc8899#section-5.2, when a probe succeeds, there might
be the following state transitions:

  - Base -> Search, occurs when probe succeeds with BASE_PLPMTU,
    pl.pmtu is not changing,
    pl.probe_size increases by SCTP_PL_BIG_STEP,

  - Error -> Search, occurs when probe succeeds with BASE_PLPMTU,
    pl.pmtu is changed from SCTP_MIN_PLPMTU to SCTP_BASE_PLPMTU,
    pl.probe_size increases by SCTP_PL_BIG_STEP.

  - Search -> Search Complete, occurs when probe succeeds with the probe
    size SCTP_MAX_PLPMTU less than pl.probe_high,
    pl.pmtu is not changing, but update *pathmtu* with it,
    pl.probe_size is set back to pl.pmtu to double check it.

  - Search Complete -> Search, occurs when probe succeeds with the probe
    size equal to pl.pmtu,
    pl.pmtu is not changing,
    pl.probe_size increases by SCTP_PL_MIN_STEP.

So search process can be described as:

 1. When it just enters 'Search' state, *pathmtu* is not updated with
    pl.pmtu, and probe_size increases by a big step (SCTP_PL_BIG_STEP)
    each round.

 2. Until pl.probe_high is set when a probe fails, and probe_size
    decreases back to pl.pmtu, as described in the last patch.

 3. When the probe with the new size succeeds, probe_size changes to
    increase by a small step (SCTP_PL_MIN_STEP) due to pl.probe_high
    is set.

 4. Until probe_size is next to pl.probe_high, the searching finishes and
    it goes to 'Complete' state and updates *pathmtu* with pl.pmtu, and
    then probe_size is set to pl.pmtu to confirm by once more probe.

 5. This probe occurs after "30 * probe_inteval", a much longer time than
    that in Search state. Once it is done it goes to 'Search' state again
    with probe_size increased by SCTP_PL_MIN_STEP.

As we can see above, during the searching, pl.pmtu changes while *pathmtu*
doesn't. *pathmtu* is only updated when the search finishes by which it
gets an optimal value for it. A big step is used at the beginning until
it gets close to the optimal value, then it changes to a small step until
it has this optimal value.

The small step is also used in 'Complete' until it goes to 'Search' state
again and the probe with 'pmtu + the small step' succeeds, which means a
higher size could be used. Then probe_size changes to increase by a big
step again until it gets close to the next optimal value.

Note that anytime when black hole is detected, it goes directly to 'Base'
state with pl.pmtu set to SCTP_BASE_PLPMTU, as described in the last patch.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-22 11:28:52 -07:00
Xin Long
1dc68c1945 sctp: do state transition when PROBE_COUNT == MAX_PROBES on HB send path
The state transition is described in rfc8899#section-5.2,
PROBE_COUNT == MAX_PROBES means the probe fails for MAX times, and the
state transition includes:

  - Base -> Error, occurs when BASE_PLPMTU Confirmation Fails,
    pl.pmtu is set to SCTP_MIN_PLPMTU,
    probe_size is still SCTP_BASE_PLPMTU;

  - Search -> Base, occurs when Black Hole Detected,
    pl.pmtu is set to SCTP_BASE_PLPMTU,
    probe_size is set back to SCTP_BASE_PLPMTU;

  - Search Complete -> Base, occurs when Black Hole Detected
    pl.pmtu is set to SCTP_BASE_PLPMTU,
    probe_size is set back to SCTP_BASE_PLPMTU;

Note a black hole is encountered when a sender is unaware that packets
are not being delivered to the destination endpoint. So it includes the
probe failures with equal probe_size to pl.pmtu, and definitely not
include that with greater probe_size than pl.pmtu. The later one is the
normal probe failure where probe_size should decrease back to pl.pmtu
and pl.probe_high is set.  pl.probe_high would be used on HB ACK recv
path in the next patch.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-22 11:28:52 -07:00
Xin Long
fe59379b9a sctp: do the basic send and recv for PLPMTUD probe
This patch does exactly what rfc8899#section-6.2.1.2 says:

   The SCTP sender needs to be able to determine the total size of a
   probe packet.  The HEARTBEAT chunk could carry a Heartbeat
   Information parameter that includes, besides the information
   suggested in [RFC4960], the probe size to help an implementation
   associate a HEARTBEAT ACK with the size of probe that was sent.  The
   sender could also use other methods, such as sending a nonce and
   verifying the information returned also contains the corresponding
   nonce.  The length of the PAD chunk is computed by reducing the
   probing size by the size of the SCTP common header and the HEARTBEAT
   chunk.

Note that HB ACK chunk will carry back whatever HB chunk carried, including
the probe_size we put it in; We also check hbinfo->probe_size in the HB ACK
against link->pl.probe_size to validate this HB ACK chunk.

v1->v2:
  - Remove the unused 'sp' and add static for sctp_packet_bundle_pad().

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-22 11:28:52 -07:00
Xin Long
92548ec2f1 sctp: add the probe timer in transport for PLPMTUD
There are 3 timers described in rfc8899#section-5.1.1:

  PROBE_TIMER, PMTU_RAISE_TIMER, CONFIRMATION_TIMER

This patches adds a 'probe_timer' in transport, and it works as either
PROBE_TIMER or PMTU_RAISE_TIMER. At most time, it works as PROBE_TIMER
and expires every a 'probe_interval' time to send the HB probe packet.
When transport pl enters COMPLETE state, it works as PMTU_RAISE_TIMER
and expires in 'probe_interval * 30' time to go back to SEARCH state
and do searching again.

SCTP HB is an acknowledged packet, CONFIRMATION_TIMER is not needed.

The timer will start when transport pl enters BASE state and stop
when it enters DISABLED state.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-22 11:28:52 -07:00
Xin Long
d9e2e410ae sctp: add the constants/variables and states and some APIs for transport
These are 4 constants described in rfc8899#section-5.1.2:

  MAX_PROBES, MIN_PLPMTU, MAX_PLPMTU, BASE_PLPMTU;

And 2 variables described in rfc8899#section-5.1.3:

  PROBED_SIZE, PROBE_COUNT;

And 5 states described in rfc8899#section-5.2:

  DISABLED, BASE, SEARCH, SEARCH_COMPLETE, ERROR;

And these 4 APIs are used to reset/update PLPMTUD, check if PLPMTUD is
enabled, and calculate the additional headers length for a transport.

Note the member 'probe_high' in transport will be set to the probe
size when a probe fails with this probe size in the next patches.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-22 11:28:52 -07:00
Xin Long
d1e462a7a5 sctp: add probe_interval in sysctl and sock/asoc/transport
PLPMTUD can be enabled by doing 'sysctl -w net.sctp.probe_interval=n'.
'n' is the interval for PLPMTUD probe timer in milliseconds, and it
can't be less than 5000 if it's not 0.

All asoc/transport's PLPMTUD in a new socket will be enabled by default.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-22 11:28:51 -07:00
Xin Long
745a32117b sctp: add pad chunk and its make function and event table
This chunk is defined in rfc4820#section-3, and used to pad an
SCTP packet. The receiver must discard this chunk and continue
processing the rest of the chunks in the packet.

Add it now, as it will be bundled with a heartbeat chunk to probe
pmtu in the following patches.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-22 11:28:51 -07:00
Vladimir Oltean
a8986681cc net: dsa: export the dsa_port_is_{user,cpu,dsa} helpers
The difference between dsa_is_user_port and dsa_port_is_user is that the
former needs to look up the list of ports of the DSA switch tree in
order to find the struct dsa_port, while the latter directly receives it
as an argument.

dsa_is_user_port is already in widespread use and has its place, so
there isn't any chance of converting all callers to a single form.
But being able to do:
	dsa_port_is_user(dp)
instead of
	dsa_is_user_port(dp->ds, dp->index)

is much more efficient too, especially when the "dp" comes from an
iterator over the DSA switch tree - this reduces the complexity from
quadratic to linear.

Move these helpers from dsa2.c to include/net/dsa.h so that others can
use them too.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-21 12:50:20 -07:00
Jakub Kicinski
adc2e56ebe Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Trivial conflicts in net/can/isotp.c and
tools/testing/selftests/net/mptcp/mptcp_connect.sh

scaled_ppm_to_ppb() was moved from drivers/ptp/ptp_clock.c
to include/linux/ptp_clock_kernel.h in -next so re-apply
the fix there.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-06-18 19:47:02 -07:00
Geliang Tang
208e8f6692 mptcp: receive checksum for MP_CAPABLE with data
This patch added a new member named csum in struct mptcp_options_received.

When parsing the MP_CAPABLE with data, if the checksum is enabled,
adjust the expected_opsize. If the receiving option length matches the
length with the data checksum, get the checksum value and save it in
mp_opt->csum. And in mptcp_incoming_options, pass it to mpext->csum.

We always parse any csum/nocsum combination and delay the presence check
to later code, to allow reset if missing.

Additionally, in the TX path, use the newly introduce ext field to avoid
MPTCP csum recomputation on TCP retransmission and unneeded csum update
on when setting the data fin_flag.

Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-18 11:40:11 -07:00
Geliang Tang
06fe1719aa mptcp: add csum_reqd in mptcp_out_options
This patch added a new member csum_reqd in struct mptcp_out_options and
struct mptcp_subflow_request_sock. Initialized it with the helper
function mptcp_is_checksum_enabled().

In mptcp_write_options, if this field is enabled, send out the MP_CAPABLE
suboption with the MPTCP_CAP_CHECKSUM_REQD flag.

Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-18 11:40:11 -07:00
Geliang Tang
d0cc298745 mptcp: generate the data checksum
This patch added a new member named csum in struct mptcp_ext, implemented
a new function named mptcp_generate_data_checksum().

Generate the data checksum in mptcp_sendmsg_frag, save it in mpext->csum.

Note that we must generate the csum for zero window probe, too.

Do the csum update incrementally, to avoid multiple csum computation
when the data is appended to existing skb.

Note that in a later patch we will skip unneeded csum related operation.
Changes not included here to keep the delta small.

Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-18 11:40:11 -07:00
Florian Westphal
62eec0d733 netfilter: conntrack: pass hook state to log functions
The packet logger backend is unable to provide the incoming (or
outgoing) interface name because that information isn't available.

Pass the hook state, it contains the network namespace, the protocol
family, the network interfaces and other things.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-06-18 14:47:43 +02:00
David S. Miller
a52171ae7b Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2021-06-17

The following pull-request contains BPF updates for your *net-next* tree.

We've added 50 non-merge commits during the last 25 day(s) which contain
a total of 148 files changed, 4779 insertions(+), 1248 deletions(-).

The main changes are:

1) BPF infrastructure to migrate TCP child sockets from a listener to another
   in the same reuseport group/map, from Kuniyuki Iwashima.

2) Add a provably sound, faster and more precise algorithm for tnum_mul() as
   noted in https://arxiv.org/abs/2105.05398, from Harishankar Vishwanathan.

3) Streamline error reporting changes in libbpf as planned out in the
   'libbpf: the road to v1.0' effort, from Andrii Nakryiko.

4) Add broadcast support to xdp_redirect_map(), from Hangbin Liu.

5) Extends bpf_map_lookup_and_delete_elem() functionality to 4 more map
   types, that is, {LRU_,PERCPU_,LRU_PERCPU_,}HASH, from Denis Salopek.

6) Support new LLVM relocations in libbpf to make them more linker friendly,
   also add a doc to describe the BPF backend relocations, from Yonghong Song.

7) Silence long standing KUBSAN complaints on register-based shifts in
   interpreter, from Daniel Borkmann and Eric Biggers.

8) Add dummy PT_REGS macros in libbpf to fail BPF program compilation when
   target arch cannot be determined, from Lorenz Bauer.

9) Extend AF_XDP to support large umems with 1M+ pages, from Magnus Karlsson.

10) Fix two minor libbpf tc BPF API issues, from Kumar Kartikeya Dwivedi.

11) Move libbpf BPF_SEQ_PRINTF/BPF_SNPRINTF macros that can be used by BPF
    programs to bpf_helpers.h header, from Florent Revest.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-17 11:54:56 -07:00
Pablo Neira Ayuso
836382dc24 netfilter: nf_tables: add last expression
Add a new optional expression that tells you when last matching on a
given rule / set element element has happened.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-06-17 03:23:00 +02:00
Guvenc Gulce
194730a9be net/smc: Make SMC statistics network namespace aware
Make the gathered SMC statistics network namespace aware, for each
namespace collect an own set of statistic information.

Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-16 12:54:02 -07:00
Jussi Maki
848ca9182a net: bonding: Use per-cpu rr_tx_counter
The round-robin rr_tx_counter was shared across CPUs leading to
significant cache thrashing at high packet rates. This patch switches
the round-robin packet counter to use a per-cpu variable to decide
the destination slave.

On a test with 2x100Gbit ICE nic with pktgen_sample_04_many_flows.sh
(-s 64 -t 32) the tx rate was 19.6Mpps before and 22.3Mpps after
this patch.

"perf top -e cache_misses" before:
    12.31%  [bonding]       [k] bond_xmit_roundrobin_slave_get
    10.59%  [sch_fq_codel]  [k] fq_codel_dequeue
     9.34%  [kernel]        [k] skb_release_data
after:
    15.42%  [sch_fq_codel]  [k] fq_codel_dequeue
    10.06%  [kernel]        [k] __memset
     9.12%  [kernel]        [k] skb_release_data

Signed-off-by: Jussi Maki <joamaki@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-15 11:26:15 -07:00
Changbin Du
e34492dea6 net: inline function get_net_ns_by_fd if NET_NS is disabled
The function get_net_ns_by_fd() could be inlined when NET_NS is not
enabled.

Signed-off-by: Changbin Du <changbin.du@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-15 11:00:45 -07:00
Kuniyuki Iwashima
1cd62c2157 tcp: Add reuseport_migrate_sock() to select a new listener.
reuseport_migrate_sock() does the same check done in
reuseport_listen_stop_sock(). If the reuseport group is capable of
migration, reuseport_migrate_sock() selects a new listener by the child
socket hash and increments the listener's sk_refcnt beforehand. Thus, if we
fail in the migration, we have to decrement it later.

We will support migration by eBPF in the later commits.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/bpf/20210612123224.12525-5-kuniyu@amazon.co.jp
2021-06-15 18:01:05 +02:00
Kuniyuki Iwashima
333bb73f62 tcp: Keep TCP_CLOSE sockets in the reuseport group.
When we close a listening socket, to migrate its connections to another
listener in the same reuseport group, we have to handle two kinds of child
sockets. One is that a listening socket has a reference to, and the other
is not.

The former is the TCP_ESTABLISHED/TCP_SYN_RECV sockets, and they are in the
accept queue of their listening socket. So we can pop them out and push
them into another listener's queue at close() or shutdown() syscalls. On
the other hand, the latter, the TCP_NEW_SYN_RECV socket is during the
three-way handshake and not in the accept queue. Thus, we cannot access
such sockets at close() or shutdown() syscalls. Accordingly, we have to
migrate immature sockets after their listening socket has been closed.

Currently, if their listening socket has been closed, TCP_NEW_SYN_RECV
sockets are freed at receiving the final ACK or retransmitting SYN+ACKs. At
that time, if we could select a new listener from the same reuseport group,
no connection would be aborted. However, we cannot do that because
reuseport_detach_sock() sets NULL to sk_reuseport_cb and forbids access to
the reuseport group from closed sockets.

This patch allows TCP_CLOSE sockets to remain in the reuseport group and
access it while any child socket references them. The point is that
reuseport_detach_sock() was called twice from inet_unhash() and
sk_destruct(). This patch replaces the first reuseport_detach_sock() with
reuseport_stop_listen_sock(), which checks if the reuseport group is
capable of migration. If capable, it decrements num_socks, moves the socket
backwards in socks[] and increments num_closed_socks. When all connections
are migrated, sk_destruct() calls reuseport_detach_sock() to remove the
socket from socks[], decrement num_closed_socks, and set NULL to
sk_reuseport_cb.

By this change, closed or shutdowned sockets can keep sk_reuseport_cb.
Consequently, calling listen() after shutdown() can cause EADDRINUSE or
EBUSY in inet_csk_bind_conflict() or reuseport_add_sock() which expects
such sockets not to have the reuseport group. Therefore, this patch also
loosens such validation rules so that a socket can listen again if it has a
reuseport group with num_closed_socks more than 0.

When such sockets listen again, we handle them in reuseport_resurrect(). If
there is an existing reuseport group (reuseport_add_sock() path), we move
the socket from the old group to the new one and free the old one if
necessary. If there is no existing group (reuseport_alloc() path), we
allocate a new reuseport group, detach sk from the old one, and free it if
necessary, not to break the current shutdown behaviour:

  - we cannot carry over the eBPF prog of shutdowned sockets
  - we cannot attach/detach an eBPF prog to/from listening sockets via
    shutdowned sockets

Note that when the number of sockets gets over U16_MAX, we try to detach a
closed socket randomly to make room for the new listening socket in
reuseport_grow().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/bpf/20210612123224.12525-4-kuniyu@amazon.co.jp
2021-06-15 18:01:05 +02:00
Kuniyuki Iwashima
5c040eaf5d tcp: Add num_closed_socks to struct sock_reuseport.
As noted in the following commit, a closed listener has to hold the
reference to the reuseport group for socket migration. This patch adds a
field (num_closed_socks) to struct sock_reuseport to manage closed sockets
within the same reuseport group. Moreover, this and the following commits
introduce some helper functions to split socks[] into two sections and keep
TCP_LISTEN and TCP_CLOSE sockets in each section. Like a double-ended
queue, we will place TCP_LISTEN sockets from the front and TCP_CLOSE
sockets from the end.

  TCP_LISTEN---------->       <-------TCP_CLOSE
  +---+---+  ---  +---+  ---  +---+  ---  +---+
  | 0 | 1 |  ...  | i |  ...  | j |  ...  | k |
  +---+---+  ---  +---+  ---  +---+  ---  +---+

  i = num_socks - 1
  j = max_socks - num_closed_socks
  k = max_socks - 1

This patch also extends reuseport_add_sock() and reuseport_grow() to
support num_closed_socks.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20210612123224.12525-3-kuniyu@amazon.co.jp
2021-06-15 18:01:05 +02:00
Kuniyuki Iwashima
f9ac779f88 net: Introduce net.ipv4.tcp_migrate_req.
This commit adds a new sysctl option: net.ipv4.tcp_migrate_req. If this
option is enabled or eBPF program is attached, we will be able to migrate
child sockets from a listener to another in the same reuseport group after
close() or shutdown() syscalls.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20210612123224.12525-2-kuniyu@amazon.co.jp
2021-06-15 18:01:05 +02:00
Oleksandr Mazur
ddee9dbc3d net: core: devlink: add dropped stats traps field
Whenever query statistics is issued for trap, devlink subsystem
would also fill-in statistics 'dropped' field. This field indicates
the number of packets HW dropped and failed to report to the device driver,
and thus - to the devlink subsystem itself.
In case if device driver didn't register callback for hard drop
statistics querying, 'dropped' field will be omitted and not filled.

Signed-off-by: Oleksandr Mazur <oleksandr.mazur@plvision.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-14 13:04:25 -07:00
Johannes Berg
8c713dc93c rtnetlink: add alloc() method to rtnl_link_ops
In order to make rtnetlink ops that can create different
kinds of devices, like what we want to add to the WWAN
framework, the priv_size and setup parameters aren't quite
sufficient. Make this easier to manage by allowing ops to
allocate their own netdev via an @alloc method that gets
the tb netlink data.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Sergey Ryazanov <ryazanov.s.a@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-12 13:16:45 -07:00
Changbin Du
ea6932d70e net: make get_net_ns return error if NET_NS is disabled
There is a panic in socket ioctl cmd SIOCGSKNS when NET_NS is not enabled.
The reason is that nsfs tries to access ns->ops but the proc_ns_operations
is not implemented in this case.

[7.670023] Unable to handle kernel NULL pointer dereference at virtual address 00000010
[7.670268] pgd = 32b54000
[7.670544] [00000010] *pgd=00000000
[7.671861] Internal error: Oops: 5 [#1] SMP ARM
[7.672315] Modules linked in:
[7.672918] CPU: 0 PID: 1 Comm: systemd Not tainted 5.13.0-rc3-00375-g6799d4f2da49 #16
[7.673309] Hardware name: Generic DT based system
[7.673642] PC is at nsfs_evict+0x24/0x30
[7.674486] LR is at clear_inode+0x20/0x9c

The same to tun SIOCGSKNS command.

To fix this problem, we make get_net_ns() return -EINVAL when NET_NS is
disabled. Meanwhile move it to right place net/core/net_namespace.c.

Signed-off-by: Changbin Du <changbin.du@gmail.com>
Fixes: c62cce2cae ("net: add an ioctl to get a socket network namespace")
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-12 13:13:08 -07:00
Arseny Krasnov
0798e78b10 af_vsock: rest of SEQPACKET support
Add socket ops for SEQPACKET type and .seqpacket_allow() callback
to query transports if they support SEQPACKET. Also split path
for data check for STREAM and SEQPACKET branches.

Signed-off-by: Arseny Krasnov <arseny.krasnov@kaspersky.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-11 13:32:46 -07:00
Arseny Krasnov
fbe70c4807 af_vsock: implement send logic for SEQPACKET
Update current stream enqueue function for SEQPACKET
support:
1) Call transport's seqpacket enqueue callback.
2) Return value from enqueue function is whole record length or error
   for SOCK_SEQPACKET.

Signed-off-by: Arseny Krasnov <arseny.krasnov@kaspersky.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-11 13:32:46 -07:00
Arseny Krasnov
9942c192b2 af_vsock: implement SEQPACKET receive loop
Add receive loop for SEQPACKET. It looks like receive loop for
STREAM, but there are differences:
1) It doesn't call notify callbacks.
2) It doesn't care about 'SO_SNDLOWAT' and 'SO_RCVLOWAT' values, because
   there is no sense for these values in SEQPACKET case.
3) It waits until whole record is received.
4) It processes and sets 'MSG_TRUNC' flag.

So to avoid extra conditions for two types of socket inside one loop, two
independent functions were created.

Signed-off-by: Arseny Krasnov <arseny.krasnov@kaspersky.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-11 13:32:46 -07:00
Vladimir Oltean
4913b8ebf8 net: dsa: add support for the SJA1110 native tagging protocol
The SJA1110 has improved a few things compared to SJA1105:

- To send a control packet from the host port with SJA1105, one needed
  to program a one-shot "management route" over SPI. This is no longer
  true with SJA1110, you can actually send "in-band control extensions"
  in the packets sent by DSA, these are in fact DSA tags which contain
  the destination port and switch ID.

- When receiving a control packet from the switch with SJA1105, the
  source port and switch ID were written in bytes 3 and 4 of the
  destination MAC address of the frame (which was a very poor shot at a
  DSA header). If the control packet also had an RX timestamp, that
  timestamp was sent in an actual follow-up packet, so there were
  reordering concerns on multi-core/multi-queue DSA masters, where the
  metadata frame with the RX timestamp might get processed before the
  actual packet to which that timestamp belonged (there is no way to
  pair a packet to its timestamp other than the order in which they were
  received). On SJA1110, this is no longer true, control packets have
  the source port, switch ID and timestamp all in the DSA tags.

- Timestamps from the switch were partial: to get a 64-bit timestamp as
  required by PTP stacks, one would need to take the partial 24-bit or
  32-bit timestamp from the packet, then read the current PTP time very
  quickly, and then patch in the high bits of the current PTP time into
  the captured partial timestamp, to reconstruct what the full 64-bit
  timestamp must have been. That is awful because packet processing is
  done in NAPI context, but reading the current PTP time is done over
  SPI and therefore needs sleepable context.

But it also aggravated a few things:

- Not only is there a DSA header in SJA1110, but there is a DSA trailer
  in fact, too. So DSA needs to be extended to support taggers which
  have both a header and a trailer. Very unconventional - my understanding
  is that the trailer exists because the timestamps couldn't be prepared
  in time for putting them in the header area.

- Like SJA1105, not all packets sent to the CPU have the DSA tag added
  to them, only control packets do:

  * the ones which match the destination MAC filters/traps in
    MAC_FLTRES1 and MAC_FLTRES0
  * the ones which match FDB entries which have TRAP or TAKETS bits set

  So we could in theory hack something up to request the switch to take
  timestamps for all packets that reach the CPU, and those would be
  DSA-tagged and contain the source port / switch ID by virtue of the
  fact that there needs to be a timestamp trailer provided. BUT:

- The SJA1110 does not parse its own DSA tags in a way that is useful
  for routing in cross-chip topologies, a la Marvell. And the sja1105
  driver already supports cross-chip bridging from the SJA1105 days.
  It does that by automatically setting up the DSA links as VLAN trunks
  which contain all the necessary tag_8021q RX VLANs that must be
  communicated between the switches that span the same bridge. So when
  using tag_8021q on sja1105, it is possible to have 2 switches with
  ports sw0p0, sw0p1, sw1p0, sw1p1, and 2 VLAN-unaware bridges br0 and
  br1, and br0 can take sw0p0 and sw1p0, and br1 can take sw0p1 and
  sw1p1, and forwarding will happen according to the expected rules of
  the Linux bridge.
  We like that, and we don't want that to go away, so as a matter of
  fact, the SJA1110 tagger still needs to support tag_8021q.

So the sja1110 tagger is a hybrid between tag_8021q for data packets,
and the native hardware support for control packets.

On RX, packets have a 13-byte trailer if they contain an RX timestamp.
That trailer is padded in such a way that its byte 8 (the start of the
"residence time" field - not parsed by Linux because we don't care) is
aligned on a 16 byte boundary. So the padding has a variable length
between 0 and 15 bytes. The DSA header contains the offset of the
beginning of the padding relative to the beginning of the frame (and the
end of the padding is obviously the end of the packet minus 13 bytes,
the length of the trailer). So we discard it.

Packets which don't have a trailer contain the source port and switch ID
information in the header (they are "trap-to-host" packets). Packets
which have a trailer contain the source port and switch ID in the trailer.

On TX, the destination port mask and switch ID is always in the trailer,
so we always need to say in the header that a trailer is present.

The header needs a custom EtherType and this was chosen as 0xdadc, after
0xdada which is for Marvell and 0xdadb which is for VLANs in
VLAN-unaware mode on SJA1105 (and SJA1110 in fact too).

Because we use tag_8021q in concert with the native tagging protocol,
control packets will have 2 DSA tags.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-11 12:45:38 -07:00
Vladimir Oltean
4e50025129 net: dsa: generalize overhead for taggers that use both headers and trailers
Some really really weird switches just couldn't decide whether to use a
normal or a tail tagger, so they just did both.

This creates problems for DSA, because we only have the concept of an
'overhead' which can be applied to the headroom or to the tailroom of
the skb (like for example during the central TX reallocation procedure),
depending on the value of bool tail_tag, but not to both.

We need to generalize DSA to cater for these odd switches by
transforming the 'overhead / tail_tag' pair into 'needed_headroom /
needed_tailroom'.

The DSA master's MTU is increased to account for both.

The flow dissector code is modified such that it only calls the DSA
adjustment callback if the tagger has a non-zero header length.

Taggers are trivially modified to declare either needed_headroom or
needed_tailroom, based on the tail_tag value that they currently
declare.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-11 12:45:38 -07:00
Eric Dumazet
b71eaed8c0 inet: annotate date races around sk->sk_txhash
UDP sendmsg() path can be lockless, it is possible for another
thread to re-connect an change sk->sk_txhash under us.

There is no serious impact, but we can use READ_ONCE()/WRITE_ONCE()
pair to document the race.

BUG: KCSAN: data-race in __ip4_datagram_connect / skb_set_owner_w

write to 0xffff88813397920c of 4 bytes by task 30997 on cpu 1:
 sk_set_txhash include/net/sock.h:1937 [inline]
 __ip4_datagram_connect+0x69e/0x710 net/ipv4/datagram.c:75
 __ip6_datagram_connect+0x551/0x840 net/ipv6/datagram.c:189
 ip6_datagram_connect+0x2a/0x40 net/ipv6/datagram.c:272
 inet_dgram_connect+0xfd/0x180 net/ipv4/af_inet.c:580
 __sys_connect_file net/socket.c:1837 [inline]
 __sys_connect+0x245/0x280 net/socket.c:1854
 __do_sys_connect net/socket.c:1864 [inline]
 __se_sys_connect net/socket.c:1861 [inline]
 __x64_sys_connect+0x3d/0x50 net/socket.c:1861
 do_syscall_64+0x4a/0x90 arch/x86/entry/common.c:47
 entry_SYSCALL_64_after_hwframe+0x44/0xae

read to 0xffff88813397920c of 4 bytes by task 31039 on cpu 0:
 skb_set_hash_from_sk include/net/sock.h:2211 [inline]
 skb_set_owner_w+0x118/0x220 net/core/sock.c:2101
 sock_alloc_send_pskb+0x452/0x4e0 net/core/sock.c:2359
 sock_alloc_send_skb+0x2d/0x40 net/core/sock.c:2373
 __ip6_append_data+0x1743/0x21a0 net/ipv6/ip6_output.c:1621
 ip6_make_skb+0x258/0x420 net/ipv6/ip6_output.c:1983
 udpv6_sendmsg+0x160a/0x16b0 net/ipv6/udp.c:1527
 inet6_sendmsg+0x5f/0x80 net/ipv6/af_inet6.c:642
 sock_sendmsg_nosec net/socket.c:654 [inline]
 sock_sendmsg net/socket.c:674 [inline]
 ____sys_sendmsg+0x360/0x4d0 net/socket.c:2350
 ___sys_sendmsg net/socket.c:2404 [inline]
 __sys_sendmmsg+0x315/0x4b0 net/socket.c:2490
 __do_sys_sendmmsg net/socket.c:2519 [inline]
 __se_sys_sendmmsg net/socket.c:2516 [inline]
 __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2516
 do_syscall_64+0x4a/0x90 arch/x86/entry/common.c:47
 entry_SYSCALL_64_after_hwframe+0x44/0xae

value changed: 0xbca3c43d -> 0xfdb309e0

Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 31039 Comm: syz-executor.2 Not tainted 5.13.0-rc3-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-10 14:12:54 -07:00