Commit Graph

64023 Commits

Author SHA1 Message Date
Julian Wiedmann
2c3b4456c8 net/af_iucv: build SG skbs for TRANS_HIPER sockets
The TX path no longer falls apart when some of its SG skbs are later
linearized by lower layers of the stack. So enable the use of SG skbs
in iucv_sock_sendmsg() again.

This effectively reverts
commit dc5367bcc5 ("net/af_iucv: don't use paged skbs for TX on HiperSockets").

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28 20:36:22 -08:00
Julian Wiedmann
80bc97aa0a net/af_iucv: don't track individual TX skbs for TRANS_HIPER sockets
Stop maintaining the skb_send_q list for TRANS_HIPER sockets.

Not only is it extra overhead, but keeping around a list of skb clones
means that we later also have to match the ->sk_txnotify() calls
against these clones and free them accordingly.
The current matching logic (comparing the skbs' shinfo location) is
frustratingly fragile, and breaks if the skb's head is mangled in any
sort of way while passing from dev_queue_xmit() to the device's
HW queue.

Also adjust the interface for ->sk_txnotify(), to make clear that we
don't actually care about any skb internals.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28 20:36:21 -08:00
Julian Wiedmann
ef6af7bdb9 net/af_iucv: count packets in the xmit path
The TX code keeps track of all skbs that are in-flight but haven't
actually been sent out yet. For native IUCV sockets that's not a huge
deal, but with TRANS_HIPER sockets it would be much better if we
didn't need to maintain a list of skb clones.

Note that we actually only care about the _count_ of skbs in this stage
of the TX pipeline. So as prep work for removing the skb tracking on
TRANS_HIPER sockets, keep track of the skb count in a separate variable
and pair any list {enqueue, unlink} with a count {increment, decrement}.

Then replace all occurences where we currently look at the skb list's
fill level.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28 20:36:21 -08:00
Julian Wiedmann
c464444fa2 net/af_iucv: don't lookup the socket on TX notification
Whoever called iucv_sk(sk)->sk_txnotify() must already know that they're
dealing with an af_iucv socket.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28 20:36:21 -08:00
Alexander Egorenkov
27e9c1de52 net/af_iucv: remove WARN_ONCE on malformed RX packets
syzbot reported the following finding:

AF_IUCV failed to receive skb, len=0
WARNING: CPU: 0 PID: 522 at net/iucv/af_iucv.c:2039 afiucv_hs_rcv+0x174/0x190 net/iucv/af_iucv.c:2039
CPU: 0 PID: 522 Comm: syz-executor091 Not tainted 5.10.0-rc1-syzkaller-07082-g55027a88ec9f #0
Hardware name: IBM 3906 M04 701 (KVM/Linux)
Call Trace:
 [<00000000b87ea538>] afiucv_hs_rcv+0x178/0x190 net/iucv/af_iucv.c:2039
([<00000000b87ea534>] afiucv_hs_rcv+0x174/0x190 net/iucv/af_iucv.c:2039)
 [<00000000b796533e>] __netif_receive_skb_one_core+0x13e/0x188 net/core/dev.c:5315
 [<00000000b79653ce>] __netif_receive_skb+0x46/0x1c0 net/core/dev.c:5429
 [<00000000b79655fe>] netif_receive_skb_internal+0xb6/0x220 net/core/dev.c:5534
 [<00000000b796ac3a>] netif_receive_skb+0x42/0x318 net/core/dev.c:5593
 [<00000000b6fd45f4>] tun_rx_batched.isra.0+0x6fc/0x860 drivers/net/tun.c:1485
 [<00000000b6fddc4e>] tun_get_user+0x1c26/0x27f0 drivers/net/tun.c:1939
 [<00000000b6fe0f00>] tun_chr_write_iter+0x158/0x248 drivers/net/tun.c:1968
 [<00000000b4f22bfa>] call_write_iter include/linux/fs.h:1887 [inline]
 [<00000000b4f22bfa>] new_sync_write+0x442/0x648 fs/read_write.c:518
 [<00000000b4f238fe>] vfs_write.part.0+0x36e/0x5d8 fs/read_write.c:605
 [<00000000b4f2984e>] vfs_write+0x10e/0x148 fs/read_write.c:615
 [<00000000b4f29d0e>] ksys_write+0x166/0x290 fs/read_write.c:658
 [<00000000b8dc4ab4>] system_call+0xe0/0x28c arch/s390/kernel/entry.S:415
Last Breaking-Event-Address:
 [<00000000b8dc64d4>] __s390_indirect_jump_r14+0x0/0xc

Malformed RX packets shouldn't generate any warnings because
debugging info already flows to dropmon via the kfree_skb().

Signed-off-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Reviewed-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28 20:36:21 -08:00
Jakub Kicinski
c358f95205 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
drivers/net/can/dev.c
  b552766c87 ("can: dev: prevent potential information leak in can_fill_info()")
  3e77f70e73 ("can: dev: move driver related infrastructure into separate subdir")
  0a042c6ec9 ("can: dev: move netlink related code into seperate file")

  Code move.

drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
  57ac4a31c4 ("net/mlx5e: Correctly handle changing the number of queues when the interface is down")
  214baf2287 ("net/mlx5e: Support HTB offload")

  Adjacent code changes

net/switchdev/switchdev.c
  20776b465c ("net: switchdev: don't set port_obj_info->handled true when -EOPNOTSUPP")
  ffb68fc58e ("net: switchdev: remove the transaction structure from port object notifiers")
  bae33f2b5a ("net: switchdev: remove the transaction structure from port attributes")

  Transaction parameter gets dropped otherwise keep the fix.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28 17:09:31 -08:00
Jakub Kicinski
24a790da0a mlx5 subfunction support
Parav Pandit Says:
 =================
 
 This patchset introduces support for mlx5 subfunction (SF).
 
 A subfunction is a lightweight function that has a parent PCI function on
 which it is deployed. mlx5 subfunction has its own function capabilities
 and its own resources. This means a subfunction has its own dedicated
 queues(txq, rxq, cq, eq). These queues are neither shared nor stolen from
 the parent PCI function.
 
 When subfunction is RDMA capable, it has its own QP1, GID table and rdma
 resources neither shared nor stolen from the parent PCI function.
 
 A subfunction has dedicated window in PCI BAR space that is not shared
 with the other subfunctions or parent PCI function. This ensures that all
 class devices of the subfunction accesses only assigned PCI BAR space.
 
 A Subfunction supports eswitch representation through which it supports tc
 offloads. User must configure eswitch to send/receive packets from/to
 subfunction port.
 
 Subfunctions share PCI level resources such as PCI MSI-X IRQs with
 their other subfunctions and/or with its parent PCI function.
 
 Patch summary:
 --------------
 Patch 1 to 4 prepares devlink
 patch 5 to 7 mlx5 adds SF device support
 Patch 8 to 11 mlx5 adds SF devlink port support
 Patch 12 and 14 adds documentation
 
 Patch-1 prepares code to handle multiple port function attributes
 Patch-2 introduces devlink pcisf port flavour similar to pcipf and pcivf
 Patch-3 adds port add and delete driver callbacks
 Patch-4 adds port function state get and set callbacks
 Patch-5 mlx5 vhca event notifier support to distribute subfunction
         state change notification
 Patch-6 adds SF auxiliary device
 Patch-7 adds SF auxiliary driver
 Patch-8 prepares eswitch to handler SF vport
 Patch-9 adds eswitch helpers to add/remove SF vport
 Patch-10 implements devlink port add/del callbacks
 Patch-11 implements devlink port function get/set callbacks
 Patch-12 to 14 adds documentation
 Patch-12 added mlx5 port function documentation
 Patch-13 adds subfunction documentation
 Patch-14 adds mlx5 subfunction documentation
 
 Subfunction support is discussed in detail in RFC [1] and [2].
 RFC [1] and extension [2] describes requirements, design and proposed
 plumbing using devlink, auxiliary bus and sysfs for systemd/udev
 support. Functionality of this patchset is best explained using real
 examples further below.
 
 overview:
 --------
 A subfunction can be created and deleted by a user using devlink port
 add/delete interface.
 
 A subfunction can be configured using devlink port function attribute
 before its activated.
 
 When a subfunction is activated, it results in an auxiliary device on
 the host PCI device where it is deployed. A driver binds to the
 auxiliary device that further creates supported class devices.
 
 example subfunction usage sequence:
 -----------------------------------
 Change device to switchdev mode:
 $ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
 
 Add a devlink port of subfunction flavour:
 $ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88
 
 Configure mac address of the port function:
 $ devlink port function set ens2f0npf0sf88 hw_addr 00:00:00:00:88:88
 
 Now activate the function:
 $ devlink port function set ens2f0npf0sf88 state active
 
 Now use the auxiliary device and class devices:
 $ devlink dev show
 pci/0000:06:00.0
 auxiliary/mlx5_core.sf.4
 
 $ ip link show
 127: ens2f0np0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
     link/ether 24:8a:07:b3:d1:12 brd ff:ff:ff:ff:ff:ff
     altname enp6s0f0np0
 129: p0sf88: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
     link/ether 00:00:00:00:88:88 brd ff:ff:ff:ff:ff:ff
 
 $ rdma dev show
 43: rdmap6s0f0: node_type ca fw 16.29.0550 node_guid 248a:0703:00b3:d112 sys_image_guid 248a:0703:00b3:d112
 44: mlx5_0: node_type ca fw 16.29.0550 node_guid 0000:00ff:fe00:8888 sys_image_guid 248a:0703:00b3:d112
 
 After use inactivate the function:
 $ devlink port function set ens2f0npf0sf88 state inactive
 
 Now delete the subfunction port:
 $ devlink port del ens2f0npf0sf88
 
 [1] https://lore.kernel.org/netdev/20200519092258.GF4655@nanopsycho/
 [2] https://marc.info/?l=linux-netdev&m=158555928517777&w=2
 
 =================
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAmALKDwACgkQSD+KveBX
 +j7qjQf6A1moPhhIlXROCzaJUjlAj2U291LWBveU+I6na6fjYjAAWHYwfv0YKQpo
 Qb0NRt+9abgEpGidc4hOwIJKhK+vlWrQuehRt83aAfAwaN3OEeGuNllniWo821Hj
 sNiJfSC/DslOlQSxKLsAs3Fduy/sV3GN9Zv7hEwOFgEr5QvB2c6H1XiypVP2Ecsd
 ZXC3SuEWxIoRtfXEkTkJne9LNoiDChlvT1FR/z75h8HUBdAOjzBTQzBbM+8M4Msw
 8aKUPya3FMRAPWsOgPhkpU0xTtH2Mi7MC9TlwiWmrK4Q3uvesIav8pVf7r3GNAZA
 sipIZ4gP0M5SiCaZa8rIBpTXBHxmvg==
 =jEG4
 -----END PGP SIGNATURE-----

Merge tag 'mlx5-updates-2021-01-13' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
mlx5 subfunction support

Parav Pandit says:

This patchset introduces support for mlx5 subfunction (SF).

A subfunction is a lightweight function that has a parent PCI function on
which it is deployed. mlx5 subfunction has its own function capabilities
and its own resources. This means a subfunction has its own dedicated
queues(txq, rxq, cq, eq). These queues are neither shared nor stolen from
the parent PCI function.

When subfunction is RDMA capable, it has its own QP1, GID table and rdma
resources neither shared nor stolen from the parent PCI function.

A subfunction has dedicated window in PCI BAR space that is not shared
with the other subfunctions or parent PCI function. This ensures that all
class devices of the subfunction accesses only assigned PCI BAR space.

A Subfunction supports eswitch representation through which it supports tc
offloads. User must configure eswitch to send/receive packets from/to
subfunction port.

Subfunctions share PCI level resources such as PCI MSI-X IRQs with
their other subfunctions and/or with its parent PCI function.

Subfunction support is discussed in detail in RFC [1] and [2].
RFC [1] and extension [2] describes requirements, design and proposed
plumbing using devlink, auxiliary bus and sysfs for systemd/udev
support. Functionality of this patchset is best explained using real
examples further below.

overview:
--------
A subfunction can be created and deleted by a user using devlink port
add/delete interface.

A subfunction can be configured using devlink port function attribute
before its activated.

When a subfunction is activated, it results in an auxiliary device on
the host PCI device where it is deployed. A driver binds to the
auxiliary device that further creates supported class devices.

example subfunction usage sequence:
-----------------------------------
Change device to switchdev mode:
$ devlink dev eswitch set pci/0000:06:00.0 mode switchdev

Add a devlink port of subfunction flavour:
$ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88

Configure mac address of the port function:
$ devlink port function set ens2f0npf0sf88 hw_addr 00:00:00:00:88:88

Now activate the function:
$ devlink port function set ens2f0npf0sf88 state active

Now use the auxiliary device and class devices:
$ devlink dev show
pci/0000:06:00.0
auxiliary/mlx5_core.sf.4

$ ip link show
127: ens2f0np0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 24:8a:07:b3:d1:12 brd ff:ff:ff:ff:ff:ff
    altname enp6s0f0np0
129: p0sf88: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 00:00:00:00:88:88 brd ff:ff:ff:ff:ff:ff

$ rdma dev show
43: rdmap6s0f0: node_type ca fw 16.29.0550 node_guid 248a:0703:00b3:d112 sys_image_guid 248a:0703:00b3:d112
44: mlx5_0: node_type ca fw 16.29.0550 node_guid 0000:00ff:fe00:8888 sys_image_guid 248a:0703:00b3:d112

After use inactivate the function:
$ devlink port function set ens2f0npf0sf88 state inactive

Now delete the subfunction port:
$ devlink port del ens2f0npf0sf88

[1] https://lore.kernel.org/netdev/20200519092258.GF4655@nanopsycho/
[2] https://marc.info/?l=linux-netdev&m=158555928517777&w=2

=================

* tag 'mlx5-updates-2021-01-13' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
  net/mlx5: Add devlink subfunction port documentation
  devlink: Extend devlink port documentation for subfunctions
  devlink: Add devlink port documentation
  net/mlx5: SF, Port function state change support
  net/mlx5: SF, Add port add delete functionality
  net/mlx5: E-switch, Add eswitch helpers for SF vport
  net/mlx5: E-switch, Prepare eswitch to handle SF vport
  net/mlx5: SF, Add auxiliary device driver
  net/mlx5: SF, Add auxiliary device support
  net/mlx5: Introduce vhca state event notifier
  devlink: Support get and set state of port function
  devlink: Support add and delete devlink port
  devlink: Introduce PCI SF port flavour and port attribute
  devlink: Prepare code to fill multiple port function attributes
====================

Link: https://lore.kernel.org/r/20210122193658.282884-1-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28 16:57:19 -08:00
Linus Torvalds
909b447dcc Networking fixes for 5.11-rc6, including fixes from can, xfrm, wireless,
wireless-drivers and netfilter trees. Nothing scary, Intel WiFi-related
 fixes seemed most notable to the users.
 
 Current release - regressions:
 
  - dsa: microchip: ksz8795: fix KSZ8794 port map again to program
                             the CPU port correctly
 
 Current release - new code bugs:
 
  - iwlwifi: pcie: reschedule in long-running memory reads
 
 Previous releases - regressions:
 
  - iwlwifi: dbg: don't try to overwrite read-only FW data
 
  - iwlwifi: provide gso_type to GSO packets
 
  - octeontx2: make sure the buffer is 128 byte aligned
 
  - tcp: make TCP_USER_TIMEOUT accurate for zero window probes
 
  - xfrm: fix wraparound in xfrm_policy_addr_delta()
 
  - xfrm: fix oops in xfrm_replay_advance_bmp due to a race between CPUs
          in presence of packet reorder
 
  - tcp: fix TLP timer not set when CA_STATE changes from DISORDER
         to OPEN
 
  - wext: fix NULL-ptr-dereference with cfg80211's lack of commit()
 
 Previous releases - always broken:
 
  - igc: fix link speed advertising
 
  - stmmac: configure EHL PSE0 GbE and PSE1 GbE to 32 bits DMA addressing
 
  - team: protect features update by RCU to avoid deadlock
 
  - xfrm: fix disable_xfrm sysctl when used on xfrm interfaces themselves
 
  - fec: fix temporary RMII clock reset on link up
 
  - can: dev: prevent potential information leak in can_fill_info()
 
 Misc:
 
  - mrp: fix bad packing of MRP test packet structures
 
  - uapi: fix big endian definition of ipv6_rpl_sr_hdr
 
  - add David Ahern to IPv4/IPv6 maintainers
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmATRs4ACgkQMUZtbf5S
 IrtOfQ//Vmn1WprrwLPf6/uOuBN0RAKHC+64IRIw2ahDuiB1QQV0c3ALRd42Xp8n
 qnoDMB/mUWdF/KjjJEKvwYyBuwBeQWLcpgTXi1HvvhxM13PVHjvyIp6hTAYYj+m4
 KyWWzQZwezz0zKQ3wXFdZV4JuefXEgXvMx65o8nk+TsutHn6WK/E6ZnWTexoZ0pa
 5Lab149mtoCdSpT3gr2x1aTqd9KYWaxfarYOUD1GY58BQyDFl4wj10MV3oE7xWPj
 /MKnSBvPx52ajbb+rUVhfFjBN1BmEjdze7cBMncJc5H+0X38R23ZaAlP3gecGaac
 hZ5C2wnSSvRR8KIvSEwbCArlpuyU+exacZXZ0vS6sfgqISKqoPv8erWvpxtLil3v
 YfwZVNPYG9RBwbnDVw1gLQIFn3lUqLhIPnJ8J2Ue6KUm7ur4fO566RjyPU3gkPdp
 5Zj3Eh7hsB2EqOy4RdwnoI0QboWmlq9+wT11HCXPFyJ077JzVU0FzMSvJr4dgVSI
 3D3ckmw+RSej4ib6G4xjpq1tPCFzdf9zlFoUPomRFTKgfJFaky5pEb/22C3bztp1
 43fsv3PiwlQtoYP3pfQsRj+r6DikYwDL7A3lskWohIZXviY2wErKWViUcIXr5ULE
 BxYQq0NYMl4TgDkn525U9EFwVgJAvPAedhYxF7VKn3eHNODqWBo=
 =dwFD
 -----END PGP SIGNATURE-----

Merge tag 'net-5.11-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Networking fixes including fixes from can, xfrm, wireless,
  wireless-drivers and netfilter trees. Nothing scary, Intel
  WiFi-related fixes seemed most notable to the users.

  Current release - regressions:

   - dsa: microchip: ksz8795: fix KSZ8794 port map again to program the
     CPU port correctly

  Current release - new code bugs:

   - iwlwifi: pcie: reschedule in long-running memory reads

  Previous releases - regressions:

   - iwlwifi: dbg: don't try to overwrite read-only FW data

   - iwlwifi: provide gso_type to GSO packets

   - octeontx2: make sure the buffer is 128 byte aligned

   - tcp: make TCP_USER_TIMEOUT accurate for zero window probes

   - xfrm: fix wraparound in xfrm_policy_addr_delta()

   - xfrm: fix oops in xfrm_replay_advance_bmp due to a race between
     CPUs in presence of packet reorder

   - tcp: fix TLP timer not set when CA_STATE changes from DISORDER to
     OPEN

   - wext: fix NULL-ptr-dereference with cfg80211's lack of commit()

  Previous releases - always broken:

   - igc: fix link speed advertising

   - stmmac: configure EHL PSE0 GbE and PSE1 GbE to 32 bits DMA
     addressing

   - team: protect features update by RCU to avoid deadlock

   - xfrm: fix disable_xfrm sysctl when used on xfrm interfaces
     themselves

   - fec: fix temporary RMII clock reset on link up

   - can: dev: prevent potential information leak in can_fill_info()

  Misc:

   - mrp: fix bad packing of MRP test packet structures

   - uapi: fix big endian definition of ipv6_rpl_sr_hdr

   - add David Ahern to IPv4/IPv6 maintainers"

* tag 'net-5.11-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (86 commits)
  rxrpc: Fix memory leak in rxrpc_lookup_local
  mlxsw: spectrum_span: Do not overwrite policer configuration
  selftests: forwarding: Specify interface when invoking mausezahn
  stmmac: intel: Configure EHL PSE0 GbE and PSE1 GbE to 32 bits DMA addressing
  net: usb: cdc_ether: added support for Thales Cinterion PLSx3 modem family.
  ibmvnic: Ensure that CRQ entry read are correctly ordered
  MAINTAINERS: add missing header for bonding
  net: decnet: fix netdev refcount leaking on error path
  net: switchdev: don't set port_obj_info->handled true when -EOPNOTSUPP
  can: dev: prevent potential information leak in can_fill_info()
  net: fec: Fix temporary RMII clock reset on link up
  net: lapb: Add locking to the lapb module
  team: protect features update by RCU to avoid deadlock
  MAINTAINERS: add David Ahern to IPv4/IPv6 maintainers
  net/mlx5: CT: Fix incorrect removal of tuple_nat_node from nat rhashtable
  net/mlx5e: Revert parameters on errors when changing MTU and LRO state without reset
  net/mlx5e: Revert parameters on errors when changing trust state without reset
  net/mlx5e: Correctly handle changing the number of queues when the interface is down
  net/mlx5e: Fix CT rule + encap slow path offload and deletion
  net/mlx5e: Disable hw-tc-offload when MLX5_CLS_ACT config is disabled
  ...
2021-01-28 15:24:43 -08:00
Takeshi Misawa
b8323f7288 rxrpc: Fix memory leak in rxrpc_lookup_local
Commit 9ebeddef58 ("rxrpc: rxrpc_peer needs to hold a ref on the rxrpc_local record")
Then release ref in __rxrpc_put_peer and rxrpc_put_peer_locked.

	struct rxrpc_peer *rxrpc_alloc_peer(struct rxrpc_local *local, gfp_t gfp)
	-               peer->local = local;
	+               peer->local = rxrpc_get_local(local);

rxrpc_discard_prealloc also need ref release in discarding.

syzbot report:
BUG: memory leak
unreferenced object 0xffff8881080ddc00 (size 256):
  comm "syz-executor339", pid 8462, jiffies 4294942238 (age 12.350s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 0a 00 00 00 00 c0 00 08 81 88 ff ff  ................
  backtrace:
    [<000000002b6e495f>] kmalloc include/linux/slab.h:552 [inline]
    [<000000002b6e495f>] kzalloc include/linux/slab.h:682 [inline]
    [<000000002b6e495f>] rxrpc_alloc_local net/rxrpc/local_object.c:79 [inline]
    [<000000002b6e495f>] rxrpc_lookup_local+0x1c1/0x760 net/rxrpc/local_object.c:244
    [<000000006b43a77b>] rxrpc_bind+0x174/0x240 net/rxrpc/af_rxrpc.c:149
    [<00000000fd447a55>] afs_open_socket+0xdb/0x200 fs/afs/rxrpc.c:64
    [<000000007fd8867c>] afs_net_init+0x2b4/0x340 fs/afs/main.c:126
    [<0000000063d80ec1>] ops_init+0x4e/0x190 net/core/net_namespace.c:152
    [<00000000073c5efa>] setup_net+0xde/0x2d0 net/core/net_namespace.c:342
    [<00000000a6744d5b>] copy_net_ns+0x19f/0x3e0 net/core/net_namespace.c:483
    [<0000000017d3aec3>] create_new_namespaces+0x199/0x4f0 kernel/nsproxy.c:110
    [<00000000186271ef>] unshare_nsproxy_namespaces+0x9b/0x120 kernel/nsproxy.c:226
    [<000000002de7bac4>] ksys_unshare+0x2fe/0x5c0 kernel/fork.c:2957
    [<00000000349b12ba>] __do_sys_unshare kernel/fork.c:3025 [inline]
    [<00000000349b12ba>] __se_sys_unshare kernel/fork.c:3023 [inline]
    [<00000000349b12ba>] __x64_sys_unshare+0x12/0x20 kernel/fork.c:3023
    [<000000006d178ef7>] do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
    [<00000000637076d4>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: 9ebeddef58 ("rxrpc: rxrpc_peer needs to hold a ref on the rxrpc_local record")
Signed-off-by: Takeshi Misawa <jeliantsurux@gmail.com>
Reported-and-tested-by: syzbot+305326672fed51b205f7@syzkaller.appspotmail.com
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/161183091692.3506637.3206605651502458810.stgit@warthog.procyon.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28 13:12:14 -08:00
Eric Dumazet
bbc20b7042 net: reduce indentation level in sk_clone_lock()
Rework initial test to jump over init code
if memory allocation has failed.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20210127152731.748663-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28 11:02:39 -08:00
Johannes Berg
776a39b819 cfg80211: call cfg80211_destroy_ifaces() with wiphy lock held
This is needed since it calls into the driver, which must have the
same context as if we got to destroy an interface through nl80211.
Fix this, and add a direct lockdep assertion so we don't see it
pop up only when the driver calls back to cfg80211.

Fixes: a05829a722 ("cfg80211: avoid holding the RTNL when calling the driver")
Reported-by: syzbot+4305e814f9b267131776@syzkaller.appspotmail.com
Link: https://lore.kernel.org/r/20210128183454.d31df9cbd7ce.I1beb07c9492f0ade900e864a098c57041e7a7ebf@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-01-28 19:11:11 +01:00
Johannes Berg
c88f952026 wext: call cfg80211_set_encryption() with wiphy lock held
Similar to the previous commit, we need to hold the wiphy lock
here. There's a second instance that is correct already, fix
this one as well.

Fixes: a05829a722 ("cfg80211: avoid holding the RTNL when calling the driver")
Link: https://lore.kernel.org/r/20210128183454.ea2f086465ed.I891d3bb44f068e6d97c160005010f052f28ab6e5@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-01-28 19:10:57 +01:00
Johannes Berg
bae173563c wext: call cfg80211_change_iface() with wiphy lock held
This is needed now that all the driver callbacks are protected by
the wiphy lock rather than (just) the RTNL.

Fixes: a05829a722 ("cfg80211: avoid holding the RTNL when calling the driver")
Reported-by: syzbot+d2d412349f88521938aa@syzkaller.appspotmail.com
Link: https://lore.kernel.org/r/20210128183454.e81bc6789b4b.I5deb8b6bfdc8b4ea7696cb2447ee6c58c7ce9a4e@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-01-28 19:10:48 +01:00
Johannes Berg
0391a45c80 nl80211: call cfg80211_dev_rename() under RTNL
This is required, and we have an assertion, move the RTNL
unlock down to cover cfg80211_dev_rename().

Fixes: a05829a722 ("cfg80211: avoid holding the RTNL when calling the driver")
Reported-by: syzbot+ed107c5fa3e21cdcd86e@syzkaller.appspotmail.com
Link: https://lore.kernel.org/r/20210128183454.16a4ea036740.I2beeb391dc322d6f1df3ee2612e714e0a7817397@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-01-28 19:10:38 +01:00
Arnd Bergmann
b64acb28da ath9k: fix build error with LEDS_CLASS=m
When CONFIG_ATH9K is built-in but LED support is in a loadable
module, both ath9k drivers fails to link:

x86_64-linux-ld: drivers/net/wireless/ath/ath9k/gpio.o: in function `ath_deinit_leds':
gpio.c:(.text+0x36): undefined reference to `led_classdev_unregister'
x86_64-linux-ld: drivers/net/wireless/ath/ath9k/gpio.o: in function `ath_init_leds':
gpio.c:(.text+0x179): undefined reference to `led_classdev_register_ext'

The problem is that the 'imply' keyword does not enforce any dependency
but is only a weak hint to Kconfig to enable another symbol from a
defconfig file.

Change imply to a 'depends on LEDS_CLASS' that prevents the incorrect
configuration but still allows building the driver without LED support.

The 'select MAC80211_LEDS' is now ensures that the LED support is
actually used if it is present, and the added Kconfig dependency
on MAC80211_LEDS ensures that it cannot be enabled manually when it
has no effect.

Fixes: 197f466e93 ("ath9k_htc: Do not select MAC80211_LEDS by default")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
Link: https://lore.kernel.org/r/20210125113654.2408057-1-arnd@kernel.org
2021-01-28 09:29:34 +02:00
Aya Levin
e78ab16459 devlink: Add DMAC filter generic packet trap
Add packet trap that can report packets that were dropped due to
destination MAC filtering.

Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-27 19:53:40 -08:00
Jakub Kicinski
5998dd0217 More updates:
* many minstrel improvements, including removal of the old
    minstrel in favour of minstrel_ht
  * speed improvements on FQ
  * support for RX decapsulation (header conversion) offload
  * RTNL reduction: limit RTNL usage in the wireless stack
    mostly to where really needed (regulatory not yet) to
    reduce contention on it
  * various other small updates
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAmAR1hoACgkQB8qZga/f
 l8QPVRAAjbF502gkrM2+VjC31xNrSVtAXn5LIsTHUUee1Kwhhe7VHNqrSYtR+poQ
 jdtBaaszlnkbBqThvIqD6AvbJgUeFhDB9Jcyynhbaq3wPcDiu7IuJikBzkN2Mhyq
 bzKu46jFPG6l31ItCKF9mToujlmcJrJVoS5mWJEi3QIWc9B3dwPsznO57tCVznmU
 95zkhs6tSwgk9ENpDoulXxvpsnlnIzo8f1z0Lr4h9TbzvwzTEA2+by0hmnBk/vZN
 wjBJXNVMlimp+lmgVms/gQfuGG6NHtmtatqsu+iKCRbO5AjOlvvql2pjkf5yNZXP
 GD2LkOcM3DiNP1Mf3pP31du1cFBo7nq+P1/ChekmY1A6/xV3GXFsKofGDCSONu1n
 E5+X+6MFt1gGHAwh8yxmg2HUjpuq6fApnx656HqlYluXqeCRR0hcv0iJ3j9ApAVJ
 Qwn6cP1rjCXXHsbOdGQi/Vx0xmUjEzNw3+j9x5nC5+zM82joRKxFMqZA8revxRUj
 gjkiu6EC4ZWWBzTXY9+qITSgEO8b1y7emBgdKSAQplWDAQnaa8nPGBdQ7QmwPdpV
 vOSZ6Dd4dMKh8Qk7FWPHxYZJKoDz+i0a8g1m9CogC3TO8hNLRt6D37ffTdoOag2m
 648kFBBS4SghKAjH6P4Nf6cfAPHpX/afivnSTpRK691cZ+2pajc=
 =gIbl
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-next-for-net-next-2021-01-27' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next

Johannes Berg says:

====================
More updates:
 * many minstrel improvements, including removal of the old
   minstrel in favour of minstrel_ht
 * speed improvements on FQ
 * support for RX decapsulation (header conversion) offload
 * RTNL reduction: limit RTNL usage in the wireless stack
   mostly to where really needed (regulatory not yet) to
   reduce contention on it

* tag 'mac80211-next-for-net-next-2021-01-27' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next: (24 commits)
  mac80211: minstrel_ht: fix regression in the max_prob_rate fix
  virt_wifi: fix deadlock on RTNL
  cfg80211: avoid holding the RTNL when calling the driver
  cfg80211: change netdev registration/unregistration semantics
  mac80211: minstrel_ht: fix rounding error in throughput calculation
  mac80211: minstrel_ht: increase stats update interval
  mac80211: minstrel_ht: fix max probability rate selection
  mac80211: minstrel_ht: improve sample rate selection
  mac80211: minstrel_ht: improve ampdu length estimation
  mac80211: minstrel_ht: remove old ewma based rate average code
  mac80211: remove legacy minstrel rate control
  mac80211: minstrel_ht: add support for OFDM rates on non-HT clients
  mac80211: minstrel_ht: clean up CCK code
  mac80211: introduce aql_enable node in debugfs
  cfg80211: Add phyrate conversion support for extended MCS in 60GHz band
  cfg80211: add VHT rate entries for MCS-10 and MCS-11
  mac80211: reduce peer HE MCS/NSS to own capabilities
  mac80211: remove NSS number of 160MHz if not support 160MHz for HE
  mac80211_hwsim: add 6GHz channels
  mac80211: add LDPC encoding to ieee80211_parse_tx_radiotap
  ...
====================

Link: https://lore.kernel.org/r/20210127210915.135550-1-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-27 19:01:06 -08:00
Jakub Kicinski
df9d80470a linux-can-next-for-5.12-20210127
-----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCgAxFiEEK3kIWJt9yTYMP3ehqclaivrt76kFAmARLD8THG1rbEBwZW5n
 dXRyb25peC5kZQAKCRCpyVqK+u3vqdlXB/48nQ5I+Z1wnhPvbtvyH4tk9XSbJaTt
 4HH+i3R5RUAzHcOmfm2PQHe9/DxiogOQAFv9Lo0t7HN449bM3LMHrhTCcJIrIRf9
 VxFSk4H97wjHR0Zj6TlEe++CTUPUalCpkCluERwqYP9WXRRklXL1mju+WNKnMMl0
 9fl4CvQDWjB2wNXXoZ1SVuoFxyeqiKQHJy9n3Wez8sQTIlguOZvm8glDQlyb4v+q
 rSxpCUrlpOVv6/11NqxQ7CfGdfTgLUi1a4greriwf1PjEXvDArXMjpDG3bo0kbgy
 7Iv0U9GsvtzOPB+6XKxEFeYTKFaixyLugYBAadfvs0lVEIFP1mtlYvQs
 =pHI/
 -----END PGP SIGNATURE-----

Merge tag 'linux-can-next-for-5.12-20210127' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next

Marc Kleine-Budde says:

====================
pull-request: can-next 2021-01-27

The first two patches are by me and fix typos on the CAN gw protocol and the
flexcan driver.

The next patch is by Vincent Mailhol and targets the CAN driver infrastructure,
it exports the function that converts the CAN state into a human readable
string.

A patch by me, which target the CAN driver infrastructure, too, makes the
calculation in can_fd_len2dlc() more readable.

A patch by Tom Rix fixes a checkpatch warning in the mcba_usb driver.

The next seven patches target the mcp251xfd driver. Su Yanjun's patch replaces
several hardcoded assumptions when calling regmap, by using
regmap_get_val_bytes(). The remaining patches are by me. First an open coded
check is replaced by an existing helper function, then in the TX path the
padding for CAN-FD frames is cleaned up. The next two patches clean up the RTR
frame handling in the RX and TX path. Then support for len8_dlc is added. The
last patch adds BQL support.

* tag 'linux-can-next-for-5.12-20210127' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next:
  can: mcp251xfd: add BQL support
  can: mcp251xfd: add len8_dlc support
  can: mcp251xfd: mcp251xfd_tx_obj_from_skb(): don't copy data for RTR CAN frames in TX-path
  can: mcp251xfd: mcp251xfd_hw_rx_obj_to_skb(): don't copy data for RTR CAN frames in RX-path
  can: mcp251xfd: mcp251xfd_tx_obj_from_skb(): clean up padding of CAN-FD frames
  can: mcp251xfd: mcp251xfd_start_xmit(): use mcp251xfd_get_tx_free() to check TX is is full
  can: mcp251xfd: replace sizeof(u32) with val_bytes in regmap
  can: mcba_usb: remove h from printk format specifier
  can: length: can_fd_len2dlc(): make legnth calculation readable again
  can: dev: export can_get_state_str() function
  can: flexcan: fix typos
  can: gw: fix typo
====================

Link: https://lore.kernel.org/r/20210127092227.2775573-1-mkl@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-27 18:53:10 -08:00
Hoang Huu Le
2a9063b7ff tipc: remove duplicated code in tipc_msg_create
Remove a duplicate code checking for header size in tipc_msg_create() as
it's already being done in tipc_msg_init().

Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Hoang Huu Le <hoang.h.le@dektech.com.au>
Link: https://lore.kernel.org/r/20210127025123.6390-1-hoang.h.le@dektech.com.au
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-27 18:50:07 -08:00
Jakub Kicinski
0f764eec3e Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

1) Honor stateful expressions defined in the set from the dynset
   extension. The set definition provides a stateful expression
   that must be used by the dynset expression in case it is specified.

2) Missing timeout extension in the set element in the dynset
   extension leads to inconsistent ruleset listing, not allowing
   the user to restore timeout and expiration on ruleset reload.

3) Do not dump the stateful expression from the dynset extension
   if it coming from the set definition.

* git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf:
  netfilter: nft_dynset: dump expressions when set definition contains no expressions
  netfilter: nft_dynset: add timeout extension to template
  netfilter: nft_dynset: honor stateful expressions in set definition
====================

Link: https://lore.kernel.org/r/20210127132512.5472-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-27 17:53:46 -08:00
Nikolay Aleksandrov
2dba407f99 net: bridge: multicast: make tracked EHT hosts limit configurable
Add two new port attributes which make EHT hosts limit configurable and
export the current number of tracked EHT hosts:
 - IFLA_BRPORT_MCAST_EHT_HOSTS_LIMIT: configure/retrieve current limit
 - IFLA_BRPORT_MCAST_EHT_HOSTS_CNT: current number of tracked hosts
Setting IFLA_BRPORT_MCAST_EHT_HOSTS_LIMIT to 0 is currently not allowed.

Note that we have to increase RTNL_SLAVE_MAX_TYPE to 38 minimum, I've
increased it to 40 to have space for two more future entries.

v2: move br_multicast_eht_set_hosts_limit() to br_multicast_eht.c,
    no functional change

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-27 17:40:35 -08:00
Nikolay Aleksandrov
89268b056e net: bridge: multicast: add per-port EHT hosts limit
Add a default limit of 512 for number of tracked EHT hosts per-port.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-27 17:40:35 -08:00
Vadim Fedorenko
3f96d64497 net: decnet: fix netdev refcount leaking on error path
On building the route there is an assumption that the destination
could be local. In this case loopback_dev is used to get the address.
If the address is still cannot be retrieved dn_route_output_slow
returns EADDRNOTAVAIL with loopback_dev reference taken.

Cannot find hash for the fixes tag because this code was introduced
long time ago. I don't think that this bug has ever fired but the
patch is done just to have a consistent code base.

Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru>
Link: https://lore.kernel.org/r/1611619334-20955-1-git-send-email-vfedorenko@novek.ru
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-27 17:33:46 -08:00
Masahiro Yamada
864e898ba3 net: remove redundant 'depends on NET'
These Kconfig files are included from net/Kconfig, inside the
if NET ... endif.

Remove 'depends on NET', which we know it is already met.

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Link: https://lore.kernel.org/r/20210125232026.106855-1-masahiroy@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-27 17:04:12 -08:00
Masahiro Yamada
d32f834cd6 net: l3mdev: use obj-$(CONFIG_NET_L3_MASTER_DEV) form in net/Makefile
CONFIG_NET_L3_MASTER_DEV is a bool option. Change the ifeq conditional
to the standard obj-$(CONFIG_NET_L3_MASTER_DEV) form.

Use obj-y in net/l3mdev/Makefile because Kbuild visits this Makefile
only when CONFIG_NET_L3_MASTER_DEV=y.

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20210125231659.106201-4-masahiroy@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-27 17:03:52 -08:00
Masahiro Yamada
0cfd99b487 net: switchdev: use obj-$(CONFIG_NET_SWITCHDEV) form in net/Makefile
CONFIG_NET_SWITCHDEV is a bool option. Change the ifeq conditional to
the standard obj-$(CONFIG_NET_SWITCHDEV) form.

Use obj-y in net/switchdev/Makefile because Kbuild visits this Makefile
only when CONFIG_NET_SWITCHDEV=y.

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Link: https://lore.kernel.org/r/20210125231659.106201-3-masahiroy@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-27 17:03:52 -08:00
Masahiro Yamada
1e328ed559 net: dcb: use obj-$(CONFIG_DCB) form in net/Makefile
CONFIG_DCB is a bool option. Change the ifeq conditional to the
standard obj-$(CONFIG_DCB) form.

Use obj-y in net/dcb/Makefile because Kbuild visits this Makefile
only when CONFIG_DCB=y.

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Link: https://lore.kernel.org/r/20210125231659.106201-2-masahiroy@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-27 17:03:52 -08:00
Masahiro Yamada
8b5f4eb3ab net: move CONFIG_NET guard to top Makefile
When CONFIG_NET is disabled, nothing under the net/ directory is
compiled. Move the CONFIG_NET guard to the top Makefile so the net/
directory is entirely skipped.

When Kbuild visits net/Makefile, CONFIG_NET is obvioulsy 'y' because
CONFIG_NET is a bool option. Clean up net/Makefile.

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Link: https://lore.kernel.org/r/20210125231659.106201-1-masahiroy@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-27 17:03:52 -08:00
Masahiro Yamada
69783429cd net: sysctl: remove redundant #ifdef CONFIG_NET
CONFIG_NET is a bool option, and this file is compiled only when
CONFIG_NET=y.

Remove #ifdef CONFIG_NET, which we know it is always met.

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Link: https://lore.kernel.org/r/20210125231421.105936-1-masahiroy@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-27 17:02:43 -08:00
Matthieu Baerts
1f2f1931b2 mptcp: pm nl: reduce variable scope
To avoid confusions like when working on the previous patch, better to
declare and assign this variable only where it is needed.

Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-27 16:53:55 -08:00
Matthieu Baerts
7b9b0f7e12 mptcp: pm nl: support IPv4 mapped in v6 addresses
On one side, we can allow the creation of subflows between v4 mapped in
v6 and v4 addresses. For that we look for v4mapped addresses between the
local address we want to select and the remote one.

On the other side, we also properly deal with received v4mapped
addresses, either announced ones or set via Netlink.

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/122
Suggested-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Co-developed-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-27 16:53:52 -08:00
Matthieu Baerts
50a13bc394 mptcp: support MPJoin with IPv4 mapped in v6 sk
With an IPv4 mapped in v6 socket, we were trying to call inet6_bind()
with an IPv4 address resulting in a -EINVAL error because the given
addr_len -- size of the address structure -- was too short.

We now make sure to use address structures for the same family as the
MPTCP socket for both the bind() and the connect(). It means we convert
v4 addresses to v4 mapped in v6 or the opposite if needed.

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/122
Co-developed-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-27 16:53:47 -08:00
Di Zhu
275b1e88ca pktgen: fix misuse of BUG_ON() in pktgen_thread_worker()
pktgen create threads for all online cpus and bond these threads to
relevant cpu repecivtily. when this thread firstly be woken up, it
will compare cpu currently running with the cpu specified at the time
of creation and if the two cpus are not equal, BUG_ON() will take effect
causing panic on the system.
Notice that these threads could be migrated to other cpus before start
running because of the cpu hotplug after these threads have created. so the
BUG_ON() used here seems unreasonable and we can replace it with WARN_ON()
to just printf a warning other than panic the system.

Signed-off-by: Di Zhu <zhudi21@huawei.com>
Link: https://lore.kernel.org/r/20210125124229.19334-1-zhudi21@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-27 16:46:37 -08:00
Rasmus Villemoes
20776b465c net: switchdev: don't set port_obj_info->handled true when -EOPNOTSUPP
It's not true that switchdev_port_obj_notify() only inspects the
->handled field of "struct switchdev_notifier_port_obj_info" if
call_switchdev_blocking_notifiers() returns 0 - there's a WARN_ON()
triggering for a non-zero return combined with ->handled not being
true. But the real problem here is that -EOPNOTSUPP is not being
properly handled.

The wrapper functions switchdev_handle_port_obj_add() et al change a
return value of -EOPNOTSUPP to 0, and the treatment of ->handled in
switchdev_port_obj_notify() seems to be designed to change that back
to -EOPNOTSUPP in case nobody actually acted on the notifier (i.e.,
everybody returned -EOPNOTSUPP).

Currently, as soon as some device down the stack passes the check_cb()
check, ->handled gets set to true, which means that
switchdev_port_obj_notify() cannot actually ever return -EOPNOTSUPP.

This, for example, means that the detection of hardware offload
support in the MRP code is broken: switchdev_port_obj_add() used by
br_mrp_switchdev_send_ring_test() always returns 0, so since the MRP
code thinks the generation of MRP test frames has been offloaded, no
such frames are actually put on the wire. Similarly,
br_mrp_switchdev_set_ring_role() also always returns 0, causing
mrp->ring_role_offloaded to be set to 1.

To fix this, continue to set ->handled true if any callback returns
success or any error distinct from -EOPNOTSUPP. But if all the
callbacks return -EOPNOTSUPP, make sure that ->handled stays false, so
the logic in switchdev_port_obj_notify() can propagate that
information.

Fixes: 9a9f26e8f7 ("bridge: mrp: Connect MRP API with the switchdev API")
Fixes: f30f0601eb ("switchdev: Add helpers to aid traversal through lower devices")
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Rasmus Villemoes <rasmus.villemoes@prevas.dk>
Link: https://lore.kernel.org/r/20210125124116.102928-1-rasmus.villemoes@prevas.dk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-27 16:41:59 -08:00
Pablo Neira Ayuso
08a01c11a5 netfilter: nftables: statify nft_parse_register()
This function is not used anymore by any extension, statify it.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-01-27 23:16:04 +01:00
Pablo Neira Ayuso
345023b0db netfilter: nftables: add nft_parse_register_store() and use it
This new function combines the netlink register attribute parser
and the store validation function.

This update requires to replace:

        enum nft_registers      dreg:8;

in many of the expression private areas otherwise compiler complains
with:

        error: cannot take address of bit-field ‘dreg’

when passing the register field as reference.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-01-27 23:16:02 +01:00
Pablo Neira Ayuso
4f16d25c68 netfilter: nftables: add nft_parse_register_load() and use it
This new function combines the netlink register attribute parser
and the load validation function.

This update requires to replace:

	enum nft_registers      sreg:8;

in many of the expression private areas otherwise compiler complains
with:

	error: cannot take address of bit-field ‘sreg’

when passing the register field as reference.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-01-27 22:53:29 +01:00
Felix Fietkau
d3b9b45f7e mac80211: minstrel_ht: fix regression in the max_prob_rate fix
Since mi->max_prob_rate is overwritten after the loop that calls
minstrel_ht_set_best_prob_rate, the new best rate needs to be written to *dest

Fixes: a7fca4e403 ("mac80211: minstrel_ht: fix max probability rate selection")
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20210126154409.6755-1-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-01-27 22:06:38 +01:00
Jiapeng Zhong
0c87b1ac60 net: Simplify the calculation of variables
Fix the following coccicheck warnings:

 ./net/ipv4/esp4_offload.c:288:32-34: WARNING !A || A && B is
equivalent to !A || B.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Zhong <abaci-bugfix@linux.alibaba.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2021-01-27 11:49:55 +01:00
Marc Kleine-Budde
12da7a1f3c can: gw: fix typo
This patch fixes a typo found by codespell.

Fixes: 94c23097f9 ("can: gw: support modification of Classical CAN DLCs")
Link: https://lore.kernel.org/r/20210127085529.2768537-3-mkl@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2021-01-27 10:01:46 +01:00
Praveen Chaudhary
6b2e04bc24 net: allow user to set metric on default route learned via Router Advertisement
For IPv4, default route is learned via DHCPv4 and user is allowed to change
metric using config etc/network/interfaces. But for IPv6, default route can
be learned via RA, for which, currently a fixed metric value 1024 is used.

Ideally, user should be able to configure metric on default route for IPv6
similar to IPv4. This patch adds sysctl for the same.

Logs:

For IPv4:

Config in etc/network/interfaces:
auto eth0
iface eth0 inet dhcp
    metric 4261413864

IPv4 Kernel Route Table:
$ ip route list
default via 172.21.47.1 dev eth0 metric 4261413864

FRR Table, if a static route is configured:
[In real scenario, it is useful to prefer BGP learned default route over DHCPv4 default route.]
Codes: K - kernel route, C - connected, S - static, R - RIP,
       O - OSPF, I - IS-IS, B - BGP, P - PIM, E - EIGRP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
       > - selected route, * - FIB route

S>* 0.0.0.0/0 [20/0] is directly connected, eth0, 00:00:03
K   0.0.0.0/0 [254/1000] via 172.21.47.1, eth0, 6d08h51m

i.e. User can prefer Default Router learned via Routing Protocol in IPv4.
Similar behavior is not possible for IPv6, without this fix.

After fix [for IPv6]:
sudo sysctl -w net.ipv6.conf.eth0.net.ipv6.conf.eth0.ra_defrtr_metric=1996489705

IP monitor: [When IPv6 RA is received]
default via fe80::xx16:xxxx:feb3:ce8e dev eth0 proto ra metric 1996489705  pref high

Kernel IPv6 routing table
$ ip -6 route list
default via fe80::be16:65ff:feb3:ce8e dev eth0 proto ra metric 1996489705 expires 21sec hoplimit 64 pref high

FRR Table, if a static route is configured:
[In real scenario, it is useful to prefer BGP learned default route over IPv6 RA default route.]
Codes: K - kernel route, C - connected, S - static, R - RIPng,
       O - OSPFv3, I - IS-IS, B - BGP, N - NHRP, T - Table,
       v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
       > - selected route, * - FIB route

S>* ::/0 [20/0] is directly connected, eth0, 00:00:06
K   ::/0 [119/1001] via fe80::xx16:xxxx:feb3:ce8e, eth0, 6d07h43m

If the metric is changed later, the effect will be seen only when next IPv6
RA is received, because the default route must be fully controlled by RA msg.
Below metric is changed from 1996489705 to 1996489704.

$ sudo sysctl -w net.ipv6.conf.eth0.ra_defrtr_metric=1996489704
net.ipv6.conf.eth0.ra_defrtr_metric = 1996489704

IP monitor:
[On next IPv6 RA msg, Kernel deletes prev route and installs new route with updated metric]

Deleted default via fe80::xx16:xxxx:feb3:ce8e dev eth0 proto ra metric 1996489705 expires 3sec hoplimit 64 pref high
default via fe80::xx16:xxxx:feb3:ce8e dev eth0 proto ra metric 1996489704 pref high

Signed-off-by: Praveen Chaudhary <pchaudhary@linkedin.com>
Signed-off-by: Zhenggen Xu <zxu@linkedin.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20210125214430.24079-1-pchaudhary@linkedin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-26 18:39:45 -08:00
Xie He
b491e6a739 net: lapb: Add locking to the lapb module
In the lapb module, the timers may run concurrently with other code in
this module, and there is currently no locking to prevent the code from
racing on "struct lapb_cb". This patch adds locking to prevent racing.

1. Add "spinlock_t lock" to "struct lapb_cb"; Add "spin_lock_bh" and
"spin_unlock_bh" to APIs, timer functions and notifier functions.

2. Add "bool t1timer_stop, t2timer_stop" to "struct lapb_cb" to make us
able to ask running timers to abort; Modify "lapb_stop_t1timer" and
"lapb_stop_t2timer" to make them able to abort running timers;
Modify "lapb_t2timer_expiry" and "lapb_t1timer_expiry" to make them
abort after they are stopped by "lapb_stop_t1timer", "lapb_stop_t2timer",
and "lapb_start_t1timer", "lapb_start_t2timer".

3. Let lapb_unregister wait for other API functions and running timers
to stop.

4. The lapb_device_event function calls lapb_disconnect_request. In
order to avoid trying to hold the lock twice, add a new function named
"__lapb_disconnect_request" which assumes the lock is held, and make
it called by lapb_disconnect_request and lapb_device_event.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Cc: Martin Schiller <ms@dev.tdt.de>
Signed-off-by: Xie He <xie.he.0141@gmail.com>
Link: https://lore.kernel.org/r/20210126040939.69995-1-xie.he.0141@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-26 17:53:45 -08:00
Nikolay Aleksandrov
3e841bacf7 net: bridge: multicast: fix br_multicast_eht_set_entry_lookup indentation
Fix the messed up indentation in br_multicast_eht_set_entry_lookup().

Fixes: baa74d39ca ("net: bridge: multicast: add EHT source set handling functions")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Link: https://lore.kernel.org/r/20210125082040.13022-1-razor@blackwall.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-26 16:26:50 -08:00
Jakub Kicinski
c5e9e8d48a A couple of fixes:
* fix 160 MHz channel switch in mac80211
  * fix a staging driver to not deadlock due to some
    recent cfg80211 changes
  * fix NULL-ptr deref if cfg80211 returns -EINPROGRESS
    to wext (syzbot)
  * pause TX in mac80211 in type change to prevent crashes
    (syzbot)
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAmAQEXcACgkQB8qZga/f
 l8SYWg//X01WgOAAu8mTbnXGdnHF/ZTssp+/eenmoc/XroJgMmvbt4b/gfSt0xMF
 kEd3kPBPtp6Ttgr4HQAyfFctmtXcgq9qo6L8wBWy8WeDHBjaOMLWBLMlZ3m3/abH
 oU8LvDNO0FYCf5qCIzmpFJYXlEsyv22XOi9GYx62ux5g7DbWO6C1Me3dn7im+IiW
 +/P7Qy5lSDpww9F89wMfkwP4dDNcycgKkeRn6IEH1gJxzF6x6GFjuTvrr0finBnO
 9m/zgZCbBm0VIdFcySiJUiWqyhDgdfa3dfcOtNwegkH8mZUZ+h8sEG67Ku4Hp56L
 Js6AmOw4TA9kSs11nOWOZm9609kk8s8ucy88A8v7ct6O9FkxvzcK6VLo3GCGTBPE
 8BsPJoL9OXvziT/vwuGPJDruO3Vv6yzoDf8cSiiMGCCOe5TPioucDSbU8QV4N02Q
 OJ8NcgRtxIo0GC6DD8xOvafWVvc6/byruBpe1x7SkkgfK0NIaMKgWh2g8Ebi3RpE
 H1X/2FpsZ7AgKDs2yfJs4oFOwL72o8WezH6jTM4AFoGOEtxPnrYaG1+R35y6R9Vp
 DwPRo9bTyVb4rpnoOv/FYMoBaKpP7qQDyGlwvGeqUlAW8Yj+H+MmxVUv5JUIj4Cu
 Zdy0bFq27dntJ+/jnoryFfU/afyOdkANAu9cmlx+vXqr1etLvww=
 =g3wj
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-for-net-2021-01-26' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
A couple of fixes:
 * fix 160 MHz channel switch in mac80211
 * fix a staging driver to not deadlock due to some
   recent cfg80211 changes
 * fix NULL-ptr deref if cfg80211 returns -EINPROGRESS
   to wext (syzbot)
 * pause TX in mac80211 in type change to prevent crashes
   (syzbot)

* tag 'mac80211-for-net-2021-01-26' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211:
  staging: rtl8723bs: fix wireless regulatory API misuse
  mac80211: pause TX while changing interface type
  wext: fix NULL-ptr-dereference with cfg80211's lack of commit()
  mac80211: 160MHz with extended NSS BW in CSA
====================

Link: https://lore.kernel.org/r/20210126130529.75225-1-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-26 15:23:18 -08:00
Johannes Berg
054c9939b4 mac80211: pause TX while changing interface type
syzbot reported a crash that happened when changing the interface
type around a lot, and while it might have been easy to fix just
the symptom there, a little deeper investigation found that really
the reason is that we allowed packets to be transmitted while in
the middle of changing the interface type.

Disallow TX by stopping the queues while changing the type.

Fixes: 34d4bc4d41 ("mac80211: support runtime interface type changes")
Reported-by: syzbot+d7a3b15976bf7de2238a@syzkaller.appspotmail.com
Link: https://lore.kernel.org/r/20210122171115.b321f98f4d4f.I6997841933c17b093535c31d29355be3c0c39628@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-01-26 11:59:45 +01:00
Johannes Berg
5122565188 wext: fix NULL-ptr-dereference with cfg80211's lack of commit()
Since cfg80211 doesn't implement commit, we never really cared about
that code there (and it's configured out w/o CONFIG_WIRELESS_EXT).
After all, since it has no commit, it shouldn't return -EIWCOMMIT to
indicate commit is needed.

However, EIWCOMMIT is actually an alias for EINPROGRESS, which _can_
happen if e.g. we try to change the frequency but we're already in
the process of connecting to some network, and drivers could return
that value (or even cfg80211 itself might).

This then causes us to crash because dev->wireless_handlers is NULL
but we try to check dev->wireless_handlers->standard[0].

Fix this by also checking dev->wireless_handlers. Also simplify the
code a little bit.

Cc: stable@vger.kernel.org
Reported-by: syzbot+444248c79e117bc99f46@syzkaller.appspotmail.com
Reported-by: syzbot+8b2a88a09653d4084179@syzkaller.appspotmail.com
Link: https://lore.kernel.org/r/20210121171621.2076e4a37d5a.I5d9c72220fe7bb133fb718751da0180a57ecba4e@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-01-26 11:59:42 +01:00
Johannes Berg
a05829a722 cfg80211: avoid holding the RTNL when calling the driver
Currently, _everything_ in cfg80211 holds the RTNL, and if you
have a slow USB device (or a few) you can get some bad lock
contention on that.

Fix that by re-adding a mutex to each wiphy/rdev as we had at
some point, so we have locking for the wireless_dev lists and
all the other things in there, and also so that drivers still
don't have to worry too much about it (they still won't get
parallel calls for a single device).

Then, we can restrict the RTNL to a few cases where we add or
remove interfaces and really need the added protection. Some
of the global list management still also uses the RTNL, since
we need to have it anyway for netdev management, but we only
hold the RTNL for very short periods of time here.

Link: https://lore.kernel.org/r/20210122161942.81df9f5e047a.I4a8e1a60b18863ea8c5e6d3a0faeafb2d45b2f40@changeid
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> [marvell driver issues]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-01-26 11:55:50 +01:00
Jiapeng Zhong
8d21c882ab bridge: Use PTR_ERR_OR_ZERO instead if(IS_ERR(...)) + PTR_ERR
coccicheck suggested using PTR_ERR_OR_ZERO() and looking at the code.

Fix the following coccicheck warnings:

./net/bridge/br_multicast.c:1295:7-13: WARNING: PTR_ERR_OR_ZERO can be
used.

Reported-by: Abaci <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Zhong <abaci-bugfix@linux.alibaba.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Link: https://lore.kernel.org/r/1611542381-91178-1-git-send-email-abaci-bugfix@linux.alibaba.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-25 18:23:07 -08:00
Pablo Neira Ayuso
dbc859d96f netfilter: flowtable: add hash offset field to tuple
Add a placeholder field to calculate hash tuple offset. Similar to
2c407aca64 ("netfilter: conntrack: avoid gcc-10 zero-length-bounds
warning").

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-01-26 01:10:07 +01:00
Darby Payne
012da53d1a ipvs: add weighted random twos choice algorithm
Adds the random twos choice load-balancing algorithm. The algorithm will
pick two random servers based on weights. Then select the server with
the least amount of connections normalized by weight. The algorithm
avoids the "herd behavior" problem. The algorithm comes from a paper
by Michael Mitzenmacher available here
http://www.eecs.harvard.edu/~michaelm/NEWWORK/postscripts/twosurvey.pdf

Signed-off-by: Darby Payne <darby.payne@gmail.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-01-26 01:09:46 +01:00
Florian Westphal
83ace77f51 netfilter: ctnetlink: remove get_ct indirection
Use nf_ct_get() directly, its a small inline helper without dependencies.

Add CONFIG_NF_CONNTRACK guards to elide the relevant part when conntrack
isn't available at all.

v2: add ifdef guard around nf_ct_get call (kernel test robot)
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-01-25 22:06:11 +01:00
Dave Wysochanski
e4a7d1f770 SUNRPC: Handle 0 length opaque XDR object data properly
When handling an auth_gss downcall, it's possible to get 0-length
opaque object for the acceptor.  In the case of a 0-length XDR
object, make sure simple_get_netobj() fills in dest->data = NULL,
and does not continue to kmemdup() which will set
dest->data = ZERO_SIZE_PTR for the acceptor.

The trace event code can handle NULL but not ZERO_SIZE_PTR for a
string, and so without this patch the rpcgss_context trace event
will crash the kernel as follows:

[  162.887992] BUG: kernel NULL pointer dereference, address: 0000000000000010
[  162.898693] #PF: supervisor read access in kernel mode
[  162.900830] #PF: error_code(0x0000) - not-present page
[  162.902940] PGD 0 P4D 0
[  162.904027] Oops: 0000 [#1] SMP PTI
[  162.905493] CPU: 4 PID: 4321 Comm: rpc.gssd Kdump: loaded Not tainted 5.10.0 #133
[  162.908548] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[  162.910978] RIP: 0010:strlen+0x0/0x20
[  162.912505] Code: 48 89 f9 74 09 48 83 c1 01 80 39 00 75 f7 31 d2 44 0f b6 04 16 44 88 04 11 48 83 c2 01 45 84 c0 75 ee c3 0f 1f 80 00 00 00 00 <80> 3f 00 74 10 48 89 f8 48 83 c0 01 80 38 00 75 f7 48 29 f8 c3 31
[  162.920101] RSP: 0018:ffffaec900c77d90 EFLAGS: 00010202
[  162.922263] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00000000fffde697
[  162.925158] RDX: 000000000000002f RSI: 0000000000000080 RDI: 0000000000000010
[  162.928073] RBP: 0000000000000010 R08: 0000000000000e10 R09: 0000000000000000
[  162.930976] R10: ffff8e698a590cb8 R11: 0000000000000001 R12: 0000000000000e10
[  162.933883] R13: 00000000fffde697 R14: 000000010034d517 R15: 0000000000070028
[  162.936777] FS:  00007f1e1eb93700(0000) GS:ffff8e6ab7d00000(0000) knlGS:0000000000000000
[  162.940067] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  162.942417] CR2: 0000000000000010 CR3: 0000000104eba000 CR4: 00000000000406e0
[  162.945300] Call Trace:
[  162.946428]  trace_event_raw_event_rpcgss_context+0x84/0x140 [auth_rpcgss]
[  162.949308]  ? __kmalloc_track_caller+0x35/0x5a0
[  162.951224]  ? gss_pipe_downcall+0x3a3/0x6a0 [auth_rpcgss]
[  162.953484]  gss_pipe_downcall+0x585/0x6a0 [auth_rpcgss]
[  162.955953]  rpc_pipe_write+0x58/0x70 [sunrpc]
[  162.957849]  vfs_write+0xcb/0x2c0
[  162.959264]  ksys_write+0x68/0xe0
[  162.960706]  do_syscall_64+0x33/0x40
[  162.962238]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  162.964346] RIP: 0033:0x7f1e1f1e57df

Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-01-25 15:59:12 -05:00
Dave Wysochanski
ba6dfce47c SUNRPC: Move simple_get_bytes and simple_get_netobj into private header
Remove duplicated helper functions to parse opaque XDR objects
and place inside new file net/sunrpc/auth_gss/auth_gss_internal.h.
In the new file carry the license and copyright from the source file
net/sunrpc/auth_gss/auth_gss.c.  Finally, update the comment inside
include/linux/sunrpc/xdr.h since lockd is not the only user of
struct xdr_netobj.

Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-01-25 15:59:12 -05:00
Bastien Nocera
98d2c3e173 Bluetooth: L2CAP: Try harder to accept device not knowing options
The current implementation of L2CAP options negotiation will continue
the negotiation when a device responds with L2CAP_CONF_UNACCEPT ("unaccepted
options"), but not when the device replies with L2CAP_CONF_UNKNOWN ("unknown
options").

Trying to continue the negotiation without ERTM support will allow
Bluetooth-capable XBox One controllers (notably models 1708 and 1797)
to connect.

btmon before patch:
> ACL Data RX: Handle 256 flags 0x02 dlen 16                            #64 [hci0] 59.182702
      L2CAP: Connection Response (0x03) ident 2 len 8
        Destination CID: 64
        Source CID: 64
        Result: Connection successful (0x0000)
        Status: No further information available (0x0000)
< ACL Data TX: Handle 256 flags 0x00 dlen 23                            #65 [hci0] 59.182744
      L2CAP: Configure Request (0x04) ident 3 len 15
        Destination CID: 64
        Flags: 0x0000
        Option: Retransmission and Flow Control (0x04) [mandatory]
          Mode: Basic (0x00)
          TX window size: 0
          Max transmit: 0
          Retransmission timeout: 0
          Monitor timeout: 0
          Maximum PDU size: 0
> ACL Data RX: Handle 256 flags 0x02 dlen 16                            #66 [hci0] 59.183948
      L2CAP: Configure Request (0x04) ident 1 len 8
        Destination CID: 64
        Flags: 0x0000
        Option: Maximum Transmission Unit (0x01) [mandatory]
          MTU: 1480
< ACL Data TX: Handle 256 flags 0x00 dlen 18                            #67 [hci0] 59.183994
      L2CAP: Configure Response (0x05) ident 1 len 10
        Source CID: 64
        Flags: 0x0000
        Result: Success (0x0000)
        Option: Maximum Transmission Unit (0x01) [mandatory]
          MTU: 1480
> ACL Data RX: Handle 256 flags 0x02 dlen 15                            #69 [hci0] 59.187676
      L2CAP: Configure Response (0x05) ident 3 len 7
        Source CID: 64
        Flags: 0x0000
        Result: Failure - unknown options (0x0003)
        04                                               .
< ACL Data TX: Handle 256 flags 0x00 dlen 12                            #70 [hci0] 59.187722
      L2CAP: Disconnection Request (0x06) ident 4 len 4
        Destination CID: 64
        Source CID: 64
> ACL Data RX: Handle 256 flags 0x02 dlen 12                            #73 [hci0] 59.192714
      L2CAP: Disconnection Response (0x07) ident 4 len 4
        Destination CID: 64
        Source CID: 64

btmon after patch:
> ACL Data RX: Handle 256 flags 0x02 dlen 16                          #248 [hci0] 103.502970
      L2CAP: Connection Response (0x03) ident 5 len 8
        Destination CID: 65
        Source CID: 65
        Result: Connection pending (0x0001)
        Status: No further information available (0x0000)
> ACL Data RX: Handle 256 flags 0x02 dlen 16                          #249 [hci0] 103.504184
      L2CAP: Connection Response (0x03) ident 5 len 8
        Destination CID: 65
        Source CID: 65
        Result: Connection successful (0x0000)
        Status: No further information available (0x0000)
< ACL Data TX: Handle 256 flags 0x00 dlen 23                          #250 [hci0] 103.504398
      L2CAP: Configure Request (0x04) ident 6 len 15
        Destination CID: 65
        Flags: 0x0000
        Option: Retransmission and Flow Control (0x04) [mandatory]
          Mode: Basic (0x00)
          TX window size: 0
          Max transmit: 0
          Retransmission timeout: 0
          Monitor timeout: 0
          Maximum PDU size: 0
> ACL Data RX: Handle 256 flags 0x02 dlen 16                          #251 [hci0] 103.505472
      L2CAP: Configure Request (0x04) ident 3 len 8
        Destination CID: 65
        Flags: 0x0000
        Option: Maximum Transmission Unit (0x01) [mandatory]
          MTU: 1480
< ACL Data TX: Handle 256 flags 0x00 dlen 18                          #252 [hci0] 103.505689
      L2CAP: Configure Response (0x05) ident 3 len 10
        Source CID: 65
        Flags: 0x0000
        Result: Success (0x0000)
        Option: Maximum Transmission Unit (0x01) [mandatory]
          MTU: 1480
> ACL Data RX: Handle 256 flags 0x02 dlen 15                          #254 [hci0] 103.509165
      L2CAP: Configure Response (0x05) ident 6 len 7
        Source CID: 65
        Flags: 0x0000
        Result: Failure - unknown options (0x0003)
        04                                               .
< ACL Data TX: Handle 256 flags 0x00 dlen 12                          #255 [hci0] 103.509426
      L2CAP: Configure Request (0x04) ident 7 len 4
        Destination CID: 65
        Flags: 0x0000
< ACL Data TX: Handle 256 flags 0x00 dlen 12                          #257 [hci0] 103.511870
      L2CAP: Connection Request (0x02) ident 8 len 4
        PSM: 1 (0x0001)
        Source CID: 66
> ACL Data RX: Handle 256 flags 0x02 dlen 14                          #259 [hci0] 103.514121
      L2CAP: Configure Response (0x05) ident 7 len 6
        Source CID: 65
        Flags: 0x0000
        Result: Success (0x0000)

Signed-off-by: Florian Dollinger <dollinger.florian@gmx.de>
Co-developed-by: Florian Dollinger <dollinger.florian@gmx.de>
Reviewed-by: Luiz Augusto Von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2021-01-25 19:28:35 +01:00
Luiz Augusto von Dentz
4d7ea8ee90 Bluetooth: L2CAP: Fix handling fragmented length
Bluetooth Core Specification v5.2, Vol. 3, Part A, section 1.4, table
1.1:

 'Start Fragments always either begin with the first octet of the Basic
  L2CAP header of a PDU or they have a length of zero (see [Vol 2] Part
  B, Section 6.6.2).'

Apparently this was changed by the following errata:

https://www.bluetooth.org/tse/errata_view.cfm?errata_id=10216

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2021-01-25 19:26:33 +01:00
Vamshi K Sthambamkadi
5ff20cbe67 Bluetooth: btusb: fix memory leak on suspend and resume
kmemleak report:
unreferenced object 0xffff9b1127f00500 (size 208):
  comm "kworker/u17:2", pid 500, jiffies 4294937470 (age 580.136s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 60 ed 05 11 9b ff ff 00 00 00 00 00 00 00 00  .`..............
  backtrace:
    [<000000006ab3fd59>] kmem_cache_alloc_node+0x17a/0x480
    [<0000000051a5f6f9>] __alloc_skb+0x5b/0x1d0
    [<0000000037e2d252>] hci_prepare_cmd+0x32/0xc0 [bluetooth]
    [<0000000010b586d5>] hci_req_add_ev+0x84/0xe0 [bluetooth]
    [<00000000d2deb520>] hci_req_clear_event_filter+0x42/0x70 [bluetooth]
    [<00000000f864bd8c>] hci_req_prepare_suspend+0x84/0x470 [bluetooth]
    [<000000001deb2cc4>] hci_prepare_suspend+0x31/0x40 [bluetooth]
    [<000000002677dd79>] process_one_work+0x209/0x3b0
    [<00000000aaa62b07>] worker_thread+0x34/0x400
    [<00000000826d176c>] kthread+0x126/0x140
    [<000000002305e558>] ret_from_fork+0x22/0x30
unreferenced object 0xffff9b1125c6ee00 (size 512):
  comm "kworker/u17:2", pid 500, jiffies 4294937470 (age 580.136s)
  hex dump (first 32 bytes):
    04 00 00 00 0d 00 00 00 05 0c 01 00 11 9b ff ff  ................
    00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00  ................
  backtrace:
    [<000000009f07c0cc>] slab_post_alloc_hook+0x59/0x270
    [<0000000049431dc2>] __kmalloc_node_track_caller+0x15f/0x330
    [<00000000027a42f6>] __kmalloc_reserve.isra.70+0x31/0x90
    [<00000000e8e3e76a>] __alloc_skb+0x87/0x1d0
    [<0000000037e2d252>] hci_prepare_cmd+0x32/0xc0 [bluetooth]
    [<0000000010b586d5>] hci_req_add_ev+0x84/0xe0 [bluetooth]
    [<00000000d2deb520>] hci_req_clear_event_filter+0x42/0x70 [bluetooth]
    [<00000000f864bd8c>] hci_req_prepare_suspend+0x84/0x470 [bluetooth]
    [<000000001deb2cc4>] hci_prepare_suspend+0x31/0x40 [bluetooth]
    [<000000002677dd79>] process_one_work+0x209/0x3b0
    [<00000000aaa62b07>] worker_thread+0x34/0x400
    [<00000000826d176c>] kthread+0x126/0x140
    [<000000002305e558>] ret_from_fork+0x22/0x30
unreferenced object 0xffff9b112b395788 (size 8):
  comm "kworker/u17:2", pid 500, jiffies 4294937470 (age 580.136s)
  hex dump (first 8 bytes):
    20 00 00 00 00 00 04 00                           .......
  backtrace:
    [<0000000052dc28d2>] kmem_cache_alloc_trace+0x15e/0x460
    [<0000000046147591>] alloc_ctrl_urb+0x52/0xe0 [btusb]
    [<00000000a2ed3e9e>] btusb_send_frame+0x91/0x100 [btusb]
    [<000000001e66030e>] hci_send_frame+0x7e/0xf0 [bluetooth]
    [<00000000bf6b7269>] hci_cmd_work+0xc5/0x130 [bluetooth]
    [<000000002677dd79>] process_one_work+0x209/0x3b0
    [<00000000aaa62b07>] worker_thread+0x34/0x400
    [<00000000826d176c>] kthread+0x126/0x140
    [<000000002305e558>] ret_from_fork+0x22/0x30

In pm sleep-resume context, while the btusb device rebinds, it enters
hci_unregister_dev(), whilst there is a possibility of hdev receiving
PM_POST_SUSPEND suspend_notifier event, leading to generation of msg
frames. When hci_unregister_dev() completes, i.e. hdev context is
destroyed/freed, those intermittently sent msg frames cause memory
leak.

BUG details:
Below is stack trace of thread that enters hci_unregister_dev(), marks
the hdev flag HCI_UNREGISTER to 1, and then goes onto to wait on notifier
lock - refer unregister_pm_notifier().

  hci_unregister_dev+0xa5/0x320 [bluetoot]
  btusb_disconnect+0x68/0x150 [btusb]
  usb_unbind_interface+0x77/0x250
  ? kernfs_remove_by_name_ns+0x75/0xa0
  device_release_driver_internal+0xfe/0x1
  device_release_driver+0x12/0x20
  bus_remove_device+0xe1/0x150
  device_del+0x192/0x3e0
  ? usb_remove_ep_devs+0x1f/0x30
  usb_disable_device+0x92/0x1b0
  usb_disconnect+0xc2/0x270
  hub_event+0x9f6/0x15d0
  ? rpm_idle+0x23/0x360
  ? rpm_idle+0x26b/0x360
  process_one_work+0x209/0x3b0
  worker_thread+0x34/0x400
  ? process_one_work+0x3b0/0x3b0
  kthread+0x126/0x140
  ? kthread_park+0x90/0x90
  ret_from_fork+0x22/0x30

Below is stack trace of thread executing hci_suspend_notifier() which
processes the PM_POST_SUSPEND event, while the unbinding thread is
waiting on lock.

  hci_suspend_notifier.cold.39+0x5/0x2b [bluetooth]
  blocking_notifier_call_chain+0x69/0x90
  pm_notifier_call_chain+0x1a/0x20
  pm_suspend.cold.9+0x334/0x352
  state_store+0x84/0xf0
  kobj_attr_store+0x12/0x20
  sysfs_kf_write+0x3b/0x40
  kernfs_fop_write+0xda/0x1c0
  vfs_write+0xbb/0x250
  ksys_write+0x61/0xe0
  __x64_sys_write+0x1a/0x20
  do_syscall_64+0x37/0x80
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fix hci_suspend_notifer(), not to act on events when flag HCI_UNREGISTER
is set.

Signed-off-by: Vamshi K Sthambamkadi <vamshi.k.sthambamkadi@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2021-01-25 19:23:43 +01:00
Pan Bian
28a758c861 Bluetooth: Put HCI device if inquiry procedure interrupts
Jump to the label done to decrement the reference count of HCI device
hdev on path that the Inquiry procedure is interrupted.

Fixes: 3e13fa1e1f ("Bluetooth: Fix hci_inquiry ioctl usage")
Signed-off-by: Pan Bian <bianpan2016@163.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2021-01-25 16:14:58 +01:00
Pan Bian
5a3ef03afe Bluetooth: drop HCI device reference before return
Call hci_dev_put() to decrement reference count of HCI device hdev if
fails to duplicate memory.

Fixes: 0b26ab9dce ("Bluetooth: AMP: Handle Accept phylink command status evt")
Signed-off-by: Pan Bian <bianpan2016@163.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2021-01-25 16:13:52 +01:00
Howard Chung
bf6a4e30ff Bluetooth: disable advertisement filters during suspend
This adds logic to disable and reenable advertisement filters during
suspend and resume. After this patch, we would only receive packets from
devices in allow list during suspend.

Signed-off-by: Howard Chung <howardchung@google.com>
Reviewed-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2021-01-25 16:07:37 +01:00
Archie Pusaka
58ceb1e6d6 Bluetooth: advmon offload MSFT interleave scanning integration
When MSFT extension is supported, we don't have to interleave the scan
as we could just do allowlist scan.

Signed-off-by: Archie Pusaka <apusaka@chromium.org>
Reviewed-by: Miao-chen Chou <mcchou@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2021-01-25 16:07:36 +01:00
Archie Pusaka
394566bf1e Bluetooth: advmon offload MSFT handle filter enablement
Implements the feature to disable/enable the filter used for
advertising monitor on MSFT controller, effectively have the same
effect as "remove all monitors" and "add all previously removed
monitors".

This feature would be needed when suspending, where we would not want
to get packets from anything outside the allowlist. Note that the
integration with the suspending part is not included in this patch.

Signed-off-by: Archie Pusaka <apusaka@chromium.org>
Reviewed-by: Miao-chen Chou <mcchou@chromium.org>
Reviewed-by: Yun-Hao Chung <howardchung@google.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2021-01-25 16:07:36 +01:00
Archie Pusaka
4a37682c6b Bluetooth: advmon offload MSFT handle controller reset
When the controller is powered off, the registered advertising monitor
is removed from the controller. This patch handles the re-registration
of those monitors when the power is on.

Signed-off-by: Archie Pusaka <apusaka@chromium.org>
Reviewed-by: Miao-chen Chou <mcchou@chromium.org>
Reviewed-by: Yun-Hao Chung <howardchung@google.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2021-01-25 16:07:36 +01:00
Archie Pusaka
66bd095ab5 Bluetooth: advmon offload MSFT remove monitor
Implements the monitor removal functionality for advertising monitor
offloading to MSFT controllers. Supply handle = 0 to remove all
monitors.

Signed-off-by: Archie Pusaka <apusaka@chromium.org>
Reviewed-by: Miao-chen Chou <mcchou@chromium.org>
Reviewed-by: Yun-Hao Chung <howardchung@google.com>
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2021-01-25 16:07:36 +01:00
Archie Pusaka
a2a4dedf88 Bluetooth: advmon offload MSFT add monitor
Enables advertising monitor offloading to the controller, if MSFT
extension is supported. The kernel won't adjust the monitor parameters
to match what the controller supports - that is the user space's
responsibility.

This patch only manages the addition of monitors. Monitor removal is
going to be handled by another patch.

Signed-off-by: Archie Pusaka <apusaka@chromium.org>
Reviewed-by: Manish Mandlik <mmandlik@chromium.org>
Reviewed-by: Miao-chen Chou <mcchou@chromium.org>
Reviewed-by: Yun-Hao Chung <howardchung@google.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2021-01-25 16:07:36 +01:00
Archie Pusaka
b4a221ea8a Bluetooth: advmon offload MSFT add rssi support
MSFT needs rssi parameter for monitoring advertisement packet,
therefore we should supply them from mgmt. This adds a new opcode
to add advertisement monitor with rssi parameters.

Signed-off-by: Archie Pusaka <apusaka@chromium.org>
Reviewed-by: Manish Mandlik <mmandlik@chromium.org>
Reviewed-by: Miao-chen Chou <mcchou@chromium.org>
Reviewed-by: Yun-Hao Chung <howardchung@google.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2021-01-25 16:07:36 +01:00
Pengcheng Yang
62d9f1a694 tcp: fix TLP timer not set when CA_STATE changes from DISORDER to OPEN
Upon receiving a cumulative ACK that changes the congestion state from
Disorder to Open, the TLP timer is not set. If the sender is app-limited,
it can only wait for the RTO timer to expire and retransmit.

The reason for this is that the TLP timer is set before the congestion
state changes in tcp_ack(), so we delay the time point of calling
tcp_set_xmit_timer() until after tcp_fastretrans_alert() returns and
remove the FLAG_SET_XMIT_TIMER from ack_flag when the RACK reorder timer
is set.

This commit has two additional benefits:
1) Make sure to reset RTO according to RFC6298 when receiving ACK, to
avoid spurious RTO caused by RTO timer early expires.
2) Reduce the xmit timer reschedule once per ACK when the RACK reorder
timer is set.

Fixes: df92c8394e ("tcp: fix xmit timer to only be reset if data ACKed/SACKed")
Link: https://lore.kernel.org/netdev/1611311242-6675-1-git-send-email-yangpc@wangsu.com
Signed-off-by: Pengcheng Yang <yangpc@wangsu.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/1611464834-23030-1-git-send-email-yangpc@wangsu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-23 21:33:01 -08:00
Alexander Lobakin
36707061d6 udp: allow forwarding of plain (non-fraglisted) UDP GRO packets
Commit 9fd1ff5d2a ("udp: Support UDP fraglist GRO/GSO.") actually
not only added a support for fraglisted UDP GRO, but also tweaked
some logics the way that non-fraglisted UDP GRO started to work for
forwarding too.
Commit 2e4ef10f58 ("net: add GSO UDP L4 and GSO fraglists to the
list of software-backed types") added GSO UDP L4 to the list of
software GSO to allow virtual netdevs to forward them as is up to
the real drivers.

Tests showed that currently forwarding and NATing of plain UDP GRO
packets are performed fully correctly, regardless if the target
netdevice has a support for hardware/driver GSO UDP L4 or not.
Add the last element and allow to form plain UDP GRO packets if
we are on forwarding path, and the new NETIF_F_GRO_UDP_FWD is
enabled on a receiving netdevice.

If both NETIF_F_GRO_FRAGLIST and NETIF_F_GRO_UDP_FWD are set,
fraglisted GRO takes precedence. This keeps the current behaviour
and is generally more optimal for now, as the number of NICs with
hardware USO offload is relatively small.

Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-23 20:18:16 -08:00
Alexander Lobakin
6f1c0ea133 net: introduce a netdev feature for UDP GRO forwarding
Introduce a new netdev feature, NETIF_F_GRO_UDP_FWD, to allow user
to turn UDP GRO on and off for forwarding.
Defaults to off to not change current datapath.

Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-23 20:16:24 -08:00
Enke Chen
344db93ae3 tcp: make TCP_USER_TIMEOUT accurate for zero window probes
The TCP_USER_TIMEOUT is checked by the 0-window probe timer. As the
timer has backoff with a max interval of about two minutes, the
actual timeout for TCP_USER_TIMEOUT can be off by up to two minutes.

In this patch the TCP_USER_TIMEOUT is made more accurate by taking it
into account when computing the timer value for the 0-window probes.

This patch is similar to and builds on top of the one that made
TCP_USER_TIMEOUT accurate for RTOs in commit b701a99e43 ("tcp: Add
tcp_clamp_rto_to_user_timeout() helper to improve accuracy").

Fixes: 9721e709fa ("tcp: simplify window probe aborting on USER_TIMEOUT")
Signed-off-by: Enke Chen <enchen@paloaltonetworks.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20210122191306.GA99540@localhost.localdomain
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-23 19:32:51 -08:00
Pan Bian
3a30537cee NFC: fix resource leak when target index is invalid
Goto to the label put_dev instead of the label error to fix potential
resource leak on path that the target index is invalid.

Fixes: c4fbb6515a ("NFC: The core part should generate the target index")
Signed-off-by: Pan Bian <bianpan2016@163.com>
Link: https://lore.kernel.org/r/20210121152748.98409-1-bianpan2016@163.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-23 13:34:35 -08:00
Pan Bian
d8f923c3ab NFC: fix possible resource leak
Put the device to avoid resource leak on path that the polling flag is
invalid.

Fixes: a831b91320 ("NFC: Do not return EBUSY when stopping a poll that's already stopped")
Signed-off-by: Pan Bian <bianpan2016@163.com>
Link: https://lore.kernel.org/r/20210121153745.122184-1-bianpan2016@163.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-23 13:34:31 -08:00
Rasmus Villemoes
6781939054 net: mrp: move struct definitions out of uapi
None of these are actually used in the kernel/userspace interface -
there's a userspace component of implementing MRP, and userspace will
need to construct certain frames to put on the wire, but there's no
reason the kernel should provide the relevant definitions in a UAPI
header.

In fact, some of those definitions were broken until previous commit,
so only keep the few that are actually referenced in the kernel code,
and move them to the br_private_mrp.h header.

Signed-off-by: Rasmus Villemoes <rasmus.villemoes@prevas.dk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-23 12:38:42 -08:00
Danielle Ratson
321f7ab0d4 mlxsw: Register physical ports as a devlink resource
The switch ASIC has a limited capacity of physical ('flavour physical'
in devlink terminology) ports that it can support. While each system is
brought up with a different number of ports, this number can be
increased via splitting up to the ASIC's limit.

Expose physical ports as a devlink resource so that user space will have
visibility to the maximum number of ports that can be supported and the
current occupancy.

In addition, add a "Generic Resources" section in devlink-resource
documentation so the different drivers will be aligned by the same resource
name when exposing to user space.

Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-22 20:42:13 -08:00
Maxim Mikityanskiy
8327158624 sch_htb: Stats for offloaded HTB
This commit adds support for statistics of offloaded HTB. Bytes and
packets counters for leaf and inner nodes are supported, the values are
taken from per-queue qdiscs, and the numbers that the user sees should
have the same behavior as the software (non-offloaded) HTB.

Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-22 20:41:29 -08:00
Maxim Mikityanskiy
d03b195b5a sch_htb: Hierarchical QoS hardware offload
HTB doesn't scale well because of contention on a single lock, and it
also consumes CPU. This patch adds support for offloading HTB to
hardware that supports hierarchical rate limiting.

In the offload mode, HTB passes control commands to the driver using
ndo_setup_tc. The driver has to replicate the whole hierarchy of classes
and their settings (rate, ceil) in the NIC. Every modification of the
HTB tree caused by the admin results in ndo_setup_tc being called.

After this setup, the HTB algorithm is done completely in the NIC. An SQ
(send queue) is created for every leaf class and attached to the
hierarchy, so that the NIC can calculate and obey aggregated rate
limits, too. In the future, it can be changed, so that multiple SQs will
back a single leaf class.

ndo_select_queue is responsible for selecting the right queue that
serves the traffic class of each packet.

The data path works as follows: a packet is classified by clsact, the
driver selects a hardware queue according to its class, and the packet
is enqueued into this queue's qdisc.

This solution addresses two main problems of scaling HTB:

1. Contention by flow classification. Currently the filters are attached
to the HTB instance as follows:

    # tc filter add dev eth0 parent 1:0 protocol ip flower dst_port 80
    classid 1:10

It's possible to move classification to clsact egress hook, which is
thread-safe and lock-free:

    # tc filter add dev eth0 egress protocol ip flower dst_port 80
    action skbedit priority 1:10

This way classification still happens in software, but the lock
contention is eliminated, and it happens before selecting the TX queue,
allowing the driver to translate the class to the corresponding hardware
queue in ndo_select_queue.

Note that this is already compatible with non-offloaded HTB and doesn't
require changes to the kernel nor iproute2.

2. Contention by handling packets. HTB is not multi-queue, it attaches
to a whole net device, and handling of all packets takes the same lock.
When HTB is offloaded, it registers itself as a multi-queue qdisc,
similarly to mq: HTB is attached to the netdev, and each queue has its
own qdisc.

Some features of HTB may be not supported by some particular hardware,
for example, the maximum number of classes may be limited, the
granularity of rate and ceil parameters may be different, etc. - so, the
offload is not enabled by default, a new parameter is used to enable it:

    # tc qdisc replace dev eth0 root handle 1: htb offload

Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-22 20:41:29 -08:00
Maxim Mikityanskiy
4dd78a7373 net: sched: Add extack to Qdisc_class_ops.delete
In a following commit, sch_htb will start using extack in the delete
class operation to pass hardware errors in offload mode. This commit
prepares for that by adding the extack parameter to this callback and
converting usage of the existing qdiscs.

Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-22 20:41:29 -08:00
Arjun Roy
7eeba1706e tcp: Add receive timestamp support for receive zerocopy.
tcp_recvmsg() uses the CMSG mechanism to receive control information
like packet receive timestamps. This patch adds CMSG fields to
struct tcp_zerocopy_receive, and provides receive timestamps
if available to the user.

Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-22 20:05:56 -08:00
Arjun Roy
925bba24e6 tcp: Remove CMSG magic numbers for tcp_recvmsg().
At present, tcp_recvmsg() uses flags to track if any CMSGs are pending
and what those CMSGs are. These flags are currently magic numbers,
used only within tcp_recvmsg().

To prepare for receive timestamp support in tcp receive zerocopy,
gently refactor these magic numbers into enums.

Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-22 20:05:56 -08:00
Nikolay Aleksandrov
d5a1022283 net: bridge: multicast: mark IGMPv3/MLDv2 fast-leave deletes
Mark groups which were deleted due to fast leave/EHT.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-22 19:39:57 -08:00
Nikolay Aleksandrov
e87e4b5caa net: bridge: multicast: handle block pg delete for all cases
A block report can result in empty source and host sets for both include
and exclude groups so if there are no hosts left we can safely remove
the group. Pull the block group handling so it can cover both cases and
add a check if EHT requires the delete.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-22 19:39:57 -08:00
Nikolay Aleksandrov
c9739016a0 net: bridge: multicast: add EHT host filter_mode handling
We should be able to handle host filter mode changing. For exclude mode
we must create a zero-src entry so the group will be kept even without
any S,G entries (non-zero source sets). That entry doesn't count to the
entry limit and can always be created, its timer is refreshed on new
exclude reports and if we change the host filter mode to include then it
gets removed and we rely only on the non-zero source sets.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-22 19:39:57 -08:00
Nikolay Aleksandrov
b66bf55bbc net: bridge: multicast: optimize TO_INCLUDE EHT timeouts
This is an optimization specifically for TO_INCLUDE which sends queries
for the older entries and thus lowers the S,G timers to LMQT. If we have
the following situation for a group in either include or exclude mode:
 - host A was interested in srcs X and Y, but is timing out
 - host B sends TO_INCLUDE src Z, the bridge lowers X and Y's timeouts
   to LMQT
 - host B sends BLOCK src Z after LMQT time has passed
 => since host B is the last host we can delete the group, but if we
    still have host A's EHT entries for X and Y (i.e. if they weren't
    lowered to LMQT previously) then we'll have to wait another LMQT
    time before deleting the group, with this optimization we can
    directly remove it regardless of the group mode as there are no more
    interested hosts

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-22 19:39:57 -08:00
Nikolay Aleksandrov
ddc255d993 net: bridge: multicast: add EHT include and exclude handling
Add support for IGMPv3/MLDv2 include and exclude EHT handling. Similar to
how the reports are processed we have 2 cases when the group is in include
or exclude mode, these are processed as follows:
 - group include
  - is_include: create missing entries
  - to_include: flush existing entries and create a new set from the
    report, obviously if the src set is empty then we delete the group

 - group exclude
  - is_exclude: create missing entries
  - to_exclude: flush existing entries and create a new set from the
    report, any empty source set entries are removed

If the group is in a different mode then we just flush all entries reported
by the host and we create a new set with the new mode entries created from
the report. If the report is include type, the source list is empty and
the group has empty sources' set then we remove it. Any source set entries
which are empty are removed as well. If the group is in exclude mode it
can exist without any S,G entries (allowing for all traffic to pass).

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-22 19:39:57 -08:00
Nikolay Aleksandrov
474ddb37fa net: bridge: multicast: add EHT allow/block handling
Add support for IGMPv3/MLDv2 allow/block EHT handling. Similar to how
the reports are processed we have 2 cases when the group is in include
or exclude mode, these are processed as follows:
 - group include
  - allow: create missing entries
  - block: remove existing matching entries and remove the corresponding
    S,G entries if there are no more set host entries, then possibly
    delete the whole group if there are no more S,G entries

 - group exclude
  - allow
    - host include: create missing entries
    - host exclude: remove existing matching entries and remove the
      corresponding S,G entries if there are no more set host entries
  - block
    - host include: remove existing matching entries and remove the
      corresponding S,G entries if there are no more set host entries,
      then possibly delete the whole group if there are no more S,G entries
    - host exclude: create missing entries

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-22 19:39:56 -08:00
Nikolay Aleksandrov
dba6b0a5ca net: bridge: multicast: add EHT host delete function
Now that we can delete set entries, we can use that to remove EHT hosts.
Since the group's host set entries exist only when there are related
source set entries we just have to flush all source set entries
joined by the host set entry and it will be automatically removed.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-22 19:39:56 -08:00
Nikolay Aleksandrov
baa74d39ca net: bridge: multicast: add EHT source set handling functions
Add EHT source set and set-entry create, delete and lookup functions.
These allow to manipulate source sets which contain their own host sets
with entries which joined that S,G. We're limiting the maximum number of
tracked S,G entries per host to PG_SRC_ENT_LIMIT (currently 32) which is
the current maximum of S,G entries for a group. There's a per-set timer
which will be used to destroy the whole set later.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-22 19:39:56 -08:00
Nikolay Aleksandrov
5b16328879 net: bridge: multicast: add EHT host handling functions
Add functions to create, destroy and lookup an EHT host. These are
per-host entries contained in the eht_host_tree in net_bridge_port_group
which are used to store a list of all sources (S,G) entries joined for that
group by each host, the host's current filter mode and total number of
joined entries.
No functional changes yet, these would be used in later patches.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-22 19:39:56 -08:00
Nikolay Aleksandrov
8f07b83119 net: bridge: multicast: add EHT structures and definitions
Add EHT structures for tracking hosts and sources per group. We keep one
set for each host which has all of the host's S,G entries, and one set for
each multicast source which has all hosts that have joined that S,G. For
each host, source entry we record the filter_mode and we keep an expiry
timer. There is also one global expiry timer per source set, it is
updated with each set entry update, it will be later used to lower the
set's timer instead of lowering each entry's timer separately.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-22 19:39:56 -08:00
Nikolay Aleksandrov
e7cfcf2c18 net: bridge: multicast: calculate idx position without changing ptr
We need to preserve the srcs pointer since we'll be passing it for EHT
handling later.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-22 19:39:56 -08:00
Nikolay Aleksandrov
0ad57c99e8 net: bridge: multicast: __grp_src_block_incl can modify pg
Prepare __grp_src_block_incl() for being able to cause a notification
due to changes. Currently it cannot happen, but EHT would change that
since we'll be deleting sources immediately. Make sure that if the pg is
deleted we don't return true as that would cause the caller to access
freed pg. This patch shouldn't cause any functional change.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-22 19:39:56 -08:00
Nikolay Aleksandrov
54bea72196 net: bridge: multicast: pass host src address to IGMPv3/MLDv2 functions
We need to pass the host address so later it can be used for explicit
host tracking. No functional change.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-22 19:39:56 -08:00
Nikolay Aleksandrov
9e10b9e656 net: bridge: multicast: rename src_size to addr_size
Rename src_size argument to addr_size in preparation for passing host
address as an argument to IGMPv3/MLDv2 functions.
No functional change.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-22 19:39:56 -08:00
Paolo Abeni
b19bc2945b mptcp: implement delegated actions
On MPTCP-level ack reception, the packet scheduler
may select a subflow other then the current one.

Prior to this commit we rely on the workqueue to trigger
action on such subflow.

This changeset introduces an infrastructure that allows
any MPTCP subflow to schedule actions (MPTCP xmit) on
others subflows without resorting to (multiple) process
reschedule.

A dummy NAPI instance is used instead. When MPTCP needs to
trigger action an a different subflow, it enqueues the target
subflow on the NAPI backlog and schedule such instance as needed.

The dummy NAPI poll method walks the sockets backlog and tries
to acquire the (BH) socket lock on each of them. If the socket
is owned by the user space, the action will be completed by
the sock release cb, otherwise push is started.

This change leverages the delegated action infrastructure
to avoid invoking the MPTCP worker to spool the pending data,
when the packet scheduler picks a subflow other then the one
currently processing the incoming MPTCP-level ack.

Additionally we further refine the subflow selection
invoking the packet scheduler for each chunk of data
even inside __mptcp_subflow_push_pending().

v1 -> v2:
 - fix possible UaF at shutdown time, resetting sock ops
   after removing the ulp context

Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-22 19:21:02 -08:00
Paolo Abeni
40dc9416cc mptcp: schedule work for better snd subflow selection
Otherwise the packet scheduler policy will not be
enforced when pushing pending data at MPTCP-level
ack reception time.

Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-22 19:21:02 -08:00
Paolo Abeni
ec369c3a33 mptcp: do not queue excessive data on subflows
The current packet scheduler can enqueue up to sndbuf
data on each subflow. If the send buffer is large and
the subflows are not symmetric, this could lead to
suboptimal aggregate bandwidth utilization.

Limit the amount of queued data to the maximum send
window.

Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-22 19:21:02 -08:00
Paolo Abeni
5cf92bbadc mptcp: re-enable sndbuf autotune
After commit 6e628cd3a8 ("mptcp: use mptcp release_cb for
delayed tasks"), MPTCP never sets the flag bit SOCK_NOSPACE
on its subflow. As a side effect, autotune never takes place,
as it happens inside tcp_new_space(), which in turn is called
only when the mentioned bit is set.

Let's sendmsg() set the subflows NOSPACE bit when looking for
more memory and use the subflow write_space callback to propagate
the snd buf update and wake-up the user-space.

Additionally, this allows dropping a bunch of duplicate code and
makes the SNDBUF_LIMITED chrono relevant again for MPTCP subflows.

Fixes: 6e628cd3a8 ("mptcp: use mptcp release_cb for delayed tasks")
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-22 19:21:02 -08:00
Paolo Abeni
866f26f2a9 mptcp: always graft subflow socket to parent
Currently, incoming subflows link to the parent socket,
while outgoing ones link to a per subflow socket. The latter
is not really needed, except at the initial connect() time and
for the first subflow.

Always graft the outgoing subflow to the parent socket and
free the unneeded ones early.

This allows some code cleanup, reduces the amount of memory
used and will simplify the next patch

Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-22 19:21:02 -08:00
Yousuk Seung
e7ed11ee94 tcp: add TTL to SCM_TIMESTAMPING_OPT_STATS
This patch adds TCP_NLA_TTL to SCM_TIMESTAMPING_OPT_STATS that exports
the time-to-live or hop limit of the latest incoming packet with
SCM_TSTAMP_ACK. The value exported may not be from the packet that acks
the sequence when incoming packets are aggregated. Exporting the
time-to-live or hop limit value of incoming packets helps to estimate
the hop count of the path of the flow that may change over time.

Signed-off-by: Yousuk Seung <ysseung@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Link: https://lore.kernel.org/r/20210120204155.552275-1-ysseung@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-22 18:20:52 -08:00
Linus Torvalds
a9034304ff A patch to zero out sensitive cryptographic data and two minor cleanups
prompted by the fact that a bunch of code was moved in this cycle.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmAK9skTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHziyGwB/9rZnaaYR6Frqc2tzE5vbVtjAxvhftn
 pGDr8laOHiK5jnKR+ljNlPAe07TSEK+qVidulX05moujKrZeIrDUJZnEpScrssZO
 o7Tm99dHziqJc10liembtSZzB3LzGJyW1hgavC5Vjo7JW+EZ+YR9x2pFKCO7Hz/M
 QlT6kQmXZLnsLB2OieAyC9Yb7IMD1wfiOHHvOZDeFpIn49Z8reFahI+dSwwK/uOv
 UouxZKKuaikSTvzp8WmTiuCpsUfBMOaDy5/pWLfBS+/116K2aieJmzSjUb2MZwDT
 cLGhzrkyeCkBeFO5vhnob3n9KqWXN03I9rPB25StcrHCRYcHa3z/D/4k
 =9COg
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-5.11-rc5' of git://github.com/ceph/ceph-client

Pull ceph fixes from Ilya Dryomov:
 "A patch to zero out sensitive cryptographic data and two minor
  cleanups prompted by the fact that a bunch of code was moved in this
  cycle"

* tag 'ceph-for-5.11-rc5' of git://github.com/ceph/ceph-client:
  libceph: fix "Boolean result is used in bitwise operation" warning
  libceph, ceph: disambiguate ceph_connection_operations handlers
  libceph: zero out session key and connection secret
2021-01-22 13:47:25 -08:00
Parav Pandit
a556dded9c devlink: Support get and set state of port function
devlink port function can be in active or inactive state.
Allow users to get and set port function's state.

When the port function it activated, its operational state may change
after a while when the device is created and driver binds to it.
Similarly on deactivation flow.

To clearly describe the state of the port function and its device's
operational state in the host system, define state and opstate
attributes.

Example of a PCI SF port which supports a port function:

$ devlink dev eswitch set pci/0000:06:00.0 mode switchdev

$ devlink port show
pci/0000:06:00.0/65535: type eth netdev ens2f0np0 flavour physical port 0 splittable false

$ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88
pci/0000:08:00.0/32768: type eth netdev eth6 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
  function:
    hw_addr 00:00:00:00:00:00 state inactive opstate detached

$ devlink port show pci/0000:06:00.0/32768
pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
  function:
    hw_addr 00:00:00:00:88:88 state inactive opstate detached

$ devlink port function set pci/0000:06:00.0/32768 hw_addr 00:00:00:00:88:88 state active

$ devlink port show pci/0000:06:00.0/32768 -jp
{
    "port": {
        "pci/0000:06:00.0/32768": {
            "type": "eth",
            "netdev": "ens2f0npf0sf88",
            "flavour": "pcisf",
            "controller": 0,
            "pfnum": 0,
            "sfnum": 88,
            "external": false,
            "splittable": false,
            "function": {
                "hw_addr": "00:00:00:00:88:88",
                "state": "active",
                "opstate": "attached"
            }
        }
    }
}

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Vu Pham <vuhuong@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-01-22 11:32:08 -08:00
Parav Pandit
cd76dcd68d devlink: Support add and delete devlink port
Extended devlink interface for the user to add and delete a port.
Extend devlink to connect user requests to driver to add/delete
a port in the device.

Driver routines are invoked without holding devlink instance lock.
This enables driver to perform several devlink objects registration,
unregistration such as (port, health reporter, resource etc) by using
existing devlink APIs.
This also helps to uniformly use the code for port unregistration
during driver unload and during port deletion initiated by user.

Examples of add, show and delete commands:
$ devlink dev eswitch set pci/0000:06:00.0 mode switchdev

$ devlink port show
pci/0000:06:00.0/65535: type eth netdev ens2f0np0 flavour physical port 0 splittable false

$ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88
pci/0000:06:00.0/32768: type eth netdev eth6 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
  function:
    hw_addr 00:00:00:00:00:00 state inactive opstate detached

$ devlink port show pci/0000:06:00.0/32768
pci/0000:06:00.0/32768: type eth netdev eth6 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
  function:
    hw_addr 00:00:00:00:00:00 state inactive opstate detached

$ udevadm test-builtin net_id /sys/class/net/eth6
Load module index
Parsed configuration file /usr/lib/systemd/network/99-default.link
Created link configuration context.
Using default interface naming scheme 'v245'.
ID_NET_NAMING_SCHEME=v245
ID_NET_NAME_PATH=enp6s0f0npf0sf88
ID_NET_NAME_SLOT=ens2f0npf0sf88
Unload module index
Unloaded link configuration context.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Vu Pham <vuhuong@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-01-22 11:32:07 -08:00
Parav Pandit
b8288837ef devlink: Introduce PCI SF port flavour and port attribute
A PCI sub-function (SF) represents a portion of the device similar
to PCI VF.

In an eswitch, PCI SF may have port which is normally represented
using a representor netdevice.
To have better visibility of eswitch port, its association with SF,
and its representor netdevice, introduce a PCI SF port flavour.

When devlink port flavour is PCI SF, fill up PCI SF attributes of the
port.

Extend port name creation using PCI PF and SF number scheme on best
effort basis, so that vendor drivers can skip defining their own
scheme.
This is done as cApfNSfM, where A, N and M are controller, PCI PF and
PCI SF number respectively.
This is similar to existing naming for PCI PF and PCI VF ports.

An example view of a PCI SF port:

$ devlink port show pci/0000:06:00.0/32768
pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
  function:
    hw_addr 00:00:00:00:88:88 state active opstate attached

$ devlink port show pci/0000:06:00.0/32768 -jp
{
    "port": {
        "pci/0000:06:00.0/32768": {
            "type": "eth",
            "netdev": "ens2f0npf0sf88",
            "flavour": "pcisf",
            "controller": 0,
            "pfnum": 0,
            "sfnum": 88,
            "splittable": false,
            "function": {
                "hw_addr": "00:00:00:00:88:88",
                "state": "active",
                "opstate": "attached"
            }
        }
    }
}

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Vu Pham <vuhuong@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-01-22 11:32:07 -08:00
Parav Pandit
1230d94820 devlink: Prepare code to fill multiple port function attributes
Prepare code to fill zero or more port function optional attributes.
Subsequent patch makes use of this to fill more port function
attributes.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Vu Pham <vuhuong@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-01-22 11:32:06 -08:00
Johannes Berg
2fe8ef1062 cfg80211: change netdev registration/unregistration semantics
We used to not require anything in terms of registering netdevs
with cfg80211, using a netdev notifier instead. However, in the
next patch reducing RTNL locking, this causes big problems, and
the simplest way is to just require drivers to do things better.

Change the registration/unregistration semantics to require the
drivers to call cfg80211_(un)register_netdevice() when this is
happening due to a cfg80211 request, i.e. add_virtual_intf() or
del_virtual_intf() (or if it somehow has to happen in any other
cfg80211 callback).

Otherwise, in other contexts, drivers may continue to use the
normal netdev (un)registration functions as usual.

Internally, we still use the netdev notifier and track (by the
new wdev->registered bool) if the wdev had already been added
to cfg80211 or not.

Link: https://lore.kernel.org/r/20210122161942.cf2f4b65e4e9.Ida8234e50da13eb675b557bac52a713ad4eddf71@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-01-22 16:28:39 +01:00
Felix Fietkau
347c2989a8 mac80211: minstrel_ht: fix rounding error in throughput calculation
On lower data rates, the throughput calculation has a significant rounding
error, causing rates like 48M and 54M OFDM to share the same throughput
value with >= 90% success probablity.

This is because the result of the division (prob_avg * 1000) / nsecs
is really small (8 in this example).

Improve accuracy by moving over some zeroes, making better use of the full
range of u32 before the division.

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20210115120242.89616-10-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-01-22 09:11:37 +01:00
Felix Fietkau
7e2123abc5 mac80211: minstrel_ht: increase stats update interval
The shorter interval was leading to too many frames being used for probing

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20210115120242.89616-9-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-01-22 09:11:37 +01:00
Felix Fietkau
a7fca4e403 mac80211: minstrel_ht: fix max probability rate selection
- do not select rates faster than the max throughput rate if probability is lower
- reset previous rate before sorting again

This ensures that the max prob rate gets set to a more reliable rate

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20210115120242.89616-8-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-01-22 09:11:37 +01:00
Felix Fietkau
019c6fc278 mac80211: minstrel_ht: improve sample rate selection
Always allow sampling of rates faster than the primary max throughput rate.
When the second max_tp_rate is higher than the first one, sample attempts were
previously skipped, potentially causing rate control to get stuck at a slightly
lower rate

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20210115120242.89616-7-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-01-22 09:11:37 +01:00
Felix Fietkau
1ae8bba9a2 mac80211: minstrel_ht: improve ampdu length estimation
If the driver does not report A-MPDU length, estimate it based on the rate.

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20210115120242.89616-6-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-01-22 09:11:37 +01:00
Felix Fietkau
eeafcb0c80 mac80211: minstrel_ht: remove old ewma based rate average code
The new noise filter has been the default for a while now with no reported
downside and significant improvement compared to the old code.

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20210115120242.89616-5-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-01-22 09:11:37 +01:00
Felix Fietkau
cbda98c710 mac80211: remove legacy minstrel rate control
Now that minstrel_ht supports legacy rates, it is no longer needed

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20210115120242.89616-4-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-01-22 09:11:37 +01:00
Felix Fietkau
a7844a5384 mac80211: minstrel_ht: add support for OFDM rates on non-HT clients
The legacy minstrel code is essentially unmaintained and receives only very
little testing. In order to bring the significant algorithm improvements from
minstrel_ht to legacy clients, this patch adds support for OFDM rates to
minstrel_ht and removes the fallback to the legacy codepath.
This also makes it work much better on hardware with rate selection constraints,
e.g. mt76.

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20210115120242.89616-3-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-01-22 09:11:37 +01:00
Felix Fietkau
f84de06398 mac80211: minstrel_ht: clean up CCK code
- move ack overhead out of rate duration table
- remove cck_supported, cck_supported_short

Preparation for adding OFDM legacy rates support

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20210115120242.89616-2-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-01-22 09:11:37 +01:00
Lorenzo Bianconi
e908435e40 mac80211: introduce aql_enable node in debugfs
Introduce aql_enable node in debugfs in order to enable/disable aql.
This is useful for debugging purpose.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://lore.kernel.org/r/e7a934d5d84e4796c4f97ea5de4e66c824296b07.1610214851.git.lorenzo@kernel.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-01-22 09:11:37 +01:00
Max Chen
d9c85e2472 cfg80211: Add phyrate conversion support for extended MCS in 60GHz band
The current phyrate conversion does not include extended MCS and provides
incorrect rates. Add a flag for extended MCS in DMG and add corresponding
phyrate table for the correct conversions using base MCS in DMG specs.

Signed-off-by: Max Chen <mxchen@codeaurora.org>
Link: https://lore.kernel.org/r/1609977050-7089-2-git-send-email-mxchen@codeaurora.org
[reduce data size, make a single WARN]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-01-22 09:11:36 +01:00
Arend van Spriel
c27aa56a72 cfg80211: add VHT rate entries for MCS-10 and MCS-11
Observed the warning in cfg80211_calculate_bitrate_vht() using an
11ac chip reporting MCS-11. Since devices reporting non-standard
MCS-9 is already supported add similar entries for MCS-10 and MCS-11.
Actually, the value of MCS-9@20MHz is slightly off so corrected that.

Signed-off-by: Arend van Spriel <arend.vanspriel@broadcom.com>
Link: https://lore.kernel.org/r/20210105105839.3795-1-arend.vanspriel@broadcom.com
[fix array size]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-01-22 09:11:36 +01:00
Wen Gong
7f7aa94bca mac80211: reduce peer HE MCS/NSS to own capabilities
For VHT capbility, we do intersection of MCS and NSS for peers in
mac80211, to simplify drivers.

Add this for HE as well.

Signed-off-by: Wen Gong <wgong@codeaurora.org>
Link: https://lore.kernel.org/r/1609816120-9411-3-git-send-email-wgong@codeaurora.org
[reword commit message, style cleanups, fix endian annotations]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-01-22 09:11:28 +01:00
Jakub Kicinski
35c715c30b Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:

====================
pull request (net): ipsec 2021-01-21

1) Fix a rare panic on SMP systems when packet reordering
   happens between anti replay check and update.
   From Shmulik Ladkani.

2) Fix disable_xfrm sysctl when used on xfrm interfaces.
   From Eyal Birger.

3) Fix a race in PF_KEY when the availability of crypto
   algorithms is set. From Cong Wang.

4) Fix a return value override in the xfrm policy selftests.
   From Po-Hsu Lin.

5) Fix an integer wraparound in xfrm_policy_addr_delta.
   From Visa Hankala.

* 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec:
  xfrm: Fix wraparound in xfrm_policy_addr_delta()
  selftests: xfrm: fix test return value override issue in xfrm_policy.sh
  af_key: relax availability checks for skb size calculation
  xfrm: fix disable_xfrm sysctl when used on xfrm interfaces
  xfrm: Fix oops in xfrm_replay_advance_bmp
====================

Link: https://lore.kernel.org/r/20210121121558.621339-1-steffen.klassert@secunet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-21 11:05:10 -08:00
Ilya Dryomov
9d5ae6f3c5 libceph: fix "Boolean result is used in bitwise operation" warning
This line dates back to 2013, but cppcheck complained because commit
2f713615dd ("libceph: move msgr1 protocol implementation to its own
file") moved it.  Add parenthesis to silence the warning.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2021-01-21 16:49:59 +01:00
Wen Gong
2d5e09d058 mac80211: remove NSS number of 160MHz if not support 160MHz for HE
When it does not support 160MHz in HE phy capabilities information,
it should not treat the NSS number of 160MHz as a valid number,
otherwise the final NSS will be set to 0.

Signed-off-by: Wen Gong <wgong@codeaurora.org>
Link: https://lore.kernel.org/r/1609816120-9411-2-git-send-email-wgong@codeaurora.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-01-21 13:45:13 +01:00
Shay Bar
dcf3c8fb32 mac80211: 160MHz with extended NSS BW in CSA
Upon receiving CSA with 160MHz extended NSS BW from associated AP,
STA should set the HT operation_mode based on new_center_freq_seg1
because it is later used as ccfs2 in ieee80211_chandef_vht_oper().

Signed-off-by: Aviad Brikman <aviad.brikman@celeno.com>
Signed-off-by: Shay Bar <shay.bar@celeno.com>
Link: https://lore.kernel.org/r/20201222064714.24888-1-shay.bar@celeno.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-01-21 13:39:11 +01:00
Philipp Borgers
f1864e193d mac80211: add LDPC encoding to ieee80211_parse_tx_radiotap
This patch adds support for LDPC encoding to the radiotap tx parse
function. Piror to this change adding the LDPC flag to the radiotap
header did not encode frames with LDPC.

Signed-off-by: Philipp Borgers <borgers@mi.fu-berlin.de>
Link: https://lore.kernel.org/r/20201219170710.11706-1-borgers@mi.fu-berlin.de
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-01-21 13:37:39 +01:00
Felix Fietkau
80a915ec44 mac80211: add rx decapsulation offload support
This allows drivers to pass 802.3 frames to mac80211, with some restrictions:

- the skb must be passed with a valid sta
- fast-rx needs to be active for the sta
- monitor mode needs to be disabled

mac80211 will tell the driver when it is safe to enable rx decap offload for
a particular station.

In order to implement support, a driver must:

- call ieee80211_hw_set(hw, SUPPORTS_RX_DECAP_OFFLOAD)
- implement ops->sta_set_decap_offload
- mark 802.3 frames with RX_FLAG_8023

If it doesn't want to enable offload for some vif types, it can mask out
IEEE80211_OFFLOAD_DECAP_ENABLED in vif->offload_flags from within the
.add_interface or .update_vif_offload driver ops

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20201218184718.93650-6-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-01-21 13:34:49 +01:00
Felix Fietkau
d7b6492917 net/fq_impl: do not maintain a backlog-sorted list of flows
A sorted flow list is only needed to drop packets in the biggest flow when
hitting the overmemory condition.
By scanning flows only when needed, we can avoid paying the cost of
maintaining the list under normal conditions
In order to avoid scanning lots of empty flows and touching too many cold
cache lines, a bitmap of flows with backlog is maintained

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20201218184718.93650-3-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-01-21 13:33:45 +01:00
Felix Fietkau
bf9009bf21 net/fq_impl: drop get_default_func, move default flow to fq_tin
Simplifies the code and prepares for a rework of scanning for flows on
overmemory drop.

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20201218184718.93650-2-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-01-21 13:33:31 +01:00
Xin Long
1a2367665a ip_gre: remove CRC flag from dev features in gre_gso_segment
This patch is to let it always do CRC checksum in sctp_gso_segment()
by removing CRC flag from the dev features in gre_gso_segment() for
SCTP over GRE, just as it does in Commit 527beb8ef9 ("udp: support
sctp over udp in skb_udp_tunnel_segment") for SCTP over UDP.

It could set csum/csum_start in GSO CB properly in sctp_gso_segment()
after that commit, so it would do checksum with gso_make_checksum()
in gre_gso_segment(), and Commit 622e32b7d4 ("net: gre: recompute
gre csum for sctp over gre tunnels") can be reverted now.

Note that when need_csum is false, we can still leave CRC checksum
of SCTP to HW by not clearing this CRC flag if it's supported, as
Jakub and Alex noticed.

v1->v2:
  - improve the changelog.
  - fix "rev xmas tree" in varibles declaration.
v2->v3:
  - remove CRC flag from dev features only when need_csum is true.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Link: https://lore.kernel.org/r/00439f24d5f69e2c6fa2beadc681d056c15c258f.1610772251.git.lucien.xin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-20 22:15:04 -08:00
Xin Long
4eb5d4a5b4 udp: not remove the CRC flag from dev features when need_csum is false
In __skb_udp_tunnel_segment(), when it's a SCTP over VxLAN/GENEVE
packet and need_csum is false, which means the outer udp checksum
doesn't need to be computed, csum_start and csum_offset could be
used by the inner SCTP CRC CSUM for SCTP HW CRC offload.

So this patch is to not remove the CRC flag from dev features when
need_csum is false.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Link: https://lore.kernel.org/r/1e81b700642498546eaa3f298e023fd7ad394f85.1610776757.git.lucien.xin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-20 22:14:59 -08:00
wenxu
7baf2429a1 net/sched: cls_flower add CT_FLAGS_INVALID flag support
This patch add the TCA_FLOWER_KEY_CT_FLAGS_INVALID flag to
match the ct_state with invalid for conntrack.

Signed-off-by: wenxu <wenxu@ucloud.cn>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Link: https://lore.kernel.org/r/1611045110-682-1-git-send-email-wenxu@ucloud.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-20 21:09:44 -08:00
Jakub Kicinski
0cbe1e57a7 net: inline rollback_registered_many()
Similar to the change for rollback_registered() -
rollback_registered_many() was a part of unregister_netdevice_many()
minus the net_set_todo(), which is no longer needed.

Functionally this patch moves the list_empty() check back after:

	BUG_ON(dev_boot_phase);
	ASSERT_RTNL();

but I can't find any reason why that would be an issue.

Reviewed-by: Edwin Peer <edwin.peer@broadcom.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-20 21:04:19 -08:00
Jakub Kicinski
bcfe2f1a38 net: move rollback_registered_many()
Move rollback_registered_many() and add a temporary
forward declaration to make merging the code into
unregister_netdevice_many() easier to review.

No functional changes.

Reviewed-by: Edwin Peer <edwin.peer@broadcom.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-20 21:04:19 -08:00
Jakub Kicinski
037e56bd96 net: inline rollback_registered()
rollback_registered() is a local helper, it's common for driver
code to call unregister_netdevice_queue(dev, NULL) when they
want to unregister netdevices under rtnl_lock. Inline
rollback_registered() and adjust the only remaining caller.

Reviewed-by: Edwin Peer <edwin.peer@broadcom.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-20 21:04:18 -08:00
Jakub Kicinski
2014beea7e net: move net_set_todo inside rollback_registered()
Commit 93ee31f14f ("[NET]: Fix free_netdev on register_netdev
failure.") moved net_set_todo() outside of rollback_registered()
so that rollback_registered() can be used in the failure path of
register_netdevice() but without risking a double free.

Since commit cf124db566 ("net: Fix inconsistent teardown and
release of private netdev state."), however, we have a better
way of handling that condition, since destructors don't call
free_netdev() directly.

After the change in commit c269a24ce0 ("net: make free_netdev()
more lenient with unregistering devices") we can now move
net_set_todo() back.

Reviewed-by: Edwin Peer <edwin.peer@broadcom.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-20 21:04:18 -08:00
Petr Machata
643d0878e6 nexthop: Specialize rtm_nh_policy
This policy is currently only used for creation of new next hops and new
next hop groups. Rename it accordingly and remove the two attributes that
are not valid in that context: NHA_GROUPS and NHA_MASTER.

For consistency with other policies, do not mention policy array size in
the declarator, and replace NHA_MAX for ARRAY_SIZE as appropriate.

Note that with this commit, NHA_MAX and __NHA_MAX are not used anymore.
Leave them in purely as a user API.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-20 21:00:24 -08:00
Petr Machata
44551bff29 nexthop: Use a dedicated policy for nh_valid_dump_req()
This function uses the global nexthop policy, but only accepts four
particular attributes. Create a new policy that only includes the four
supported attributes, and use it. Convert the loop to a series of ifs.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-20 21:00:24 -08:00
Petr Machata
60f5ad5e19 nexthop: Use a dedicated policy for nh_valid_get_del_req()
This function uses the global nexthop policy only to then bounce all
arguments except for NHA_ID. Instead, just create a new policy that
only includes the one allowed attribute.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-20 21:00:24 -08:00
Jakub Kicinski
0fe2f273ab Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Conflicts:

drivers/net/can/dev.c
  commit 03f16c5075 ("can: dev: can_restart: fix use after free bug")
  commit 3e77f70e73 ("can: dev: move driver related infrastructure into separate subdir")

  Code move.

drivers/net/dsa/b53/b53_common.c
 commit 8e4052c32d ("net: dsa: b53: fix an off by one in checking "vlan->vid"")
 commit b7a9e0da2d ("net: switchdev: remove vid_begin -> vid_end range from VLAN objects")

 Field rename.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-20 12:16:11 -08:00
Linus Torvalds
75439bc439 Networking fixes for 5.11-rc5, including fixes from bpf, wireless,
and can trees.
 
 Current release - regressions:
 
  - nfc: nci: fix the wrong NCI_CORE_INIT parameters
 
 Current release - new code bugs:
 
  - bpf: allow empty module BTFs
 
 Previous releases - regressions:
 
  - bpf: fix signed_{sub,add32}_overflows type handling
 
  - tcp: do not mess with cloned skbs in tcp_add_backlog()
 
  - bpf: prevent double bpf_prog_put call from bpf_tracing_prog_attach
 
  - bpf: don't leak memory in bpf getsockopt when optlen == 0
 
  - tcp: fix potential use-after-free due to double kfree()
 
  - mac80211: fix encryption issues with WEP
 
  - devlink: use right genl user_ptr when handling port param get/set
 
  - ipv6: set multicast flag on the multicast route
 
  - tcp: fix TCP_USER_TIMEOUT with zero window
 
 Previous releases - always broken:
 
  - bpf: local storage helpers should check nullness of owner ptr passed
 
  - mac80211: fix incorrect strlen of .write in debugfs
 
  - cls_flower: call nla_ok() before nla_next()
 
  - skbuff: back tiny skbs with kmalloc() in __netdev_alloc_skb() too
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmAIa+UACgkQMUZtbf5S
 IruZTQ/+O263ZyI0C5S1uCbHPCsAyjZyxECWDNfQ3tRzTfvldoRRP4YbC1ekSoXu
 8Y9GKDDLMI2pYkNlCqfMhrFaop8sudosntOZDSeRm/2TkkQFnkM/bxAlz++7Rnwx
 vHu1Xo2t2bKJxooSw8gLJ5iZNTbkw/M5iA3qR9kP+BG1yDP7By4P/Y4ziFphffad
 gPlfLQaU8nRVuDBYYrGIX0GoMg05IH1zt2/MxvN4ReXuex/9tq2TrU8jxHiwT2ja
 K1DHR+g2VVZf55TWrL9Yw8V5Rr+F7bxf6i+yer9hWWhENXgoTv6QkndAnTFOcoat
 VQh44GzoNoL1dAHD8kyUOOxJCyjItJJe58Evcwjnls4o+5BC2aDNQADwrSyz3sHe
 l9iNMSMEylymu7Xu+cJw2kjOq/BK6TdjaGSxwm1M2ErPehf36eJuc4FkaJz3RO55
 nkYMfm0+5rYWSsR5CTTJp8r2urCAT4SSx1iLoZknUXE6qa5AcMSNhIjGbw6pUp4q
 RDBtAKqiV0l37vdUag4Z+QgjPA0cH9E4aMQKYmD9dop20Zuzp4ug38qR32aEFC6q
 Qfb0VBMKgwu6OWjuWARbwYktVQNcoelKiGnsGnORJ5S9cyc1N4HeKEnb5Hw8ky5q
 4FBpNMfx3Ief14iNkh65KrzA+uyZBjqEG+joTSzn+9R7Lof60QA=
 =KyY7
 -----END PGP SIGNATURE-----

Merge tag 'net-5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Networking fixes for 5.11-rc5, including fixes from bpf, wireless, and
  can trees.

  Current release - regressions:

   - nfc: nci: fix the wrong NCI_CORE_INIT parameters

  Current release - new code bugs:

   - bpf: allow empty module BTFs

  Previous releases - regressions:

   - bpf: fix signed_{sub,add32}_overflows type handling

   - tcp: do not mess with cloned skbs in tcp_add_backlog()

   - bpf: prevent double bpf_prog_put call from bpf_tracing_prog_attach

   - bpf: don't leak memory in bpf getsockopt when optlen == 0

   - tcp: fix potential use-after-free due to double kfree()

   - mac80211: fix encryption issues with WEP

   - devlink: use right genl user_ptr when handling port param get/set

   - ipv6: set multicast flag on the multicast route

   - tcp: fix TCP_USER_TIMEOUT with zero window

  Previous releases - always broken:

   - bpf: local storage helpers should check nullness of owner ptr passed

   - mac80211: fix incorrect strlen of .write in debugfs

   - cls_flower: call nla_ok() before nla_next()

   - skbuff: back tiny skbs with kmalloc() in __netdev_alloc_skb() too"

* tag 'net-5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (52 commits)
  net: systemport: free dev before on error path
  net: usb: cdc_ncm: don't spew notifications
  net: mscc: ocelot: Fix multicast to the CPU port
  tcp: Fix potential use-after-free due to double kfree()
  bpf: Fix signed_{sub,add32}_overflows type handling
  can: peak_usb: fix use after free bugs
  can: vxcan: vxcan_xmit: fix use after free bug
  can: dev: can_restart: fix use after free bug
  tcp: fix TCP socket rehash stats mis-accounting
  net: dsa: b53: fix an off by one in checking "vlan->vid"
  tcp: do not mess with cloned skbs in tcp_add_backlog()
  selftests: net: fib_tests: remove duplicate log test
  net: nfc: nci: fix the wrong NCI_CORE_INIT parameters
  sh_eth: Fix power down vs. is_opened flag ordering
  net: Disable NETIF_F_HW_TLS_RX when RXCSUM is disabled
  netfilter: rpfilter: mask ecn bits before fib lookup
  udp: mask TOS bits in udp_v4_early_demux()
  xsk: Clear pool even for inactive queues
  bpf: Fix helper bpf_map_peek_elem_proto pointing to wrong callback
  sh_eth: Make PHY access aware of Runtime PM to fix reboot crash
  ...
2021-01-20 11:52:21 -08:00
Kuniyuki Iwashima
c89dffc70b tcp: Fix potential use-after-free due to double kfree()
Receiving ACK with a valid SYN cookie, cookie_v4_check() allocates struct
request_sock and then can allocate inet_rsk(req)->ireq_opt. After that,
tcp_v4_syn_recv_sock() allocates struct sock and copies ireq_opt to
inet_sk(sk)->inet_opt. Normally, tcp_v4_syn_recv_sock() inserts the full
socket into ehash and sets NULL to ireq_opt. Otherwise,
tcp_v4_syn_recv_sock() has to reset inet_opt by NULL and free the full
socket.

The commit 01770a1661 ("tcp: fix race condition when creating child
sockets from syncookies") added a new path, in which more than one cores
create full sockets for the same SYN cookie. Currently, the core which
loses the race frees the full socket without resetting inet_opt, resulting
in that both sock_put() and reqsk_put() call kfree() for the same memory:

  sock_put
    sk_free
      __sk_free
        sk_destruct
          __sk_destruct
            sk->sk_destruct/inet_sock_destruct
              kfree(rcu_dereference_protected(inet->inet_opt, 1));

  reqsk_put
    reqsk_free
      __reqsk_free
        req->rsk_ops->destructor/tcp_v4_reqsk_destructor
          kfree(rcu_dereference_protected(inet_rsk(req)->ireq_opt, 1));

Calling kmalloc() between the double kfree() can lead to use-after-free, so
this patch fixes it by setting NULL to inet_opt before sock_put().

As a side note, this kind of issue does not happen for IPv6. This is
because tcp_v6_syn_recv_sock() clones both ipv6_opt and pktopts which
correspond to ireq_opt in IPv4.

Fixes: 01770a1661 ("tcp: fix race condition when creating child sockets from syncookies")
CC: Ricardo Dias <rdias@singlestore.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20210118055920.82516-1-kuniyu@amazon.co.jp
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-20 08:56:16 -08:00
Jakub Kicinski
b3741b4388 Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2021-01-20

1) Fix wrong bpf_map_peek_elem_proto helper callback, from Mircea Cirjaliu.

2) Fix signed_{sub,add32}_overflows type truncation, from Daniel Borkmann.

3) Fix AF_XDP to also clear pools for inactive queues, from Maxim Mikityanskiy.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  bpf: Fix signed_{sub,add32}_overflows type handling
  xsk: Clear pool even for inactive queues
  bpf: Fix helper bpf_map_peek_elem_proto pointing to wrong callback
====================

Link: https://lore.kernel.org/r/20210120163439.8160-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-20 08:47:35 -08:00
Yuchung Cheng
9c30ae8398 tcp: fix TCP socket rehash stats mis-accounting
The previous commit 32efcc06d2 ("tcp: export count for rehash attempts")
would mis-account rehashing SNMP and socket stats:

  a. During handshake of an active open, only counts the first
     SYN timeout

  b. After handshake of passive and active open, stop updating
     after (roughly) TCP_RETRIES1 recurring RTOs

  c. After the socket aborts, over count timeout_rehash by 1

This patch fixes this by checking the rehash result from sk_rethink_txhash.

Fixes: 32efcc06d2 ("tcp: export count for rehash attempts")
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Link: https://lore.kernel.org/r/20210119192619.1848270-1-ycheng@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-19 19:47:20 -08:00
Paolo Abeni
00b229f762 net: fix GSO for SG-enabled devices
The commit dbd50f238d ("net: move the hsize check to the else
block in skb_segment") introduced a data corruption for devices
supporting scatter-gather.

The problem boils down to signed/unsigned comparison given
unexpected results: if signed 'hsize' is negative, it will be
considered greater than a positive 'len', which is unsigned.

This commit addresses resorting to the old checks order, so that
'hsize' never has a negative value when compared with 'len'.

v1 -> v2:
 - reorder hsize checks instead of explicit cast (Alex)

Bisected-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Fixes: dbd50f238d ("net: move the hsize check to the else block in skb_segment")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Link: https://lore.kernel.org/r/861947c2d2d087db82af93c21920ce8147d15490.1611074818.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-19 18:40:40 -08:00
Eric Dumazet
b160c28548 tcp: do not mess with cloned skbs in tcp_add_backlog()
Heiner Kallweit reported that some skbs were sent with
the following invalid GSO properties :
- gso_size > 0
- gso_type == 0

This was triggerring a WARN_ON_ONCE() in rtl8169_tso_csum_v2.

Juerg Haefliger was able to reproduce a similar issue using
a lan78xx NIC and a workload mixing TCP incoming traffic
and forwarded packets.

The problem is that tcp_add_backlog() is writing
over gso_segs and gso_size even if the incoming packet will not
be coalesced to the backlog tail packet.

While skb_try_coalesce() would bail out if tail packet is cloned,
this overwriting would lead to corruptions of other packets
cooked by lan78xx, sharing a common super-packet.

The strategy used by lan78xx is to use a big skb, and split
it into all received packets using skb_clone() to avoid copies.
The drawback of this strategy is that all the small skb share a common
struct skb_shared_info.

This patch rewrites TCP gso_size/gso_segs handling to only
happen on the tail skb, since skb_try_coalesce() made sure
it was not cloned.

Fixes: 4f693b55c3 ("tcp: implement coalescing on backlog queue")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Bisected-by: Juerg Haefliger <juergh@canonical.com>
Tested-by: Juerg Haefliger <juergh@canonical.com>
Reported-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=209423
Link: https://lore.kernel.org/r/20210119164900.766957-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-19 17:57:59 -08:00
Jiapeng Zhong
0deee7aa23 taprio: boolean values to a bool variable
Fix the following coccicheck warnings:

./net/sched/sch_taprio.c:393:3-16: WARNING: Assignment of 0/1 to bool
variable.

./net/sched/sch_taprio.c:375:2-15: WARNING: Assignment of 0/1 to bool
variable.

./net/sched/sch_taprio.c:244:4-19: WARNING: Assignment of 0/1 to bool
variable.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Zhong <abaci-bugfix@linux.alibaba.com>
Link: https://lore.kernel.org/r/1610958662-71166-1-git-send-email-abaci-bugfix@linux.alibaba.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-19 17:43:48 -08:00
Bongsu Jeon
4964e5a1e0 net: nfc: nci: fix the wrong NCI_CORE_INIT parameters
Fix the code because NCI_CORE_INIT_CMD includes two parameters in NCI2.0
but there is no parameters in NCI1.x.

Fixes: bcd684aace ("net/nfc/nci: Support NCI 2.x initial sequence")
Signed-off-by: Bongsu Jeon <bongsu.jeon@samsung.com>
Link: https://lore.kernel.org/r/20210118205522.317087-1-bongsu.jeon@samsung.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-19 16:49:28 -08:00
Tariq Toukan
a3eb4e9d4c net: Disable NETIF_F_HW_TLS_RX when RXCSUM is disabled
With NETIF_F_HW_TLS_RX packets are decrypted in HW. This cannot be
logically done when RXCSUM offload is off.

Fixes: 14136564c8 ("net: Add TLS RX offload feature")
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Boris Pismenny <borisp@nvidia.com>
Link: https://lore.kernel.org/r/20210117151538.9411-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-19 15:58:05 -08:00
Xin Long
fa82117010 net: add inline function skb_csum_is_sctp
This patch is to define a inline function skb_csum_is_sctp(), and
also replace all places where it checks if it's a SCTP CSUM skb.
This function would be used later in many networking drivers in
the following patches.

Suggested-by: Alexander Duyck <alexander.duyck@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-19 14:31:25 -08:00
Guillaume Nault
2e5a6266fb netfilter: rpfilter: mask ecn bits before fib lookup
RT_TOS() only masks one of the two ECN bits. Therefore rpfilter_mt()
treats Not-ECT or ECT(1) packets in a different way than those with
ECT(0) or CE.

Reproducer:

  Create two netns, connected with a veth:
  $ ip netns add ns0
  $ ip netns add ns1
  $ ip link add name veth01 netns ns0 type veth peer name veth10 netns ns1
  $ ip -netns ns0 link set dev veth01 up
  $ ip -netns ns1 link set dev veth10 up
  $ ip -netns ns0 address add 192.0.2.10/32 dev veth01
  $ ip -netns ns1 address add 192.0.2.11/32 dev veth10

  Add a route to ns1 in ns0:
  $ ip -netns ns0 route add 192.0.2.11/32 dev veth01

  In ns1, only packets with TOS 4 can be routed to ns0:
  $ ip -netns ns1 route add 192.0.2.10/32 tos 4 dev veth10

  Ping from ns0 to ns1 works regardless of the ECN bits, as long as TOS
  is 4:
  $ ip netns exec ns0 ping -Q 4 192.0.2.11   # TOS 4, Not-ECT
    ... 0% packet loss ...
  $ ip netns exec ns0 ping -Q 5 192.0.2.11   # TOS 4, ECT(1)
    ... 0% packet loss ...
  $ ip netns exec ns0 ping -Q 6 192.0.2.11   # TOS 4, ECT(0)
    ... 0% packet loss ...
  $ ip netns exec ns0 ping -Q 7 192.0.2.11   # TOS 4, CE
    ... 0% packet loss ...

  Now use iptable's rpfilter module in ns1:
  $ ip netns exec ns1 iptables-legacy -t raw -A PREROUTING -m rpfilter --invert -j DROP

  Not-ECT and ECT(1) packets still pass:
  $ ip netns exec ns0 ping -Q 4 192.0.2.11   # TOS 4, Not-ECT
    ... 0% packet loss ...
  $ ip netns exec ns0 ping -Q 5 192.0.2.11   # TOS 4, ECT(1)
    ... 0% packet loss ...

  But ECT(0) and ECN packets are dropped:
  $ ip netns exec ns0 ping -Q 6 192.0.2.11   # TOS 4, ECT(0)
    ... 100% packet loss ...
  $ ip netns exec ns0 ping -Q 7 192.0.2.11   # TOS 4, CE
    ... 100% packet loss ...

After this patch, rpfilter doesn't drop ECT(0) and CE packets anymore.

Fixes: 8f97339d3f ("netfilter: add ipv4 reverse path filter match")
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-19 13:54:30 -08:00
Guillaume Nault
8d2b51b008 udp: mask TOS bits in udp_v4_early_demux()
udp_v4_early_demux() is the only function that calls
ip_mc_validate_source() with a TOS that hasn't been masked with
IPTOS_RT_MASK.

This results in different behaviours for incoming multicast UDPv4
packets, depending on if ip_mc_validate_source() is called from the
early-demux path (udp_v4_early_demux) or from the regular input path
(ip_route_input_noref).

ECN would normally not be used with UDP multicast packets, so the
practical consequences should be limited on that side. However,
IPTOS_RT_MASK is used to also masks the TOS' high order bits, to align
with the non-early-demux path behaviour.

Reproducer:

  Setup two netns, connected with veth:
  $ ip netns add ns0
  $ ip netns add ns1
  $ ip -netns ns0 link set dev lo up
  $ ip -netns ns1 link set dev lo up
  $ ip link add name veth01 netns ns0 type veth peer name veth10 netns ns1
  $ ip -netns ns0 link set dev veth01 up
  $ ip -netns ns1 link set dev veth10 up
  $ ip -netns ns0 address add 192.0.2.10 peer 192.0.2.11/32 dev veth01
  $ ip -netns ns1 address add 192.0.2.11 peer 192.0.2.10/32 dev veth10

  In ns0, add route to multicast address 224.0.2.0/24 using source
  address 198.51.100.10:
  $ ip -netns ns0 address add 198.51.100.10/32 dev lo
  $ ip -netns ns0 route add 224.0.2.0/24 dev veth01 src 198.51.100.10

  In ns1, define route to 198.51.100.10, only for packets with TOS 4:
  $ ip -netns ns1 route add 198.51.100.10/32 tos 4 dev veth10

  Also activate rp_filter in ns1, so that incoming packets not matching
  the above route get dropped:
  $ ip netns exec ns1 sysctl -wq net.ipv4.conf.veth10.rp_filter=1

  Now try to receive packets on 224.0.2.11:
  $ ip netns exec ns1 socat UDP-RECVFROM:1111,ip-add-membership=224.0.2.11:veth10,ignoreeof -

  In ns0, send packet to 224.0.2.11 with TOS 4 and ECT(0) (that is,
  tos 6 for socat):
  $ echo test0 | ip netns exec ns0 socat - UDP-DATAGRAM:224.0.2.11:1111,bind=:1111,tos=6

  The "test0" message is properly received by socat in ns1, because
  early-demux has no cached dst to use, so source address validation
  is done by ip_route_input_mc(), which receives a TOS that has the
  ECN bits masked.

  Now send another packet to 224.0.2.11, still with TOS 4 and ECT(0):
  $ echo test1 | ip netns exec ns0 socat - UDP-DATAGRAM:224.0.2.11:1111,bind=:1111,tos=6

  The "test1" message isn't received by socat in ns1, because, now,
  early-demux has a cached dst to use and calls ip_mc_validate_source()
  immediately, without masking the ECN bits.

Fixes: bc044e8db7 ("udp: perform source validation for mcast early demux")
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-19 13:54:30 -08:00
Maxim Mikityanskiy
b425e24a93 xsk: Clear pool even for inactive queues
The number of queues can change by other means, rather than ethtool. For
example, attaching an mqprio qdisc with num_tc > 1 leads to creating
multiple sets of TX queues, which may be then destroyed when mqprio is
deleted. If an AF_XDP socket is created while mqprio is active,
dev->_tx[queue_id].pool will be filled, but then real_num_tx_queues may
decrease with deletion of mqprio, which will mean that the pool won't be
NULLed, and a further increase of the number of TX queues may expose a
dangling pointer.

To avoid any potential misbehavior, this commit clears pool for RX and
TX queues, regardless of real_num_*_queues, still taking into
consideration num_*_queues to avoid overflows.

Fixes: 1c1efc2af1 ("xsk: Create and free buffer pool independently from umem")
Fixes: a41b4f3c58 ("xsk: simplify xdp_clear_umem_at_qid implementation")
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/20210118160333.333439-1-maximmi@mellanox.com
2021-01-19 22:47:04 +01:00
Linus Torvalds
f419f031de Fixes:
- Avoid exposing parent of root directory in NFSv3 READDIRPLUS results
 - Fix a tracepoint change that went in the initial 5.11 merge
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAl//AVMACgkQM2qzM29m
 f5dTWg//c2prRAhE1V9fwIDczJn8MM0tXljoWWgSpuslbd8Bgv0Ss8mitvr4B3pO
 JhzBdWcTb2/3j2D52LbLOGjr0z6BCvXX1Gp0QUnC96lhaNBC5aby309xQpSkhPbQ
 j2jw3CImGbiH7YY2BGsjcUx5mfpIkJMbg7rPSSOVHufIiUZLCg98Y3JJJKrJk+78
 qGFgaAHqBLzLK96F7Sz9q8du5lsiCbpLgx+qWjpaJEfJ0XWbEe2jA/uakrb1OzoD
 OkpG8RjZiJFAhWGdnR8y7eJQ7FyIi8h7BYAr4AlE97YZRZdjDqyummshJkKKVG2f
 5u4B225cKkcmVfLQem7Ym+nVFneR7/WLy00O12v08d0s54RLDp4xjdKplgLnHdwB
 AJg+l6K/AN24UtyE1OUIuOKJsLZd+DSANYNzZrCjeF8o6LKsKSUGrRtRbNVmtyJH
 qBYXR3gXrNt9lWYU+i/4OfJIVfksWjjyRk2/ww83INi5KxixuL0w8BcMpaTC1qQg
 ds+rmvosLvtfnY2k0wdScYbQZHoFvf+qJHRDhOVq4lWgpooExOMXKUry6k5AVOd4
 EchDX870Qe6wc4uT8xafmizD6hdJXCDN0rTGTuGnMoksoBZ7uCCsyyztbfNGiFMC
 i+0wCIWkHU3LgfHQMmTJ3J6e8mgTWPD3pTOJU5xoizQnTHGoTho=
 =Qf5O
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-5.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fixes from Chuck Lever:

 - Avoid exposing parent of root directory in NFSv3 READDIRPLUS results

 - Fix a tracepoint change that went in the initial 5.11 merge

* tag 'nfsd-5.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  SUNRPC: Move the svc_xdr_recvfrom tracepoint again
  nfsd4: readdirplus shouldn't return parent of export
2021-01-19 13:01:50 -08:00
Oleksandr Mazur
7e238de828 net: core: devlink: use right genl user_ptr when handling port param get/set
Fix incorrect user_ptr dereferencing when handling port param get/set:

    idx [0] stores the 'struct devlink' pointer;
    idx [1] stores the 'struct devlink_port' pointer;

Fixes: 637989b5d7 ("devlink: Always use user_ptr[0] for devlink and simplify post_doit")
CC: Parav Pandit <parav@mellanox.com>
Signed-off-by: Oleksandr Mazur <oleksandr.mazur@plvision.eu>
Signed-off-by: Vadym Kochan <vadym.kochan@plvision.eu>
Link: https://lore.kernel.org/r/20210119085333.16833-1-vadym.kochan@plvision.eu
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-19 11:45:41 -08:00
Tariq Toukan
4e5a733290 net/tls: Except bond interface from some TLS checks
In the tls_dev_event handler, ignore tlsdev_ops requirement for bond
interfaces, they do not exist as the interaction is done directly with
the lower device.

Also, make the validate function pass when it's called with the upper
bond interface.

Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Boris Pismenny <borisp@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-18 20:48:40 -08:00
Tariq Toukan
153cbd137f net/tls: Device offload to use lowest netdevice in chain
Do not call the tls_dev_ops of upper devices. Instead, ask them
for the proper lowest device and communicate with it directly.

Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Boris Pismenny <borisp@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-18 20:48:40 -08:00
Tariq Toukan
719a402cf6 net: netdevice: Add operation ndo_sk_get_lower_dev
ndo_sk_get_lower_dev returns the lower netdev that corresponds to
a given socket.
Additionally, we implement a helper netdev_sk_get_lowest_dev() to get
the lowest one in chain.

Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Boris Pismenny <borisp@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-18 20:48:39 -08:00
Cong Wang
d349f99768 net_sched: fix RTNL deadlock again caused by request_module()
tcf_action_init_1() loads tc action modules automatically with
request_module() after parsing the tc action names, and it drops RTNL
lock and re-holds it before and after request_module(). This causes a
lot of troubles, as discovered by syzbot, because we can be in the
middle of batch initializations when we create an array of tc actions.

One of the problem is deadlock:

CPU 0					CPU 1
rtnl_lock();
for (...) {
  tcf_action_init_1();
    -> rtnl_unlock();
    -> request_module();
				rtnl_lock();
				for (...) {
				  tcf_action_init_1();
				    -> tcf_idr_check_alloc();
				   // Insert one action into idr,
				   // but it is not committed until
				   // tcf_idr_insert_many(), then drop
				   // the RTNL lock in the _next_
				   // iteration
				   -> rtnl_unlock();
    -> rtnl_lock();
    -> a_o->init();
      -> tcf_idr_check_alloc();
      // Now waiting for the same index
      // to be committed
				    -> request_module();
				    -> rtnl_lock()
				    // Now waiting for RTNL lock
				}
				rtnl_unlock();
}
rtnl_unlock();

This is not easy to solve, we can move the request_module() before
this loop and pre-load all the modules we need for this netlink
message and then do the rest initializations. So the loop breaks down
to two now:

        for (i = 1; i <= TCA_ACT_MAX_PRIO && tb[i]; i++) {
                struct tc_action_ops *a_o;

                a_o = tc_action_load_ops(name, tb[i]...);
                ops[i - 1] = a_o;
        }

        for (i = 1; i <= TCA_ACT_MAX_PRIO && tb[i]; i++) {
                act = tcf_action_init_1(ops[i - 1]...);
        }

Although this looks serious, it only has been reported by syzbot, so it
seems hard to trigger this by humans. And given the size of this patch,
I'd suggest to make it to net-next and not to backport to stable.

This patch has been tested by syzbot and tested with tdc.py by me.

Fixes: 0fedc63fad ("net_sched: commit action insertions together")
Reported-and-tested-by: syzbot+82752bc5331601cf4899@syzkaller.appspotmail.com
Reported-and-tested-by: syzbot+b3b63b6bff456bd95294@syzkaller.appspotmail.com
Reported-by: syzbot+ba67b12b1ca729912834@syzkaller.appspotmail.com
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Tested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://lore.kernel.org/r/20210117005657.14810-1-xiyou.wangcong@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-18 20:13:55 -08:00
Enke Chen
9d9b1ee0b2 tcp: fix TCP_USER_TIMEOUT with zero window
The TCP session does not terminate with TCP_USER_TIMEOUT when data
remain untransmitted due to zero window.

The number of unanswered zero-window probes (tcp_probes_out) is
reset to zero with incoming acks irrespective of the window size,
as described in tcp_probe_timer():

    RFC 1122 4.2.2.17 requires the sender to stay open indefinitely
    as long as the receiver continues to respond probes. We support
    this by default and reset icsk_probes_out with incoming ACKs.

This counter, however, is the wrong one to be used in calculating the
duration that the window remains closed and data remain untransmitted.
Thanks to Jonathan Maxwell <jmaxwell37@gmail.com> for diagnosing the
actual issue.

In this patch a new timestamp is introduced for the socket in order to
track the elapsed time for the zero-window probes that have not been
answered with any non-zero window ack.

Fixes: 9721e709fa ("tcp: simplify window probe aborting on USER_TIMEOUT")
Reported-by: William McCall <william.mccall@gmail.com>
Co-developed-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Enke Chen <enchen@paloaltonetworks.com>
Reviewed-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20210115223058.GA39267@localhost.localdomain
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-18 19:59:17 -08:00
Matteo Croce
ceed9038b2 ipv6: set multicast flag on the multicast route
The multicast route ff00::/8 is created with type RTN_UNICAST:

  $ ip -6 -d route
  unicast ::1 dev lo proto kernel scope global metric 256 pref medium
  unicast fe80::/64 dev eth0 proto kernel scope global metric 256 pref medium
  unicast ff00::/8 dev eth0 proto kernel scope global metric 256 pref medium

Set the type to RTN_MULTICAST which is more appropriate.

Fixes: e8478e80e5 ("net/ipv6: Save route type in rt6_info")
Signed-off-by: Matteo Croce <mcroce@microsoft.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-18 19:52:37 -08:00
Matteo Croce
a826b04303 ipv6: create multicast route with RTPROT_KERNEL
The ff00::/8 multicast route is created without specifying the fc_protocol
field, so the default RTPROT_BOOT value is used:

  $ ip -6 -d route
  unicast ::1 dev lo proto kernel scope global metric 256 pref medium
  unicast fe80::/64 dev eth0 proto kernel scope global metric 256 pref medium
  unicast ff00::/8 dev eth0 proto boot scope global metric 256 pref medium

As the documentation says, this value identifies routes installed during
boot, but the route is created when interface is set up.
Change the value to RTPROT_KERNEL which is a better value.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Matteo Croce <mcroce@microsoft.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-18 19:52:02 -08:00
Menglong Dong
a98c0c4742 net: bridge: check vlan with eth_type_vlan() method
Replace some checks for ETH_P_8021Q and ETH_P_8021AD with
eth_type_vlan().

Signed-off-by: Menglong Dong <dong.menglong@zte.com.cn>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Link: https://lore.kernel.org/r/20210117080950.122761-1-dong.menglong@zte.com.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-18 14:27:33 -08:00
Jakub Kicinski
bde2c0af61 Various fixes:
* kernel-doc parsing fixes
  * incorrect debugfs string checks
  * locking fix in regulatory
  * some encryption-related fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAmAF80wACgkQB8qZga/f
 l8RClw//TFzAO2PoGZk7+xWkzRFM7YIZuinHRVeAxaehwVM+9cLnL9YrC90qNX+J
 GwxTsZa5JjewQMrKPoBu+5TNRAqMu0Nf4t1hT1TfPLQKLrOtYfKui2PVUkG3Iqii
 6EtizjtmHS2UelLS+zMjpqG8NKD4hE6G0oxp/K8IEh5WygEvQhggi/6f5Ld9O0kx
 A1PAWrzDOAOMGZtY7IyhqDvwaTHJ2nMFkhsiZPXGbCUKT+xKFefmKRLsiqFXo3of
 ld3nQ3L1BgeLbqAxR7a3zDbRIfNVeZJvvwCtA7T3Gcuy0syqGfguKoGMSlkO6IAu
 aUlpSZaYSGcxGCWiWjTC1MIO2Sx6+Ug4dw+mDv2fEubA1d651yFqqFC9M95FOo5b
 4jCyaw9bG/0ceHChw71tpAdDgqCGeu3jw92SuVpjIzRqdRrLvQ1kxw+FKaWF++wH
 QgAKF7l+WsYJvFkJQhf/eGlhFk6K4Ez4T3/053Exq3OfHcYgPWdTxQcAZJ36GGl1
 kM01dki5j6YS0GMYo9RQIyag/yH72qv4fKK4hYtp7Mu/5W5J3lDV0fZzM5UOhc4+
 x8UPVbqLEUaRXHIJRks0KoRBWwr/NLb/w60xqPQIM1hbmImBfqLYLkmvOhTMJ8Dn
 1WsC2pevDSIjjto9lUCmB4/jHjhpkQ4c/m9bg4wE9Qd8Ha8aT/Q=
 =ea27
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-for-net-2021-01-18.2' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
Various fixes:
 * kernel-doc parsing fixes
 * incorrect debugfs string checks
 * locking fix in regulatory
 * some encryption-related fixes

* tag 'mac80211-for-net-2021-01-18.2' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211:
  mac80211: check if atf has been disabled in __ieee80211_schedule_txq
  mac80211: do not drop tx nulldata packets on encrypted links
  mac80211: fix encryption key selection for 802.3 xmit
  mac80211: fix fast-rx encryption check
  mac80211: fix incorrect strlen of .write in debugfs
  cfg80211: fix a kerneldoc markup
  cfg80211: Save the regulatory domain with a lock
  cfg80211/mac80211: fix kernel-doc for SAR APIs
====================

Link: https://lore.kernel.org/r/20210118204750.7243-1-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-18 14:23:58 -08:00
Xin Long
1fef8544bf sctp: remove the NETIF_F_SG flag before calling skb_segment
It makes more sense to clear NETIF_F_SG instead of set it when
calling skb_segment() in sctp_gso_segment(), since SCTP GSO is
using head_skb's fraglist, of which all frags are linear skbs.

This will make SCTP GSO code more understandable.

Suggested-by: Alexander Duyck <alexander.duyck@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-16 19:05:59 -08:00
Xin Long
dbd50f238d net: move the hsize check to the else block in skb_segment
After commit 89319d3801 ("net: Add frag_list support to skb_segment"),
it goes to process frag_list when !hsize in skb_segment(). However, when
using skb frag_list, sg normally should not be set. In this case, hsize
will be set with len right before !hsize check, then it won't go to
frag_list processing code.

So the right thing to do is move the hsize check to the else block, so
that it won't affect the !hsize check for frag_list processing.

v1->v2:
  - change to do "hsize <= 0" check instead of "!hsize", and also move
    "hsize < 0" into else block, to save some cycles, as Alex suggested.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-16 19:05:59 -08:00
Alexander Lobakin
66c556025d skbuff: back tiny skbs with kmalloc() in __netdev_alloc_skb() too
Commit 3226b158e6 ("net: avoid 32 x truesize under-estimation for
tiny skbs") ensured that skbs with data size lower than 1025 bytes
will be kmalloc'ed to avoid excessive page cache fragmentation and
memory consumption.
However, the fix adressed only __napi_alloc_skb() (primarily for
virtio_net and napi_get_frags()), but the issue can still be achieved
through __netdev_alloc_skb(), which is still used by several drivers.
Drivers often allocate a tiny skb for headers and place the rest of
the frame to frags (so-called copybreak).
Mirror the condition to __netdev_alloc_skb() to handle this case too.

Since v1 [0]:
 - fix "Fixes:" tag;
 - refine commit message (mention copybreak usecase).

[0] https://lore.kernel.org/netdev/20210114235423.232737-1-alobakin@pm.me

Fixes: a1c7fff7e1 ("net: netdev_alloc_skb() use build_skb()")
Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Link: https://lore.kernel.org/r/20210115150354.85967-1-alobakin@pm.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-16 19:02:04 -08:00
Pablo Neira Ayuso
ce5379963b netfilter: nft_dynset: dump expressions when set definition contains no expressions
If the set definition provides no stateful expressions, then include the
stateful expression in the ruleset listing. Without this fix, the dynset
rule listing shows the stateful expressions provided by the set
definition.

Fixes: 65038428b2 ("netfilter: nf_tables: allow to specify stateful expression in set definition")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-01-16 19:54:42 +01:00
Pablo Neira Ayuso
0c5b7a501e netfilter: nft_dynset: add timeout extension to template
Otherwise, the newly create element shows no timeout when listing the
ruleset. If the set definition does not specify a default timeout, then
the set element only shows the expiration time, but not the timeout.
This is a problem when restoring a stateful ruleset listing since it
skips the timeout policy entirely.

Fixes: 22fe54d5fe ("netfilter: nf_tables: add support for dynamic set updates")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-01-16 19:20:15 +01:00
Pablo Neira Ayuso
fca05d4d61 netfilter: nft_dynset: honor stateful expressions in set definition
If the set definition contains stateful expressions, allocate them for
the newly added entries from the packet path.

Fixes: 65038428b2 ("netfilter: nf_tables: allow to specify stateful expression in set definition")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-01-16 18:57:26 +01:00
Yejune Deng
f4d133d86a tcp_cubic: use memset and offsetof init
In bictcp_reset(), use memset and offsetof instead of = 0.

Signed-off-by: Yejune Deng <yejune.deng@gmail.com>
Link: https://lore.kernel.org/r/1610597696-128610-1-git-send-email-yejune.deng@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-15 20:22:16 -08:00
Geliang Tang
32d91b4af3 nfc: netlink: use &w->w in nfc_genl_rcv_nl_event
Use the struct member w of the struct urelease_work directly instead of
casting it.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Link: https://lore.kernel.org/r/f0ed86d6d54ac0834bd2e161d172bf7bb5647cf7.1610683862.git.geliangtang@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-15 20:08:28 -08:00
Vladimir Oltean
2a6ef76303 net: dsa: add ops for devlink-sb
Switches that care about QoS might have hardware support for reserving
buffer pools for individual ports or traffic classes, and configuring
their sizes and thresholds. Through devlink-sb (shared buffers), this is
all configurable, as well as their occupancy being viewable.

Add the plumbing in DSA for these operations.

Individual drivers still need to call devlink_sb_register() with the
shared buffers they want to expose. A helper was not created in DSA for
this purpose (unlike, say, dsa_devlink_params_register), since in my
opinion it does not bring any benefit over plainly calling
devlink_sb_register() directly.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-15 20:02:34 -08:00
Eric Dumazet
bcd0cf19ef net_sched: avoid shift-out-of-bounds in tcindex_set_parms()
tc_index being 16bit wide, we need to check that TCA_TCINDEX_SHIFT
attribute is not silly.

UBSAN: shift-out-of-bounds in net/sched/cls_tcindex.c:260:29
shift exponent 255 is too large for 32-bit type 'int'
CPU: 0 PID: 8516 Comm: syz-executor228 Not tainted 5.10.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:79 [inline]
 dump_stack+0x107/0x163 lib/dump_stack.c:120
 ubsan_epilogue+0xb/0x5a lib/ubsan.c:148
 __ubsan_handle_shift_out_of_bounds.cold+0xb1/0x181 lib/ubsan.c:395
 valid_perfect_hash net/sched/cls_tcindex.c:260 [inline]
 tcindex_set_parms.cold+0x1b/0x215 net/sched/cls_tcindex.c:425
 tcindex_change+0x232/0x340 net/sched/cls_tcindex.c:546
 tc_new_tfilter+0x13fb/0x21b0 net/sched/cls_api.c:2127
 rtnetlink_rcv_msg+0x8b6/0xb80 net/core/rtnetlink.c:5555
 netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2494
 netlink_unicast_kernel net/netlink/af_netlink.c:1304 [inline]
 netlink_unicast+0x533/0x7d0 net/netlink/af_netlink.c:1330
 netlink_sendmsg+0x907/0xe40 net/netlink/af_netlink.c:1919
 sock_sendmsg_nosec net/socket.c:652 [inline]
 sock_sendmsg+0xcf/0x120 net/socket.c:672
 ____sys_sendmsg+0x6e8/0x810 net/socket.c:2336
 ___sys_sendmsg+0xf3/0x170 net/socket.c:2390
 __sys_sendmsg+0xe5/0x1b0 net/socket.c:2423
 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Link: https://lore.kernel.org/r/20210114185229.1742255-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-15 18:11:10 -08:00
Eric Dumazet
dd5e073381 net_sched: gen_estimator: support large ewma log
syzbot report reminded us that very big ewma_log were supported in the past,
even if they made litle sense.

tc qdisc replace dev xxx root est 1sec 131072sec ...

While fixing the bug, also add boundary checks for ewma_log, in line
with range supported by iproute2.

UBSAN: shift-out-of-bounds in net/core/gen_estimator.c:83:38
shift exponent -1 is negative
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.10.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 <IRQ>
 __dump_stack lib/dump_stack.c:79 [inline]
 dump_stack+0x107/0x163 lib/dump_stack.c:120
 ubsan_epilogue+0xb/0x5a lib/ubsan.c:148
 __ubsan_handle_shift_out_of_bounds.cold+0xb1/0x181 lib/ubsan.c:395
 est_timer.cold+0xbb/0x12d net/core/gen_estimator.c:83
 call_timer_fn+0x1a5/0x710 kernel/time/timer.c:1417
 expire_timers kernel/time/timer.c:1462 [inline]
 __run_timers.part.0+0x692/0xa80 kernel/time/timer.c:1731
 __run_timers kernel/time/timer.c:1712 [inline]
 run_timer_softirq+0xb3/0x1d0 kernel/time/timer.c:1744
 __do_softirq+0x2bc/0xa77 kernel/softirq.c:343
 asm_call_irq_on_stack+0xf/0x20
 </IRQ>
 __run_on_irqstack arch/x86/include/asm/irq_stack.h:26 [inline]
 run_on_irqstack_cond arch/x86/include/asm/irq_stack.h:77 [inline]
 do_softirq_own_stack+0xaa/0xd0 arch/x86/kernel/irq_64.c:77
 invoke_softirq kernel/softirq.c:226 [inline]
 __irq_exit_rcu+0x17f/0x200 kernel/softirq.c:420
 irq_exit_rcu+0x5/0x20 kernel/softirq.c:432
 sysvec_apic_timer_interrupt+0x4d/0x100 arch/x86/kernel/apic/apic.c:1096
 asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:628
RIP: 0010:native_save_fl arch/x86/include/asm/irqflags.h:29 [inline]
RIP: 0010:arch_local_save_flags arch/x86/include/asm/irqflags.h:79 [inline]
RIP: 0010:arch_irqs_disabled arch/x86/include/asm/irqflags.h:169 [inline]
RIP: 0010:acpi_safe_halt drivers/acpi/processor_idle.c:111 [inline]
RIP: 0010:acpi_idle_do_entry+0x1c9/0x250 drivers/acpi/processor_idle.c:516

Fixes: 1c0d32fde5 ("net_sched: gen_estimator: complete rewrite of rate estimators")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Link: https://lore.kernel.org/r/20210114181929.1717985-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-15 18:11:06 -08:00
Eric Dumazet
e4bedf48aa net_sched: reject silly cell_log in qdisc_get_rtab()
iproute2 probably never goes beyond 8 for the cell exponent,
but stick to the max shift exponent for signed 32bit.

UBSAN reported:
UBSAN: shift-out-of-bounds in net/sched/sch_api.c:389:22
shift exponent 130 is too large for 32-bit type 'int'
CPU: 1 PID: 8450 Comm: syz-executor586 Not tainted 5.11.0-rc3-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:79 [inline]
 dump_stack+0x183/0x22e lib/dump_stack.c:120
 ubsan_epilogue lib/ubsan.c:148 [inline]
 __ubsan_handle_shift_out_of_bounds+0x432/0x4d0 lib/ubsan.c:395
 __detect_linklayer+0x2a9/0x330 net/sched/sch_api.c:389
 qdisc_get_rtab+0x2b5/0x410 net/sched/sch_api.c:435
 cbq_init+0x28f/0x12c0 net/sched/sch_cbq.c:1180
 qdisc_create+0x801/0x1470 net/sched/sch_api.c:1246
 tc_modify_qdisc+0x9e3/0x1fc0 net/sched/sch_api.c:1662
 rtnetlink_rcv_msg+0xb1d/0xe60 net/core/rtnetlink.c:5564
 netlink_rcv_skb+0x1f0/0x460 net/netlink/af_netlink.c:2494
 netlink_unicast_kernel net/netlink/af_netlink.c:1304 [inline]
 netlink_unicast+0x7de/0x9b0 net/netlink/af_netlink.c:1330
 netlink_sendmsg+0xaa6/0xe90 net/netlink/af_netlink.c:1919
 sock_sendmsg_nosec net/socket.c:652 [inline]
 sock_sendmsg net/socket.c:672 [inline]
 ____sys_sendmsg+0x5a2/0x900 net/socket.c:2345
 ___sys_sendmsg net/socket.c:2399 [inline]
 __sys_sendmsg+0x319/0x400 net/socket.c:2432
 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Acked-by: Cong Wang <cong.wang@bytedance.com>
Link: https://lore.kernel.org/r/20210114160637.1660597-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-15 18:06:45 -08:00
Jakub Kicinski
2d9116be76 Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2021-01-16

1) Extend atomic operations to the BPF instruction set along with x86-64 JIT support,
   that is, atomic{,64}_{xchg,cmpxchg,fetch_{add,and,or,xor}}, from Brendan Jackman.

2) Add support for using kernel module global variables (__ksym externs in BPF
   programs) retrieved via module's BTF, from Andrii Nakryiko.

3) Generalize BPF stackmap's buildid retrieval and add support to have buildid
   stored in mmap2 event for perf, from Jiri Olsa.

4) Various fixes for cross-building BPF sefltests out-of-tree which then will
   unblock wider automated testing on ARM hardware, from Jean-Philippe Brucker.

5) Allow to retrieve SOL_SOCKET opts from sock_addr progs, from Daniel Borkmann.

6) Clean up driver's XDP buffer init and split into two helpers to init per-
   descriptor and non-changing fields during processing, from Lorenzo Bianconi.

7) Minor misc improvements to libbpf & bpftool, from Ian Rogers.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (41 commits)
  perf: Add build id data in mmap2 event
  bpf: Add size arg to build_id_parse function
  bpf: Move stack_map_get_build_id into lib
  bpf: Document new atomic instructions
  bpf: Add tests for new BPF atomic operations
  bpf: Add bitwise atomic instructions
  bpf: Pull out a macro for interpreting atomic ALU operations
  bpf: Add instructions for atomic_[cmp]xchg
  bpf: Add BPF_FETCH field / create atomic_fetch_add instruction
  bpf: Move BPF_STX reserved field check into BPF_STX verifier code
  bpf: Rename BPF_XADD and prepare to encode other atomics in .imm
  bpf: x86: Factor out a lookup table for some ALU opcodes
  bpf: x86: Factor out emission of REX byte
  bpf: x86: Factor out emission of ModR/M for *(reg + off)
  tools/bpftool: Add -Wall when building BPF programs
  bpf, libbpf: Avoid unused function warning on bpf_tail_call_static
  selftests/bpf: Install btf_dump test cases
  selftests/bpf: Fix installation of urandom_read
  selftests/bpf: Move generated test files to $(TEST_GEN_FILES)
  selftests/bpf: Fix out-of-tree build
  ...
====================

Link: https://lore.kernel.org/r/20210116012922.17823-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-15 17:57:26 -08:00
Tom Rix
e794e7fa19 neighbor: remove definition of DEBUG
Defining DEBUG should only be done in development.
So remove DEBUG.

Signed-off-by: Tom Rix <trix@redhat.com>
Link: https://lore.kernel.org/r/20210114212917.48174-1-trix@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-15 17:51:18 -08:00
Vladimir Oltean
0ee2af4ebb net: dsa: set configure_vlan_while_not_filtering to true by default
As explained in commit 54a0ed0df4 ("net: dsa: provide an option for
drivers to always receive bridge VLANs"), DSA has historically been
skipping VLAN switchdev operations when the bridge wasn't in
vlan_filtering mode, but the reason why it was doing that has never been
clear. So the configure_vlan_while_not_filtering option is there merely
to preserve functionality for existing drivers. It isn't some behavior
that drivers should opt into. Ideally, when all drivers leave this flag
set, we can delete the dsa_port_skip_vlan_configuration() function.

New drivers always seem to omit setting this flag, for some reason. So
let's reverse the logic: the DSA core sets it by default to true before
the .setup() callback, and legacy drivers can turn it off. This way, new
drivers get the new behavior by default, unless they explicitly set the
flag to false, which is more obvious during review.

Remove the assignment from drivers which were setting it to true, and
add the assignment to false for the drivers that didn't previously have
it. This way, it should be easier to see how many we have left.

The following drivers: lan9303, mv88e6060 were skipped from setting this
flag to false, because they didn't have any VLAN offload ops in the
first place.

The Broadcom Starfighter 2 driver calls the common b53_switch_alloc and
therefore also inherits the configure_vlan_while_not_filtering=true
behavior.

Also, print a message through netlink extack every time a VLAN has been
skipped. This is mildly annoying on purpose, so that (a) it is at least
clear that VLANs are being skipped - the legacy behavior in itself is
confusing, and the extack should be much more difficult to miss, unlike
kernel logs - and (b) people have one more incentive to convert to the
new behavior.

No behavior change except for the added prints is intended at this time.

$ ip link add br0 type bridge vlan_filtering 0
$ ip link set sw0p2 master br0
[   60.315148] br0: port 1(sw0p2) entered blocking state
[   60.320350] br0: port 1(sw0p2) entered disabled state
[   60.327839] device sw0p2 entered promiscuous mode
[   60.334905] br0: port 1(sw0p2) entered blocking state
[   60.340142] br0: port 1(sw0p2) entered forwarding state
Warning: dsa_core: skipping configuration of VLAN. # This was the pvid
$ bridge vlan add dev sw0p2 vid 100
Warning: dsa_core: skipping configuration of VLAN.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/20210115231919.43834-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-15 17:29:40 -08:00
Jakub Kicinski
e23a8d0021 Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2021-01-16

1) Fix a double bpf_prog_put() for BPF_PROG_{TYPE_EXT,TYPE_TRACING} types in
   link creation's error path causing a refcount underflow, from Jiri Olsa.

2) Fix BTF validation errors for the case where kernel modules don't declare
   any new types and end up with an empty BTF, from Andrii Nakryiko.

3) Fix BPF local storage helpers to first check their {task,inode} owners for
   being NULL before access, from KP Singh.

4) Fix a memory leak in BPF setsockopt handling for the case where optlen is
   zero and thus temporary optval buffer should be freed, from Stanislav Fomichev.

5) Fix a syzbot memory allocation splat in BPF_PROG_TEST_RUN infra for
   raw_tracepoint caused by too big ctx_size_in, from Song Liu.

6) Fix LLVM code generation issues with verifier where PTR_TO_MEM{,_OR_NULL}
   registers were spilled to stack but not recognized, from Gilad Reti.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  MAINTAINERS: Update my email address
  selftests/bpf: Add verifier test for PTR_TO_MEM spill
  bpf: Support PTR_TO_MEM{,_OR_NULL} register spilling
  bpf: Reject too big ctx_size_in for raw_tp test run
  libbpf: Allow loading empty BTFs
  bpf: Allow empty module BTFs
  bpf: Don't leak memory in bpf getsockopt when optlen == 0
  bpf: Update local storage test to check handling of null ptrs
  bpf: Fix typo in bpf_inode_storage.c
  bpf: Local storage helpers should check nullness of owner ptr passed
  bpf: Prevent double bpf_prog_put call from bpf_tracing_prog_attach
====================

Link: https://lore.kernel.org/r/20210116002025.15706-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-15 16:34:59 -08:00
Cong Wang
c96adff956 cls_flower: call nla_ok() before nla_next()
fl_set_enc_opt() simply checks if there are still bytes left to parse,
but this is not sufficent as syzbot seems to be able to generate
malformatted netlink messages. nla_ok() is more strict so should be
used to validate the next nlattr here.

And nla_validate_nested_deprecated() has less strict check too, it is
probably too late to switch to the strict version, but we can just
call nla_ok() too after it.

Reported-and-tested-by: syzbot+2624e3778b18fc497c92@syzkaller.appspotmail.com
Fixes: 0a6e77784f ("net/sched: allow flower to match tunnel options")
Fixes: 79b1011cb3 ("net: sched: allow flower to match erspan options")
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Xin Long <lucien.xin@gmail.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Link: https://lore.kernel.org/r/20210115185024.72298-1-xiyou.wangcong@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-15 15:48:53 -08:00
George McCollister
54a52823a2 dsa: add support for Arrow XRS700x tag trailer
Add support for Arrow SpeedChips XRS700x single byte tag trailer. This
is modeled on tag_trailer.c which works in a similar way.

Signed-off-by: George McCollister <george.mccollister@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-15 15:36:31 -08:00
Jakub Kicinski
1d9f03c0a1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-14 18:34:50 -08:00
Eran Ben Elisha
4f1cc51f34 net: flow_dissector: Parse PTP L2 packet header
Add support for parsing PTP L2 packet header. Such packet consists
of an L2 header (with ethertype of ETH_P_1588), PTP header, body
and an optional suffix.

Signed-off-by: Eran Ben Elisha <eranbe@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-14 18:24:54 -08:00
Eran Ben Elisha
7185425582 net: vlan: Add parse protocol header ops
Add parse protocol header ops for vlan device. Before this patch, vlan
tagged packet transmitted by af_packet had skb->protocol unset. Some
kernel methods (like __skb_flow_dissect()) rely on this missing information
for its packet processing.

Signed-off-by: Eran Ben Elisha <eranbe@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-14 18:24:53 -08:00
Tobias Waldekranz
5b60dadb71 net: dsa: tag_dsa: Support reception of packets from LAG devices
Packets ingressing on a LAG that egress on the CPU port, which are not
classified as management, will have a FORWARD tag that does not
contain the normal source device/port tuple. Instead the trunk bit
will be set, and the port field holds the LAG id.

Since the exact source port information is not available in the tag,
frames are injected directly on the LAG interface and thus do never
pass through any DSA port interface on ingress.

Management frames (TO_CPU) are not affected and will pass through the
DSA port interface as usual.

Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-14 17:11:56 -08:00
Tobias Waldekranz
058102a6e9 net: dsa: Link aggregation support
Monitor the following events and notify the driver when:

- A DSA port joins/leaves a LAG.
- A LAG, made up of DSA ports, joins/leaves a bridge.
- A DSA port in a LAG is enabled/disabled (enabled meaning
  "distributing" in 802.3ad LACP terms).

When a LAG joins a bridge, the DSA subsystem will treat that as each
individual port joining the bridge. The driver may look at the port's
LAG device pointer to see if it is associated with any LAG, if that is
required. This is analogue to how switchdev events are replicated out
to all lower devices when reaching e.g. a LAG.

Drivers can optionally request that DSA maintain a linear mapping from
a LAG ID to the corresponding netdev by setting ds->num_lag_ids to the
desired size.

In the event that the hardware is not capable of offloading a
particular LAG for any reason (the typical case being use of exotic
modes like broadcast), DSA will take a hands-off approach, allowing
the LAG to be formed as a pure software construct. This is reported
back through the extended ACK, but is otherwise transparent to the
user.

Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Tested-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-14 17:11:56 -08:00
Tobias Waldekranz
5696c8aedf net: dsa: Don't offload port attributes on standalone ports
In a situation where a standalone port is indirectly attached to a
bridge (e.g. via a LAG) which is not offloaded, do not offload any
port attributes either. The port should behave as a standard NIC.

Previously, on mv88e6xxx, this meant that in the following setup:

     br0
     /
  team0
   / \
swp0 swp1

If vlan filtering was enabled on br0, swp0's and swp1's QMode was set
to "secure". This caused all untagged packets to be dropped, as their
default VID (0) was not loaded into the VTU.

Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-14 17:11:55 -08:00
Eelco Chaudron
a5317f3b06 net: openvswitch: add log message for error case
As requested by upstream OVS, added some error messages in the
validate_and_copy_dec_ttl function.

Includes a small cleanup, which removes an unnecessary parameter
from the dec_ttl_exception_handler() function.

Reported-by: Flavio Leitner <fbl@sysclose.org>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Link: https://lore.kernel.org/r/161054576573.26637.18396634650212670580.stgit@ebuild
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-14 16:32:14 -08:00
Linus Torvalds
e8c13a6bc8 Networking fixes for 5.11-rc4, including fixes from can and netfilter.
Current release - regressions:
 
  - fix feature enforcement to allow NETIF_F_HW_TLS_TX
    if IP_CSUM && IPV6_CSUM
 
  - dcb: accept RTM_GETDCB messages carrying set-like DCB commands
         if user is admin for backward-compatibility
 
  - selftests/tls: fix selftests build after adding ChaCha20-Poly1305
 
 Current release - always broken:
 
  - ppp: fix refcount underflow on channel unbridge
 
  - bnxt_en: clear DEFRAG flag in firmware message when retry flashing
 
  - smc: fix out of bound access in the new netlink interface
 
 Previous releases - regressions:
 
  - fix use-after-free with UDP GRO by frags
 
  - mptcp: better msk-level shutdown
 
  - rndis_host: set proper input size for OID_GEN_PHYSICAL_MEDIUM request
 
  - i40e: xsk: fix potential NULL pointer dereferencing
 
 Previous releases - always broken:
 
  - skb frag: kmap_atomic fixes
 
  - avoid 32 x truesize under-estimation for tiny skbs
 
  - fix issues around register_netdevice() failures
 
  - udp: prevent reuseport_select_sock from reading uninitialized socks
 
  - dsa: unbind all switches from tree when DSA master unbinds
 
  - dsa: clear devlink port type before unregistering slave netdevs
 
  - can: isotp: isotp_getname(): fix kernel information leak
 
  - mlxsw: core: Thermal control fixes
 
  - ipv6: validate GSO SKB against MTU before finish IPv6 processing
 
  - stmmac: use __napi_schedule() for PREEMPT_RT
 
  - net: mvpp2: remove Pause and Asym_Pause support
 
 Misc:
 
  - remove from MAINTAINERS folks who had been inactive for >5yrs
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmAAnyYACgkQMUZtbf5S
 IrsdmhAAotkTNVS1zEsvwIirI9KUKKMXvNvscpO0+HJgsQHVnCGkfrj0BQmqQR21
 D9njJIkGRiIANRO/Y/3wVCew55a0bxLmyE3JaU6krGLpvcNUFX6+fvuuzFSiWtKu
 1c/AaXFIDTa8uVtXP/Ve8DfxKZmh3YPX5pNtk3fS6OlymbUfu8pOEPY5k69/Nlmr
 QwbGZO0Q5Ab18rmPztgWpcZi8wLbpZYbrIR2E45u3k+LnXG3UUVYeYTC9Hi89wkz
 8YiS0PIs6GmWeSWnWK9TWXFSaxV8ttABsFxpbmzWW6oqkaviGjLfPg7kYYRgPu08
 nCyYx7LN58shQ8FTfZm1yBpJ1fbPV/5RIMZKQ6Fg4cICgCab63E4N6xxoA9mLNu9
 hP/qgeynQ2w1FbPw5yQVbDCVmcyfPb5V4WC1OccHQdgaAzz2SFPxvsUTOoBRxY8m
 DmZDHjBi2ZXB3/PSkwWmIsW9PuPq6de8xgHIQtjrCeduvVvmOYkrcdfkMxTx9HC0
 LH2a5x9VCL/cf/Y/tQ2TZSntweSq8MhlRV9vOIO1FOqiviYHlnD8+EuIBMe8To14
 XRIDMl92lpY5xjJpKdRhZ7Yh4CNMk199yFf5bt3xSlM4A3ALUlwqRKES6I2MZiiF
 0Yvxsr2qVShaHx6XpmBAimaUXxTmmUV7X1hf19EEzzmTdiMjad4=
 =e8t6
 -----END PGP SIGNATURE-----

Merge tag 'net-5.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "We have a few fixes for long standing issues, in particular Eric's fix
  to not underestimate the skb sizes, and my fix for brokenness of
  register_netdevice() error path. They may uncover other bugs so we
  will keep an eye on them. Also included are Willem's fixes for
  kmap(_atomic).

  Looking at the "current release" fixes, it seems we are about one rc
  behind a normal cycle. We've previously seen an uptick of "people had
  run their test suites" / "humans actually tried to use new features"
  fixes between rc2 and rc3.

  Summary:

  Current release - regressions:

   - fix feature enforcement to allow NETIF_F_HW_TLS_TX if IP_CSUM &&
     IPV6_CSUM

   - dcb: accept RTM_GETDCB messages carrying set-like DCB commands if
     user is admin for backward-compatibility

   - selftests/tls: fix selftests build after adding ChaCha20-Poly1305

  Current release - always broken:

   - ppp: fix refcount underflow on channel unbridge

   - bnxt_en: clear DEFRAG flag in firmware message when retry flashing

   - smc: fix out of bound access in the new netlink interface

  Previous releases - regressions:

   - fix use-after-free with UDP GRO by frags

   - mptcp: better msk-level shutdown

   - rndis_host: set proper input size for OID_GEN_PHYSICAL_MEDIUM
     request

   - i40e: xsk: fix potential NULL pointer dereferencing

  Previous releases - always broken:

   - skb frag: kmap_atomic fixes

   - avoid 32 x truesize under-estimation for tiny skbs

   - fix issues around register_netdevice() failures

   - udp: prevent reuseport_select_sock from reading uninitialized socks

   - dsa: unbind all switches from tree when DSA master unbinds

   - dsa: clear devlink port type before unregistering slave netdevs

   - can: isotp: isotp_getname(): fix kernel information leak

   - mlxsw: core: Thermal control fixes

   - ipv6: validate GSO SKB against MTU before finish IPv6 processing

   - stmmac: use __napi_schedule() for PREEMPT_RT

   - net: mvpp2: remove Pause and Asym_Pause support

  Misc:

   - remove from MAINTAINERS folks who had been inactive for >5yrs"

* tag 'net-5.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (58 commits)
  mptcp: fix locking in mptcp_disconnect()
  net: Allow NETIF_F_HW_TLS_TX if IP_CSUM && IPV6_CSUM
  MAINTAINERS: dccp: move Gerrit Renker to CREDITS
  MAINTAINERS: ipvs: move Wensong Zhang to CREDITS
  MAINTAINERS: tls: move Aviad to CREDITS
  MAINTAINERS: ena: remove Zorik Machulsky from reviewers
  MAINTAINERS: vrf: move Shrijeet to CREDITS
  MAINTAINERS: net: move Alexey Kuznetsov to CREDITS
  MAINTAINERS: altx: move Jay Cliburn to CREDITS
  net: avoid 32 x truesize under-estimation for tiny skbs
  nt: usb: USB_RTL8153_ECM should not default to y
  net: stmmac: fix taprio configuration when base_time is in the past
  net: stmmac: fix taprio schedule configuration
  net: tip: fix a couple kernel-doc markups
  net: sit: unregister_netdevice on newlink's error path
  net: stmmac: Fixed mtu channged by cache aligned
  cxgb4/chtls: Fix tid stuck due to wrong update of qid
  i40e: fix potential NULL pointer dereferencing
  net: stmmac: use __napi_schedule() for PREEMPT_RT
  can: mcp251xfd: mcp251xfd_handle_rxif_one(): fix wrong NULL pointer check
  ...
2021-01-14 13:31:07 -08:00
Lorenzo Bianconi
c13cf5c159 mac80211: check if atf has been disabled in __ieee80211_schedule_txq
Check if atf has been disabled in __ieee80211_schedule_txq() in order to
avoid a given sta is always put to the beginning of the active_txqs list
and never moved to the end since deficit is not decremented in
ieee80211_sta_register_airtime()

Fixes: b4809e9484 ("mac80211: Add airtime accounting and scheduling to TXQs")
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk>
Link: https://lore.kernel.org/r/93889406c50f1416214c079ca0b8c9faecc5143e.1608975195.git.lorenzo@kernel.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-01-14 22:27:38 +01:00
Felix Fietkau
2463ec86cd mac80211: do not drop tx nulldata packets on encrypted links
ieee80211_tx_h_select_key drops any non-mgmt packets without a key when
encryption is used. This is wrong for nulldata packets that can't be
encrypted and are sent out for probing clients and indicating 4-address
mode.

Reported-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Fixes: a0761a3017 ("mac80211: drop data frames without key on encrypted links")
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20201218191525.1168-1-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-01-14 22:26:10 +01:00
Felix Fietkau
b101dd2d22 mac80211: fix encryption key selection for 802.3 xmit
When using WEP, the default unicast key needs to be selected, instead of
the STA PTK.

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20201218184718.93650-4-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-01-14 22:25:43 +01:00
Felix Fietkau
622d3b4e39 mac80211: fix fast-rx encryption check
When using WEP, the default unicast key needs to be selected, instead of
the STA PTK.

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20201218184718.93650-5-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-01-14 22:25:28 +01:00
Shayne Chen
6020d534fa mac80211: fix incorrect strlen of .write in debugfs
This fixes strlen mismatch problems happening in some .write callbacks
of debugfs.

When trying to configure airtime_flags in debugfs, an error appeared:
ash: write error: Invalid argument

The error is returned from kstrtou16() since a wrong length makes it
miss the real end of input string.  To fix this, use count as the string
length, and set proper end of string for a char buffer.

The debug print is shown - airtime_flags_write: count = 2, len = 8,
where the actual length is 2, but "len = strlen(buf)" gets 8.

Also cleanup the other similar cases for the sake of consistency.

Signed-off-by: Sujuan Chen <sujuan.chen@mediatek.com>
Signed-off-by: Ryder Lee <ryder.lee@mediatek.com>
Signed-off-by: Shayne Chen <shayne.chen@mediatek.com>
Link: https://lore.kernel.org/r/20210112032028.7482-1-shayne.chen@mediatek.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-01-14 22:23:56 +01:00
Paolo Abeni
13a9499e83 mptcp: fix locking in mptcp_disconnect()
tcp_disconnect() expects the caller acquires the sock lock,
but mptcp_disconnect() is not doing that. Add the missing
required lock.

Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Fixes: 76e2a55d16 ("mptcp: better msk-level shutdown.")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/f818e82b58a556feeb71dcccc8bf1c87aafc6175.1610638176.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-14 11:25:21 -08:00
Tariq Toukan
25537d71e2 net: Allow NETIF_F_HW_TLS_TX if IP_CSUM && IPV6_CSUM
Cited patch below blocked the TLS TX device offload unless HW_CSUM
is set. This broke devices that use IP_CSUM && IP6_CSUM.
Here we fix it.

Note that the single HW_TLS_TX feature flag indicates support for
both IPv4/6, hence it should still be disabled in case only one of
(IP_CSUM | IPV6_CSUM) is set.

Fixes: ae0b04b238 ("net: Disable NETIF_F_HW_TLS_TX when HW_CSUM is disabled")
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reported-by: Rohit Maheshwari <rohitm@chelsio.com>
Reviewed-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Link: https://lore.kernel.org/r/20210114151215.7061-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-14 10:56:13 -08:00
Eric Dumazet
3226b158e6 net: avoid 32 x truesize under-estimation for tiny skbs
Both virtio net and napi_get_frags() allocate skbs
with a very small skb->head

While using page fragments instead of a kmalloc backed skb->head might give
a small performance improvement in some cases, there is a huge risk of
under estimating memory usage.

For both GOOD_COPY_LEN and GRO_MAX_HEAD, we can fit at least 32 allocations
per page (order-3 page in x86), or even 64 on PowerPC

We have been tracking OOM issues on GKE hosts hitting tcp_mem limits
but consuming far more memory for TCP buffers than instructed in tcp_mem[2]

Even if we force napi_alloc_skb() to only use order-0 pages, the issue
would still be there on arches with PAGE_SIZE >= 32768

This patch makes sure that small skb head are kmalloc backed, so that
other objects in the slab page can be reused instead of being held as long
as skbs are sitting in socket queues.

Note that we might in the future use the sk_buff napi cache,
instead of going through a more expensive __alloc_skb()

Another idea would be to use separate page sizes depending
on the allocated length (to never have more than 4 frags per page)

I would like to thank Greg Thelen for his precious help on this matter,
analysing crash dumps is always a time consuming task.

Fixes: fd11a83dd3 ("net: Pull out core bits of __netdev_alloc_skb and add __napi_alloc_skb")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://lore.kernel.org/r/20210113161819.1155526-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-14 10:51:52 -08:00
Mauro Carvalho Chehab
2576477929 net: tip: fix a couple kernel-doc markups
A function has a different name between their prototype
and its kernel-doc markup:

	../net/tipc/link.c:2551: warning: expecting prototype for link_reset_stats(). Prototype was for tipc_link_reset_stats() instead
	../net/tipc/node.c:1678: warning: expecting prototype for is the general link level function for message sending(). Prototype was for tipc_node_xmit() instead

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-14 10:30:24 -08:00
Jakub Kicinski
47e4bb147a net: sit: unregister_netdevice on newlink's error path
We need to unregister the netdevice if config failed.
.ndo_uninit takes care of most of the heavy lifting.

This was uncovered by recent commit c269a24ce0 ("net: make
free_netdev() more lenient with unregistering devices").
Previously the partially-initialized device would be left
in the system.

Reported-and-tested-by: syzbot+2393580080a2da190f04@syzkaller.appspotmail.com
Fixes: e2f1f072db ("sit: allow to configure 6rd tunnels via netlink")
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Link: https://lore.kernel.org/r/20210114012947.2515313-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-14 10:26:46 -08:00
Yuchung Cheng
0ae5b43d6d tcp: assign skb hash after tcp_event_data_sent
Move skb_set_hash_from_sk s.t. it's called after instead of before
tcp_event_data_sent is called. This enables congestion control
modules to change the socket hash right before restarting from
idle (via the TX_START congestion event).

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Link: https://lore.kernel.org/r/20210111230552.2704579-1-ycheng@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-13 19:36:51 -08:00
Song Liu
7ac6ad0511 bpf: Reject too big ctx_size_in for raw_tp test run
syzbot reported a WARNING for allocating too big memory:

WARNING: CPU: 1 PID: 8484 at mm/page_alloc.c:4976 __alloc_pages_nodemask+0x5f8/0x730 mm/page_alloc.c:5011
Modules linked in:
CPU: 1 PID: 8484 Comm: syz-executor862 Not tainted 5.11.0-rc2-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:__alloc_pages_nodemask+0x5f8/0x730 mm/page_alloc.c:4976
Code: 00 00 0c 00 0f 85 a7 00 00 00 8b 3c 24 4c 89 f2 44 89 e6 c6 44 24 70 00 48 89 6c 24 58 e8 d0 d7 ff ff 49 89 c5 e9 ea fc ff ff <0f> 0b e9 b5 fd ff ff 89 74 24 14 4c 89 4c 24 08 4c 89 74 24 18 e8
RSP: 0018:ffffc900012efb10 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 1ffff9200025df66 RCX: 0000000000000000
RDX: 0000000000000000 RSI: dffffc0000000000 RDI: 0000000000140dc0
RBP: 0000000000140dc0 R08: 0000000000000000 R09: 0000000000000000
R10: ffffffff81b1f7e1 R11: 0000000000000000 R12: 0000000000000014
R13: 0000000000000014 R14: 0000000000000000 R15: 0000000000000000
FS:  000000000190c880(0000) GS:ffff8880b9e00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f08b7f316c0 CR3: 0000000012073000 CR4: 00000000001506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
alloc_pages_current+0x18c/0x2a0 mm/mempolicy.c:2267
alloc_pages include/linux/gfp.h:547 [inline]
kmalloc_order+0x2e/0xb0 mm/slab_common.c:837
kmalloc_order_trace+0x14/0x120 mm/slab_common.c:853
kmalloc include/linux/slab.h:557 [inline]
kzalloc include/linux/slab.h:682 [inline]
bpf_prog_test_run_raw_tp+0x4b5/0x670 net/bpf/test_run.c:282
bpf_prog_test_run kernel/bpf/syscall.c:3120 [inline]
__do_sys_bpf+0x1ea9/0x4f10 kernel/bpf/syscall.c:4398
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x440499
Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 7b 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007ffe1f3bfb18 EFLAGS: 00000246 ORIG_RAX: 0000000000000141
RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 0000000000440499
RDX: 0000000000000048 RSI: 0000000020000600 RDI: 000000000000000a
RBP: 00000000006ca018 R08: 0000000000000000 R09: 00000000004002c8
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000401ca0
R13: 0000000000401d30 R14: 0000000000000000 R15: 0000000000000000

This is because we didn't filter out too big ctx_size_in. Fix it by
rejecting ctx_size_in that are bigger than MAX_BPF_FUNC_ARGS (12) u64
numbers.

Fixes: 1b4d60ec16 ("bpf: Enable BPF_PROG_TEST_RUN for raw_tracepoint")
Reported-by: syzbot+4f98876664c7337a4ae6@syzkaller.appspotmail.com
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210112234254.1906829-1-songliubraving@fb.com
2021-01-13 19:31:43 -08:00
Menglong Dong
324cefaf1c net: core: use eth_type_vlan in __netif_receive_skb_core
Replace the check for ETH_P_8021Q and ETH_P_8021AD in
__netif_receive_skb_core with eth_type_vlan.

Signed-off-by: Menglong Dong <dong.menglong@zte.com.cn>
Link: https://lore.kernel.org/r/20210111104221.3451-1-dong.menglong@zte.com.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-13 19:08:54 -08:00
Oliver Hartkopp
b42b3a2744 can: isotp: isotp_getname(): fix kernel information leak
Initialize the sockaddr_can structure to prevent a data leak to user space.

Suggested-by: Cong Wang <xiyou.wangcong@gmail.com>
Reported-by: syzbot+057884e2f453e8afebc8@syzkaller.appspotmail.com
Fixes: e057dd3fc2 ("can: add ISO 15765-2:2016 transport protocol")
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Link: https://lore.kernel.org/r/20210112091643.11789-1-socketcan@hartkopp.net
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2021-01-13 22:15:13 +01:00
Baptiste Lepers
a95d25dd7b rxrpc: Call state should be read with READ_ONCE() under some circumstances
The call state may be changed at any time by the data-ready routine in
response to received packets, so if the call state is to be read and acted
upon several times in a function, READ_ONCE() must be used unless the call
state lock is held.

As it happens, we used READ_ONCE() to read the state a few lines above the
unmarked read in rxrpc_input_data(), so use that value rather than
re-reading it.

Fixes: a158bdd324 ("rxrpc: Fix call timeouts")
Signed-off-by: Baptiste Lepers <baptiste.lepers@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/161046715522.2450566.488819910256264150.stgit@warthog.procyon.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-13 10:38:20 -08:00
David Howells
d52e419ac8 rxrpc: Fix handling of an unsupported token type in rxrpc_read()
Clang static analysis reports the following:

net/rxrpc/key.c:657:11: warning: Assigned value is garbage or undefined
                toksize = toksizes[tok++];
                        ^ ~~~~~~~~~~~~~~~

rxrpc_read() contains two consecutive loops.  The first loop calculates the
token sizes and stores the results in toksizes[] and the second one uses
the array.  When there is an error in identifying the token in the first
loop, the token is skipped, no change is made to the toksizes[] array.
When the same error happens in the second loop, the token is not skipped.
This will cause the toksizes[] array to be out of step and will overrun
past the calculated sizes.

Fix this by making both loops log a message and return an error in this
case.  This should only happen if a new token type is incompletely
implemented, so it should normally be impossible to trigger this.

Fixes: 9a059cd5ca ("rxrpc: Downgrade the BUG() for unsupported token type in rxrpc_read()")
Reported-by: Tom Rix <trix@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Tom Rix <trix@redhat.com>
Link: https://lore.kernel.org/r/161046503122.2445787.16714129930607546635.stgit@warthog.procyon.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-13 10:38:00 -08:00
Chuck Lever
5f39d2713b SUNRPC: Move the svc_xdr_recvfrom tracepoint again
Commit 156708adf2 ("SUNRPC: Move the svc_xdr_recvfrom()
tracepoint") tried to capture the correct XID in the trace record,
but this line in svc_recv:

	rqstp->rq_xid = svc_getu32(&rqstp->rq_arg.head[0]);

alters the size of rq_arg.head[0].iov_len. The tracepoint records
the correct XID but an incorrect value for the length of the
xdr_buf's head.

To keep the trace callsites simple, I've created two trace classes.
One assumes the xdr_buf contains a full RPC message, and the XID
can be extracted from it. The other assumes the contents of the
xdr_buf are arbitrary, and the xid will be provided by the caller.

Currently there is only one user of each class, but I expect we will
need a few more tracepoints using each class as time goes on.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2021-01-13 09:13:20 -05:00
Jakub Kicinski
c8a8ead017 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

1) Pass conntrack -f to specify family in netfilter conntrack helper
   selftests, from Chen Yi.

2) Honor hashsize modparam from nf_conntrack_buckets sysctl,
   from Jesper D. Brouer.

3) Fix memleak in nf_nat_init() error path, from Dinghao Liu.

* git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf:
  netfilter: nf_nat: Fix memleak in nf_nat_init
  netfilter: conntrack: fix reading nf_conntrack_buckets
  selftests: netfilter: Pass family parameter "-f" to conntrack tool
====================

Link: https://lore.kernel.org/r/20210112222033.9732-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-12 20:25:29 -08:00
Guvenc Gulce
8a44653689 net/smc: use memcpy instead of snprintf to avoid out of bounds read
Using snprintf() to convert not null-terminated strings to null
terminated strings may cause out of bounds read in the source string.
Therefore use memcpy() and terminate the target string with a null
afterwards.

Fixes: a3db10efcc ("net/smc: Add support for obtaining SMCR device list")
Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-12 20:22:01 -08:00
Jakub Kicinski
25fe2c9c4c smc: fix out of bound access in smc_nl_get_sys_info()
smc_clc_get_hostname() sets the host pointer to a buffer
which is not NULL-terminated (see smc_clc_init()).

Reported-by: syzbot+f4708c391121cfc58396@syzkaller.appspotmail.com
Fixes: 099b990bd1 ("net/smc: Add support for obtaining system information")
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-12 20:22:01 -08:00
YANG LI
f50e2f9f79 hci: llc_shdlc: style: Simplify bool comparison
Fix the following coccicheck warning:
./net/nfc/hci/llc_shdlc.c:239:5-21: WARNING: Comparison to bool

Signed-off-by: YANG LI <abaci-bugfix@linux.alibaba.com>
Reported-by: Abaci Robot<abaci@linux.alibaba.com>
Link: https://lore.kernel.org/r/1610357063-57705-1-git-send-email-abaci-bugfix@linux.alibaba.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-12 20:18:30 -08:00
Oleksij Rempel
c2ec5f2ecf net: dsa: add optional stats64 support
Allow DSA drivers to export stats64

Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-12 20:17:09 -08:00
Paolo Abeni
76e2a55d16 mptcp: better msk-level shutdown.
Instead of re-implementing most of inet_shutdown, re-use
such helper, and implement the MPTCP-specific bits at the
'proto' level.

The msk-level disconnect() can now be invoked, lets provide a
suitable implementation.

As a side effect, this fixes bad state management for listener
sockets. The latter could lead to division by 0 oops since
commit ea4ca586b1 ("mptcp: refine MPTCP-level ack scheduling").

Fixes: 43b54c6ee3 ("mptcp: Use full MPTCP-level disconnect state machine")
Fixes: ea4ca586b1 ("mptcp: refine MPTCP-level ack scheduling")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-12 20:09:19 -08:00
Paolo Abeni
20bc80b6f5 mptcp: more strict state checking for acks
Syzkaller found a way to trigger division by zero
in mptcp_subflow_cleanup_rbuf().

The current checks implemented into tcp_can_send_ack()
are too week, let's be more accurate.

Reported-by: Christoph Paasch <cpaasch@apple.com>
Fixes: ea4ca586b1 ("mptcp: refine MPTCP-level ack scheduling")
Fixes: fd8976790a ("mptcp: be careful on MPTCP-level ack.")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-12 20:09:19 -08:00
Vladimir Oltean
91158e1680 net: dsa: clear devlink port type before unregistering slave netdevs
Florian reported a use-after-free bug in devlink_nl_port_fill found with
KASAN:

(devlink_nl_port_fill)
(devlink_port_notify)
(devlink_port_unregister)
(dsa_switch_teardown.part.3)
(dsa_tree_teardown_switches)
(dsa_unregister_switch)
(bcm_sf2_sw_remove)
(platform_remove)
(device_release_driver_internal)
(device_links_unbind_consumers)
(device_release_driver_internal)
(device_driver_detach)
(unbind_store)

Allocated by task 31:
 alloc_netdev_mqs+0x5c/0x50c
 dsa_slave_create+0x110/0x9c8
 dsa_register_switch+0xdb0/0x13a4
 b53_switch_register+0x47c/0x6dc
 bcm_sf2_sw_probe+0xaa4/0xc98
 platform_probe+0x90/0xf4
 really_probe+0x184/0x728
 driver_probe_device+0xa4/0x278
 __device_attach_driver+0xe8/0x148
 bus_for_each_drv+0x108/0x158

Freed by task 249:
 free_netdev+0x170/0x194
 dsa_slave_destroy+0xac/0xb0
 dsa_port_teardown.part.2+0xa0/0xb4
 dsa_tree_teardown_switches+0x50/0xc4
 dsa_unregister_switch+0x124/0x250
 bcm_sf2_sw_remove+0x98/0x13c
 platform_remove+0x44/0x5c
 device_release_driver_internal+0x150/0x254
 device_links_unbind_consumers+0xf8/0x12c
 device_release_driver_internal+0x84/0x254
 device_driver_detach+0x30/0x34
 unbind_store+0x90/0x134

What happens is that devlink_port_unregister emits a netlink
DEVLINK_CMD_PORT_DEL message which associates the devlink port that is
getting unregistered with the ifindex of its corresponding net_device.
Only trouble is, the net_device has already been unregistered.

It looks like we can stub out the search for a corresponding net_device
if we clear the devlink_port's type. This looks like a bit of a hack,
but also seems to be the reason why the devlink_port_type_clear function
exists in the first place.

Fixes: 3122433eb5 ("net: dsa: Register devlink ports before calling DSA driver setup()")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Florian fainelli <f.fainelli@gmail.com>
Reported-by: Florian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/20210112004831.3778323-1-olteanv@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-12 18:48:50 -08:00
Vladimir Oltean
07b90056cb net: dsa: unbind all switches from tree when DSA master unbinds
Currently the following happens when a DSA master driver unbinds while
there are DSA switches attached to it:

$ echo 0000:00:00.5 > /sys/bus/pci/drivers/mscc_felix/unbind
------------[ cut here ]------------
WARNING: CPU: 0 PID: 392 at net/core/dev.c:9507
Call trace:
 rollback_registered_many+0x5fc/0x688
 unregister_netdevice_queue+0x98/0x120
 dsa_slave_destroy+0x4c/0x88
 dsa_port_teardown.part.16+0x78/0xb0
 dsa_tree_teardown_switches+0x58/0xc0
 dsa_unregister_switch+0x104/0x1b8
 felix_pci_remove+0x24/0x48
 pci_device_remove+0x48/0xf0
 device_release_driver_internal+0x118/0x1e8
 device_driver_detach+0x28/0x38
 unbind_store+0xd0/0x100

Located at the above location is this WARN_ON:

	/* Notifier chain MUST detach us all upper devices. */
	WARN_ON(netdev_has_any_upper_dev(dev));

Other stacked interfaces, like VLAN, do indeed listen for
NETDEV_UNREGISTER on the real_dev and also unregister themselves at that
time, which is clearly the behavior that rollback_registered_many
expects. But DSA interfaces are not VLAN. They have backing hardware
(platform devices, PCI devices, MDIO, SPI etc) which have a life cycle
of their own and we can't just trigger an unregister from the DSA
framework when we receive a netdev notifier that the master unregisters.

Luckily, there is something we can do, and that is to inform the driver
core that we have a runtime dependency to the DSA master interface's
device, and create a device link where that is the supplier and we are
the consumer. Having this device link will make the DSA switch unbind
before the DSA master unbinds, which is enough to avoid the WARN_ON from
rollback_registered_many.

Note that even before the blamed commit, DSA did nothing intelligent
when the master interface got unregistered either. See the discussion
here:
https://lore.kernel.org/netdev/20200505210253.20311-1-f.fainelli@gmail.com/
But this time, at least the WARN_ON is loud enough that the
upper_dev_link commit can be blamed.

The advantage with this approach vs dev_hold(master) in the attached
link is that the latter is not meant for long term reference counting.
With dev_hold, the only thing that will happen is that when the user
attempts an unbind of the DSA master, netdev_wait_allrefs will keep
waiting and waiting, due to DSA keeping the refcount forever. DSA would
not access freed memory corresponding to the master interface, but the
unbind would still result in a freeze. Whereas with device links,
graceful teardown is ensured. It even works with cascaded DSA trees.

$ echo 0000:00:00.2 > /sys/bus/pci/drivers/fsl_enetc/unbind
[ 1818.797546] device swp0 left promiscuous mode
[ 1819.301112] sja1105 spi2.0: Link is Down
[ 1819.307981] DSA: tree 1 torn down
[ 1819.312408] device eno2 left promiscuous mode
[ 1819.656803] mscc_felix 0000:00:00.5: Link is Down
[ 1819.667194] DSA: tree 0 torn down
[ 1819.711557] fsl_enetc 0000:00:00.2 eno2: Link is Down

This approach allows us to keep the DSA framework absolutely unchanged,
and the driver core will just know to unbind us first when the master
goes away - as opposed to the large (and probably impossible) rework
required if attempting to listen for NETDEV_UNREGISTER.

As per the documentation at Documentation/driver-api/device_link.rst,
specifying the DL_FLAG_AUTOREMOVE_CONSUMER flag causes the device link
to be automatically purged when the consumer fails to probe or later
unbinds. So we don't need to keep the consumer_link variable in struct
dsa_switch.

Fixes: 2f1e8ea726 ("net: dsa: link interfaces with the DSA master to get rid of lockdep warnings")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Florian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/20210111230943.3701806-1-olteanv@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-12 18:48:40 -08:00
Petr Machata
df85bc140a net: dcb: Accept RTM_GETDCB messages carrying set-like DCB commands
In commit 826f328e2b ("net: dcb: Validate netlink message in DCB
handler"), Linux started rejecting RTM_GETDCB netlink messages if they
contained a set-like DCB_CMD_ command.

The reason was that privileges were only verified for RTM_SETDCB messages,
but the value that determined the action to be taken is the command, not
the message type. And validation of message type against the DCB command
was the obvious missing piece.

Unfortunately it turns out that mlnx_qos, a somewhat widely deployed tool
for configuration of DCB, accesses the DCB set-like APIs through
RTM_GETDCB.

Therefore do not bounce the discrepancy between message type and command.
Instead, in addition to validating privileges based on the actual message
type, validate them also based on the expected message type. This closes
the loophole of allowing DCB configuration on non-admin accounts, while
maintaining backward compatibility.

Fixes: 2f90b8657e ("ixgbe: this patch adds support for DCB to the kernel and ixgbe driver")
Fixes: 826f328e2b ("net: dcb: Validate netlink message in DCB handler")
Signed-off-by: Petr Machata <petrm@nvidia.com>
Link: https://lore.kernel.org/r/a3edcfda0825f2aa2591801c5232f2bbf2d8a554.1610384801.git.me@pmachata.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-12 15:55:21 -08:00
Daniel Borkmann
bcd6f4a8be bpf: Allow to retrieve sol_socket opts from sock_addr progs
The _bpf_setsockopt() is able to set some of the SOL_SOCKET level options,
however, _bpf_getsockopt() has little support to actually retrieve them.
This small patch adds few misc options such as SO_MARK, SO_PRIORITY and
SO_BINDTOIFINDEX. For the latter getter and setter are added. The mark and
priority in particular allow to retrieve the options from BPF cgroup hooks
to then implement custom behavior / settings on the syscall hooks compared
to other sockets that stick to the defaults, for example.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/cba44439b801e5ddc1170e5be787f4dc93a2d7f9.1610406333.git.daniel@iogearbox.net
2021-01-12 14:44:53 -08:00
Linus Torvalds
e609571b5f NFS client bugfixes for Linux 5.11
Highlights include:
 
 Bugfixes:
 - Fix parsing of link-local IPv6 addresses
 - Fix confusing logging of mount errors that was introduced by the
   fsopen() patchset.
 - Fix a tracing use after free in _nfs4_do_setlk()
 - Layout return-on-close fixes when called from nfs4_evict_inode()
 - Layout segments were being leaked in pnfs_generic_clear_request_commit()
 - Don't leak DS commits in pnfs_generic_retry_commit()
 - Fix an Oopsable use-after-free when nfs_delegation_find_inode_server()
   calls iput() on an inode after the super block has gone away.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAl/806IACgkQZwvnipYK
 APKIZA/+L+LvkMXflS9TQGGccpOPw+BBW5ixi2DabFYLqHz6WXNnIcUStU0NtF3q
 uHM2YrJT0XtWtQ8W6fWcsfdeS/1ixciXDS/5RH/o2e+fFMNg1lPWAeOc4brQSDFd
 DYEc7lSqw0D/pX8vY4dFIrpQorU2hnasjMK582JU7mDYXveRMLB/Bhcq9qBP2XgQ
 LVUpnHU/3dayvFGmr/sPzzZk/rIEfPaHU/J0YLbPfrEGFOo/mZKqstfS4ZkINAWp
 0yRD90s1hWTfRcxAiDaUoYPoxEw5AYjdbwC82owOaEa0zNWA2U7tD94UeVS51JCJ
 DtCn81znWaF4jVzes4VGzPlWirYoumthJwrKpKh04tEwo0a4V4AtsOAg2IbxfE/O
 CYsfwjwikzW4nOEerv22zOHICLNd2IP65kHAACaN0NVhS7dlLSuckwnMILdstD2Z
 x0LHxFhyRQe5c7bf6W6Jal2E/ThyD2qaUmSIxWweTq93OldD0mTLGHO7e2/chXwP
 3xkcuZLpU6bmg9QzmylWZWBB3ncDtC95VlRv/IV29mbN3a8XjJaugSOAwjx14JNT
 OFlJtLav2pvCwFLUutvgAMSgbshhfkwdUoUUHrcabXNL/4QBeeZB/pp9Ytr3NoBT
 xxC6nmB/Af7FtRnTrTpOSlH9s1NEB3JN4uMNx4kAKC+ZLySdMPQ=
 =08H3
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-5.11-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client fixes from Trond Myklebust:
 "Highlights include:

   - Fix parsing of link-local IPv6 addresses

   - Fix confusing logging of mount errors that was introduced by the
     fsopen() patchset.

   - Fix a tracing use after free in _nfs4_do_setlk()

   - Layout return-on-close fixes when called from nfs4_evict_inode()

   - Layout segments were being leaked in
     pnfs_generic_clear_request_commit()

   - Don't leak DS commits in pnfs_generic_retry_commit()

   - Fix an Oopsable use-after-free when nfs_delegation_find_inode_server()
     calls iput() on an inode after the super block has gone away"

* tag 'nfs-for-5.11-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  NFS: nfs_igrab_and_active must first reference the superblock
  NFS: nfs_delegation_find_inode_server must first reference the superblock
  NFS/pNFS: Fix a leak of the layout 'plh_outstanding' counter
  NFS/pNFS: Don't leak DS commits in pnfs_generic_retry_commit()
  NFS/pNFS: Don't call pnfs_free_bucket_lseg() before removing the request
  pNFS: Stricter ordering of layoutget and layoutreturn
  pNFS: Clean up pnfs_layoutreturn_free_lsegs()
  pNFS: We want return-on-close to complete when evicting the inode
  pNFS: Mark layout for return if return-on-close was not sent
  net: sunrpc: interpret the return value of kstrtou32 correctly
  NFS: Adjust fs_context error logging
  NFS4: Fix use-after-free in trace_event_raw_event_nfs4_set_lock
2021-01-12 09:38:53 -08:00
Willem de Bruijn
9bd6b629c3 esp: avoid unneeded kmap_atomic call
esp(6)_output_head uses skb_page_frag_refill to allocate a buffer for
the esp trailer.

It accesses the page with kmap_atomic to handle highmem. But
skb_page_frag_refill can return compound pages, of which
kmap_atomic only maps the first underlying page.

skb_page_frag_refill does not return highmem, because flag
__GFP_HIGHMEM is not set. ESP uses it in the same manner as TCP.
That also does not call kmap_atomic, but directly uses page_address,
in skb_copy_to_page_nocache. Do the same for ESP.

This issue has become easier to trigger with recent kmap local
debugging feature CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP.

Fixes: cac2661c53 ("esp4: Avoid skb_cow_data whenever possible")
Fixes: 03e2a30f6a ("esp6: Avoid skb_cow_data whenever possible")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-11 18:20:09 -08:00
Willem de Bruijn
97550f6fa5 net: compound page support in skb_seq_read
skb_seq_read iterates over an skb, returning pointer and length of
the next data range with each call.

It relies on kmap_atomic to access highmem pages when needed.

An skb frag may be backed by a compound page, but kmap_atomic maps
only a single page. There are not enough kmap slots to always map all
pages concurrently.

Instead, if kmap_atomic is needed, iterate over each page.

As this increases the number of calls, avoid this unless needed.
The necessary condition is captured in skb_frag_must_loop.

I tried to make the change as obvious as possible. It should be easy
to verify that nothing changes if skb_frag_must_loop returns false.

Tested:
  On an x86 platform with
    CONFIG_HIGHMEM=y
    CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP=y
    CONFIG_NETFILTER_XT_MATCH_STRING=y

  Run
    ip link set dev lo mtu 1500
    iptables -A OUTPUT -m string --string 'badstring' -algo bm -j ACCEPT
    dd if=/dev/urandom of=in bs=1M count=20
    nc -l -p 8000 > /dev/null &
    nc -w 1 -q 0 localhost 8000 < in

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-11 18:20:09 -08:00
Vladimir Oltean
417b99bf75 net: dsa: remove obsolete comments about switchdev transactions
Now that all port object notifiers were converted to be non-transactional,
we can remove the comments that say otherwise.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Acked-by: Linus Walleij <linus.walleij@linaro.org>
Acked-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-11 16:00:57 -08:00
Vladimir Oltean
1958d5815c net: dsa: remove the transactional logic from VLAN objects
It should be the driver's business to logically separate its VLAN
offloading into a preparation and a commit phase, and some drivers don't
need / can't do this.

So remove the transactional shim from DSA and let drivers propagate
errors directly from the .port_vlan_add callback.

It would appear that the code has worse error handling now than it had
before. DSA is the only in-kernel user of switchdev that offloads one
switchdev object to more than one port: for every VLAN object offloaded
to a user port, that VLAN is also offloaded to the CPU port. So the
"prepare for user port -> check for errors -> prepare for CPU port ->
check for errors -> commit for user port -> commit for CPU port"
sequence appears to make more sense than the one we are using now:
"offload to user port -> check for errors -> offload to CPU port ->
check for errors", but it is really a compromise. In the new way, we can
catch errors from the commit phase that we previously had to ignore.
But we have our hands tied and cannot do any rollback now: if we add a
VLAN on the CPU port and it fails, we can't do the rollback by simply
deleting it from the user port, because the switchdev API is not so nice
with us: it could have simply been there already, even with the same
flags. So we don't even attempt to rollback anything on addition error,
just leave whatever VLANs managed to get offloaded right where they are.
This should not be a problem at all in practice.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Acked-by: Linus Walleij <linus.walleij@linaro.org>
Acked-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-11 16:00:57 -08:00
Vladimir Oltean
a52b2da778 net: dsa: remove the transactional logic from MDB entries
For many drivers, the .port_mdb_prepare callback was not a good opportunity
to avoid any error condition, and they would suppress errors found during
the actual commit phase.

Where a logical separation between the prepare and the commit phase
existed, the function that used to implement the .port_mdb_prepare
callback still exists, but now it is called directly from .port_mdb_add,
which was modified to return an int code.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Acked-by: Linus Walleij <linus.walleij@linaro.org>
Acked-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> # hellcreek
Reviewed-by: Linus Wallei <linus.walleij@linaro.org> # RTL8366
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-11 16:00:57 -08:00
Vladimir Oltean
77b61365ec net: dsa: remove the transactional logic from ageing time notifiers
Remove the shim introduced in DSA for offloading the bridge ageing time
from switchdev, by first checking whether the ageing time is within the
range limits requested by the driver.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Acked-by: Linus Walleij <linus.walleij@linaro.org>
Acked-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-11 16:00:57 -08:00
Vladimir Oltean
bae33f2b5a net: switchdev: remove the transaction structure from port attributes
Since the introduction of the switchdev API, port attributes were
transmitted to drivers for offloading using a two-step transactional
model, with a prepare phase that was supposed to catch all errors, and a
commit phase that was supposed to never fail.

Some classes of failures can never be avoided, like hardware access, or
memory allocation. In the latter case, merely attempting to move the
memory allocation to the preparation phase makes it impossible to avoid
memory leaks, since commit 91cf8eceff ("switchdev: Remove unused
transaction item queue") which has removed the unused mechanism of
passing on the allocated memory between one phase and another.

It is time we admit that separating the preparation from the commit
phase is something that is best left for the driver to decide, and not
something that should be baked into the API, especially since there are
no switchdev callers that depend on this.

This patch removes the struct switchdev_trans member from switchdev port
attribute notifier structures, and converts drivers to not look at this
member.

In part, this patch contains a revert of my previous commit 2e554a7a5d
("net: dsa: propagate switchdev vlan_filtering prepare phase to
drivers").

For the most part, the conversion was trivial except for:
- Rocker's world implementation based on Broadcom OF-DPA had an odd
  implementation of ofdpa_port_attr_bridge_flags_set. The conversion was
  done mechanically, by pasting the implementation twice, then only
  keeping the code that would get executed during prepare phase on top,
  then only keeping the code that gets executed during the commit phase
  on bottom, then simplifying the resulting code until this was obtained.
- DSA's offloading of STP state, bridge flags, VLAN filtering and
  multicast router could be converted right away. But the ageing time
  could not, so a shim was introduced and this was left for a further
  commit.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Linus Walleij <linus.walleij@linaro.org>
Acked-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> # hellcreek
Reviewed-by: Linus Walleij <linus.walleij@linaro.org> # RTL8366RB
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-11 16:00:57 -08:00
Vladimir Oltean
cf6def51ba net: switchdev: delete switchdev_port_obj_add_now
After the removal of the transactional model inside
switchdev_port_obj_add_now, it has no added value and we can just call
switchdev_port_obj_notify directly, bypassing this function. Let's
delete it.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Linus Walleij <linus.walleij@linaro.org>
Acked-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-11 16:00:56 -08:00
Vladimir Oltean
ffb68fc58e net: switchdev: remove the transaction structure from port object notifiers
Since the introduction of the switchdev API, port objects were
transmitted to drivers for offloading using a two-step transactional
model, with a prepare phase that was supposed to catch all errors, and a
commit phase that was supposed to never fail.

Some classes of failures can never be avoided, like hardware access, or
memory allocation. In the latter case, merely attempting to move the
memory allocation to the preparation phase makes it impossible to avoid
memory leaks, since commit 91cf8eceff ("switchdev: Remove unused
transaction item queue") which has removed the unused mechanism of
passing on the allocated memory between one phase and another.

It is time we admit that separating the preparation from the commit
phase is something that is best left for the driver to decide, and not
something that should be baked into the API, especially since there are
no switchdev callers that depend on this.

This patch removes the struct switchdev_trans member from switchdev port
object notifier structures, and converts drivers to not look at this
member.

Where driver conversion is trivial (like in the case of the Marvell
Prestera driver, NXP DPAA2 switch, TI CPSW, and Rocker drivers), it is
done in this patch.

Where driver conversion needs more attention (DSA, Mellanox Spectrum),
the conversion is left for subsequent patches and here we only fake the
prepare/commit phases at a lower level, just not in the switchdev
notifier itself.

Where the code has a natural structure that is best left alone as a
preparation and a commit phase (as in the case of the Ocelot switch),
that structure is left in place, just made to not depend upon the
switchdev transactional model.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Acked-by: Linus Walleij <linus.walleij@linaro.org>
Acked-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-11 16:00:56 -08:00
Vladimir Oltean
b7a9e0da2d net: switchdev: remove vid_begin -> vid_end range from VLAN objects
The call path of a switchdev VLAN addition to the bridge looks something
like this today:

        nbp_vlan_init
        |  __br_vlan_set_default_pvid
        |  |                       |
        |  |    br_afspec          |
        |  |        |              |
        |  |        v              |
        |  | br_process_vlan_info  |
        |  |        |              |
        |  |        v              |
        |  |   br_vlan_info        |
        |  |       / \            /
        |  |      /   \          /
        |  |     /     \        /
        |  |    /       \      /
        v  v   v         v    v
      nbp_vlan_add   br_vlan_add ------+
       |              ^      ^ |       |
       |             /       | |       |
       |            /       /  /       |
       \ br_vlan_get_master/  /        v
        \        ^        /  /  br_vlan_add_existing
         \       |       /  /          |
          \      |      /  /          /
           \     |     /  /          /
            \    |    /  /          /
             \   |   /  /          /
              v  |   | v          /
              __vlan_add         /
                 / |            /
                /  |           /
               v   |          /
   __vlan_vid_add  |         /
               \   |        /
                v  v        v
      br_switchdev_port_vlan_add

The ranges UAPI was introduced to the bridge in commit bdced7ef78
("bridge: support for multiple vlans and vlan ranges in setlink and
dellink requests") (Jan 10 2015). But the VLAN ranges (parsed in br_afspec)
have always been passed one by one, through struct bridge_vlan_info
tmp_vinfo, to br_vlan_info. So the range never went too far in depth.

Then Scott Feldman introduced the switchdev_port_bridge_setlink function
in commit 47f8328bb1 ("switchdev: add new switchdev bridge setlink").
That marked the introduction of the SWITCHDEV_OBJ_PORT_VLAN, which made
full use of the range. But switchdev_port_bridge_setlink was called like
this:

br_setlink
-> br_afspec
-> switchdev_port_bridge_setlink

Basically, the switchdev and the bridge code were not tightly integrated.
Then commit 41c498b935 ("bridge: restore br_setlink back to original")
came, and switchdev drivers were required to implement
.ndo_bridge_setlink = switchdev_port_bridge_setlink for a while.

In the meantime, commits such as 0944d6b5a2 ("bridge: try switchdev op
first in __vlan_vid_add/del") finally made switchdev penetrate the
br_vlan_info() barrier and start to develop the call path we have today.
But remember, br_vlan_info() still receives VLANs one by one.

Then Arkadi Sharshevsky refactored the switchdev API in 2017 in commit
29ab586c3d ("net: switchdev: Remove bridge bypass support from
switchdev") so that drivers would not implement .ndo_bridge_setlink any
longer. The switchdev_port_bridge_setlink also got deleted.
This refactoring removed the parallel bridge_setlink implementation from
switchdev, and left the only switchdev VLAN objects to be the ones
offloaded from __vlan_vid_add (basically RX filtering) and  __vlan_add
(the latter coming from commit 9c86ce2c1a ("net: bridge: Notify about
bridge VLANs")).

That is to say, today the switchdev VLAN object ranges are not used in
the kernel. Refactoring the above call path is a bit complicated, when
the bridge VLAN call path is already a bit complicated.

Let's go off and finish the job of commit 29ab586c3d by deleting the
bogus iteration through the VLAN ranges from the drivers. Some aspects
of this feature never made too much sense in the first place. For
example, what is a range of VLANs all having the BRIDGE_VLAN_INFO_PVID
flag supposed to mean, when a port can obviously have a single pvid?
This particular configuration _is_ denied as of commit 6623c60dc2
("bridge: vlan: enforce no pvid flag in vlan ranges"), but from an API
perspective, the driver still has to play pretend, and only offload the
vlan->vid_end as pvid. And the addition of a switchdev VLAN object can
modify the flags of another, completely unrelated, switchdev VLAN
object! (a VLAN that is PVID will invalidate the PVID flag from whatever
other VLAN had previously been offloaded with switchdev and had that
flag. Yet switchdev never notifies about that change, drivers are
supposed to guess).

Nonetheless, having a VLAN range in the API makes error handling look
scarier than it really is - unwinding on errors and all of that.
When in reality, no one really calls this API with more than one VLAN.
It is all unnecessary complexity.

And despite appearing pretentious (two-phase transactional model and
all), the switchdev API is really sloppy because the VLAN addition and
removal operations are not paired with one another (you can add a VLAN
100 times and delete it just once). The bridge notifies through
switchdev of a VLAN addition not only when the flags of an existing VLAN
change, but also when nothing changes. There are switchdev drivers out
there who don't like adding a VLAN that has already been added, and
those checks don't really belong at driver level. But the fact that the
API contains ranges is yet another factor that prevents this from being
addressed in the future.

Of the existing switchdev pieces of hardware, it appears that only
Mellanox Spectrum supports offloading more than one VLAN at a time,
through mlxsw_sp_port_vlan_set. I have kept that code internal to the
driver, because there is some more bookkeeping that makes use of it, but
I deleted it from the switchdev API. But since the switchdev support for
ranges has already been de facto deleted by a Mellanox employee and
nobody noticed for 4 years, I'm going to assume it's not a biggie.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com> # switchdev and mlxsw
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> # hellcreek
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-11 16:00:56 -08:00
Linus Torvalds
c912fd05fa Fixes:
- Fix major TCP performance regression
 - Get NFSv4.2 READ_PLUS regression tests to pass
 - Improve NFSv4 COMPOUND memory allocation
 - Fix sparse warning
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAl/138wACgkQM2qzM29m
 f5c0QQ/+NkUxtmXd5lKXjzB0NcXsiQm9QxGvY52Oj75DHHprGmGkNEQAKczr/1Gu
 l+MArFXJTITrZRwbqQMA4uxwgCfup51atI12c27n1u5T9+bMicJIjT5yCtQ7rT2t
 U70VSZKgBlWWTcvfiEcFc1rloI3IY5c4ZYpeMxaXseegn6w3LYQfkZLcRdRleSz3
 P0IO59Eow8Wt/GxRXpeYv0sK2m8OK1OyknKAzbq9swrc0ARJzKIwuTDs7jPtlvg5
 SkDOTrXdSHwVvTrCqr9BwaNtQa76xR/Zo5UqKYgyzx3/NQ7h39hRTR5xLVst+Ynh
 3TgOPS0YDWlmRzjX0xhr5y+rwWFxRvS6uecaIMOSuqABQ1F0RwbfXE/XplQLhk1E
 kjL819y5MuUpOdjMx5SZEo0pC7VeAoqGmzvTunpf974ExTNvDiKf0fPFs74cYUzG
 /a4k3DYJQbzUgG1PzPElbKbPUwSk/W/M7p9Tw7R9dnX2huVa/2J6TllbnbUi6REf
 4qVqCe3WXFHE8Q9FCBuYEaTddToPqA4M98B8ba/pDYiqgfI8goWvGEQukuL7RES0
 0i3G5SMC5zScgk44RMewyNrzl8IzCJXITv39+YDQ9O4FVJJXTSAMoyQ5aXlzVhc6
 v+b4560cXoltEecFzooKjNbb+2FURKNgfeDk9xgG2DoydzelipU=
 =POBn
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-5.11-1' of git://git.linux-nfs.org/projects/cel/cel-2.6

Pull nfsd fixes from Chuck Lever:

 - Fix major TCP performance regression

 - Get NFSv4.2 READ_PLUS regression tests to pass

 - Improve NFSv4 COMPOUND memory allocation

 - Fix sparse warning

* tag 'nfsd-5.11-1' of git://git.linux-nfs.org/projects/cel/cel-2.6:
  NFSD: Restore NFSv4 decoding's SAVEMEM functionality
  SUNRPC: Handle TCP socket sends with kernel_sendpage() again
  NFSD: Fix sparse warning in nfssvc.c
  nfsd: Don't set eof on a truncated READ_PLUS
  nfsd: Fixes for nfsd4_encode_read_plus_data()
2021-01-11 11:35:46 -08:00
Dinghao Liu
869f4fdaf4 netfilter: nf_nat: Fix memleak in nf_nat_init
When register_pernet_subsys() fails, nf_nat_bysource
should be freed just like when nf_ct_extend_register()
fails.

Fixes: 1cd472bf03 ("netfilter: nf_nat: add nat hook register functions to nf_nat")
Signed-off-by: Dinghao Liu <dinghao.liu@zju.edu.cn>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-01-11 00:34:11 +01:00
j.nixdorf@avm.de
86b53fbf08 net: sunrpc: interpret the return value of kstrtou32 correctly
A return value of 0 means success. This is documented in lib/kstrtox.c.

This was found by trying to mount an NFS share from a link-local IPv6
address with the interface specified by its index:

  mount("[fe80::1%1]:/srv/nfs", "/mnt", "nfs", 0, "nolock,addr=fe80::1%1")

Before this commit this failed with EINVAL and also caused the following
message in dmesg:

  [...] NFS: bad IP address specified: addr=fe80::1%1

The syscall using the same address based on the interface name instead
of its index succeeds.

Credits for this patch go to my colleague Christian Speich, who traced
the origin of this bug to this line of code.

Signed-off-by: Johannes Nixdorf <j.nixdorf@avm.de>
Fixes: 00cfaa943e ("replace strict_strto calls")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-01-10 13:32:51 -05:00
Jesper Dangaard Brouer
f6351c3f1c netfilter: conntrack: fix reading nf_conntrack_buckets
The old way of changing the conntrack hashsize runtime was through changing
the module param via file /sys/module/nf_conntrack/parameters/hashsize. This
was extended to sysctl change in commit 3183ab8997 ("netfilter: conntrack:
allow increasing bucket size via sysctl too").

The commit introduced second "user" variable nf_conntrack_htable_size_user
which shadow actual variable nf_conntrack_htable_size. When hashsize is
changed via module param this "user" variable isn't updated. This results in
sysctl net/netfilter/nf_conntrack_buckets shows the wrong value when users
update via the old way.

This patch fix the issue by always updating "user" variable when reading the
proc file. This will take care of changes to the actual variable without
sysctl need to be aware.

Fixes: 3183ab8997 ("netfilter: conntrack: allow increasing bucket size via sysctl too")
Reported-by: Yoel Caspersen <yoel@kviknet.dk>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-01-10 09:39:22 +01:00
Geliang Tang
0be2ac287b mptcp: add the mibs for MP_PRIO
This patch added the mibs for MP_PRIO, MPTCP_MIB_MPPRIOTX for transmitting
of the MP_PRIO suboption, and MPTCP_MIB_MPPRIORX for receiving of it.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-09 18:18:44 -08:00
Geliang Tang
0f9f696a50 mptcp: add set_flags command in PM netlink
This patch added a new command MPTCP_PM_CMD_SET_FLAGS in PM netlink:

In mptcp_nl_cmd_set_flags, parse the input address, get the backup value
according to whether the address's FLAG_BACKUP flag is set from the
user-space. Then check whether this address had been added in the local
address list. If it had been, then call mptcp_nl_addr_backup to deal with
this address.

In mptcp_nl_addr_backup, traverse all the existing msk sockets to find
the relevant sockets, and call mptcp_pm_nl_mp_prio_send_ack to send out
a MP_PRIO ACK packet.

Finally in mptcp_nl_cmd_set_flags, set or clear the address's FLAG_BACKUP
flag.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-09 18:18:43 -08:00
Geliang Tang
40453a5c61 mptcp: add the incoming MP_PRIO support
This patch added the incoming MP_PRIO logic:

Added a flag named mp_prio in struct mptcp_options_received, to mark the
MP_PRIO is received, and save the priority value to struct
mptcp_options_received's backup member. Then invoke
mptcp_pm_mp_prio_received with the receiving subsocket and the backup
value.

In mptcp_pm_mp_prio_received, get the subflow context according the input
subsocket, and change the subflow's backup as the incoming priority value.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-09 18:18:43 -08:00
Geliang Tang
067065422f mptcp: add the outgoing MP_PRIO support
This patch added the outgoing MP_PRIO logic:

In mptcp_pm_nl_mp_prio_send_ack, find the related subflow and subsocket
according to the input parameter addr. Save the input priority value to
suflow's backup, then set subflow's send_mp_prio flag to true, and save
the input priority value to suflow's request_bkup. Finally, send out a
pure ACK on the related subsocket.

In mptcp_established_options_mp_prio, check whether the subflow's
send_mp_prio is set. If it is, this is the packet for sending MP_PRIO.
So save subflow->request_bkup value to mptcp_out_options's backup, and
change the option type to OPTION_MPTCP_PRIO.

In mptcp_write_options, clear the send_mp_prio flag and send out the
MP_PRIO suboption with mptcp_out_options's backup value.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-09 18:18:43 -08:00
Geliang Tang
efd5a4c04e mptcp: add the address ID assignment bitmap
Currently the address ID set by the netlink PM from user-space is
overridden by the kernel. This patch added the address ID assignment
bitmap to allow user-space to set the address ID.

Use a per netns bitmask id_bitmap (256 bits) to keep track of in-use IDs.
And use next_id to keep track of the highest ID currently in use. If the
user-space provides an ID at endpoint creation time, try to use it. If
already in use, endpoint creation fails. Otherwise pick the first ID
available after the highest currently in use, with wrap-around.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-09 18:18:43 -08:00
Vladimir Oltean
4b9c935898 net: dsa: dsa_legacy_fdb_{add,del} can be static
Introduced in commit 37b8da1a3c ("net: dsa: Move FDB add/del
implementation inside DSA") in net/dsa/legacy.c, these functions were
moved again to slave.c as part of commit 2a93c1a365 ("net: dsa: Allow
compiling out legacy support"), before actually deleting net/dsa/slave.c
in 93e86b3bc8 ("net: dsa: Remove legacy probing support"). Along with
that movement there should have been a deletion of the prototypes from
dsa_priv.h, they are not useful.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/20210108233054.1222278-1-olteanv@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-09 18:05:52 -08:00
Hoang Le
b774134464 tipc: fix NULL deref in tipc_link_xmit()
The buffer list can have zero skb as following path:
tipc_named_node_up()->tipc_node_xmit()->tipc_link_xmit(), so
we need to check the list before casting an &sk_buff.

Fault report:
 [] tipc: Bulk publication failure
 [] general protection fault, probably for non-canonical [#1] PREEMPT [...]
 [] KASAN: null-ptr-deref in range [0x00000000000000c8-0x00000000000000cf]
 [] CPU: 0 PID: 0 Comm: swapper/0 Kdump: loaded Not tainted 5.10.0-rc4+ #2
 [] Hardware name: Bochs ..., BIOS Bochs 01/01/2011
 [] RIP: 0010:tipc_link_xmit+0xc1/0x2180
 [] Code: 24 b8 00 00 00 00 4d 39 ec 4c 0f 44 e8 e8 d7 0a 10 f9 48 [...]
 [] RSP: 0018:ffffc90000006ea0 EFLAGS: 00010202
 [] RAX: dffffc0000000000 RBX: ffff8880224da000 RCX: 1ffff11003d3cc0d
 [] RDX: 0000000000000019 RSI: ffffffff886007b9 RDI: 00000000000000c8
 [] RBP: ffffc90000007018 R08: 0000000000000001 R09: fffff52000000ded
 [] R10: 0000000000000003 R11: fffff52000000dec R12: ffffc90000007148
 [] R13: 0000000000000000 R14: 0000000000000000 R15: ffffc90000007018
 [] FS:  0000000000000000(0000) GS:ffff888037400000(0000) knlGS:000[...]
 [] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 [] CR2: 00007fffd2db5000 CR3: 000000002b08f000 CR4: 00000000000006f0

Fixes: af9b028e27 ("tipc: make media xmit call outside node spinlock context")
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Link: https://lore.kernel.org/r/20210108071337.3598-1-hoang.h.le@dektech.com.au
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-09 14:44:47 -08:00
Eric Dumazet
1d11fa6967 net-gro: remove GRO_DROP
GRO_DROP can only be returned from napi_gro_frags()
if the skb has not been allocated by a prior napi_get_frags()

Since drivers must use napi_get_frags() and test its result
before populating the skb with metadata, we can safely remove
GRO_DROP since it offers no practical use.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
Acked-by: Edward Cree <ecree.xilinx@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-09 14:24:26 -08:00
Aya Levin
b210de4f8c net: ipv6: Validate GSO SKB before finish IPv6 processing
There are cases where GSO segment's length exceeds the egress MTU:
 - Forwarding of a TCP GRO skb, when DF flag is not set.
 - Forwarding of an skb that arrived on a virtualisation interface
   (virtio-net/vhost/tap) with TSO/GSO size set by other network
   stack.
 - Local GSO skb transmitted on an NETIF_F_TSO tunnel stacked over an
   interface with a smaller MTU.
 - Arriving GRO skb (or GSO skb in a virtualised environment) that is
   bridged to a NETIF_F_TSO tunnel stacked over an interface with an
   insufficient MTU.

If so:
 - Consume the SKB and its segments.
 - Issue an ICMP packet with 'Packet Too Big' message containing the
   MTU, allowing the source host to reduce its Path MTU appropriately.

Note: These cases are handled in the same manner in IPv4 output finish.
This patch aligns the behavior of IPv6 and the one of IPv4.

Fixes: 9e50849054 ("netfilter: ipv6: move POSTROUTING invocation before fragmentation")
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/1610027418-30438-1-git-send-email-ayal@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-09 14:06:32 -08:00
Menglong Dong
efb5b338da net: bridge: fix misspellings using codespell tool
Some typos are found out by codespell tool:

$ codespell ./net/bridge/
./net/bridge/br_stp.c:604: permanant  ==> permanent
./net/bridge/br_stp.c:605: persistance  ==> persistence
./net/bridge/br.c:125: underlaying  ==> underlying
./net/bridge/br_input.c:43: modue  ==> mode
./net/bridge/br_mrp.c:828: Determin  ==> Determine
./net/bridge/br_mrp.c:848: Determin  ==> Determine
./net/bridge/br_mrp.c:897: Determin  ==> Determine

Fix typos found by codespell.

Signed-off-by: Menglong Dong <dong.menglong@zte.com.cn>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://lore.kernel.org/r/20210108025332.52480-1-dong.menglong@zte.com.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-09 13:54:47 -08:00
Jakub Kicinski
766b0515d5 net: make sure devices go through netdev_wait_all_refs
If register_netdevice() fails at the very last stage - the
notifier call - some subsystems may have already seen it and
grabbed a reference. struct net_device can't be freed right
away without calling netdev_wait_all_refs().

Now that we have a clean interface in form of dev->needs_free_netdev
and lenient free_netdev() we can undo what commit 93ee31f14f ("[NET]:
Fix free_netdev on register_netdev failure.") has done and complete
the unregistration path by bringing the net_set_todo() call back.

After registration fails user is still expected to explicitly
free the net_device, so make sure ->needs_free_netdev is cleared,
otherwise rolling back the registration will cause the old double
free for callers who release rtnl_lock before the free.

This also solves the problem of priv_destructor not being called
on notifier error.

net_set_todo() will be moved back into unregister_netdevice_queue()
in a follow up.

Reported-by: Hulk Robot <hulkci@huawei.com>
Reported-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-08 19:27:41 -08:00
Jakub Kicinski
c269a24ce0 net: make free_netdev() more lenient with unregistering devices
There are two flavors of handling netdev registration:
 - ones called without holding rtnl_lock: register_netdev() and
   unregister_netdev(); and
 - those called with rtnl_lock held: register_netdevice() and
   unregister_netdevice().

While the semantics of the former are pretty clear, the same can't
be said about the latter. The netdev_todo mechanism is utilized to
perform some of the device unregistering tasks and it hooks into
rtnl_unlock() so the locked variants can't actually finish the work.
In general free_netdev() does not mix well with locked calls. Most
drivers operating under rtnl_lock set dev->needs_free_netdev to true
and expect core to make the free_netdev() call some time later.

The part where this becomes most problematic is error paths. There is
no way to unwind the state cleanly after a call to register_netdevice(),
since unreg can't be performed fully without dropping locks.

Make free_netdev() more lenient, and defer the freeing if device
is being unregistered. This allows error paths to simply call
free_netdev() both after register_netdevice() failed, and after
a call to unregister_netdevice() but before dropping rtnl_lock.

Simplify the error paths which are currently doing gymnastics
around free_netdev() handling.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-08 19:27:41 -08:00
Jakub Kicinski
2b446e650b docs: net: explain struct net_device lifetime
Explain the two basic flows of struct net_device's operation.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-08 19:27:41 -08:00
Julian Wiedmann
fda4fde297 net: ip_tunnel: clean up endianness conversions
sparse complains about some harmless endianness issues:

> net/ipv4/ip_tunnel_core.c:225:43: warning: cast to restricted __be16
> net/ipv4/ip_tunnel_core.c:225:43: warning: incorrect type in initializer (different base types)
> net/ipv4/ip_tunnel_core.c:225:43:    expected restricted __be16 [usertype] mtu
> net/ipv4/ip_tunnel_core.c:225:43:    got unsigned short [usertype]

iptunnel_pmtud_build_icmp() uses the wrong flavour of byte-order conversion
when storing the MTU into the ICMPv4 packet. Use htons(), just like
iptunnel_pmtud_build_icmpv6() does.

> net/ipv4/ip_tunnel_core.c:248:35: warning: cast from restricted __be16
> net/ipv4/ip_tunnel_core.c:248:35: warning: incorrect type in argument 3 (different base types)
> net/ipv4/ip_tunnel_core.c:248:35:    expected unsigned short type
> net/ipv4/ip_tunnel_core.c:248:35:    got restricted __be16 [usertype]
> net/ipv4/ip_tunnel_core.c:341:35: warning: cast from restricted __be16
> net/ipv4/ip_tunnel_core.c:341:35: warning: incorrect type in argument 3 (different base types)
> net/ipv4/ip_tunnel_core.c:341:35:    expected unsigned short type
> net/ipv4/ip_tunnel_core.c:341:35:    got restricted __be16 [usertype]

eth_header() wants the Ethertype in host-order, use the correct flavour of
byte-order conversion.

> net/ipv4/ip_tunnel_core.c:600:45: warning: restricted __be16 degrades to integer
> net/ipv4/ip_tunnel_core.c:609:30: warning: incorrect type in assignment (different base types)
> net/ipv4/ip_tunnel_core.c:609:30:    expected int type
> net/ipv4/ip_tunnel_core.c:609:30:    got restricted __be16 [usertype]
> net/ipv4/ip_tunnel_core.c:619:30: warning: incorrect type in assignment (different base types)
> net/ipv4/ip_tunnel_core.c:619:30:    expected int type
> net/ipv4/ip_tunnel_core.c:619:30:    got restricted __be16 [usertype]
> net/ipv4/ip_tunnel_core.c:629:30: warning: incorrect type in assignment (different base types)
> net/ipv4/ip_tunnel_core.c:629:30:    expected int type
> net/ipv4/ip_tunnel_core.c:629:30:    got restricted __be16 [usertype]

The TUNNEL_* types are big-endian, so adjust the type of the local
variable in ip_tun_parse_opts().

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Link: https://lore.kernel.org/r/20210107144008.25777-1-jwi@linux.ibm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-08 19:25:35 -08:00
Baptiste Lepers
fd2ddef043 udp: Prevent reuseport_select_sock from reading uninitialized socks
reuse->socks[] is modified concurrently by reuseport_add_sock. To
prevent reading values that have not been fully initialized, only read
the array up until the last known safe index instead of incorrectly
re-reading the last index of the array.

Fixes: acdcecc612 ("udp: correct reuseport selection with connected sockets")
Signed-off-by: Baptiste Lepers <baptiste.lepers@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20210107051110.12247-1-baptiste.lepers@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-08 19:15:40 -08:00
Dongseok Yi
53475c5dd8 net: fix use-after-free when UDP GRO with shared fraglist
skbs in fraglist could be shared by a BPF filter loaded at TC. If TC
writes, it will call skb_ensure_writable -> pskb_expand_head to create
a private linear section for the head_skb. And then call
skb_clone_fraglist -> skb_get on each skb in the fraglist.

skb_segment_list overwrites part of the skb linear section of each
fragment itself. Even after skb_clone, the frag_skbs share their
linear section with their clone in PF_PACKET.

Both sk_receive_queue of PF_PACKET and PF_INET (or PF_INET6) can have
a link for the same frag_skbs chain. If a new skb (not frags) is
queued to one of the sk_receive_queue, multiple ptypes can see and
release this. It causes use-after-free.

[ 4443.426215] ------------[ cut here ]------------
[ 4443.426222] refcount_t: underflow; use-after-free.
[ 4443.426291] WARNING: CPU: 7 PID: 28161 at lib/refcount.c:190
refcount_dec_and_test_checked+0xa4/0xc8
[ 4443.426726] pstate: 60400005 (nZCv daif +PAN -UAO)
[ 4443.426732] pc : refcount_dec_and_test_checked+0xa4/0xc8
[ 4443.426737] lr : refcount_dec_and_test_checked+0xa0/0xc8
[ 4443.426808] Call trace:
[ 4443.426813]  refcount_dec_and_test_checked+0xa4/0xc8
[ 4443.426823]  skb_release_data+0x144/0x264
[ 4443.426828]  kfree_skb+0x58/0xc4
[ 4443.426832]  skb_queue_purge+0x64/0x9c
[ 4443.426844]  packet_set_ring+0x5f0/0x820
[ 4443.426849]  packet_setsockopt+0x5a4/0xcd0
[ 4443.426853]  __sys_setsockopt+0x188/0x278
[ 4443.426858]  __arm64_sys_setsockopt+0x28/0x38
[ 4443.426869]  el0_svc_common+0xf0/0x1d0
[ 4443.426873]  el0_svc_handler+0x74/0x98
[ 4443.426880]  el0_svc+0x8/0xc

Fixes: 3a1296a38d (net: Support GRO/GSO fraglist chaining.)
Signed-off-by: Dongseok Yi <dseok.yi@samsung.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/1610072918-174177-1-git-send-email-dseok.yi@samsung.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-08 19:14:02 -08:00
Lorenzo Bianconi
be9df4aff6 net, xdp: Introduce xdp_prepare_buff utility routine
Introduce xdp_prepare_buff utility routine to initialize per-descriptor
xdp_buff fields (e.g. xdp_buff pointers). Rely on xdp_prepare_buff() in
all XDP capable drivers.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Shay Agroskin <shayagr@amazon.com>
Acked-by: Martin Habets <habetsm.xilinx@gmail.com>
Acked-by: Camelia Groza <camelia.groza@nxp.com>
Acked-by: Marcin Wojtas <mw@semihalf.com>
Link: https://lore.kernel.org/bpf/45f46f12295972a97da8ca01990b3e71501e9d89.1608670965.git.lorenzo@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2021-01-08 13:39:24 -08:00
Lorenzo Bianconi
43b5169d83 net, xdp: Introduce xdp_init_buff utility routine
Introduce xdp_init_buff utility routine to initialize xdp_buff fields
const over NAPI iterations (e.g. frame_sz or rxq pointer). Rely on
xdp_init_buff in all XDP capable drivers.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Shay Agroskin <shayagr@amazon.com>
Acked-by: Martin Habets <habetsm.xilinx@gmail.com>
Acked-by: Camelia Groza <camelia.groza@nxp.com>
Acked-by: Marcin Wojtas <mw@semihalf.com>
Link: https://lore.kernel.org/bpf/7f8329b6da1434dc2b05a77f2e800b29628a8913.1608670965.git.lorenzo@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2021-01-08 13:39:24 -08:00
Zheng Yongjun
ec24e11e08 bpf: Replace fput with sockfd_put in sock map
The function sockfd_lookup uses fget on the value that is stored in
the file field of the returned structure, so fput should ultimately
be applied to this value. This can be done directly, but it seems
better to use the specific macro sockfd_put, which does the same
thing.

The cleanup was done using the following semantic patch:
    (http://www.emn.fr/x-info/coccinelle/)

    // <smpl>
    @@
    expression s;
    @@

       s = sockfd_lookup(...)
       ...
    +  sockfd_put(s);
    ?- fput(s->file);
    // </smpl>

Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20201229134834.22962-1-zhengyongjun3@huawei.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2021-01-08 13:39:24 -08:00
Jakub Kicinski
833d22f2f9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Trivial conflict in CAN on file rename.

Conflicts:
	drivers/net/can/m_can/tcan4x5x-core.c

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-08 13:28:00 -08:00
Ilan Peer
51d62f2f2c cfg80211: Save the regulatory domain with a lock
Saving the regulatory domain while setting custom regulatory domain
was done while accessing a RCU protected pointer but without any
protection.

Fix this by using RTNL while accessing the pointer.

Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Reported-by: syzbot+27771d4abcd9b7a1f5d3@syzkaller.appspotmail.com
Reported-by: syzbot+db4035751c56c0079282@syzkaller.appspotmail.com
Reported-by: Hans de Goede <hdegoede@redhat.com>
Fixes: beee246951 ("cfg80211: Save the regulatory domain when setting custom regulatory")
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20210105165657.613e9a876829.Ia38d27dbebea28bf9c56d70691d243186ede70e7@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-01-08 21:03:27 +01:00