Commit Graph

64093 Commits

Author SHA1 Message Date
Geliang Tang
0f9f696a50 mptcp: add set_flags command in PM netlink
This patch added a new command MPTCP_PM_CMD_SET_FLAGS in PM netlink:

In mptcp_nl_cmd_set_flags, parse the input address, get the backup value
according to whether the address's FLAG_BACKUP flag is set from the
user-space. Then check whether this address had been added in the local
address list. If it had been, then call mptcp_nl_addr_backup to deal with
this address.

In mptcp_nl_addr_backup, traverse all the existing msk sockets to find
the relevant sockets, and call mptcp_pm_nl_mp_prio_send_ack to send out
a MP_PRIO ACK packet.

Finally in mptcp_nl_cmd_set_flags, set or clear the address's FLAG_BACKUP
flag.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-09 18:18:43 -08:00
Geliang Tang
40453a5c61 mptcp: add the incoming MP_PRIO support
This patch added the incoming MP_PRIO logic:

Added a flag named mp_prio in struct mptcp_options_received, to mark the
MP_PRIO is received, and save the priority value to struct
mptcp_options_received's backup member. Then invoke
mptcp_pm_mp_prio_received with the receiving subsocket and the backup
value.

In mptcp_pm_mp_prio_received, get the subflow context according the input
subsocket, and change the subflow's backup as the incoming priority value.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-09 18:18:43 -08:00
Geliang Tang
067065422f mptcp: add the outgoing MP_PRIO support
This patch added the outgoing MP_PRIO logic:

In mptcp_pm_nl_mp_prio_send_ack, find the related subflow and subsocket
according to the input parameter addr. Save the input priority value to
suflow's backup, then set subflow's send_mp_prio flag to true, and save
the input priority value to suflow's request_bkup. Finally, send out a
pure ACK on the related subsocket.

In mptcp_established_options_mp_prio, check whether the subflow's
send_mp_prio is set. If it is, this is the packet for sending MP_PRIO.
So save subflow->request_bkup value to mptcp_out_options's backup, and
change the option type to OPTION_MPTCP_PRIO.

In mptcp_write_options, clear the send_mp_prio flag and send out the
MP_PRIO suboption with mptcp_out_options's backup value.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-09 18:18:43 -08:00
Geliang Tang
efd5a4c04e mptcp: add the address ID assignment bitmap
Currently the address ID set by the netlink PM from user-space is
overridden by the kernel. This patch added the address ID assignment
bitmap to allow user-space to set the address ID.

Use a per netns bitmask id_bitmap (256 bits) to keep track of in-use IDs.
And use next_id to keep track of the highest ID currently in use. If the
user-space provides an ID at endpoint creation time, try to use it. If
already in use, endpoint creation fails. Otherwise pick the first ID
available after the highest currently in use, with wrap-around.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-09 18:18:43 -08:00
Vladimir Oltean
4b9c935898 net: dsa: dsa_legacy_fdb_{add,del} can be static
Introduced in commit 37b8da1a3c ("net: dsa: Move FDB add/del
implementation inside DSA") in net/dsa/legacy.c, these functions were
moved again to slave.c as part of commit 2a93c1a365 ("net: dsa: Allow
compiling out legacy support"), before actually deleting net/dsa/slave.c
in 93e86b3bc8 ("net: dsa: Remove legacy probing support"). Along with
that movement there should have been a deletion of the prototypes from
dsa_priv.h, they are not useful.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/20210108233054.1222278-1-olteanv@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-09 18:05:52 -08:00
Hoang Le
b774134464 tipc: fix NULL deref in tipc_link_xmit()
The buffer list can have zero skb as following path:
tipc_named_node_up()->tipc_node_xmit()->tipc_link_xmit(), so
we need to check the list before casting an &sk_buff.

Fault report:
 [] tipc: Bulk publication failure
 [] general protection fault, probably for non-canonical [#1] PREEMPT [...]
 [] KASAN: null-ptr-deref in range [0x00000000000000c8-0x00000000000000cf]
 [] CPU: 0 PID: 0 Comm: swapper/0 Kdump: loaded Not tainted 5.10.0-rc4+ #2
 [] Hardware name: Bochs ..., BIOS Bochs 01/01/2011
 [] RIP: 0010:tipc_link_xmit+0xc1/0x2180
 [] Code: 24 b8 00 00 00 00 4d 39 ec 4c 0f 44 e8 e8 d7 0a 10 f9 48 [...]
 [] RSP: 0018:ffffc90000006ea0 EFLAGS: 00010202
 [] RAX: dffffc0000000000 RBX: ffff8880224da000 RCX: 1ffff11003d3cc0d
 [] RDX: 0000000000000019 RSI: ffffffff886007b9 RDI: 00000000000000c8
 [] RBP: ffffc90000007018 R08: 0000000000000001 R09: fffff52000000ded
 [] R10: 0000000000000003 R11: fffff52000000dec R12: ffffc90000007148
 [] R13: 0000000000000000 R14: 0000000000000000 R15: ffffc90000007018
 [] FS:  0000000000000000(0000) GS:ffff888037400000(0000) knlGS:000[...]
 [] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 [] CR2: 00007fffd2db5000 CR3: 000000002b08f000 CR4: 00000000000006f0

Fixes: af9b028e27 ("tipc: make media xmit call outside node spinlock context")
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Link: https://lore.kernel.org/r/20210108071337.3598-1-hoang.h.le@dektech.com.au
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-09 14:44:47 -08:00
Eric Dumazet
1d11fa6967 net-gro: remove GRO_DROP
GRO_DROP can only be returned from napi_gro_frags()
if the skb has not been allocated by a prior napi_get_frags()

Since drivers must use napi_get_frags() and test its result
before populating the skb with metadata, we can safely remove
GRO_DROP since it offers no practical use.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
Acked-by: Edward Cree <ecree.xilinx@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-09 14:24:26 -08:00
Aya Levin
b210de4f8c net: ipv6: Validate GSO SKB before finish IPv6 processing
There are cases where GSO segment's length exceeds the egress MTU:
 - Forwarding of a TCP GRO skb, when DF flag is not set.
 - Forwarding of an skb that arrived on a virtualisation interface
   (virtio-net/vhost/tap) with TSO/GSO size set by other network
   stack.
 - Local GSO skb transmitted on an NETIF_F_TSO tunnel stacked over an
   interface with a smaller MTU.
 - Arriving GRO skb (or GSO skb in a virtualised environment) that is
   bridged to a NETIF_F_TSO tunnel stacked over an interface with an
   insufficient MTU.

If so:
 - Consume the SKB and its segments.
 - Issue an ICMP packet with 'Packet Too Big' message containing the
   MTU, allowing the source host to reduce its Path MTU appropriately.

Note: These cases are handled in the same manner in IPv4 output finish.
This patch aligns the behavior of IPv6 and the one of IPv4.

Fixes: 9e50849054 ("netfilter: ipv6: move POSTROUTING invocation before fragmentation")
Signed-off-by: Aya Levin <ayal@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/1610027418-30438-1-git-send-email-ayal@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-09 14:06:32 -08:00
Menglong Dong
efb5b338da net: bridge: fix misspellings using codespell tool
Some typos are found out by codespell tool:

$ codespell ./net/bridge/
./net/bridge/br_stp.c:604: permanant  ==> permanent
./net/bridge/br_stp.c:605: persistance  ==> persistence
./net/bridge/br.c:125: underlaying  ==> underlying
./net/bridge/br_input.c:43: modue  ==> mode
./net/bridge/br_mrp.c:828: Determin  ==> Determine
./net/bridge/br_mrp.c:848: Determin  ==> Determine
./net/bridge/br_mrp.c:897: Determin  ==> Determine

Fix typos found by codespell.

Signed-off-by: Menglong Dong <dong.menglong@zte.com.cn>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://lore.kernel.org/r/20210108025332.52480-1-dong.menglong@zte.com.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-09 13:54:47 -08:00
Jakub Kicinski
766b0515d5 net: make sure devices go through netdev_wait_all_refs
If register_netdevice() fails at the very last stage - the
notifier call - some subsystems may have already seen it and
grabbed a reference. struct net_device can't be freed right
away without calling netdev_wait_all_refs().

Now that we have a clean interface in form of dev->needs_free_netdev
and lenient free_netdev() we can undo what commit 93ee31f14f ("[NET]:
Fix free_netdev on register_netdev failure.") has done and complete
the unregistration path by bringing the net_set_todo() call back.

After registration fails user is still expected to explicitly
free the net_device, so make sure ->needs_free_netdev is cleared,
otherwise rolling back the registration will cause the old double
free for callers who release rtnl_lock before the free.

This also solves the problem of priv_destructor not being called
on notifier error.

net_set_todo() will be moved back into unregister_netdevice_queue()
in a follow up.

Reported-by: Hulk Robot <hulkci@huawei.com>
Reported-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-08 19:27:41 -08:00
Jakub Kicinski
c269a24ce0 net: make free_netdev() more lenient with unregistering devices
There are two flavors of handling netdev registration:
 - ones called without holding rtnl_lock: register_netdev() and
   unregister_netdev(); and
 - those called with rtnl_lock held: register_netdevice() and
   unregister_netdevice().

While the semantics of the former are pretty clear, the same can't
be said about the latter. The netdev_todo mechanism is utilized to
perform some of the device unregistering tasks and it hooks into
rtnl_unlock() so the locked variants can't actually finish the work.
In general free_netdev() does not mix well with locked calls. Most
drivers operating under rtnl_lock set dev->needs_free_netdev to true
and expect core to make the free_netdev() call some time later.

The part where this becomes most problematic is error paths. There is
no way to unwind the state cleanly after a call to register_netdevice(),
since unreg can't be performed fully without dropping locks.

Make free_netdev() more lenient, and defer the freeing if device
is being unregistered. This allows error paths to simply call
free_netdev() both after register_netdevice() failed, and after
a call to unregister_netdevice() but before dropping rtnl_lock.

Simplify the error paths which are currently doing gymnastics
around free_netdev() handling.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-08 19:27:41 -08:00
Jakub Kicinski
2b446e650b docs: net: explain struct net_device lifetime
Explain the two basic flows of struct net_device's operation.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-08 19:27:41 -08:00
Julian Wiedmann
fda4fde297 net: ip_tunnel: clean up endianness conversions
sparse complains about some harmless endianness issues:

> net/ipv4/ip_tunnel_core.c:225:43: warning: cast to restricted __be16
> net/ipv4/ip_tunnel_core.c:225:43: warning: incorrect type in initializer (different base types)
> net/ipv4/ip_tunnel_core.c:225:43:    expected restricted __be16 [usertype] mtu
> net/ipv4/ip_tunnel_core.c:225:43:    got unsigned short [usertype]

iptunnel_pmtud_build_icmp() uses the wrong flavour of byte-order conversion
when storing the MTU into the ICMPv4 packet. Use htons(), just like
iptunnel_pmtud_build_icmpv6() does.

> net/ipv4/ip_tunnel_core.c:248:35: warning: cast from restricted __be16
> net/ipv4/ip_tunnel_core.c:248:35: warning: incorrect type in argument 3 (different base types)
> net/ipv4/ip_tunnel_core.c:248:35:    expected unsigned short type
> net/ipv4/ip_tunnel_core.c:248:35:    got restricted __be16 [usertype]
> net/ipv4/ip_tunnel_core.c:341:35: warning: cast from restricted __be16
> net/ipv4/ip_tunnel_core.c:341:35: warning: incorrect type in argument 3 (different base types)
> net/ipv4/ip_tunnel_core.c:341:35:    expected unsigned short type
> net/ipv4/ip_tunnel_core.c:341:35:    got restricted __be16 [usertype]

eth_header() wants the Ethertype in host-order, use the correct flavour of
byte-order conversion.

> net/ipv4/ip_tunnel_core.c:600:45: warning: restricted __be16 degrades to integer
> net/ipv4/ip_tunnel_core.c:609:30: warning: incorrect type in assignment (different base types)
> net/ipv4/ip_tunnel_core.c:609:30:    expected int type
> net/ipv4/ip_tunnel_core.c:609:30:    got restricted __be16 [usertype]
> net/ipv4/ip_tunnel_core.c:619:30: warning: incorrect type in assignment (different base types)
> net/ipv4/ip_tunnel_core.c:619:30:    expected int type
> net/ipv4/ip_tunnel_core.c:619:30:    got restricted __be16 [usertype]
> net/ipv4/ip_tunnel_core.c:629:30: warning: incorrect type in assignment (different base types)
> net/ipv4/ip_tunnel_core.c:629:30:    expected int type
> net/ipv4/ip_tunnel_core.c:629:30:    got restricted __be16 [usertype]

The TUNNEL_* types are big-endian, so adjust the type of the local
variable in ip_tun_parse_opts().

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Link: https://lore.kernel.org/r/20210107144008.25777-1-jwi@linux.ibm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-08 19:25:35 -08:00
Baptiste Lepers
fd2ddef043 udp: Prevent reuseport_select_sock from reading uninitialized socks
reuse->socks[] is modified concurrently by reuseport_add_sock. To
prevent reading values that have not been fully initialized, only read
the array up until the last known safe index instead of incorrectly
re-reading the last index of the array.

Fixes: acdcecc612 ("udp: correct reuseport selection with connected sockets")
Signed-off-by: Baptiste Lepers <baptiste.lepers@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20210107051110.12247-1-baptiste.lepers@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-08 19:15:40 -08:00
Dongseok Yi
53475c5dd8 net: fix use-after-free when UDP GRO with shared fraglist
skbs in fraglist could be shared by a BPF filter loaded at TC. If TC
writes, it will call skb_ensure_writable -> pskb_expand_head to create
a private linear section for the head_skb. And then call
skb_clone_fraglist -> skb_get on each skb in the fraglist.

skb_segment_list overwrites part of the skb linear section of each
fragment itself. Even after skb_clone, the frag_skbs share their
linear section with their clone in PF_PACKET.

Both sk_receive_queue of PF_PACKET and PF_INET (or PF_INET6) can have
a link for the same frag_skbs chain. If a new skb (not frags) is
queued to one of the sk_receive_queue, multiple ptypes can see and
release this. It causes use-after-free.

[ 4443.426215] ------------[ cut here ]------------
[ 4443.426222] refcount_t: underflow; use-after-free.
[ 4443.426291] WARNING: CPU: 7 PID: 28161 at lib/refcount.c:190
refcount_dec_and_test_checked+0xa4/0xc8
[ 4443.426726] pstate: 60400005 (nZCv daif +PAN -UAO)
[ 4443.426732] pc : refcount_dec_and_test_checked+0xa4/0xc8
[ 4443.426737] lr : refcount_dec_and_test_checked+0xa0/0xc8
[ 4443.426808] Call trace:
[ 4443.426813]  refcount_dec_and_test_checked+0xa4/0xc8
[ 4443.426823]  skb_release_data+0x144/0x264
[ 4443.426828]  kfree_skb+0x58/0xc4
[ 4443.426832]  skb_queue_purge+0x64/0x9c
[ 4443.426844]  packet_set_ring+0x5f0/0x820
[ 4443.426849]  packet_setsockopt+0x5a4/0xcd0
[ 4443.426853]  __sys_setsockopt+0x188/0x278
[ 4443.426858]  __arm64_sys_setsockopt+0x28/0x38
[ 4443.426869]  el0_svc_common+0xf0/0x1d0
[ 4443.426873]  el0_svc_handler+0x74/0x98
[ 4443.426880]  el0_svc+0x8/0xc

Fixes: 3a1296a38d (net: Support GRO/GSO fraglist chaining.)
Signed-off-by: Dongseok Yi <dseok.yi@samsung.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/1610072918-174177-1-git-send-email-dseok.yi@samsung.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-08 19:14:02 -08:00
Lorenzo Bianconi
be9df4aff6 net, xdp: Introduce xdp_prepare_buff utility routine
Introduce xdp_prepare_buff utility routine to initialize per-descriptor
xdp_buff fields (e.g. xdp_buff pointers). Rely on xdp_prepare_buff() in
all XDP capable drivers.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Shay Agroskin <shayagr@amazon.com>
Acked-by: Martin Habets <habetsm.xilinx@gmail.com>
Acked-by: Camelia Groza <camelia.groza@nxp.com>
Acked-by: Marcin Wojtas <mw@semihalf.com>
Link: https://lore.kernel.org/bpf/45f46f12295972a97da8ca01990b3e71501e9d89.1608670965.git.lorenzo@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2021-01-08 13:39:24 -08:00
Lorenzo Bianconi
43b5169d83 net, xdp: Introduce xdp_init_buff utility routine
Introduce xdp_init_buff utility routine to initialize xdp_buff fields
const over NAPI iterations (e.g. frame_sz or rxq pointer). Rely on
xdp_init_buff in all XDP capable drivers.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Shay Agroskin <shayagr@amazon.com>
Acked-by: Martin Habets <habetsm.xilinx@gmail.com>
Acked-by: Camelia Groza <camelia.groza@nxp.com>
Acked-by: Marcin Wojtas <mw@semihalf.com>
Link: https://lore.kernel.org/bpf/7f8329b6da1434dc2b05a77f2e800b29628a8913.1608670965.git.lorenzo@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2021-01-08 13:39:24 -08:00
Zheng Yongjun
ec24e11e08 bpf: Replace fput with sockfd_put in sock map
The function sockfd_lookup uses fget on the value that is stored in
the file field of the returned structure, so fput should ultimately
be applied to this value. This can be done directly, but it seems
better to use the specific macro sockfd_put, which does the same
thing.

The cleanup was done using the following semantic patch:
    (http://www.emn.fr/x-info/coccinelle/)

    // <smpl>
    @@
    expression s;
    @@

       s = sockfd_lookup(...)
       ...
    +  sockfd_put(s);
    ?- fput(s->file);
    // </smpl>

Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20201229134834.22962-1-zhengyongjun3@huawei.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2021-01-08 13:39:24 -08:00
Jakub Kicinski
833d22f2f9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Trivial conflict in CAN on file rename.

Conflicts:
	drivers/net/can/m_can/tcan4x5x-core.c

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-08 13:28:00 -08:00
Ilan Peer
51d62f2f2c cfg80211: Save the regulatory domain with a lock
Saving the regulatory domain while setting custom regulatory domain
was done while accessing a RCU protected pointer but without any
protection.

Fix this by using RTNL while accessing the pointer.

Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Reported-by: syzbot+27771d4abcd9b7a1f5d3@syzkaller.appspotmail.com
Reported-by: syzbot+db4035751c56c0079282@syzkaller.appspotmail.com
Reported-by: Hans de Goede <hdegoede@redhat.com>
Fixes: beee246951 ("cfg80211: Save the regulatory domain when setting custom regulatory")
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20210105165657.613e9a876829.Ia38d27dbebea28bf9c56d70691d243186ede70e7@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2021-01-08 21:03:27 +01:00
Eyal Birger
18f976960b xfrm: interface: enable TSO on xfrm interfaces
Underlying xfrm output supports gso packets.
Declare support in hw_features and adapt the xmit MTU check to pass GSO
packets.

Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2021-01-08 12:59:34 +01:00
Petr Machata
b19218b27f nexthop: Bounce NHA_GATEWAY in FDB nexthop groups
The function nh_check_attr_group() is called to validate nexthop groups.
The intention of that code seems to have been to bounce all attributes
above NHA_GROUP_TYPE except for NHA_FDB. However instead it bounces all
these attributes except when NHA_FDB attribute is present--then it accepts
them.

NHA_FDB validation that takes place before, in rtm_to_nh_config(), already
bounces NHA_OIF, NHA_BLACKHOLE, NHA_ENCAP and NHA_ENCAP_TYPE. Yet further
back, NHA_GROUPS and NHA_MASTER are bounced unconditionally.

But that still leaves NHA_GATEWAY as an attribute that would be accepted in
FDB nexthop groups (with no meaning), so long as it keeps the address
family as unspecified:

 # ip nexthop add id 1 fdb via 127.0.0.1
 # ip nexthop add id 10 fdb via default group 1

The nexthop code is still relatively new and likely not used very broadly,
and the FDB bits are newer still. Even though there is a reproducer out
there, it relies on an improbable gateway arguments "via default", "via
all" or "via any". Given all this, I believe it is OK to reformulate the
condition to do the right thing and bounce NHA_GATEWAY.

Fixes: 38428d6871 ("nexthop: support for fdb ecmp nexthops")
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07 18:47:18 -08:00
Ido Schimmel
7b01e53eee nexthop: Unlink nexthop group entry in error path
In case of error, remove the nexthop group entry from the list to which
it was previously added.

Fixes: 430a049190 ("nexthop: Add support for nexthop groups")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07 18:47:18 -08:00
Ido Schimmel
07e61a979c nexthop: Fix off-by-one error in error path
A reference was not taken for the current nexthop entry, so do not try
to put it in the error path.

Fixes: 430a049190 ("nexthop: Add support for nexthop groups")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07 18:47:18 -08:00
Jonathan Lemon
8e04491724 skbuff: Rename skb_zcopy_{get|put} to net_zcopy_{get|put}
Unlike the rest of the skb_zcopy_ functions, these routines
operate on a 'struct ubuf', not a skb.  Remove the 'skb_'
prefix from the naming to make things clearer.

Suggested-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07 16:08:37 -08:00
Jonathan Lemon
04c2d33eab skbuff: add flags to ubuf_info for ubuf setup
Currently, when an ubuf is attached to a new skb, the shared
flags word is initialized to a fixed value.  Instead of doing
this, set the default flags in the ubuf, and have new skbs
inherit from this default.

This is needed when setting up different zerocopy types.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07 16:08:37 -08:00
Jonathan Lemon
06b4feb37e net: group skb_shinfo zerocopy related bits together.
In preparation for expanded zerocopy (TX and RX), move
the zerocopy related bits out of tx_flags into their own
flag word.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07 16:08:37 -08:00
Jonathan Lemon
8c793822c5 skbuff: rename sock_zerocopy_* to msg_zerocopy_*
At Willem's suggestion, rename the sock_zerocopy_* functions
so that they match the MSG_ZEROCOPY flag, which makes it clear
they are specific to this zerocopy implementation.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07 16:08:35 -08:00
Jonathan Lemon
70c4316749 skbuff: Call skb_zcopy_clear() before unref'ing fragments
RX zerocopy fragment pages which are not allocated from the
system page pool require special handling.  Give the callback
in skb_zcopy_clear() a chance to process them first.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07 16:06:38 -08:00
Jonathan Lemon
236a6b1cd5 skbuff: Call sock_zerocopy_put_abort from skb_zcopy_put_abort
The sock_zerocopy_put_abort function contains logic which is
specific to the current zerocopy implementation.  Add a wrapper
which checks the callback and dispatches apppropriately.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07 16:06:37 -08:00
Jonathan Lemon
36177832f4 skbuff: Add skb parameter to the ubuf zerocopy callback
Add an optional skb parameter to the zerocopy callback parameter,
which is passed down from skb_zcopy_clear().  This gives access
to the original skb, which is needed for upcoming RX zero-copy
error handling.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07 16:06:37 -08:00
Jonathan Lemon
e76d46cfff skbuff: replace sock_zerocopy_get with skb_zcopy_get
Rename the get routines for consistency.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07 16:06:37 -08:00
Jonathan Lemon
59776362b1 skbuff: replace sock_zerocopy_put() with skb_zcopy_put()
Replace sock_zerocopy_put with the generic skb_zcopy_put()
function.  Pass 'true' as the success argument, as this
is identical to no change.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07 16:06:37 -08:00
Jonathan Lemon
75518851a2 skbuff: Push status and refcounts into sock_zerocopy_callback
Before this change, the caller of sock_zerocopy_callback would
need to save the zerocopy status, decrement and check the refcount,
and then call the callback function - the callback was only invoked
when the refcount reached zero.

Now, the caller just passes the status into the callback function,
which saves the status and handles its own refcounts.

This makes the behavior of the sock_zerocopy_callback identical
to the tpacket and vhost callbacks.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07 16:06:37 -08:00
Jonathan Lemon
d6adf1b103 skbuff: simplify sock_zerocopy_put
All 'struct ubuf_info' users should have a callback defined
as of commit 0a4a060bb2 ("sock: fix zerocopy_success regression
with msg_zerocopy").

Remove the dead code path to consume_skb(), which makes
assumptions about how the structure was allocated.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07 16:06:37 -08:00
Vladimir Oltean
1dbb130281 net: dsa: remove the DSA specific notifiers
This effectively reverts commit 60724d4bae ("net: dsa: Add support for
DSA specific notifiers"). The reason is that since commit 2f1e8ea726
("net: dsa: link interfaces with the DSA master to get rid of lockdep
warnings"), it appears that there is a generic way to achieve the same
purpose. The only user thus far, the Broadcom SYSTEMPORT driver, was
converted to use the generic notifiers.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07 15:42:07 -08:00
Vladimir Oltean
a5e3c9ba92 net: dsa: export dsa_slave_dev_check
Using the NETDEV_CHANGEUPPER notifications, drivers can be aware when
they are enslaved to e.g. a bridge by calling netif_is_bridge_master().

Export this helper from DSA to get the equivalent functionality of
determining whether the upper interface of a CHANGEUPPER notifier is a
DSA switch interface or not.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07 15:42:07 -08:00
Vladimir Oltean
f46b9b8ee8 net: dsa: move the Broadcom tag information in a separate header file
It is a bit strange to see something as specific as Broadcom SYSTEMPORT
bits in the main DSA include file. Move these away into a separate
header, and have the tagger and the SYSTEMPORT driver include them.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07 15:42:07 -08:00
Vladimir Oltean
d5f19486ce net: dsa: listen for SWITCHDEV_{FDB,DEL}_ADD_TO_DEVICE on foreign bridge neighbors
Some DSA switches (and not only) cannot learn source MAC addresses from
packets injected from the CPU. They only perform hardware address
learning from inbound traffic.

This can be problematic when we have a bridge spanning some DSA switch
ports and some non-DSA ports (which we'll call "foreign interfaces" from
DSA's perspective).

There are 2 classes of problems created by the lack of learning on
CPU-injected traffic:
- excessive flooding, due to the fact that DSA treats those addresses as
  unknown
- the risk of stale routes, which can lead to temporary packet loss

To illustrate the second class, consider the following situation, which
is common in production equipment (wireless access points, where there
is a WLAN interface and an Ethernet switch, and these form a single
bridging domain).

 AP 1:
 +------------------------------------------------------------------------+
 |                                          br0                           |
 +------------------------------------------------------------------------+
 +------------+ +------------+ +------------+ +------------+ +------------+
 |    swp0    | |    swp1    | |    swp2    | |    swp3    | |    wlan0   |
 +------------+ +------------+ +------------+ +------------+ +------------+
       |                                                       ^        ^
       |                                                       |        |
       |                                                       |        |
       |                                                    Client A  Client B
       |
       |
       |
 +------------+ +------------+ +------------+ +------------+ +------------+
 |    swp0    | |    swp1    | |    swp2    | |    swp3    | |    wlan0   |
 +------------+ +------------+ +------------+ +------------+ +------------+
 +------------------------------------------------------------------------+
 |                                          br0                           |
 +------------------------------------------------------------------------+
 AP 2

- br0 of AP 1 will know that Clients A and B are reachable via wlan0
- the hardware fdb of a DSA switch driver today is not kept in sync with
  the software entries on other bridge ports, so it will not know that
  clients A and B are reachable via the CPU port UNLESS the hardware
  switch itself performs SA learning from traffic injected from the CPU.
  Nonetheless, a substantial number of switches don't.
- the hardware fdb of the DSA switch on AP 2 may autonomously learn that
  Client A and B are reachable through swp0. Therefore, the software br0
  of AP 2 also may or may not learn this. In the example we're
  illustrating, some Ethernet traffic has been going on, and br0 from AP
  2 has indeed learnt that it can reach Client B through swp0.

One of the wireless clients, say Client B, disconnects from AP 1 and
roams to AP 2. The topology now looks like this:

 AP 1:
 +------------------------------------------------------------------------+
 |                                          br0                           |
 +------------------------------------------------------------------------+
 +------------+ +------------+ +------------+ +------------+ +------------+
 |    swp0    | |    swp1    | |    swp2    | |    swp3    | |    wlan0   |
 +------------+ +------------+ +------------+ +------------+ +------------+
       |                                                            ^
       |                                                            |
       |                                                         Client A
       |
       |
       |                                                         Client B
       |                                                            |
       |                                                            v
 +------------+ +------------+ +------------+ +------------+ +------------+
 |    swp0    | |    swp1    | |    swp2    | |    swp3    | |    wlan0   |
 +------------+ +------------+ +------------+ +------------+ +------------+
 +------------------------------------------------------------------------+
 |                                          br0                           |
 +------------------------------------------------------------------------+
 AP 2

- br0 of AP 1 still knows that Client A is reachable via wlan0 (no change)
- br0 of AP 1 will (possibly) know that Client B has left wlan0. There
  are cases where it might never find out though. Either way, DSA today
  does not process that notification in any way.
- the hardware FDB of the DSA switch on AP 1 may learn autonomously that
  Client B can be reached via swp0, if it receives any packet with
  Client 1's source MAC address over Ethernet.
- the hardware FDB of the DSA switch on AP 2 still thinks that Client B
  can be reached via swp0. It does not know that it has roamed to wlan0,
  because it doesn't perform SA learning from the CPU port.

Now Client A contacts Client B.
AP 1 routes the packet fine towards swp0 and delivers it on the Ethernet
segment.
AP 2 sees a frame on swp0 and its fdb says that the destination is swp0.
Hairpinning is disabled => drop.

This problem comes from the fact that these switches have a 'blind spot'
for addresses coming from software bridging. The generic solution is not
to assume that hardware learning can be enabled somehow, but to listen
to more bridge learning events. It turns out that the bridge driver does
learn in software from all inbound frames, in __br_handle_local_finish.
A proper SWITCHDEV_FDB_ADD_TO_DEVICE notification is emitted for the
addresses serviced by the bridge on 'foreign' interfaces. The software
bridge also does the right thing on migration, by notifying that the old
entry is deleted, so that does not need to be special-cased in DSA. When
it is deleted, we just need to delete our static FDB entry towards the
CPU too, and wait.

The problem is that DSA currently only cares about SWITCHDEV_FDB_ADD_TO_DEVICE
events received on its own interfaces, such as static FDB entries.

Luckily we can change that, and DSA can listen to all switchdev FDB
add/del events in the system and figure out if those events were emitted
by a bridge that spans at least one of DSA's own ports. In case that is
true, DSA will also offload that address towards its own CPU port, in
the eventuality that there might be bridge clients attached to the DSA
switch who want to talk to the station connected to the foreign
interface.

In terms of implementation, we need to keep the fdb_info->added_by_user
check for the case where the switchdev event was targeted directly at a
DSA switch port. But we don't need to look at that flag for snooped
events. So the check is currently too late, we need to move it earlier.
This also simplifies the code a bit, since we avoid uselessly allocating
and freeing switchdev_work.

We could probably do some improvements in the future. For example,
multi-bridge support is rudimentary at the moment. If there are two
bridges spanning a DSA switch's ports, and both of them need to service
the same MAC address, then what will happen is that the migration of one
of those stations will trigger the deletion of the FDB entry from the
CPU port while it is still used by other bridge. That could be improved
with reference counting but is left for another time.

This behavior needs to be enabled at driver level by setting
ds->assisted_learning_on_cpu_port = true. This is because we don't want
to inflict a potential performance penalty (accesses through
MDIO/I2C/SPI are expensive) to hardware that really doesn't need it
because address learning on the CPU port works there.

Reported-by: DENG Qingfang <dqfext@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07 15:34:46 -08:00
Vladimir Oltean
5fb4a451a8 net: dsa: exit early in dsa_slave_switchdev_event if we can't program the FDB
Right now, the following would happen for a switch driver that does not
implement .port_fdb_add or .port_fdb_del.

dsa_slave_switchdev_event returns NOTIFY_OK and schedules:
-> dsa_slave_switchdev_event_work
   -> dsa_port_fdb_add
      -> dsa_port_notify(DSA_NOTIFIER_FDB_ADD)
         -> dsa_switch_fdb_add
            -> if (!ds->ops->port_fdb_add) return -EOPNOTSUPP;
   -> an error is printed with dev_dbg, and
      dsa_fdb_offload_notify(switchdev_work) is not called.

We can avoid scheduling the worker for nothing and say NOTIFY_DONE.
Because we don't call dsa_fdb_offload_notify, the static FDB entry will
remain just in the software bridge.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07 15:34:46 -08:00
Vladimir Oltean
447d290a58 net: dsa: move switchdev event implementation under the same switch/case statement
We'll need to start listening to SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE
events even for interfaces where dsa_slave_dev_check returns false, so
we need that check inside the switch-case statement for SWITCHDEV_FDB_*.

This movement also avoids a useless allocation / free of switchdev_work
on the untreated "default event" case.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07 15:34:46 -08:00
Vladimir Oltean
c4bb76a9a0 net: dsa: don't use switchdev_notifier_fdb_info in dsa_switchdev_event_work
Currently DSA doesn't add FDB entries on the CPU port, because it only
does so through switchdev, which is associated with a net_device, and
there are none of those for the CPU port.

But actually FDB addresses on the CPU port have some use cases of their
own, if the switchdev operations are initiated from within the DSA
layer. There is just one problem with the existing code: it passes a
structure in dsa_switchdev_event_work which was retrieved directly from
switchdev, so it contains a net_device. We need to generalize the
contents to something that covers the CPU port as well: the "ds, port"
tuple is fine for that.

Note that the new procedure for notifying the successful FDB offload is
inspired from the rocker model.

Also, nothing was being done if added_by_user was false. Let's check for
that a lot earlier, and don't actually bother to schedule the worker
for nothing.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07 15:34:45 -08:00
Vladimir Oltean
2fd186501b net: dsa: be louder when a non-legacy FDB operation fails
The dev_close() call was added in commit c9eb3e0f87 ("net: dsa: Add
support for learning FDB through notification") "to indicate inconsistent
situation" when we could not delete an FDB entry from the port.

bridge fdb del d8:58:d7:00:ca:6d dev swp0 self master

It is a bit drastic and at the same time not helpful if the above fails
to only print with netdev_dbg log level, but on the other hand to bring
the interface down.

So increase the verbosity of the error message, and drop dev_close().

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07 15:34:45 -08:00
Vladimir Oltean
90dc8fd360 net: bridge: notify switchdev of disappearance of old FDB entry upon migration
Currently the bridge emits atomic switchdev notifications for
dynamically learnt FDB entries. Monitoring these notifications works
wonders for switchdev drivers that want to keep their hardware FDB in
sync with the bridge's FDB.

For example station A wants to talk to station B in the diagram below,
and we are concerned with the behavior of the bridge on the DUT device:

                   DUT
 +-------------------------------------+
 |                 br0                 |
 | +------+ +------+ +------+ +------+ |
 | |      | |      | |      | |      | |
 | | swp0 | | swp1 | | swp2 | | eth0 | |
 +-------------------------------------+
      |        |                  |
  Station A    |                  |
               |                  |
         +--+------+--+    +--+------+--+
         |  |      |  |    |  |      |  |
         |  | swp0 |  |    |  | swp0 |  |
 Another |  +------+  |    |  +------+  | Another
  switch |     br0    |    |     br0    | switch
         |  +------+  |    |  +------+  |
         |  |      |  |    |  |      |  |
         |  | swp1 |  |    |  | swp1 |  |
         +--+------+--+    +--+------+--+
                                  |
                              Station B

Interfaces swp0, swp1, swp2 are handled by a switchdev driver that has
the following property: frames injected from its control interface bypass
the internal address analyzer logic, and therefore, this hardware does
not learn from the source address of packets transmitted by the network
stack through it. So, since bridging between eth0 (where Station B is
attached) and swp0 (where Station A is attached) is done in software,
the switchdev hardware will never learn the source address of Station B.
So the traffic towards that destination will be treated as unknown, i.e.
flooded.

This is where the bridge notifications come in handy. When br0 on the
DUT sees frames with Station B's MAC address on eth0, the switchdev
driver gets these notifications and can install a rule to send frames
towards Station B's address that are incoming from swp0, swp1, swp2,
only towards the control interface. This is all switchdev driver private
business, which the notification makes possible.

All is fine until someone unplugs Station B's cable and moves it to the
other switch:

                   DUT
 +-------------------------------------+
 |                 br0                 |
 | +------+ +------+ +------+ +------+ |
 | |      | |      | |      | |      | |
 | | swp0 | | swp1 | | swp2 | | eth0 | |
 +-------------------------------------+
      |        |                  |
  Station A    |                  |
               |                  |
         +--+------+--+    +--+------+--+
         |  |      |  |    |  |      |  |
         |  | swp0 |  |    |  | swp0 |  |
 Another |  +------+  |    |  +------+  | Another
  switch |     br0    |    |     br0    | switch
         |  +------+  |    |  +------+  |
         |  |      |  |    |  |      |  |
         |  | swp1 |  |    |  | swp1 |  |
         +--+------+--+    +--+------+--+
               |
           Station B

Luckily for the use cases we care about, Station B is noisy enough that
the DUT hears it (on swp1 this time). swp1 receives the frames and
delivers them to the bridge, who enters the unlikely path in br_fdb_update
of updating an existing entry. It moves the entry in the software bridge
to swp1 and emits an addition notification towards that.

As far as the switchdev driver is concerned, all that it needs to ensure
is that traffic between Station A and Station B is not forever broken.
If it does nothing, then the stale rule to send frames for Station B
towards the control interface remains in place. But Station B is no
longer reachable via the control interface, but via a port that can
offload the bridge port learning attribute. It's just that the port is
prevented from learning this address, since the rule overrides FDB
updates. So the rule needs to go. The question is via what mechanism.

It sure would be possible for this switchdev driver to keep track of all
addresses which are sent to the control interface, and then also listen
for bridge notifier events on its own ports, searching for the ones that
have a MAC address which was previously sent to the control interface.
But this is cumbersome and inefficient. Instead, with one small change,
the bridge could notify of the address deletion from the old port, in a
symmetrical manner with how it did for the insertion. Then the switchdev
driver would not be required to monitor learn/forget events for its own
ports. It could just delete the rule towards the control interface upon
bridge entry migration. This would make hardware address learning be
possible again. Then it would take a few more packets until the hardware
and software FDB would be in sync again.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07 15:34:45 -08:00
Florian Westphal
bb4cc1a188 net: ip: always refragment ip defragmented packets
Conntrack reassembly records the largest fragment size seen in IPCB.
However, when this gets forwarded/transmitted, fragmentation will only
be forced if one of the fragmented packets had the DF bit set.

In that case, a flag in IPCB will force fragmentation even if the
MTU is large enough.

This should work fine, but this breaks with ip tunnels.
Consider client that sends a UDP datagram of size X to another host.

The client fragments the datagram, so two packets, of size y and z, are
sent. DF bit is not set on any of these packets.

Middlebox netfilter reassembles those packets back to single size-X
packet, before routing decision.

packet-size-vs-mtu checks in ip_forward are irrelevant, because DF bit
isn't set.  At output time, ip refragmentation is skipped as well
because x is still smaller than the mtu of the output device.

If ttransmit device is an ip tunnel, the packet size increases to
x+overhead.

Also, tunnel might be configured to force DF bit on outer header.

In this case, packet will be dropped (exceeds MTU) and an ICMP error is
generated back to sender.

But sender already respects the announced MTU, all the packets that
it sent did fit the announced mtu.

Force refragmentation as per original sizes unconditionally so ip tunnel
will encapsulate the fragments instead.

The only other solution I see is to place ip refragmentation in
the ip_tunnel code to handle this case.

Fixes: d6b915e29f ("ip_fragment: don't forward defragmented DF packet")
Reported-by: Christian Perle <christian.perle@secunet.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07 14:42:36 -08:00
Florian Westphal
50c661670f net: fix pmtu check in nopmtudisc mode
For some reason ip_tunnel insist on setting the DF bit anyway when the
inner header has the DF bit set, EVEN if the tunnel was configured with
'nopmtudisc'.

This means that the script added in the previous commit
cannot be made to work by adding the 'nopmtudisc' flag to the
ip tunnel configuration. Doing so breaks connectivity even for the
without-conntrack/netfilter scenario.

When nopmtudisc is set, the tunnel will skip the mtu check, so no
icmp error is sent to client. Then, because inner header has DF set,
the outer header gets added with DF bit set as well.

IP stack then sends an error to itself because the packet exceeds
the device MTU.

Fixes: 23a3647bc4 ("ip_tunnels: Use skb-len to PMTU check.")
Cc: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07 14:42:36 -08:00
Jakub Kicinski
b9ef3fecd1 udp_tunnel: reshuffle NETIF_F_RX_UDP_TUNNEL_PORT checks
Move the NETIF_F_RX_UDP_TUNNEL_PORT feature check into
udp_tunnel_nic_*_port() helpers, since they're always
done right before the call.

Add similar checks before calling the notifier.
udp_tunnel_nic invokes the notifier without checking
features which could result in some wasted cycles.

Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07 12:53:29 -08:00
Jakub Kicinski
876c4384ae udp_tunnel: hard-wire NDOs to udp_tunnel_nic_*_port() helpers
All drivers use udp_tunnel_nic_*_port() helpers, prepare for
NDO removal by invoking those helpers directly.

The helpers are safe to call on all devices, they check if
device has the UDP tunnel state initialized.

Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07 12:53:29 -08:00
Sean Tranchetti
d8f5c29653 net: ipv6: fib: flush exceptions when purging route
Route removal is handled by two code paths. The main removal path is via
fib6_del_route() which will handle purging any PMTU exceptions from the
cache, removing all per-cpu copies of the DST entry used by the route, and
releasing the fib6_info struct.

The second removal location is during fib6_add_rt2node() during a route
replacement operation. This path also calls fib6_purge_rt() to handle
cleaning up the per-cpu copies of the DST entries and releasing the
fib6_info associated with the older route, but it does not flush any PMTU
exceptions that the older route had. Since the older route is removed from
the tree during the replacement, we lose any way of accessing it again.

As these lingering DSTs and the fib6_info struct are holding references to
the underlying netdevice struct as well, unregistering that device from the
kernel can never complete.

Fixes: 2b760fcf5c ("ipv6: hook up exception table to store dst cache")
Signed-off-by: Sean Tranchetti <stranche@codeaurora.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/1609892546-11389-1-git-send-email-stranche@quicinc.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07 12:03:16 -08:00
Jakub Kicinski
c10b377ff6 linux-can-next-for-5.12-20210106
-----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCgAxFiEEK3kIWJt9yTYMP3ehqclaivrt76kFAl/12xATHG1rbEBwZW5n
 dXRyb25peC5kZQAKCRCpyVqK+u3vqcB6B/4xTRQCOLF1BN59AdixqGf9gdXjIpQs
 R5h4WJXqh9OPS4XSckSVBVGYniT8KdfBLL0IY4NV8q2s6R6Yo70G3GTiaYXI4cX/
 BaKMRcvX6p1i2qZxTtHDGYxucnKCd+yRwfWkL4GLsCNcmTWCkqUL+nGgx6UmS6eB
 9fMHcLXiP9vSfJQUfAOLR33Bu2C0QslI5fX5hi8PIdKEz9cHGf3PAuaRvtWEWi3a
 1Zi/iXrv/0XDcT+OE5/2tAU6n8slIA2MU/80r7KdgUrWXOfuueH/pqepYgVU0Nf2
 eFH+gDYBlZucd4ef37TPIAdPTw3z4F9irOVI1iwRIWdK3aQ3jyinsVWF
 =0lW0
 -----END PGP SIGNATURE-----

Merge tag 'linux-can-next-for-5.12-20210106' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next

Marc Kleine-Budde says:

====================
pull-request: can-next 2021-01-06

The first 16 patches are by me and target the tcan4x5x SPI glue driver for the
m_can CAN driver. First there are a several cleanup commits, then the SPI
regmap part is converted to 8 bits per word, to make it possible to use that
driver on SPI controllers that only support the 8 bit per word mode (such as
the SPI cores on the raspberry pi).

Oliver Hartkopp contributes a patch for the CAN_RAW protocol. The getsockopt()
for CAN_RAW_FILTER is changed to return -ERANGE if the filterset does not fit
into the provided user space buffer.

The last two patches are by Joakim Zhang and add wakeup support to the flexcan
driver for the i.MX8QM SoC. The dt-bindings docs are extended to describe the
added property.

* tag 'linux-can-next-for-5.12-20210106' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next:
  can: flexcan: add CAN wakeup function for i.MX8QM
  dt-bindings: can: fsl,flexcan: add fsl,scu-index property to indicate a resource
  can: raw: return -ERANGE when filterset does not fit into user space buffer
  can: tcan4x5x: add support for half-duplex controllers
  can: tcan4x5x: rework SPI access
  can: tcan4x5x: add {wr,rd}_table
  can: tcan4x5x: add max_raw_{read,write} of 256
  can: tcan4x5x: tcan4x5x_regmap: set reg_stride to 4
  can: tcan4x5x: fix max register value
  can: tcan4x5x: tcan4x5x_regmap_init(): use spi as context pointer
  can: tcan4x5x: tcan4x5x_regmap_write(): remove not needed casts and replace 4 by sizeof
  can: tcan4x5x: rename regmap_spi_gather_write() -> tcan4x5x_regmap_gather_write()
  can: tcan4x5x: remove regmap async support
  can: tcan4x5x: tcan4x5x_bus: remove not needed read_flag_mask
  can: tcan4x5x: mark struct regmap_bus tcan4x5x_bus as constant
  can: tcan4x5x: move regmap code into seperate file
  can: tcan4x5x: rename tcan4x5x.c -> tcan4x5x-core.c
  can: tcan4x5x: beautify indention of tcan4x5x_of_match and tcan4x5x_id_table
  can: tcan4x5x: replace DEVICE_NAME by KBUILD_MODNAME
====================

Link: https://lore.kernel.org/r/20210107094900.173046-1-mkl@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07 10:53:32 -08:00
Rafał Miłecki
8209f5bc3b net: dsa: print error on invalid port index
Looking for an -EINVAL all over the dsa code could take hours for
inexperienced DSA users.

Signed-off-by: Rafał Miłecki <rafal@milecki.pl>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20210106090915.21439-1-zajec5@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-06 16:21:08 -08:00
Oliver Hartkopp
0de70e287b can: raw: return -ERANGE when filterset does not fit into user space buffer
Multiple filters (struct can_filter) can be set with the setsockopt()
function, which was originally intended as a write-only operation.

As getsockopt() also provides a CAN_RAW_FILTER option to read back the
given filters, the caller has to provide an appropriate user space buffer.
In the case this buffer is too small the getsockopt() silently truncates
the filter information and gives no information about the needed space.
This is safe but not convenient for the programmer.

In net/core/sock.c the SO_PEERGROUPS sockopt had a similar requirement
and solved it by returning -ERANGE in the case that the provided data
does not fit into the given user space buffer and fills the required size
into optlen, so that the caller can retry with a matching buffer length.

This patch adopts this approach for CAN_RAW_FILTER getsockopt().

Reported-by: Phillip Schichtel <phillip@schich.tel>
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Tested-By: Phillip Schichtel <phillip@schich.tel>
Link: https://lore.kernel.org/r/20201216174928.21663-1-socketcan@hartkopp.net
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2021-01-06 15:15:41 +01:00
Ard Biesheuvel
ef0bb5adc1 Bluetooth: avoid u128_xor() on potentially misaligned inputs
u128_xor() takes pointers to quantities that are assumed to be at least
64-bit aligned, which is not guaranteed to be the case in the smp_c1()
routine. So switch to crypto_xor() instead.

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2021-01-06 08:41:48 +01:00
Qinglang Miao
4beb17e553 net: qrtr: fix null-ptr-deref in qrtr_ns_remove
A null-ptr-deref bug is reported by Hulk Robot like this:
--------------
KASAN: null-ptr-deref in range [0x0000000000000128-0x000000000000012f]
Call Trace:
qrtr_ns_remove+0x22/0x40 [ns]
qrtr_proto_fini+0xa/0x31 [qrtr]
__x64_sys_delete_module+0x337/0x4e0
do_syscall_64+0x34/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x468ded
--------------

When qrtr_ns_init fails in qrtr_proto_init, qrtr_ns_remove which would
be called later on would raise a null-ptr-deref because qrtr_ns.workqueue
has been destroyed.

Fix it by making qrtr_ns_init have a return value and adding a check in
qrtr_proto_init.

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Qinglang Miao <miaoqinglang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-01-05 16:50:09 -08:00
Jakub Kicinski
55b7ab1178 net: vlan: avoid leaks on register_vlan_dev() failures
VLAN checks for NETREG_UNINITIALIZED to distinguish between
registration failure and unregistration in progress.

Since commit cb626bf566 ("net-sysfs: Fix reference count leak")
registration failure may, however, result in NETREG_UNREGISTERED
as well as NETREG_UNINITIALIZED.

This fix is similer to cebb69754f ("rtnetlink: Fix
memory(net_device) leak when ->newlink fails")

Fixes: cb626bf566 ("net-sysfs: Fix reference count leak")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-01-05 16:25:31 -08:00
Bongsu Jeon
f011539e72 net: nfc: nci: Change the NCI close sequence
If there is a NCI command in work queue after closing the NCI device at
nci_unregister_device, The NCI command timer starts at flush_workqueue
function and then NCI command timeout handler would be called 5 second
after flushing the NCI command work queue and destroying the queue.
At that time, the timeout handler would try to use NCI command work queue
that is destroyed already. it will causes the problem. To avoid this
abnormal situation, change the sequence to prevent the NCI command timeout
handler from being called after destroying the NCI command work queue.

Signed-off-by: Bongsu Jeon <bongsu.jeon@samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-01-05 16:06:34 -08:00
Zheng Yongjun
8dc879a1bf net: kcm: Replace fput with sockfd_put
The function sockfd_lookup uses fget on the value that is stored in
the file field of the returned structure, so fput should ultimately be
applied to this value.  This can be done directly, but it seems better
to use the specific macro sockfd_put, which does the same thing.

Perform a source code refactoring by using the following semantic patch.

    // <smpl>
    @@
    expression s;
    @@

       s = sockfd_lookup(...)
       ...
    + sockfd_put(s);
    - fput(s->file);
    // </smpl>

Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-01-05 16:02:32 -08:00
Arnd Bergmann
152a8a6c01 cfg80211: select CONFIG_CRC32
Without crc32 support, this fails to link:

arm-linux-gnueabi-ld: net/wireless/scan.o: in function `cfg80211_scan_6ghz':
scan.c:(.text+0x928): undefined reference to `crc32_le'

Fixes: c8cb5b854b ("nl80211/cfg80211: support 6 GHz scanning")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-01-05 15:50:36 -08:00
Zheng Yongjun
520ec34385 net: tipc: Replace expression with offsetof()
Use the existing offsetof() macro instead of duplicating code.

Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-01-05 15:43:41 -08:00
Linus Torvalds
aa35e45cd4 Networking fixes for 5.11-rc3, including fixes from netfilter, wireless
and bpf trees.
 
 Current release - regressions:
 
  - mt76: - usb: fix NULL pointer dereference in mt76u_status_worker
          - sdio: fix NULL pointer dereference in mt76s_process_tx_queue
 
  - net: ipa: fix interconnect enable bug
 
 Current release - always broken:
 
  - netfilter: ipset: fixes possible oops in mtype_resize
 
  - ath11k: fix number of coding issues found by static analysis tools
            and spurious error messages
 
 Previous releases - regressions:
 
  - e1000e: re-enable s0ix power saving flows for systems with
            the Intel i219-LM Ethernet controllers to fix power
 	   use regression
 
  - virtio_net: fix recursive call to cpus_read_lock() to avoid
                a deadlock
 
  - ipv4: ignore ECN bits for fib lookups in fib_compute_spec_dst()
 
  - net-sysfs: take the rtnl lock around XPS configuration
 
  - xsk: - fix memory leak for failed bind
         - rollback reservation at NETDEV_TX_BUSY
 
  - r8169: work around power-saving bug on some chip versions
 
 Previous releases - always broken:
 
  - dcb: validate netlink message in DCB handler
 
  - tun: fix return value when the number of iovs exceeds MAX_SKB_FRAGS
         to prevent unnecessary retries
 
  - vhost_net: fix ubuf refcount when sendmsg fails
 
  - bpf: save correct stopping point in file seq iteration
 
  - ncsi: use real net-device for response handler
 
  - neighbor: fix div by zero caused by a data race (TOCTOU)
 
  - bareudp: - fix use of incorrect min_headroom size
             - fix false positive lockdep splat from the TX lock
 
  - net: mvpp2: - clear force link UP during port init procedure
                  in case bootloader had set it
                - add TCAM entry to drop flow control pause frames
 	       - fix PPPoE with ipv6 packet parsing
 	       - fix GoP Networking Complex Control config of port 3
 	       - fix pkt coalescing IRQ-threshold configuration
 
  - xsk: fix race in SKB mode transmit with shared cq
 
  - ionic: account for vlan tag len in rx buffer len
 
  - net: stmmac: ignore the second clock input, current clock framework
                 does not handle exclusive clock use well, other drivers
 		may reconfigure the second clock
 Misc:
 
  - ppp: change PPPIOCUNBRIDGECHAN ioctl request number to follow
         existing scheme
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAl/zsqQACgkQMUZtbf5S
 IrvfqA/+MbjN9TRccZRgYVzPVzlP5jswi7VZIjikPrNxCdwgQd8bDMfeaD6I1PcX
 WHf35vtD8zh729qz9DheWXFp7kDQ1fY0Z59KA25xf/ulFEkZPl3RBg70rSgv4rc+
 T82dVo6x33DPe6NkspDC+Uhjz2IxcS/P7F9N7DtbavrfNuDyX8+0U/FFQIL0xOyG
 DuhwecCh0vJFGcWXTWtK1vP1CPD98L28KS2Od+EZsUUZOKt1WMyGrAgNcT6uYXmO
 NIYNy+FPyvvIwTLupoFE7oU4LA0sZozyvzcTDugXBF5EKoR8BwBFk0FfWzN9Oxge
 LrmhNBSTeYyiw8XMOwSIfxwZnBm7mJFQqTHR1+Y83Qw1SR6PfSUZgkEkW2SYgprL
 9CzE3O3P3Ci7TSx7fvZUn8B1q5J0DfZR6ZYyor9zl55e+ikraRYtXsk47bf9AGXl
 owpHXEYWHFmgOP+LVdf1BUjuiE3vnCBJBsHlMbRkxiNPKravWtPSiM2yTu6fEbpT
 pMXCgFQBL/IqwzX01zuw7teg40YLVaFnmFdQbYDwA5p9VODlQvHzn2K4GyuktswX
 wxHYU5WRWtCkBfE+nbAROKzE7MuH9jtPtV1ZeuseTqYGBRuvEvudX8ypEvKS45pP
 OWkzFsSXd9q7M6cxftipwjcyLiIO+UGdizNHvDUyEQOPAyYPKb4=
 =N4/x
 -----END PGP SIGNATURE-----

Merge tag 'net-5.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Networking fixes, including fixes from netfilter, wireless and bpf
  trees.

  Current release - regressions:

   - mt76: fix NULL pointer dereference in mt76u_status_worker and
     mt76s_process_tx_queue

   - net: ipa: fix interconnect enable bug

  Current release - always broken:

   - netfilter: fixes possible oops in mtype_resize in ipset

   - ath11k: fix number of coding issues found by static analysis tools
     and spurious error messages

  Previous releases - regressions:

   - e1000e: re-enable s0ix power saving flows for systems with the
     Intel i219-LM Ethernet controllers to fix power use regression

   - virtio_net: fix recursive call to cpus_read_lock() to avoid a
     deadlock

   - ipv4: ignore ECN bits for fib lookups in fib_compute_spec_dst()

   - sysfs: take the rtnl lock around XPS configuration

   - xsk: fix memory leak for failed bind and rollback reservation at
     NETDEV_TX_BUSY

   - r8169: work around power-saving bug on some chip versions

  Previous releases - always broken:

   - dcb: validate netlink message in DCB handler

   - tun: fix return value when the number of iovs exceeds MAX_SKB_FRAGS
     to prevent unnecessary retries

   - vhost_net: fix ubuf refcount when sendmsg fails

   - bpf: save correct stopping point in file seq iteration

   - ncsi: use real net-device for response handler

   - neighbor: fix div by zero caused by a data race (TOCTOU)

   - bareudp: fix use of incorrect min_headroom size and a false
     positive lockdep splat from the TX lock

   - mvpp2:
      - clear force link UP during port init procedure in case
        bootloader had set it
      - add TCAM entry to drop flow control pause frames
      - fix PPPoE with ipv6 packet parsing
      - fix GoP Networking Complex Control config of port 3
      - fix pkt coalescing IRQ-threshold configuration

   - xsk: fix race in SKB mode transmit with shared cq

   - ionic: account for vlan tag len in rx buffer len

   - stmmac: ignore the second clock input, current clock framework does
     not handle exclusive clock use well, other drivers may reconfigure
     the second clock

  Misc:

   - ppp: change PPPIOCUNBRIDGECHAN ioctl request number to follow
     existing scheme"

* tag 'net-5.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (99 commits)
  net: dsa: lantiq_gswip: Fix GSWIP_MII_CFG(p) register access
  net: dsa: lantiq_gswip: Enable GSWIP_MII_CFG_EN also for internal PHYs
  net: lapb: Decrease the refcount of "struct lapb_cb" in lapb_device_event
  r8169: work around power-saving bug on some chip versions
  net: usb: qmi_wwan: add Quectel EM160R-GL
  selftests: mlxsw: Set headroom size of correct port
  net: macb: Correct usage of MACB_CAPS_CLK_HW_CHG flag
  ibmvnic: fix: NULL pointer dereference.
  docs: networking: packet_mmap: fix old config reference
  docs: networking: packet_mmap: fix formatting for C macros
  vhost_net: fix ubuf refcount incorrectly when sendmsg fails
  bareudp: Fix use of incorrect min_headroom size
  bareudp: set NETIF_F_LLTX flag
  net: hdlc_ppp: Fix issues when mod_timer is called while timer is running
  atlantic: remove architecture depends
  erspan: fix version 1 check in gre_parse_header()
  net: hns: fix return value check in __lb_other_process()
  net: sched: prevent invalid Scell_log shift count
  net: neighbor: fix a crash caused by mod zero
  ipv4: Ignore ECN bits for fib lookups in fib_compute_spec_dst()
  ...
2021-01-05 12:38:56 -08:00
Jakub Kicinski
a8f33c038f Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

1) Missing sanitization of rateest userspace string, bug has been
   triggered by syzbot, patch from Florian Westphal.

2) Report EOPNOTSUPP on missing set features in nft_dynset, otherwise
   error reporting to userspace via EINVAL is misleading since this is
   reserved for malformed netlink requests.

3) New binaries with old kernels might silently accept several set
   element expressions. New binaries set on the NFT_SET_EXPR and
   NFT_DYNSET_F_EXPR flags to request for several expressions per
   element, hence old kernels which do not support for this bail out
   with EOPNOTSUPP.

* git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf:
  netfilter: nftables: add set expression flags
  netfilter: nft_dynset: report EOPNOTSUPP on missing set feature
  netfilter: xt_RATEEST: reject non-null terminated string from userspace
====================

Link: https://lore.kernel.org/r/20210103192920.18639-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-04 14:02:02 -08:00
Xie He
b40f97b91a net: lapb: Decrease the refcount of "struct lapb_cb" in lapb_device_event
In lapb_device_event, lapb_devtostruct is called to get a reference to
an object of "struct lapb_cb". lapb_devtostruct increases the refcount
of the object and returns a pointer to it. However, we didn't decrease
the refcount after we finished using the pointer. This patch fixes this
problem.

Fixes: a4989fa911 ("net/lapb: support netdev events")
Cc: Martin Schiller <ms@dev.tdt.de>
Signed-off-by: Xie He <xie.he.0141@gmail.com>
Link: https://lore.kernel.org/r/20201231174331.64539-1-xie.he.0141@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-04 13:42:41 -08:00
Ilya Dryomov
4972cf605f libceph, ceph: disambiguate ceph_connection_operations handlers
Since a few years, kernel addresses are no longer included in oops
dumps, at least on x86.  All we get is a symbol name with offset and
size.

This is a problem for ceph_connection_operations handlers, especially
con->ops->dispatch().  All three handlers have the same name and there
is little context to disambiguate between e.g. monitor and OSD clients
because almost everything is inlined.  gdb sneakily stops at the first
matching symbol, so one has to resort to nm and addr2line.

Some of these are already prefixed with mon_, osd_ or mds_.  Let's do
the same for all others.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
2021-01-04 17:31:32 +01:00
Ilya Dryomov
10f42b3e64 libceph: zero out session key and connection secret
Try and avoid leaving bits and pieces of session key and connection
secret (gets split into GCM key and a pair of GCM IVs) around.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
2021-01-04 17:31:32 +01:00
Visa Hankala
da64ae2d35 xfrm: Fix wraparound in xfrm_policy_addr_delta()
Use three-way comparison for address components to avoid integer
wraparound in the result of xfrm_policy_addr_delta(). This ensures
that the search trees are built and traversed correctly.

Treat IPv4 and IPv6 similarly by returning 0 when prefixlen == 0.
Prefix /0 has only one equivalence class.

Fixes: 9cf545ebd5 ("xfrm: policy: store inexact policies in a tree ordered by destination address")
Signed-off-by: Visa Hankala <visa@hankala.org>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2021-01-04 10:35:09 +01:00
Cong Wang
afbc293add af_key: relax availability checks for skb size calculation
xfrm_probe_algs() probes kernel crypto modules and changes the
availability of struct xfrm_algo_desc. But there is a small window
where ealg->available and aalg->available get changed between
count_ah_combs()/count_esp_combs() and dump_ah_combs()/dump_esp_combs(),
in this case we may allocate a smaller skb but later put a larger
amount of data and trigger the panic in skb_put().

Fix this by relaxing the checks when counting the size, that is,
skipping the test of ->available. We may waste some memory for a few
of sizeof(struct sadb_comb), but it is still much better than a panic.

Reported-by: syzbot+b2bf2652983d23734c5c@syzkaller.appspotmail.com
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2021-01-04 10:05:50 +01:00
Eyal Birger
9f8550e4bd xfrm: fix disable_xfrm sysctl when used on xfrm interfaces
The disable_xfrm flag signals that xfrm should not be performed during
routing towards a device before reaching device xmit.

For xfrm interfaces this is usually desired as they perform the outbound
policy lookup as part of their xmit using their if_id.

Before this change enabling this flag on xfrm interfaces prevented them
from xmitting as xfrm_lookup_with_ifid() would not perform a policy lookup
in case the original dst had the DST_NOXFRM flag.

This optimization is incorrect when the lookup is done by the xfrm
interface xmit logic.

Fix by performing policy lookup when invoked by xfrmi as if_id != 0.

Similarly it's unlikely for the 'no policy exists on net' check to yield
any performance benefits when invoked from xfrmi.

Fixes: f203b76d78 ("xfrm: Add virtual xfrm interfaces")
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2021-01-04 10:04:05 +01:00
David S. Miller
4bfc471484 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2020-12-28

The following pull-request contains BPF updates for your *net* tree.

There is a small merge conflict between bpf tree commit 69ca310f34
("bpf: Save correct stopping point in file seq iteration") and net tree
commit 66ed594409 ("bpf/task_iter: In task_file_seq_get_next use
task_lookup_next_fd_rcu"). The get_files_struct() does not exist anymore
in net, so take the hunk in HEAD and add the `info->tid = curr_tid` to
the error path:

  [...]
                curr_task = task_seq_get_next(ns, &curr_tid, true);
                if (!curr_task) {
                        info->task = NULL;
                        info->tid = curr_tid;
                        return NULL;
                }

                /* set info->task and info->tid */
  [...]

We've added 10 non-merge commits during the last 9 day(s) which contain
a total of 11 files changed, 75 insertions(+), 20 deletions(-).

The main changes are:

1) Various AF_XDP fixes such as fill/completion ring leak on failed bind and
   fixing a race in skb mode's backpressure mechanism, from Magnus Karlsson.

2) Fix latency spikes on lockdep enabled kernels by adding a rescheduling
   point to BPF hashtab initialization, from Eric Dumazet.

3) Fix a splat in task iterator by saving the correct stopping point in the
   seq file iteration, from Jonathan Lemon.

4) Fix BPF maps selftest by adding retries in case hashtab returns EBUSY
   errors on update/deletes, from Andrii Nakryiko.

5) Fix BPF selftest error reporting to something more user friendly if the
   vmlinux BTF cannot be found, from Kamal Mostafa.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-12-28 15:26:11 -08:00
Cong Wang
085c7c4e1c erspan: fix version 1 check in gre_parse_header()
Both version 0 and version 1 use ETH_P_ERSPAN, but version 0 does not
have an erspan header. So the check in gre_parse_header() is wrong,
we have to distinguish version 1 from version 0.

We can just check the gre header length like is_erspan_type1().

Fixes: cb73ee40b1 ("net: ip_gre: use erspan key field for tunnel lookup")
Reported-by: syzbot+f583ce3d4ddf9836b27a@syzkaller.appspotmail.com
Cc: William Tu <u9012063@gmail.com>
Cc: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-12-28 15:00:00 -08:00
Randy Dunlap
bd1248f1dd net: sched: prevent invalid Scell_log shift count
Check Scell_log shift size in red_check_params() and modify all callers
of red_check_params() to pass Scell_log.

This prevents a shift out-of-bounds as detected by UBSAN:
  UBSAN: shift-out-of-bounds in ./include/net/red.h:252:22
  shift exponent 72 is too large for 32-bit type 'int'

Fixes: 8afa10cbe2 ("net_sched: red: Avoid illegal values")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reported-by: syzbot+97c5bd9cc81eca63d36e@syzkaller.appspotmail.com
Cc: Nogah Frankel <nogahf@mellanox.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: netdev@vger.kernel.org
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-12-28 14:52:54 -08:00
weichenchen
a533b70a65 net: neighbor: fix a crash caused by mod zero
pneigh_enqueue() tries to obtain a random delay by mod
NEIGH_VAR(p, PROXY_DELAY). However, NEIGH_VAR(p, PROXY_DELAY)
migth be zero at that point because someone could write zero
to /proc/sys/net/ipv4/neigh/[device]/proxy_delay after the
callers check it.

This patch uses prandom_u32_max() to get a random delay instead
which avoids potential division by zero.

Signed-off-by: weichenchen <weichen.chen@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-12-28 14:49:48 -08:00
Guillaume Nault
21fdca22eb ipv4: Ignore ECN bits for fib lookups in fib_compute_spec_dst()
RT_TOS() only clears one of the ECN bits. Therefore, when
fib_compute_spec_dst() resorts to a fib lookup, it can return
different results depending on the value of the second ECN bit.

For example, ECT(0) and ECT(1) packets could be treated differently.

  $ ip netns add ns0
  $ ip netns add ns1
  $ ip link add name veth01 netns ns0 type veth peer name veth10 netns ns1
  $ ip -netns ns0 link set dev lo up
  $ ip -netns ns1 link set dev lo up
  $ ip -netns ns0 link set dev veth01 up
  $ ip -netns ns1 link set dev veth10 up

  $ ip -netns ns0 address add 192.0.2.10/24 dev veth01
  $ ip -netns ns1 address add 192.0.2.11/24 dev veth10

  $ ip -netns ns1 address add 192.0.2.21/32 dev lo
  $ ip -netns ns1 route add 192.0.2.10/32 tos 4 dev veth10 src 192.0.2.21
  $ ip netns exec ns1 sysctl -wq net.ipv4.icmp_echo_ignore_broadcasts=0

With TOS 4 and ECT(1), ns1 replies using source address 192.0.2.21
(ping uses -Q to set all TOS and ECN bits):

  $ ip netns exec ns0 ping -c 1 -b -Q 5 192.0.2.255
  [...]
  64 bytes from 192.0.2.21: icmp_seq=1 ttl=64 time=0.544 ms

But with TOS 4 and ECT(0), ns1 replies using source address 192.0.2.11
because the "tos 4" route isn't matched:

  $ ip netns exec ns0 ping -c 1 -b -Q 6 192.0.2.255
  [...]
  64 bytes from 192.0.2.11: icmp_seq=1 ttl=64 time=0.597 ms

After this patch the ECN bits don't affect the result anymore:

  $ ip netns exec ns0 ping -c 1 -b -Q 6 192.0.2.255
  [...]
  64 bytes from 192.0.2.21: icmp_seq=1 ttl=64 time=0.591 ms

Fixes: 35ebf65e85 ("ipv4: Create and use fib_compute_spec_dst() helper.")
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-12-28 14:44:32 -08:00
Davide Caratti
e7579d5d5b net: mptcp: cap forward allocation to 1M
the following syzkaller reproducer:

 r0 = socket$inet_mptcp(0x2, 0x1, 0x106)
 bind$inet(r0, &(0x7f0000000080)={0x2, 0x4e24, @multicast2}, 0x10)
 connect$inet(r0, &(0x7f0000000480)={0x2, 0x4e24, @local}, 0x10)
 sendto$inet(r0, &(0x7f0000000100)="f6", 0xffffffe7, 0xc000, 0x0, 0x0)

systematically triggers the following warning:

 WARNING: CPU: 2 PID: 8618 at net/core/stream.c:208 sk_stream_kill_queues+0x3fa/0x580
 Modules linked in:
 CPU: 2 PID: 8618 Comm: syz-executor Not tainted 5.10.0+ #334
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.1-4.module+el8.1.0+4066+0f1aadab 04/04
 RIP: 0010:sk_stream_kill_queues+0x3fa/0x580
 Code: df 48 c1 ea 03 0f b6 04 02 84 c0 74 04 3c 03 7e 40 8b ab 20 02 00 00 e9 64 ff ff ff e8 df f0 81 2
 RSP: 0018:ffffc9000290fcb0 EFLAGS: 00010293
 RAX: ffff888011cb8000 RBX: 0000000000000000 RCX: ffffffff86eecf0e
 RDX: 0000000000000000 RSI: ffffffff86eecf6a RDI: 0000000000000005
 RBP: 0000000000000e28 R08: ffff888011cb8000 R09: fffffbfff1f48139
 R10: ffffffff8fa409c7 R11: fffffbfff1f48138 R12: ffff8880215e6220
 R13: ffffffff8fa409c0 R14: ffffc9000290fd30 R15: 1ffff92000521fa2
 FS:  00007f41c78f4800(0000) GS:ffff88802d000000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007f95c803d088 CR3: 0000000025ed2000 CR4: 00000000000006f0
 Call Trace:
  __mptcp_destroy_sock+0x4f5/0x8e0
   mptcp_close+0x5e2/0x7f0
  inet_release+0x12b/0x270
  __sock_release+0xc8/0x270
  sock_close+0x18/0x20
  __fput+0x272/0x8e0
  task_work_run+0xe0/0x1a0
  exit_to_user_mode_prepare+0x1df/0x200
  syscall_exit_to_user_mode+0x19/0x50
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

userspace programs provide arbitrarily high values of 'len' in sendmsg():
this is causing integer overflow of 'amount'. Cap forward allocation to 1
megabyte: higher values are not really useful.

Suggested-by: Paolo Abeni <pabeni@redhat.com>
Fixes: e93da92896 ("mptcp: implement wmem reservation")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Link: https://lore.kernel.org/r/3334d00d8b2faecafdfab9aa593efcbf61442756.1608584474.git.dcaratti@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-28 13:53:57 -08:00
Antoine Tenart
4ae2bb8164 net-sysfs: take the rtnl lock when accessing xps_rxqs_map and num_tc
Accesses to dev->xps_rxqs_map (when using dev->num_tc) should be
protected by the rtnl lock, like we do for netif_set_xps_queue. I didn't
see an actual bug being triggered, but let's be safe here and take the
rtnl lock while accessing the map in sysfs.

Fixes: 8af2c06ff4 ("net-sysfs: Add interface for Rx queue(s) map per Tx queue")
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-28 13:26:46 -08:00
Antoine Tenart
2d57b4f142 net-sysfs: take the rtnl lock when storing xps_rxqs
Two race conditions can be triggered when storing xps rxqs, resulting in
various oops and invalid memory accesses:

1. Calling netdev_set_num_tc while netif_set_xps_queue:

   - netif_set_xps_queue uses dev->tc_num as one of the parameters to
     compute the size of new_dev_maps when allocating it. dev->tc_num is
     also used to access the map, and the compiler may generate code to
     retrieve this field multiple times in the function.

   - netdev_set_num_tc sets dev->tc_num.

   If new_dev_maps is allocated using dev->tc_num and then dev->tc_num
   is set to a higher value through netdev_set_num_tc, later accesses to
   new_dev_maps in netif_set_xps_queue could lead to accessing memory
   outside of new_dev_maps; triggering an oops.

2. Calling netif_set_xps_queue while netdev_set_num_tc is running:

   2.1. netdev_set_num_tc starts by resetting the xps queues,
        dev->tc_num isn't updated yet.

   2.2. netif_set_xps_queue is called, setting up the map with the
        *old* dev->num_tc.

   2.3. netdev_set_num_tc updates dev->tc_num.

   2.4. Later accesses to the map lead to out of bound accesses and
        oops.

   A similar issue can be found with netdev_reset_tc.

One way of triggering this is to set an iface up (for which the driver
uses netdev_set_num_tc in the open path, such as bnx2x) and writing to
xps_rxqs in a concurrent thread. With the right timing an oops is
triggered.

Both issues have the same fix: netif_set_xps_queue, netdev_set_num_tc
and netdev_reset_tc should be mutually exclusive. We do that by taking
the rtnl lock in xps_rxqs_store.

Fixes: 8af2c06ff4 ("net-sysfs: Add interface for Rx queue(s) map per Tx queue")
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-28 13:26:46 -08:00
Antoine Tenart
fb25038586 net-sysfs: take the rtnl lock when accessing xps_cpus_map and num_tc
Accesses to dev->xps_cpus_map (when using dev->num_tc) should be
protected by the rtnl lock, like we do for netif_set_xps_queue. I didn't
see an actual bug being triggered, but let's be safe here and take the
rtnl lock while accessing the map in sysfs.

Fixes: 184c449f91 ("net: Add support for XPS with QoS via traffic classes")
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-28 13:26:46 -08:00
Antoine Tenart
1ad58225db net-sysfs: take the rtnl lock when storing xps_cpus
Two race conditions can be triggered when storing xps cpus, resulting in
various oops and invalid memory accesses:

1. Calling netdev_set_num_tc while netif_set_xps_queue:

   - netif_set_xps_queue uses dev->tc_num as one of the parameters to
     compute the size of new_dev_maps when allocating it. dev->tc_num is
     also used to access the map, and the compiler may generate code to
     retrieve this field multiple times in the function.

   - netdev_set_num_tc sets dev->tc_num.

   If new_dev_maps is allocated using dev->tc_num and then dev->tc_num
   is set to a higher value through netdev_set_num_tc, later accesses to
   new_dev_maps in netif_set_xps_queue could lead to accessing memory
   outside of new_dev_maps; triggering an oops.

2. Calling netif_set_xps_queue while netdev_set_num_tc is running:

   2.1. netdev_set_num_tc starts by resetting the xps queues,
        dev->tc_num isn't updated yet.

   2.2. netif_set_xps_queue is called, setting up the map with the
        *old* dev->num_tc.

   2.3. netdev_set_num_tc updates dev->tc_num.

   2.4. Later accesses to the map lead to out of bound accesses and
        oops.

   A similar issue can be found with netdev_reset_tc.

One way of triggering this is to set an iface up (for which the driver
uses netdev_set_num_tc in the open path, such as bnx2x) and writing to
xps_cpus in a concurrent thread. With the right timing an oops is
triggered.

Both issues have the same fix: netif_set_xps_queue, netdev_set_num_tc
and netdev_reset_tc should be mutually exclusive. We do that by taking
the rtnl lock in xps_cpus_store.

Fixes: 184c449f91 ("net: Add support for XPS with QoS via traffic classes")
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-28 13:26:46 -08:00
Ilya Dryomov
f5f2c9a0e3 libceph: align session_key and con_secret to 16 bytes
crypto_shash_setkey() and crypto_aead_setkey() will do a (small)
GFP_ATOMIC allocation to align the key if it isn't suitably aligned.
It's not a big deal, but at the same time easy to avoid.

The actual alignment requirement is dynamic, queryable with
crypto_shash_alignmask() and crypto_aead_alignmask(), but shouldn't
be stricter than 16 bytes for our algorithms.

Fixes: cd1a677cad ("libceph, ceph: implement msgr2.1 protocol (crc and secure modes)")
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-28 20:34:33 +01:00
Ilya Dryomov
ad32fe8801 libceph: fix auth_signature buffer allocation in secure mode
auth_signature frame is 68 bytes in plain mode and 96 bytes in
secure mode but we are requesting 68 bytes in both modes.  By luck,
this doesn't actually result in any invalid memory accesses because
the allocation is satisfied out of kmalloc-96 slab and so exactly
96 bytes are allocated, but KASAN rightfully complains.

Fixes: cd1a677cad ("libceph, ceph: implement msgr2.1 protocol (crc and secure modes)")
Reported-by: Luis Henriques <lhenriques@suse.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-28 20:34:32 +01:00
Pablo Neira Ayuso
b4e70d8dd9 netfilter: nftables: add set expression flags
The set flag NFT_SET_EXPR provides a hint to the kernel that userspace
supports for multiple expressions per set element. In the same
direction, NFT_DYNSET_F_EXPR specifies that dynset expression defines
multiple expressions per set element.

This allows new userspace software with old kernels to bail out with
EOPNOTSUPP. This update is similar to ef516e8625 ("netfilter:
nf_tables: reintroduce the NFT_SET_CONCAT flag"). The NFT_SET_EXPR flag
needs to be set on when the NFTA_SET_EXPRESSIONS attribute is specified.
The NFT_SET_EXPR flag is not set on with NFTA_SET_EXPR to retain
backward compatibility in old userspace binaries.

Fixes: 48b0ae046e ("netfilter: nftables: netlink support for several set element expressions")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-12-28 10:50:26 +01:00
Pablo Neira Ayuso
95cd4bca7b netfilter: nft_dynset: report EOPNOTSUPP on missing set feature
If userspace requests a feature which is not available the original set
definition, then bail out with EOPNOTSUPP. If userspace sends
unsupported dynset flags (new feature not supported by this kernel),
then report EOPNOTSUPP to userspace. EINVAL should be only used to
report malformed netlink messages from userspace.

Fixes: 22fe54d5fe ("netfilter: nf_tables: add support for dynamic set updates")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-12-28 10:50:16 +01:00
Christopher William Snowhill
a5687c6440 Bluetooth: Fix initializing response id after clearing struct
Looks like this was missed when patching the source to clear the structures
throughout, causing this one instance to clear the struct after the response
id is assigned.

Fixes: eddb773211 ("Bluetooth: A2MP: Fix not initializing all members")
Signed-off-by: Christopher William Snowhill <chris@kode54.net>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-12-27 21:35:59 +01:00
Florian Westphal
6cb56218ad netfilter: xt_RATEEST: reject non-null terminated string from userspace
syzbot reports:
detected buffer overflow in strlen
[..]
Call Trace:
 strlen include/linux/string.h:325 [inline]
 strlcpy include/linux/string.h:348 [inline]
 xt_rateest_tg_checkentry+0x2a5/0x6b0 net/netfilter/xt_RATEEST.c:143

strlcpy assumes src is a c-string. Check info->name before its used.

Reported-by: syzbot+e86f7c428c8c50db65b4@syzkaller.appspotmail.com
Fixes: 5859034d7e ("[NETFILTER]: x_tables: add RATEEST target")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-12-27 11:52:26 +01:00
John Wang
427c940558 net/ncsi: Use real net-device for response handler
When aggregating ncsi interfaces and dedicated interfaces to bond
interfaces, the ncsi response handler will use the wrong net device to
find ncsi_dev, so that the ncsi interface will not work properly.
Here, we use the original net device to fix it.

Fixes: 138635cc27 ("net/ncsi: NCSI response packet handler")
Signed-off-by: John Wang <wangzhiqiang.bj@bytedance.com>
Link: https://lore.kernel.org/r/20201223055523.2069-1-wangzhiqiang.bj@bytedance.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-23 12:22:23 -08:00
Petr Machata
826f328e2b net: dcb: Validate netlink message in DCB handler
DCB uses the same handler function for both RTM_GETDCB and RTM_SETDCB
messages. dcb_doit() bounces RTM_SETDCB mesasges if the user does not have
the CAP_NET_ADMIN capability.

However, the operation to be performed is not decided from the DCB message
type, but from the DCB command. Thus DCB_CMD_*_GET commands are used for
reading DCB objects, the corresponding SET and DEL commands are used for
manipulation.

The assumption is that set-like commands will be sent via an RTM_SETDCB
message, and get-like ones via RTM_GETDCB. However, this assumption is not
enforced.

It is therefore possible to manipulate DCB objects without CAP_NET_ADMIN
capability by sending the corresponding command in an RTM_GETDCB message.
That is a bug. Fix it by validating the type of the request message against
the type used for the response.

Fixes: 2f90b8657e ("ixgbe: this patch adds support for DCB to the kernel and ixgbe driver")
Signed-off-by: Petr Machata <me@pmachata.org>
Link: https://lore.kernel.org/r/a2a9b88418f3a58ef211b718f2970128ef9e3793.1608673640.git.me@pmachata.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-23 12:19:48 -08:00
Linus Torvalds
70990afa34 9p for 5.11-rc1
- fix long-standing limitation on open-unlink-fop pattern
 - add refcount to p9_fid (fixes the above and will allow for more
 cleanups and simplifications in the future)
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE/IPbcYBuWt0zoYhOq06b7GqY5nAFAl/gWOYACgkQq06b7GqY
 5nBcHxAArtfxZ4wGa+OMoWt19UvF0bUeYUYdy75rp6awpXmQyMfMZY9oOrZ9L+St
 mS3oLd8Mq7MeJQ+iGKPPOX+085aNhRtxjQOiHlv02fi09zVyNqj1vptq+rTRkKZL
 8KTs+wYSIX2x1WensDBdohYFiWf87vCWyywpQ/1Vm0585mI8B88N3/H5EpkLbPZn
 1bgXkakgd2hS+rKu6vJGV8lCJ8eKOSpQ3WUK0WQhY6ysJEL9dbl00qyNQ4kq2UgI
 i3CpWC/86297KJOehPmegpVwqTIL5wrCkhQy7sNxDkR9E4Q3VoKTNSrnRDtBI0x/
 mRCEnc/2Pf76Zv3LpGXnLFSnDLTQaO842Hla92RZJCcHaDtGDwV9Znq6sZgn+pdV
 wlEpT383s+7AC/X1EFvLPMr5PHrEM1KW25YEYQI78X2hh8G+utW5p3+vu70b0PyT
 Z7P9yBW4aU4cZb3XcZZihatZG2P1LxHyjNMmn4YBdUeQuitdiJGc2KwjVyKXi9PX
 +t6xxI+YAIhu/5I9fRtOoIDQNTjsRRE+4FzRKd2kvxW5Z5sV/bZCf6CHnQ2uKYVP
 fEMYd0GhjrUABbJPLLdFlBXaV+yigvwKtABDnJ5KSA9NlDSVBGewTRa2I+9DoclL
 yQS4hELuvQsUNCo4AkICixo9vABt0j1OhXlo+B0WVk8Ilh9Wxpc=
 =DKlc
 -----END PGP SIGNATURE-----

Merge tag '9p-for-5.11-rc1' of git://github.com/martinetd/linux

Pull 9p update from Dominique Martinet:

 - fix long-standing limitation on open-unlink-fop pattern

 - add refcount to p9_fid (fixes the above and will allow for more
   cleanups and simplifications in the future)

* tag '9p-for-5.11-rc1' of git://github.com/martinetd/linux:
  9p: Remove unnecessary IS_ERR() check
  9p: Uninitialized variable in v9fs_writeback_fid()
  9p: Fix writeback fid incorrectly being attached to dentry
  9p: apply review requests for fid refcounting
  9p: add refcount to p9_fid struct
  fs/9p: search open fids first
  fs/9p: track open fids
  fs/9p: fix create-unlink-getattr idiom
2020-12-21 10:28:02 -08:00
Shmulik Ladkani
56ce7c25ae xfrm: Fix oops in xfrm_replay_advance_bmp
When setting xfrm replay_window to values higher than 32, a rare
page-fault occurs in xfrm_replay_advance_bmp:

  BUG: unable to handle page fault for address: ffff8af350ad7920
  #PF: supervisor write access in kernel mode
  #PF: error_code(0x0002) - not-present page
  PGD ad001067 P4D ad001067 PUD 0
  Oops: 0002 [#1] SMP PTI
  CPU: 3 PID: 30 Comm: ksoftirqd/3 Kdump: loaded Not tainted 5.4.52-050452-generic #202007160732
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
  RIP: 0010:xfrm_replay_advance_bmp+0xbb/0x130
  RSP: 0018:ffffa1304013ba40 EFLAGS: 00010206
  RAX: 000000000000010d RBX: 0000000000000002 RCX: 00000000ffffff4b
  RDX: 0000000000000018 RSI: 00000000004c234c RDI: 00000000ffb3dbff
  RBP: ffffa1304013ba50 R08: ffff8af330ad7920 R09: 0000000007fffffa
  R10: 0000000000000800 R11: 0000000000000010 R12: ffff8af29d6258c0
  R13: ffff8af28b95c700 R14: 0000000000000000 R15: ffff8af29d6258fc
  FS:  0000000000000000(0000) GS:ffff8af339ac0000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: ffff8af350ad7920 CR3: 0000000015ee4000 CR4: 00000000001406e0
  Call Trace:
   xfrm_input+0x4e5/0xa10
   xfrm4_rcv_encap+0xb5/0xe0
   xfrm4_udp_encap_rcv+0x140/0x1c0

Analysis revealed offending code is when accessing:

	replay_esn->bmp[nr] |= (1U << bitnr);

with 'nr' being 0x07fffffa.

This happened in an SMP system when reordering of packets was present;
A packet arrived with a "too old" sequence number (outside the window,
i.e 'diff > replay_window'), and therefore the following calculation:

			bitnr = replay_esn->replay_window - (diff - pos);

yields a negative result, but since bitnr is u32 we get a large unsigned
quantity (in crash dump above: 0xffffff4b seen in ecx).

This was supposed to be protected by xfrm_input()'s former call to:

		if (x->repl->check(x, skb, seq)) {

However, the state's spinlock x->lock is *released* after '->check()'
is performed, and gets re-acquired before '->advance()' - which gives a
chance for a different core to update the xfrm state, e.g. by advancing
'replay_esn->seq' when it encounters more packets - leading to a
'diff > replay_window' situation when original core continues to
xfrm_replay_advance_bmp().

An attempt to fix this issue was suggested in commit bcf66bf54a
("xfrm: Perform a replay check after return from async codepaths"),
by calling 'x->repl->recheck()' after lock is re-acquired, but fix
applied only to asyncronous crypto algorithms.

Augment the fix, by *always* calling 'recheck()' - irrespective if we're
using async crypto.

Fixes: 0ebea8ef35 ("[IPSEC]: Move state lock into x->type->input")
Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-12-19 08:12:17 +01:00
Jakub Kicinski
1e72faedcd Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

1) Incorrect loop in error path of nft_set_elem_expr_clone(),
   from Colin Ian King.

2) Missing xt_table_get_private_protected() to access table
   private data in x_tables, from Subash Abhinov Kasiviswanathan.

3) Possible oops in ipset hash type resize, from Vasily Averin.

4) Fix shift-out-of-bounds in ipset hash type, also from Vasily.

* git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf:
  netfilter: ipset: fix shift-out-of-bounds in htable_bits()
  netfilter: ipset: fixes possible oops in mtype_resize
  netfilter: x_tables: Update remaining dereference to RCU
  netfilter: nftables: fix incorrect increment of loop counter
====================

Link: https://lore.kernel.org/r/20201218120409.3659-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-18 18:07:14 -08:00
Davide Caratti
698285da79 net/sched: sch_taprio: ensure to reset/destroy all child qdiscs
taprio_graft() can insert a NULL element in the array of child qdiscs. As
a consquence, taprio_reset() might not reset child qdiscs completely, and
taprio_destroy() might leak resources. Fix it by ensuring that loops that
iterate over q->qdiscs[] don't end when they find the first NULL item.

Fixes: 44d4775ca5 ("net/sched: sch_taprio: reset child qdiscs before freeing them")
Fixes: 5a781ccbd1 ("tc: Add support for configuring the taprio scheduler")
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Link: https://lore.kernel.org/r/13edef6778fef03adc751582562fba4a13e06d6a.1608240532.git.dcaratti@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-18 16:43:29 -08:00
Sonny Sasaka
89e65975fe Bluetooth: Cancel Inquiry before Create Connection
Many controllers do not allow HCI Create Connection while it is doing
Inquiry. This patch adds Inquiry Cancel before Create Connection in this
case to allow the controller to do Create Connection. User space will be
aware of this Inquiry cancellation and they may issue another discovery
request afterwards.

Sample Command Disallowed response of HCI Create Connection:
< HCI Command: Inquiry (0x01|0x0001) plen 5
        Access code: 0x9e8b33 (General Inquiry)
        Length: 10.24s (0x08)
        Num responses: 0
> HCI Event: Command Status (0x0f) plen 4
      Inquiry (0x01|0x0001) ncmd 2
        Status: Success (0x00)
< HCI Command: Create Connection (0x01|0x0005) plen 13
        Address: XX:XX:XX:XX:XX:XX
        Packet type: 0xcc18
        Page scan repetition mode: R2 (0x02)
        Page scan mode: Mandatory (0x00)
        Clock offset: 0x0000
        Role switch: Allow slave (0x01)
> HCI Event: Command Status (0x0f) plen 4
      Create Connection (0x01|0x0005) ncmd 1
        Status: Success (0x00)
> HCI Event: Connect Complete (0x03) plen 11
        Status: Command Disallowed (0x0c)
        Handle: 65535
        Address: XX:XX:XX:XX:XX:XX
        Link type: ACL (0x01)
        Encryption: Disabled (0x00)

Reviewed-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Reviewed-by: Alain Michaud <alainm@chromium.org>
Signed-off-by: Sonny Sasaka <sonnysasaka@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-12-18 22:46:42 +01:00
Abhishek Pandit-Subedi
36211f7fc1 Bluetooth: Pause service discovery for suspend
Just like MGMT_OP_START_DISCOVERY, we should reject
MGMT_OP_START_SERVICE_DISCOVERY with MGMT_STATUS_BUSY when we are paused
for suspend.

Signed-off-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-12-18 22:31:04 +01:00
Abhishek Pandit-Subedi
295fa2a564 Bluetooth: Remove hci_req_le_suspend_config
Add a missing SUSPEND_SCAN_ENABLE in passive scan, remove the separate
function for configuring le scan during suspend and update the request
complete function to clear both enable and disable tasks.

Fixes: dce0a4be80 ("Bluetooth: Set missing suspend task bits")
Reviewed-by: Alain Michaud <alainm@chromium.org>
Signed-off-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-12-18 22:26:59 +01:00
Baruch Siach
abdcd06c4d net: af_packet: fix procfs header for 64-bit pointers
On 64-bit systems the packet procfs header field names following 'sk'
are not aligned correctly:

sk       RefCnt Type Proto  Iface R Rmem   User   Inode
00000000605d2c64 3      3    0003   7     1 450880 0      16643
00000000080e9b80 2      2    0000   0     0 0      0      17404
00000000b23b8a00 2      2    0000   0     0 0      0      17421
...

With this change field names are correctly aligned:

sk               RefCnt Type Proto  Iface R Rmem   User   Inode
000000005c3b1d97 3      3    0003   7     1 21568  0      16178
000000007be55bb7 3      3    fbce   8     1 0      0      16250
00000000be62127d 3      3    fbcd   8     1 0      0      16254
...

Signed-off-by: Baruch Siach <baruch@tkos.co.il>
Link: https://lore.kernel.org/r/54917251d8433735d9a24e935a6cb8eb88b4058a.1608103684.git.baruch@tkos.co.il
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-18 12:17:23 -08:00
Chuck Lever
4a85a6a332 SUNRPC: Handle TCP socket sends with kernel_sendpage() again
Daire Byrne reports a ~50% aggregrate throughput regression on his
Linux NFS server after commit da1661b93b ("SUNRPC: Teach server to
use xprt_sock_sendmsg for socket sends"), which replaced
kernel_send_page() calls in NFSD's socket send path with calls to
sock_sendmsg() using iov_iter.

Investigation showed that tcp_sendmsg() was not using zero-copy to
send the xdr_buf's bvec pages, but instead was relying on memcpy.
This means copying every byte of a large NFS READ payload.

It looks like TLS sockets do indeed support a ->sendpage method,
so it's really not necessary to use xprt_sock_sendmsg() to support
TLS fully on the server. A mechanical reversion of da1661b93b is
not possible at this point, but we can re-implement the server's
TCP socket sendmsg path using kernel_sendpage().

Reported-by: Daire Byrne <daire@dneg.com>
BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=209439
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-12-18 12:28:41 -05:00
Magnus Karlsson
b1b95cb5c0 xsk: Rollback reservation at NETDEV_TX_BUSY
Rollback the reservation in the completion ring when we get a
NETDEV_TX_BUSY. When this error is received from the driver, we are
supposed to let the user application retry the transmit again. And in
order to do this, we need to roll back the failed send so it can be
retried. Unfortunately, we did not cancel the reservation we had made
in the completion ring. By not doing this, we actually make the
completion ring one entry smaller per NETDEV_TX_BUSY error we get, and
after enough of these errors the completion ring will be of size zero
and transmit will stop working.

Fix this by cancelling the reservation when we get a NETDEV_TX_BUSY
error.

Fixes: 642e450b6b ("xsk: Do not discard packet when NETDEV_TX_BUSY")
Reported-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/20201218134525.13119-3-magnus.karlsson@gmail.com
2020-12-18 16:10:21 +01:00
Magnus Karlsson
f09ced4053 xsk: Fix race in SKB mode transmit with shared cq
Fix a race when multiple sockets are simultaneously calling sendto()
when the completion ring is shared in the SKB case. This is the case
when you share the same netdev and queue id through the
XDP_SHARED_UMEM bind flag. The problem is that multiple processes can
be in xsk_generic_xmit() and call the backpressure mechanism in
xskq_prod_reserve(xs->pool->cq). As this is a shared resource in this
specific scenario, a race might occur since the rings are
single-producer single-consumer.

Fix this by moving the tx_completion_lock from the socket to the pool
as the pool is shared between the sockets that share the completion
ring. (The pool is not shared when this is not the case.) And then
protect the accesses to xskq_prod_reserve() with this lock. The
tx_completion_lock is renamed cq_lock to better reflect that it
protects accesses to the potentially shared completion ring.

Fixes: 35fcde7f8d ("xsk: support for Tx")
Reported-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/20201218134525.13119-2-magnus.karlsson@gmail.com
2020-12-18 16:10:21 +01:00
Magnus Karlsson
8bee683384 xsk: Fix memory leak for failed bind
Fix a possible memory leak when a bind of an AF_XDP socket fails. When
the fill and completion rings are created, they are tied to the
socket. But when the buffer pool is later created at bind time, the
ownership of these two rings are transferred to the buffer pool as
they might be shared between sockets (and the buffer pool cannot be
created until we know what we are binding to). So, before the buffer
pool is created, these two rings are cleaned up with the socket, and
after they have been transferred they are cleaned up together with
the buffer pool.

The problem is that ownership was transferred before it was absolutely
certain that the buffer pool could be created and initialized
correctly and when one of these errors occurred, the fill and
completion rings did neither belong to the socket nor the pool and
where therefore leaked. Solve this by moving the ownership transfer
to the point where the buffer pool has been completely set up and
there is no way it can fail.

Fixes: 7361f9c3d7 ("xsk: Move fill and completion rings to buffer pool")
Reported-by: syzbot+cfa88ddd0655afa88763@syzkaller.appspotmail.com
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/20201214085127.3960-1-magnus.karlsson@gmail.com
2020-12-17 22:48:55 +01:00
Linus Torvalds
d64c6f96ba Networking fixes for 5.11-rc1.
Current release - always broken:
 
  - net/smc: fix access to parent of an ib device
 
  - devlink: use _BITUL() macro instead of BIT() in the UAPI header
 
  - handful of mptcp fixes
 
 Previous release - regressions:
 
  - intel: AF_XDP: clear the status bits for the next_to_use descriptor
 
  - dpaa2-eth: fix the size of the mapped SGT buffer
 
 Previous release - always broken:
 
  - mptcp: fix security context on server socket
 
  - ethtool: fix string set id check
 
  - ethtool: fix error paths in ethnl_set_channels()
 
  - lan743x: fix rx_napi_poll/interrupt ping-pong
 
  - qca: ar9331: fix sleeping function called from invalid context bug
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAl/bxLMACgkQMUZtbf5S
 IrtTkRAAoxZFOsuXeTejJqUPlUwxCfAWb4HhND9k2xLNqi2H85b16v56Fy/R6AzK
 AKVHYFn0a0BetcicpjNZmV8bC7Xeu7YaEF8OmzBZ0TQ2OiCSHdMWFUOlo/2Q9ogN
 xc0q4umaq10SekUEeanOyHta5Y9YEVaiRk/y6Eue5FGo9jbCbgDXKoxBvFzwlNkr
 yuXFZOvY9CwlzKuBWFnpBNP6tizdG2q42JDif/v1nF0wLSEt7KkwKPOButusWbjd
 WdqeZ6vazg2z6/PLJ/GCgr8vGrF5Ublp8uccz0kfTW3Cmu2jZH8SxsdaUcL2WRiR
 rqnu0sYe6grQCF78lUZIpJp5K5TMpXI5UtjN2Kwv1J/yo8rSXMFqTNgdWNLxUkTF
 VzhoufTrjQIR1ERAzEWUdz2JA2VyIoJ5YrqHFPwJworlpKDKWHVjn4rMmsxValoN
 G+F3BzEBUlPbsV8IQa0uD9tW2qxtZ1g51uueKNkX9s5m/7mXdRoIFD6JWqoY4vg6
 8oWhc3xpGry3u+pOSOWHMM1FVDpgiclF6ybuERcTHFcn821RdiLtfIaRKQQG72z2
 oYoadbbwrR3CnYmYY0l/7LExYSuW0kaPPiuAPKRZiPb1vK3qv0X50GXylyFYdSfT
 JPytadfHz5xo9bVE8x1sDCTkwNGPdWs+w6fyHyHNGp2EZVSJyno=
 =nTjM
 -----END PGP SIGNATURE-----

Merge tag 'net-5.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Current release - always broken:

   - net/smc: fix access to parent of an ib device

   - devlink: use _BITUL() macro instead of BIT() in the UAPI header

   - handful of mptcp fixes

  Previous release - regressions:

   - intel: AF_XDP: clear the status bits for the next_to_use descriptor

   - dpaa2-eth: fix the size of the mapped SGT buffer

  Previous release - always broken:

   - mptcp: fix security context on server socket

   - ethtool: fix string set id check

   - ethtool: fix error paths in ethnl_set_channels()

   - lan743x: fix rx_napi_poll/interrupt ping-pong

   - qca: ar9331: fix sleeping function called from invalid context bug"

* tag 'net-5.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (32 commits)
  net/sched: sch_taprio: reset child qdiscs before freeing them
  nfp: move indirect block cleanup to flower app stop callback
  octeontx2-af: Fix undetected unmap PF error check
  net: nixge: fix spelling mistake in Kconfig: "Instuments" -> "Instruments"
  qlcnic: Fix error code in probe
  mptcp: fix pending data accounting
  mptcp: push pending frames when subflow has free space
  mptcp: properly annotate nested lock
  mptcp: fix security context on server socket
  net/mlx5: Fix compilation warning for 32-bit platform
  mptcp: clear use_ack and use_map when dropping other suboptions
  devlink: use _BITUL() macro instead of BIT() in the UAPI header
  net: korina: fix return value
  net/smc: fix access to parent of an ib device
  ethtool: fix error paths in ethnl_set_channels()
  nfc: s3fwrn5: Remove unused NCI prop commands
  nfc: s3fwrn5: Remove the delay for NFC sleep
  phy: fix kdoc warning
  tipc: do sanity check payload of a netlink message
  use __netdev_notify_peers in hyperv
  ...
2020-12-17 13:45:24 -08:00
Linus Torvalds
74f602dc96 NFS client updates for Linux 5.11
Highlights include:
 
 Features:
 - NFSv3: Add emulation of lookupp() to improve open_by_filehandle()
   support.
 - A series of patches to improve readdir performance, particularly with
   large directories.
 - Basic support for using NFS/RDMA with the pNFS files and flexfiles
   drivers.
 - Micro-optimisations for RDMA.
 - RDMA tracing improvements.
 
 Bugfixes:
 - Fix a long standing bug with xs_read_xdr_buf() when receiving partial
   pages (Dan Aloni).
 - Various fixes for getxattr and listxattr, when used over non-TCP
   transports.
 - Fixes for containerised NFS from Sargun Dhillon.
 - switch nfsiod to be an UNBOUND workqueue (Neil Brown).
 - READDIR should not ask for security label information if there is no
   LSM policy. (Olga Kornievskaia)
 - Avoid using interval-based rebinding with TCP in lockd (Calum Mackay).
 - A series of RPC and NFS layer fixes to support the NFSv4.2 READ_PLUS code.
 - A couple of fixes for pnfs/flexfiles read failover
 
 Cleanups:
 - Various cleanups for the SUNRPC xdr code in conjunction with the
   READ_PLUS fixes.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAl/aiaIACgkQZwvnipYK
 APIOihAAvONscxrFSaGRh2ICNv9I/zXW/A5+R3qnkESPVLTqTPJVphoN7FlINAr1
 B74pg6n4T4viycbvsogU2+kHrlJZO7B8lTkJL7ynm9Wgyw8+2Ga4QEn1bsAoqmuY
 b91p/+LfOLKrYeeojoH31PC73uOYYG1WHXJhjq0l9b5CTgThWpj6O3gDaFEbFvmz
 A7V3yqSp04sV70YxUhwelBHZ5BXdiXIKsPnIwvXXHuY7IcamrE4EA3wGCwtxkBnu
 4dwbOtRXURNSev0r3n6FsH4wZl+/nvp9UpnGdPtVv94F1zm2JKLwkhoJejS/vpjq
 eyKc7ZXBQ0uHbTWI2Yj1YjA61VIUO0R0EDuyTAnRKDeaarID42n5kMG7J8cIglZR
 jQfyx99xm0eSrdwxC09tcRL/lBzYcOfc6pJo5P9BtaFtRvbp9iFIHuFKlrXbULd4
 WrZzDMhiKVYGSTcTpfQyVoK2rCvn6W1Ida4iYeI0gkJ1v9X90UhbtJOyggn/bxyL
 DV/Qy40+l48n7CZfPU2eDv4WXqjKGRibpDoWMBLwUH20dDEX6kKYv3BfApFYGqyO
 /GTPAFUZarCy8BENvzZv/Jb9mt5pDQM5p9ZXpdUOhydLMMA+pauaT/Gr+pAHPIPx
 MPj546Gh2cEaT883xvRrJmQTG0nw/WscPNcHaJcgL5oYltmuwck=
 =IKWG
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-5.11-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client updates from Trond Myklebust:
 "Highlights include:

  Features:

   - NFSv3: Add emulation of lookupp() to improve open_by_filehandle()
     support

   - A series of patches to improve readdir performance, particularly
     with large directories

   - Basic support for using NFS/RDMA with the pNFS files and flexfiles
     drivers

   - Micro-optimisations for RDMA

   - RDMA tracing improvements

  Bugfixes:

   - Fix a long standing bug with xs_read_xdr_buf() when receiving
     partial pages (Dan Aloni)

   - Various fixes for getxattr and listxattr, when used over non-TCP
     transports

   - Fixes for containerised NFS from Sargun Dhillon

   - switch nfsiod to be an UNBOUND workqueue (Neil Brown)

   - READDIR should not ask for security label information if there is
     no LSM policy (Olga Kornievskaia)

   - Avoid using interval-based rebinding with TCP in lockd (Calum
     Mackay)

   - A series of RPC and NFS layer fixes to support the NFSv4.2
     READ_PLUS code

   - A couple of fixes for pnfs/flexfiles read failover

  Cleanups:

   - Various cleanups for the SUNRPC xdr code in conjunction with the
     READ_PLUS fixes"

* tag 'nfs-for-5.11-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (90 commits)
  NFS/pNFS: Fix a typo in ff_layout_resend_pnfs_read()
  pNFS/flexfiles: Avoid spurious layout returns in ff_layout_choose_ds_for_read
  NFSv4/pnfs: Add tracing for the deviceid cache
  fs/lockd: convert comma to semicolon
  NFSv4.2: fix error return on memory allocation failure
  NFSv4.2/pnfs: Don't use READ_PLUS with pNFS yet
  NFSv4.2: Deal with potential READ_PLUS data extent buffer overflow
  NFSv4.2: Don't error when exiting early on a READ_PLUS buffer overflow
  NFSv4.2: Handle hole lengths that exceed the READ_PLUS read buffer
  NFSv4.2: decode_read_plus_hole() needs to check the extent offset
  NFSv4.2: decode_read_plus_data() must skip padding after data segment
  NFSv4.2: Ensure we always reset the result->count in decode_read_plus()
  SUNRPC: When expanding the buffer, we may need grow the sparse pages
  SUNRPC: Cleanup - constify a number of xdr_buf helpers
  SUNRPC: Clean up open coded setting of the xdr_stream 'nwords' field
  SUNRPC: _copy_to/from_pages() now check for zero length
  SUNRPC: Cleanup xdr_shrink_bufhead()
  SUNRPC: Fix xdr_expand_hole()
  SUNRPC: Fixes for xdr_align_data()
  SUNRPC: _shift_data_left/right_pages should check the shift length
  ...
2020-12-17 12:15:03 -08:00
Linus Torvalds
be695ee29e The big ticket item here is support for msgr2 on-wire protocol, which
adds the option of full in-transit encryption using AES-GCM algorithm
 (myself).  On top of that we have a series to avoid intermittent
 errors during recovery with recover_session=clean and some MDS request
 encoding work from Jeff, a cap handling fix and assorted observability
 improvements from Luis and Xiubo and a good number of cleanups.  Luis
 also ran into a corner case with quotas which sadly means that we are
 back to denying cross-quota-realm renames.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAl/beWITHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi4i0CACnvd87l2n7dndig7p5d5lVsmo8tAFs
 wHYHaIVisWKMcqKoT+YJajSgzaonxjzvYiyCzwLxV7s7vI7cswAwjEfYT7tTDRp2
 pnO1+4N/1ftznnTk/1QdqwOQLUg5UtdgWvFCaXQF+Vr/YroZomKJPaK8fXK882pC
 9FBjoLNy1HWySsoXPCxJktmDzpEEyYRNJg0vquxm7mxwTgQErupWlwEFjNg5LBkm
 gC0UoKhCE3DeUrXnoq21Ga62RIajxHofTooNx7dg+JiSVgluW+nORaWDYJXNzwLC
 j5puSe4pWIah+gmcwIFuyNz4ddkvVL4URvsYPGkVFYXlEefQjErc10Jh
 =6b9f
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-5.11-rc1' of git://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:
 "The big ticket item here is support for msgr2 on-wire protocol, which
  adds the option of full in-transit encryption using AES-GCM algorithm
  (myself).

  On top of that we have a series to avoid intermittent errors during
  recovery with recover_session=clean and some MDS request encoding work
  from Jeff, a cap handling fix and assorted observability improvements
  from Luis and Xiubo and a good number of cleanups.

  Luis also ran into a corner case with quotas which sadly means that we
  are back to denying cross-quota-realm renames"

* tag 'ceph-for-5.11-rc1' of git://github.com/ceph/ceph-client: (59 commits)
  libceph: drop ceph_auth_{create,update}_authorizer()
  libceph, ceph: make use of __ceph_auth_get_authorizer() in msgr1
  libceph, ceph: implement msgr2.1 protocol (crc and secure modes)
  libceph: introduce connection modes and ms_mode option
  libceph, rbd: ignore addr->type while comparing in some cases
  libceph, ceph: get and handle cluster maps with addrvecs
  libceph: factor out finish_auth()
  libceph: drop ac->ops->name field
  libceph: amend cephx init_protocol() and build_request()
  libceph, ceph: incorporate nautilus cephx changes
  libceph: safer en/decoding of cephx requests and replies
  libceph: more insight into ticket expiry and invalidation
  libceph: move msgr1 protocol specific fields to its own struct
  libceph: move msgr1 protocol implementation to its own file
  libceph: separate msgr1 protocol implementation
  libceph: export remaining protocol independent infrastructure
  libceph: export zero_page
  libceph: rename and export con->flags bits
  libceph: rename and export con->state states
  libceph: make con->state an int
  ...
2020-12-17 11:53:52 -08:00
Davide Caratti
44d4775ca5 net/sched: sch_taprio: reset child qdiscs before freeing them
syzkaller shows that packets can still be dequeued while taprio_destroy()
is running. Let sch_taprio use the reset() function to cancel the advance
timer and drop all skbs from the child qdiscs.

Fixes: 5a781ccbd1 ("tc: Add support for configuring the taprio scheduler")
Link: https://syzkaller.appspot.com/bug?id=f362872379bf8f0017fb667c1ab158f2d1e764ae
Reported-by: syzbot+8971da381fb5a31f542d@syzkaller.appspotmail.com
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Link: https://lore.kernel.org/r/63b6d79b0e830ebb0283e020db4df3cdfdfb2b94.1608142843.git.dcaratti@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-17 10:57:57 -08:00
Vasily Averin
5c8193f568 netfilter: ipset: fix shift-out-of-bounds in htable_bits()
htable_bits() can call jhash_size(32) and trigger shift-out-of-bounds

UBSAN: shift-out-of-bounds in net/netfilter/ipset/ip_set_hash_gen.h:151:6
shift exponent 32 is too large for 32-bit type 'unsigned int'
CPU: 0 PID: 8498 Comm: syz-executor519
 Not tainted 5.10.0-rc7-next-20201208-syzkaller #0
Call Trace:
 __dump_stack lib/dump_stack.c:79 [inline]
 dump_stack+0x107/0x163 lib/dump_stack.c:120
 ubsan_epilogue+0xb/0x5a lib/ubsan.c:148
 __ubsan_handle_shift_out_of_bounds.cold+0xb1/0x181 lib/ubsan.c:395
 htable_bits net/netfilter/ipset/ip_set_hash_gen.h:151 [inline]
 hash_mac_create.cold+0x58/0x9b net/netfilter/ipset/ip_set_hash_gen.h:1524
 ip_set_create+0x610/0x1380 net/netfilter/ipset/ip_set_core.c:1115
 nfnetlink_rcv_msg+0xecc/0x1180 net/netfilter/nfnetlink.c:252
 netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2494
 nfnetlink_rcv+0x1ac/0x420 net/netfilter/nfnetlink.c:600
 netlink_unicast_kernel net/netlink/af_netlink.c:1304 [inline]
 netlink_unicast+0x533/0x7d0 net/netlink/af_netlink.c:1330
 netlink_sendmsg+0x907/0xe40 net/netlink/af_netlink.c:1919
 sock_sendmsg_nosec net/socket.c:652 [inline]
 sock_sendmsg+0xcf/0x120 net/socket.c:672
 ____sys_sendmsg+0x6e8/0x810 net/socket.c:2345
 ___sys_sendmsg+0xf3/0x170 net/socket.c:2399
 __sys_sendmsg+0xe5/0x1b0 net/socket.c:2432
 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

This patch replaces htable_bits() by simple fls(hashsize - 1) call:
it alone returns valid nbits both for round and non-round hashsizes.
It is normal to set any nbits here because it is validated inside
following htable_size() call which returns 0 for nbits>31.

Fixes: 1feab10d7e6d("netfilter: ipset: Unified hash type generation")
Reported-by: syzbot+d66bfadebca46cf61a2b@syzkaller.appspotmail.com
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Acked-by: Jozsef Kadlecsik <kadlec@netfilter.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-12-17 19:44:52 +01:00
Vasily Averin
2b33d6ffa9 netfilter: ipset: fixes possible oops in mtype_resize
currently mtype_resize() can cause oops

        t = ip_set_alloc(htable_size(htable_bits));
        if (!t) {
                ret = -ENOMEM;
                goto out;
        }
        t->hregion = ip_set_alloc(ahash_sizeof_regions(htable_bits));

Increased htable_bits can force htable_size() to return 0.
In own turn ip_set_alloc(0) returns not 0 but ZERO_SIZE_PTR,
so follwoing access to t->hregion should trigger an OOPS.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Acked-by: Jozsef Kadlecsik <kadlec@netfilter.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-12-17 19:44:52 +01:00
Subash Abhinov Kasiviswanathan
443d6e86f8 netfilter: x_tables: Update remaining dereference to RCU
This fixes the dereference to fetch the RCU pointer when holding
the appropriate xtables lock.

Reported-by: kernel test robot <lkp@intel.com>
Fixes: cc00bcaa58 ("netfilter: x_tables: Switch synchronization to RCU")
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-12-17 19:44:52 +01:00
Paolo Abeni
13e1603739 mptcp: fix pending data accounting
When sendmsg() needs to wait for memory, the pending data
is not updated. That causes a drift in forward memory allocation,
leading to stall and/or warnings at socket close time.

This change addresses the above issue moving the pending data
counter update inside the sendmsg() main loop.

Fixes: 6e628cd3a8 ("mptcp: use mptcp release_cb for delayed tasks")
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-17 10:24:47 -08:00
Paolo Abeni
219d04992b mptcp: push pending frames when subflow has free space
When multiple subflows are active, we can receive a
window update on subflow with no write space available.
MPTCP will try to push frames on such subflow and will
fail. Pending frames will be pushed only after receiving
a window update on a subflow with some wspace available.

Overall the above could lead to suboptimal aggregate
bandwidth usage.

Instead, we should try to push pending frames as soon as
the subflow reaches both conditions mentioned above.

We can finally enable self-tests with asymmetric links,
as the above makes them finally pass.

Fixes: 6f8a612a33 ("mptcp: keep track of advertised windows right edge")
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-17 10:24:47 -08:00
Paolo Abeni
3f8b2667f2 mptcp: properly annotate nested lock
MPTCP closes the subflows while holding the msk-level lock.
While acquiring the subflow socket lock we need to use the
correct nested annotation, or we can hit a lockdep splat
at runtime.

Reported-and-tested-by: Geliang Tang <geliangtang@gmail.com>
Fixes: e16163b6e2 ("mptcp: refactor shutdown and close")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-17 10:24:47 -08:00
Paolo Abeni
0c14846032 mptcp: fix security context on server socket
Currently MPTCP is not propagating the security context
from the ingress request socket to newly created msk
at clone time.

Address the issue invoking the missing security helper.

Fixes: cf7da0d66c ("mptcp: Create SUBFLOW socket for incoming connections")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-17 10:24:47 -08:00
Geliang Tang
3ae32c0781 mptcp: clear use_ack and use_map when dropping other suboptions
This patch cleared use_ack and use_map when dropping other suboptions to
fix the following syzkaller BUG:

[   15.223006] BUG: unable to handle page fault for address: 0000000000223b10
[   15.223700] #PF: supervisor read access in kernel mode
[   15.224209] #PF: error_code(0x0000) - not-present page
[   15.224724] PGD b8d5067 P4D b8d5067 PUD c0a5067 PMD 0
[   15.225237] Oops: 0000 [#1] SMP
[   15.225556] CPU: 0 PID: 7747 Comm: syz-executor Not tainted 5.10.0-rc6+ #24
[   15.226281] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[   15.227292] RIP: 0010:skb_release_data+0x89/0x1e0
[   15.227816] Code: 5b 5d 41 5c 41 5d 41 5e 41 5f e9 02 06 8a ff e8 fd 05 8a ff 45 31 ed 80 7d 02 00 4c 8d 65 30 74 55 e8 eb 05 8a ff 49 8b 1c 24 <4c> 8b 7b 08 41 f6 c7 01 0f 85 18 01 00 00 e8 d4 05 8a ff 8b 43 34
[   15.229669] RSP: 0018:ffffc900019c7c08 EFLAGS: 00010293
[   15.230188] RAX: ffff88800daad900 RBX: 0000000000223b08 RCX: 0000000000000006
[   15.230895] RDX: 0000000000000000 RSI: ffffffff818e06c5 RDI: ffff88807f6dc700
[   15.231593] RBP: ffff88807f71a4c0 R08: 0000000000000001 R09: 0000000000000001
[   15.232299] R10: ffffc900019c7c18 R11: 0000000000000000 R12: ffff88807f71a4f0
[   15.233007] R13: 0000000000000000 R14: ffff88807f6dc700 R15: 0000000000000002
[   15.233714] FS:  00007f65d9b5f700(0000) GS:ffff88807c400000(0000) knlGS:0000000000000000
[   15.234509] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   15.235081] CR2: 0000000000223b10 CR3: 000000000b883000 CR4: 00000000000006f0
[   15.235788] Call Trace:
[   15.236042]  skb_release_all+0x28/0x30
[   15.236419]  __kfree_skb+0x11/0x20
[   15.236768]  tcp_data_queue+0x270/0x1240
[   15.237161]  ? tcp_urg+0x50/0x2a0
[   15.237496]  tcp_rcv_established+0x39a/0x890
[   15.237997]  ? mark_held_locks+0x49/0x70
[   15.238467]  tcp_v4_do_rcv+0xb9/0x270
[   15.238915]  __release_sock+0x8a/0x160
[   15.239365]  release_sock+0x32/0xd0
[   15.239793]  __inet_stream_connect+0x1d2/0x400
[   15.240313]  ? do_wait_intr_irq+0x80/0x80
[   15.240791]  inet_stream_connect+0x36/0x50
[   15.241275]  mptcp_stream_connect+0x69/0x1b0
[   15.241787]  __sys_connect+0x122/0x140
[   15.242236]  ? syscall_enter_from_user_mode+0x17/0x50
[   15.242836]  ? lockdep_hardirqs_on_prepare+0xd4/0x170
[   15.243436]  __x64_sys_connect+0x1a/0x20
[   15.243924]  do_syscall_64+0x33/0x40
[   15.244313]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   15.244821] RIP: 0033:0x7f65d946e469
[   15.245183] Code: 00 f3 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ff 49 2b 00 f7 d8 64 89 01 48
[   15.247019] RSP: 002b:00007f65d9b5eda8 EFLAGS: 00000246 ORIG_RAX: 000000000000002a
[   15.247770] RAX: ffffffffffffffda RBX: 000000000049bf00 RCX: 00007f65d946e469
[   15.248471] RDX: 0000000000000010 RSI: 00000000200000c0 RDI: 0000000000000005
[   15.249205] RBP: 000000000049bf00 R08: 0000000000000000 R09: 0000000000000000
[   15.249908] R10: 0000000000000000 R11: 0000000000000246 R12: 000000000049bf0c
[   15.250603] R13: 00007fffe8a25cef R14: 00007f65d9b3f000 R15: 0000000000000003
[   15.251312] Modules linked in:
[   15.251626] CR2: 0000000000223b10
[   15.251965] BUG: kernel NULL pointer dereference, address: 0000000000000048
[   15.252005] ---[ end trace f5c51fe19123c773 ]---
[   15.252822] #PF: supervisor read access in kernel mode
[   15.252823] #PF: error_code(0x0000) - not-present page
[   15.252825] PGD c6c6067 P4D c6c6067 PUD c0d8067
[   15.253294] RIP: 0010:skb_release_data+0x89/0x1e0
[   15.253910] PMD 0
[   15.253914] Oops: 0000 [#2] SMP
[   15.253917] CPU: 1 PID: 7746 Comm: syz-executor Tainted: G      D           5.10.0-rc6+ #24
[   15.253920] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[   15.254435] Code: 5b 5d 41 5c 41 5d 41 5e 41 5f e9 02 06 8a ff e8 fd 05 8a ff 45 31 ed 80 7d 02 00 4c 8d 65 30 74 55 e8 eb 05 8a ff 49 8b 1c 24 <4c> 8b 7b 08 41 f6 c7 01 0f 85 18 01 00 00 e8 d4 05 8a ff 8b 43 34
[   15.254899] RIP: 0010:skb_release_data+0x89/0x1e0
[   15.254902] Code: 5b 5d 41 5c 41 5d 41 5e 41 5f e9 02 06 8a ff e8 fd 05 8a ff 45 31 ed 80 7d 02 00 4c 8d 65 30 74 55 e8 eb 05 8a ff 49 8b 1c 24 <4c> 8b 7b 08 41 f6 c7 01 0f 85 18 01 00 00 e8 d4 05 8a ff 8b 43 34
[   15.254905] RSP: 0018:ffffc900019bfc08 EFLAGS: 00010293
[   15.255376] RSP: 0018:ffffc900019c7c08 EFLAGS: 00010293
[   15.255580]
[   15.255583] RAX: ffff888004a7ac80 RBX: 0000000000000040 RCX: 0000000000000000
[   15.255912]
[   15.256724] RDX: 0000000000000000 RSI: ffffffff818e06c5 RDI: ffff88807f6ddd00
[   15.257620] RAX: ffff88800daad900 RBX: 0000000000223b08 RCX: 0000000000000006
[   15.259817] RBP: ffff88800e9006c0 R08: 0000000000000000 R09: 0000000000000000
[   15.259818] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88800e9006f0
[   15.259820] R13: 0000000000000000 R14: ffff88807f6ddd00 R15: 0000000000000002
[   15.259822] FS:  00007fae4a60a700(0000) GS:ffff88807c500000(0000) knlGS:0000000000000000
[   15.259826] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   15.260296] RDX: 0000000000000000 RSI: ffffffff818e06c5 RDI: ffff88807f6dc700
[   15.262514] CR2: 0000000000000048 CR3: 000000000b89c000 CR4: 00000000000006e0
[   15.262515] Call Trace:
[   15.262519]  skb_release_all+0x28/0x30
[   15.262523]  __kfree_skb+0x11/0x20
[   15.263054] RBP: ffff88807f71a4c0 R08: 0000000000000001 R09: 0000000000000001
[   15.263680]  tcp_data_queue+0x270/0x1240
[   15.263843] R10: ffffc900019c7c18 R11: 0000000000000000 R12: ffff88807f71a4f0
[   15.264693]  ? tcp_urg+0x50/0x2a0
[   15.264856] R13: 0000000000000000 R14: ffff88807f6dc700 R15: 0000000000000002
[   15.265720]  tcp_rcv_established+0x39a/0x890
[   15.266438] FS:  00007f65d9b5f700(0000) GS:ffff88807c400000(0000) knlGS:0000000000000000
[   15.267283]  ? __schedule+0x3fa/0x880
[   15.267287]  tcp_v4_do_rcv+0xb9/0x270
[   15.267290]  __release_sock+0x8a/0x160
[   15.268049] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   15.268788]  release_sock+0x32/0xd0
[   15.268791]  __inet_stream_connect+0x1d2/0x400
[   15.268795]  ? do_wait_intr_irq+0x80/0x80
[   15.269593] CR2: 0000000000223b10 CR3: 000000000b883000 CR4: 00000000000006f0
[   15.270246]  inet_stream_connect+0x36/0x50
[   15.270250]  mptcp_stream_connect+0x69/0x1b0
[   15.270253]  __sys_connect+0x122/0x140
[   15.271097] Kernel panic - not syncing: Fatal exception
[   15.271820]  ? syscall_enter_from_user_mode+0x17/0x50
[   15.283542]  ? lockdep_hardirqs_on_prepare+0xd4/0x170
[   15.284275]  __x64_sys_connect+0x1a/0x20
[   15.284853]  do_syscall_64+0x33/0x40
[   15.285369]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   15.286105] RIP: 0033:0x7fae49f19469
[   15.286638] Code: 00 f3 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ff 49 2b 00 f7 d8 64 89 01 48
[   15.289295] RSP: 002b:00007fae4a609da8 EFLAGS: 00000246 ORIG_RAX: 000000000000002a
[   15.290375] RAX: ffffffffffffffda RBX: 000000000049bf00 RCX: 00007fae49f19469
[   15.291403] RDX: 0000000000000010 RSI: 00000000200000c0 RDI: 0000000000000005
[   15.292437] RBP: 000000000049bf00 R08: 0000000000000000 R09: 0000000000000000
[   15.293456] R10: 0000000000000000 R11: 0000000000000246 R12: 000000000049bf0c
[   15.294473] R13: 00007fff0004b6bf R14: 00007fae4a5ea000 R15: 0000000000000003
[   15.295492] Modules linked in:
[   15.295944] CR2: 0000000000000048
[   15.296567] Kernel Offset: disabled
[   15.296941] ---[ end Kernel panic - not syncing: Fatal exception ]---

Reported-by: Christoph Paasch <cpaasch@apple.com>
Fixes: 84dfe3677a (mptcp: send out dedicated ADD_ADDR packet)
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Link: https://lore.kernel.org/r/ccca4e8f01457a1b495c5d612ed16c5f7a585706.1608010058.git.geliangtang@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-16 16:20:38 -08:00
Linus Torvalds
009bd55dfc RDMA 5.11 pull request
A smaller set of patches, nothing stands out as being particularly major
 this cycle:
 
 - Driver bug fixes and updates: bnxt_re, cxgb4, rxe, hns, i40iw, cxgb4,
   mlx4 and mlx5
 
 - Bug fixes and polishing for the new rts ULP
 
 - Cleanup of uverbs checking for allowed driver operations
 
 - Use sysfs_emit all over the place
 
 - Lots of bug fixes and clarity improvements for hns
 
 - hip09 support for hns
 
 - NDR and 50/100Gb signaling rates
 
 - Remove dma_virt_ops and go back to using the IB DMA wrappers
 
 - mlx5 optimizations for contiguous DMA regions
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAl/aNXUACgkQOG33FX4g
 mxqlMQ/+O6UhxKnDAnMB+HzDGvOm+KXNHOQBuzxz4ZWXqtUrW8WU5ca3PhXovc4z
 /QX0HhMhQmVsva5mjp1OGVATxQ2E+yasqFLg4QXAFWFR3N7s0u/sikE9i1DoPvOC
 lsmLTeRauCFaE4mJD5nvYwm+riECX0GmyVVW7v6V05xwAp0hwdhyU7Kb6Yh3lxsE
 umTz+onPNJcD6Tc4snziyC5QEp5ebEjAaj4dVI1YPR5X0c2RwC5E1CIDI6u4OQ2k
 j7/+Kvo8LNdYNERGiR169x6c1L7WS6dYnGMMeXRgyy0BVbVdRGDnvCV9VRmF66w5
 99fHfDjNMNmqbGNt/4/gwNdVrR9aI4jMZWCh7SmsguX6XwNOlhYldy3x3WnlkfkQ
 e4O0huJceJqcB2Uya70GqufnAetRXsbjzcvWxpR5YAwRmcRkm1f6aGK3BxPjWEbr
 BbYRpiKMxxT4yTe65BuuThzx6g4pNQHe0z3BM/dzMJQAX+PZcs1CPQR8F8PbCrZR
 Ad7qw4HJ587PoSxPi3toVMpYZRP6cISh1zx9q/JCj8cxH9Ri4MovUCS3cF63Ny3B
 1LJ2q0x8FuLLjgZJogKUyEkS8OO6q7NL8WumjvrYWWx19+jcYsV81jTRGSkH3bfY
 F7Esv5K2T1F2gVsCe1ZFFplQg6ja1afIcc+LEl8cMJSyTdoSub4=
 =9t8b
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma updates from Jason Gunthorpe:
 "A smaller set of patches, nothing stands out as being particularly
  major this cycle. The biggest item would be the new HIP09 HW support
  from HNS, otherwise it was pretty quiet for new work here:

   - Driver bug fixes and updates: bnxt_re, cxgb4, rxe, hns, i40iw,
     cxgb4, mlx4 and mlx5

   - Bug fixes and polishing for the new rts ULP

   - Cleanup of uverbs checking for allowed driver operations

   - Use sysfs_emit all over the place

   - Lots of bug fixes and clarity improvements for hns

   - hip09 support for hns

   - NDR and 50/100Gb signaling rates

   - Remove dma_virt_ops and go back to using the IB DMA wrappers

   - mlx5 optimizations for contiguous DMA regions"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (147 commits)
  RDMA/cma: Don't overwrite sgid_attr after device is released
  RDMA/mlx5: Fix MR cache memory leak
  RDMA/rxe: Use acquire/release for memory ordering
  RDMA/hns: Simplify AEQE process for different types of queue
  RDMA/hns: Fix inaccurate prints
  RDMA/hns: Fix incorrect symbol types
  RDMA/hns: Clear redundant variable initialization
  RDMA/hns: Fix coding style issues
  RDMA/hns: Remove unnecessary access right set during INIT2INIT
  RDMA/hns: WARN_ON if get a reserved sl from users
  RDMA/hns: Avoid filling sl in high 3 bits of vlan_id
  RDMA/hns: Do shift on traffic class when using RoCEv2
  RDMA/hns: Normalization the judgment of some features
  RDMA/hns: Limit the length of data copied between kernel and userspace
  RDMA/mlx4: Remove bogus dev_base_lock usage
  RDMA/uverbs: Fix incorrect variable type
  RDMA/core: Do not indicate device ready when device enablement fails
  RDMA/core: Clean up cq pool mechanism
  RDMA/core: Update kernel documentation for ib_create_named_qp()
  MAINTAINERS: SOFT-ROCE: Change Zhu Yanjun's email address
  ...
2020-12-16 13:42:26 -08:00
Karsten Graul
995433b795 net/smc: fix access to parent of an ib device
The parent of an ib device is used to retrieve the PCI device
attributes. It turns out that there are possible cases when an ib device
has no parent set in the device structure, which may lead to page
faults when trying to access this memory.
Fix that by checking the parent pointer and consolidate the pci device
specific processing in a new function.

Fixes: a3db10efcc ("net/smc: Add support for obtaining SMCR device list")
Reported-by: syzbot+600fef7c414ee7e2d71b@syzkaller.appspotmail.com
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Link: https://lore.kernel.org/r/20201215091058.49354-2-kgraul@linux.ibm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-16 13:33:47 -08:00
Ivan Vecera
ef72cd3c5c ethtool: fix error paths in ethnl_set_channels()
Fix two error paths in ethnl_set_channels() to avoid lock-up caused
but unreleased RTNL.

Fixes: e19c591eaf ("ethtool: set device channel counts with CHANNELS_SET request")
Reported-by: LiLiang <liali@redhat.com>
Signed-off-by: Ivan Vecera <ivecera@redhat.com>
Reviewed-by: Michal Kubecek <mkubecek@suse.cz>
Link: https://lore.kernel.org/r/20201215090810.801777-1-ivecera@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-16 13:27:17 -08:00
Hoang Le
c32c928d29 tipc: do sanity check payload of a netlink message
When we initialize nlmsghdr with no payload inside tipc_nl_compat_dumpit()
the parsing function returns -EINVAL. We fix it by making the parsing call
conditional.

Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Link: https://lore.kernel.org/r/20201215033151.76139-1-hoang.h.le@dektech.com.au
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-16 12:45:02 -08:00
Linus Torvalds
48aba79bcf for-5.11/io_uring-2020-12-14
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl/XeDUQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpnF9D/4+l1r1G5AcsSsgEvu1aCjP83LLWrHIAA5+
 ca3OY6vwOjBvqI7oOoPcYJeYJ9uuGGQc31tDFJtP6Sl6Gk31AB4iSddyrowaX+t+
 UJyJNfsgWKiLjY48EyQJ0gIqjuvPq8hPGMGClJb1A7+w87fqBC5UwCWEnJmE7MaX
 401kIw0CRVWYTnDEOYxToss6D6gQ30E8UZjdJ0cG4g8xVQBY2kKwYR3F9tDlAwsY
 CF+RCKpibcKwnaNZJBL67ClWjj1hC0ivg0O0G+W1UYysesKKdWFRI2rmxvH55K5T
 7tHlfVuVPladNmlLVNZnCvyqBrFHyAZPmOsdv3xQOvJ7pZPaxKV9xIYryQKZW4H4
 9tKkj3T1aop/fDGqIMxgymZsWW+1vvxAmM+7WkdOPHwHRSakJ5wGIj6Ekpton+5y
 aixJUFq390o/o+S8PDO7mgzdvYrasv3iLl5UxnIcU3rq30wxnRKit4vUZny8DlzF
 gOTw7QSocximhGYci+Uz4d4/XdK2CHc6eZDkQDltgJXxIrdsrN0qKxMCEsMKgCR1
 RMiDv+52MP6kp/wpXiOHQF25YRnUOW0qfEjWKK6Ye28DGuKPPuIXtN/BUD3rjdIc
 IJX3lDfOI3PgXNX24nOarucrF+ootyRmE6tGTVZhCVBhUXGR+MGatGfkeCqnmNzZ
 gny2+UrGIQ==
 =ly9V
 -----END PGP SIGNATURE-----

Merge tag 'for-5.11/io_uring-2020-12-14' of git://git.kernel.dk/linux-block

Pull io_uring updates from Jens Axboe:
 "Fairly light set of changes this time around, and mostly some bits
  that were pushed out to 5.11 instead of 5.10, fixes/cleanups, and a
  few features. In particular:

   - Cleanups around iovec import (David Laight, Pavel)

   - Add timeout support for io_uring_enter(2), which enables us to
     clean up liburing and avoid a timeout sqe submission in the
     completion path.

     The big win here is that it allows setups that split SQ and CQ
     handling into separate threads to avoid locking, as the CQ side
     will no longer submit when timeouts are needed when waiting for
     events (Hao Xu)

   - Add support for socket shutdown, and renameat/unlinkat.

   - SQPOLL cleanups and improvements (Xiaoguang Wang)

   - Allow SQPOLL setups for CAP_SYS_NICE, and enable regular
     (non-fixed) files to be used.

   - Cancelation improvements (Pavel)

   - Fixed file reference improvements (Pavel)

   - IOPOLL related race fixes (Pavel)

   - Lots of other little fixes and cleanups (mostly Pavel)"

* tag 'for-5.11/io_uring-2020-12-14' of git://git.kernel.dk/linux-block: (43 commits)
  io_uring: fix io_cqring_events()'s noflush
  io_uring: fix racy IOPOLL flush overflow
  io_uring: fix racy IOPOLL completions
  io_uring: always let io_iopoll_complete() complete polled io
  io_uring: add timeout update
  io_uring: restructure io_timeout_cancel()
  io_uring: fix files cancellation
  io_uring: use bottom half safe lock for fixed file data
  io_uring: fix miscounting ios_left
  io_uring: change submit file state invariant
  io_uring: check kthread stopped flag when sq thread is unparked
  io_uring: share fixed_file_refs b/w multiple rsrcs
  io_uring: replace inflight_wait with tctx->wait
  io_uring: don't take fs for recvmsg/sendmsg
  io_uring: only wake up sq thread while current task is in io worker context
  io_uring: don't acquire uring_lock twice
  io_uring: initialize 'timeout' properly in io_sq_thread()
  io_uring: refactor io_sq_thread() handling
  io_uring: always batch cancel in *cancel_files()
  io_uring: pass files into kill timeouts/poll
  ...
2020-12-16 12:44:05 -08:00
Lijun Pan
7061eb8cfa net: core: introduce __netdev_notify_peers
There are some use cases for netdev_notify_peers in the context
when rtnl lock is already held. Introduce lockless version
of netdev_notify_peers call to save the extra code to call
	call_netdevice_notifiers(NETDEV_NOTIFY_PEERS, dev);
	call_netdevice_notifiers(NETDEV_RESEND_IGMP, dev);
After that, convert netdev_notify_peers to call the new helper.

Suggested-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Lijun Pan <ljp@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-16 11:43:25 -08:00
Michal Kubecek
efb796f557 ethtool: fix string set id check
Syzbot reported a shift of a u32 by more than 31 in strset_parse_request()
which is undefined behavior. This is caused by range check of string set id
using variable ret (which is always 0 at this point) instead of id (string
set id from request).

Fixes: 71921690f9 ("ethtool: provide string sets with STRSET_GET request")
Reported-by: syzbot+96523fb438937cd01220@syzkaller.appspotmail.com
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Link: https://lore.kernel.org/r/b54ed5c5fd972a59afea3e1badfb36d86df68799.1607952208.git.mkubecek@suse.cz
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-16 11:15:11 -08:00
Linus Torvalds
ca5b877b6c selinux/stable-5.11 PR 20201214
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAl/YBtEUHHBhdWxAcGF1
 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXNnwA/9Ek8DG/1t8CEoJxpoRvwovQxNo+bi
 0rCT9vqvx9PeCwoZi/0Vp6oKmpE1HADvbeB/+e00VrbLYnzE3oRY6VkpjoZRofKS
 vc0/MzHSFxFUR1OTHwCefcXlPLK+bfitQbX5jEMeVyQCXNXXIrN7CnJf1LmCeLTR
 kQBPlEN9lt7HyNVAi34FhOD/TQbWnFHgl2z5puffgri6cWnc+TALKMYytUZ+rYex
 NYndDJW5b3g5kTat2eErn0FruxfzloGs0xMIiWb+z2i9kl41D+dkKPdAN7idqCSC
 Jv0nJP/bDftzA0wOe9szmGaLQzu7YnCN5kiWcSspatZVnon42Cy/tp9tiuPGLRFU
 XtelDfpyX6o3CLN0tX7LQEO+GYxPzvM6iaR2OrsChWPozUIIR3TLQg7jJN4bvNKl
 TR6gCGZCoAeS5JLNGjzVKxT/oKQY+tCLLlYXQdQY6swNFi3EKmPr+K1D9lgm98fO
 f3d1QmWiZZNmtxxoVogT0qoQYjkfgpnm3dVx813Vt+lwHlVpHGMEPpO27iD3/RYb
 w2yWOJaGKwMD8iL0l+Cm6CPW0/nE5FFISQjWgC8b4Vgxlyan6+L9eViqGICkrUQ2
 Edo0i1YFFZ4utHYkDf1VYBbJ+36KyCtdktgLAcbgnePiPB3E1XBsXTIIStSUIbVQ
 iEbTkBlsCG4GIeU=
 =6Cqb
 -----END PGP SIGNATURE-----

Merge tag 'selinux-pr-20201214' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux

Pull selinux updates from Paul Moore:
 "While we have a small number of SELinux patches for v5.11, there are a
  few changes worth highlighting:

   - Change the LSM network hooks to pass flowi_common structs instead
     of the parent flowi struct as the LSMs do not currently need the
     full flowi struct and they do not have enough information to use it
     safely (missing information on the address family).

     This patch was discussed both with Herbert Xu (representing team
     netdev) and James Morris (representing team
     LSMs-other-than-SELinux).

   - Fix how we handle errors in inode_doinit_with_dentry() so that we
     attempt to properly label the inode on following lookups instead of
     continuing to treat it as unlabeled.

   - Tweak the kernel logic around allowx, auditallowx, and dontauditx
     SELinux policy statements such that the auditx/dontauditx are
     effective even without the allowx statement.

  Everything passes our test suite"

* tag 'selinux-pr-20201214' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
  lsm,selinux: pass flowi_common instead of flowi to the LSM hooks
  selinux: Fix fall-through warnings for Clang
  selinux: drop super_block backpointer from superblock_security_struct
  selinux: fix inode_doinit_with_dentry() LABEL_INVALID error handling
  selinux: allow dontauditx and auditallowx rules to take effect without allowx
  selinux: fix error initialization in inode_doinit_with_dentry()
2020-12-16 11:01:04 -08:00
Colin Ian King
161b838e25 netfilter: nftables: fix incorrect increment of loop counter
The intention of the err_expr cleanup path is to iterate over the
allocated expr_array objects and free them, starting from i - 1 and
working down to the start of the array. Currently the loop counter
is being incremented instead of decremented and also the index i is
being used instead of k, repeatedly destroying the same expr_array
element.  Fix this by decrementing k and using k as the index into
expr_array.

Addresses-Coverity: ("Infinite loop")
Fixes: 8cfd9b0f85 ("netfilter: nftables: generalize set expressions support")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-12-16 11:02:10 +01:00
Linus Torvalds
1a50ede2b3 Highlights:
- Improve support for re-exporting NFS mounts
 - Replace NFSv4 XDR decoding C macros with xdr_stream helpers
 - Support for multiple RPC/RDMA chunks per RPC transaction
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAl/Q4dIACgkQM2qzM29m
 f5fInw//eDrmXBEhxbzcgeqNilGU5Qkn4INJtAcOGwPcw5Kjp4UVNGFpZNPqIDSf
 FP0Yw0d/rW7UggwCviPcs/adLTasU9skq1jgAv8d0ig4DtPbeqFo6BvbY+G2JxVF
 EfTeHzr6w6er8HRqyuLN4hjm1rQIpQlDHaYU4QcMs4fjPVv88eYLiwnYGYf3X46i
 vBYstu1IRxHhg2x4O833xmiL6VbkZDQoWwDjGICylxUBcNUtAmq/sETjTa4JVEJj
 4vgXdcJmAFjNgAOrmoR3DISsr9mvCvKN9g3C0+hHiRERTGEon//HzvscWH74wT48
 o0LUW0ZWgpmunTcmiSNeeiHNsUXJyy3A/xyEdteqqnvSxulxlqkQzb15Eb+92+6n
 BHGT/sOz1zz+/l9NCpdeEl5AkSA9plV8Iqd/kzwFwe1KwHMjldeMw/mhMut8EM2j
 b54EMsp40ipITAwBHvcygCXiWAn/mPex6bCr17Dijo6MsNLsyd+cDsazntbNzwz3
 RMGMf2TPOi8tWswrTUS9J5xKk5LAEWX/6Z/hTA1YlsB3PfrhXO97ztrytxvoO/bp
 M0NREA+NNMn/JyyL8FT3ID5peaLVHhA1GHw9CcUw3C7OVzmsEg29D4zNo02dF1TC
 LIyekp0kbSGGY1jLOeMLsa6Jr+2+40CcctsooVkRA+3rN0tJQvw=
 =1uP3
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-5.11' of git://git.linux-nfs.org/projects/cel/cel-2.6

Pull nfsd updates from Chuck Lever:
 "Several substantial changes this time around:

   - Previously, exporting an NFS mount via NFSD was considered to be an
     unsupported feature. With v5.11, the community has attempted to
     make re-exporting a first-class feature of NFSD.

     This would enable the Linux in-kernel NFS server to be used as an
     intermediate cache for a remotely-located primary NFS server, for
     example, even with other NFS server implementations, like a NetApp
     filer, as the primary.

   - A short series of patches brings support for multiple RPC/RDMA data
     chunks per RPC transaction to the Linux NFS server's RPC/RDMA
     transport implementation.

     This is a part of the RPC/RDMA spec that the other premiere
     NFS/RDMA implementation (Solaris) has had for a very long time, and
     completes the implementation of RPC/RDMA version 1 in the Linux
     kernel's NFS server.

   - Long ago, NFSv4 support was introduced to NFSD using a series of C
     macros that hid dprintk's and goto's. Over time, the kernel's XDR
     implementation has been greatly improved, but these C macros have
     remained and become fallow. A series of patches in this pull
     request completely replaces those macros with the use of current
     kernel XDR infrastructure. Benefits include:

       - More robust input sanitization in NFSD's NFSv4 XDR decoders.

       - Make it easier to use common kernel library functions that use
         XDR stream APIs (for example, GSS-API).

       - Align the structure of the source code with the RFCs so it is
         easier to learn, verify, and maintain our XDR implementation.

       - Removal of more than a hundred hidden dprintk() call sites.

       - Removal of some explicit manipulation of pages to help make the
         eventual transition to xdr->bvec smoother.

   - On top of several related fixes in 5.10-rc, there are a few more
     fixes to get the Linux NFSD implementation of NFSv4.2 inter-server
     copy up to speed.

  And as usual, there is a pinch of seasoning in the form of a
  collection of unrelated minor bug fixes and clean-ups.

  Many thanks to all who contributed this time around!"

* tag 'nfsd-5.11' of git://git.linux-nfs.org/projects/cel/cel-2.6: (131 commits)
  nfsd: Record NFSv4 pre/post-op attributes as non-atomic
  nfsd: Set PF_LOCAL_THROTTLE on local filesystems only
  nfsd: Fix up nfsd to ensure that timeout errors don't result in ESTALE
  exportfs: Add a function to return the raw output from fh_to_dentry()
  nfsd: close cached files prior to a REMOVE or RENAME that would replace target
  nfsd: allow filesystems to opt out of subtree checking
  nfsd: add a new EXPORT_OP_NOWCC flag to struct export_operations
  Revert "nfsd4: support change_attr_type attribute"
  nfsd4: don't query change attribute in v2/v3 case
  nfsd: minor nfsd4_change_attribute cleanup
  nfsd: simplify nfsd4_change_info
  nfsd: only call inode_query_iversion in the I_VERSION case
  nfs_common: need lock during iterate through the list
  NFSD: Fix 5 seconds delay when doing inter server copy
  NFSD: Fix sparse warning in nfs4proc.c
  SUNRPC: Remove XDRBUF_SPARSE_PAGES flag in gss_proxy upcall
  sunrpc: clean-up cache downcall
  nfsd: Fix message level for normal termination
  NFSD: Remove macros that are no longer used
  NFSD: Replace READ* macros in nfsd4_decode_compound()
  ...
2020-12-15 18:52:30 -08:00
Trond Myklebust
edffb84cc8 NFSoRDmA Client updates for Linux 5.11
Cleanups and improvements:
   - Remove use of raw kernel memory addresses in tracepoints
   - Replace dprintk() call sites in ERR_CHUNK path
   - Trace unmap sync calls
   - Optimize MR DMA-unmapping
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAl/SlTcACgkQ18tUv7Cl
 QOsc+xAA0qmDLbdZShFAiip/jgFvHkoIJzVcil5++xiRY77xOjR2rgkgBSv4hFYm
 XWBidP/eQm3n5r0S+LG0zudcHnaKBonZ0UV2j32PelMEnvn3H9qlSJYneEm9xv2O
 K1koNcbunJH8JRsLxbStdMnOnNwCmB+HzIjnWr87OgXkYucqDBBxt3RMyZVD46QU
 GypnItAtLns4Oacsw0TyPAPAxstjZ71YRlTMhtfyIYEJfvxUVRYRKvs7nPei9eCa
 0uCzoKb28F7CWy+S9so7wSjfdDu4G+FmAfMvJbQF4irhgZBvzyzj4jq4hCnv377S
 XF6Eygm0Z4BSWNoAnRjgLx3Bo4ps4qZi7iAj8cUAksEKQ3QI2BWTWR3e07YwLfnm
 iwhWougP76zbtC3lNA+4D2yaYwA1LtN4IYp19KmS/0LyRe5EBry39ccE/eE508JA
 BGnCohawI9ya7WT3xTne9lyxNGhjm/jNDKt0ax0ze4hwIqSGekqhUzxQ0bl4suQz
 ZHP2gUEad07jZyy8gCcpEvvxA3fFW241vaG/hN34Dy47eHpAQDQIMk7611YRQXH/
 jCBHoym1dHXc8eNa+gTmrjXdhVdgHo4zI0ppeaQvA2Ss0nPp8N8H2nJV/as536Dp
 on0b1zAF4o5uzGlEzHu+FuFGqUPE9My9/dNDqw3o0Cncu7adffk=
 =R45y
 -----END PGP SIGNATURE-----

Merge tag 'nfs-rdma-for-5.11-1' of git://git.linux-nfs.org/projects/anna/linux-nfs into linux-next

NFSoRDmA Client updates for Linux 5.11

Cleanups and improvements:
  - Remove use of raw kernel memory addresses in tracepoints
  - Replace dprintk() call sites in ERR_CHUNK path
  - Trace unmap sync calls
  - Optimize MR DMA-unmapping

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-15 20:08:41 -05:00
Linus Torvalds
d635a69dd4 Networking updates for 5.11
Core:
 
  - support "prefer busy polling" NAPI operation mode, where we defer softirq
    for some time expecting applications to periodically busy poll
 
  - AF_XDP: improve efficiency by more batching and hindering
            the adjacency cache prefetcher
 
  - af_packet: make packet_fanout.arr size configurable up to 64K
 
  - tcp: optimize TCP zero copy receive in presence of partial or unaligned
         reads making zero copy a performance win for much smaller messages
 
  - XDP: add bulk APIs for returning / freeing frames
 
  - sched: support fragmenting IP packets as they come out of conntrack
 
  - net: allow virtual netdevs to forward UDP L4 and fraglist GSO skbs
 
 BPF:
 
  - BPF switch from crude rlimit-based to memcg-based memory accounting
 
  - BPF type format information for kernel modules and related tracing
    enhancements
 
  - BPF implement task local storage for BPF LSM
 
  - allow the FENTRY/FEXIT/RAW_TP tracing programs to use bpf_sk_storage
 
 Protocols:
 
  - mptcp: improve multiple xmit streams support, memory accounting and
           many smaller improvements
 
  - TLS: support CHACHA20-POLY1305 cipher
 
  - seg6: add support for SRv6 End.DT4/DT6 behavior
 
  - sctp: Implement RFC 6951: UDP Encapsulation of SCTP
 
  - ppp_generic: add ability to bridge channels directly
 
  - bridge: Connectivity Fault Management (CFM) support as is defined in
            IEEE 802.1Q section 12.14.
 
 Drivers:
 
  - mlx5: make use of the new auxiliary bus to organize the driver internals
 
  - mlx5: more accurate port TX timestamping support
 
  - mlxsw:
    - improve the efficiency of offloaded next hop updates by using
      the new nexthop object API
    - support blackhole nexthops
    - support IEEE 802.1ad (Q-in-Q) bridging
 
  - rtw88: major bluetooth co-existance improvements
 
  - iwlwifi: support new 6 GHz frequency band
 
  - ath11k: Fast Initial Link Setup (FILS)
 
  - mt7915: dual band concurrent (DBDC) support
 
  - net: ipa: add basic support for IPA v4.5
 
 Refactor:
 
  - a few pieces of in_interrupt() cleanup work from Sebastian Andrzej Siewior
 
  - phy: add support for shared interrupts; get rid of multiple driver
         APIs and have the drivers write a full IRQ handler, slight growth
 	of driver code should be compensated by the simpler API which
 	also allows shared IRQs
 
  - add common code for handling netdev per-cpu counters
 
  - move TX packet re-allocation from Ethernet switch tag drivers to
    a central place
 
  - improve efficiency and rename nla_strlcpy
 
  - number of W=1 warning cleanups as we now catch those in a patchwork
    build bot
 
 Old code removal:
 
  - wan: delete the DLCI / SDLA drivers
 
  - wimax: move to staging
 
  - wifi: remove old WDS wifi bridging support
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAl/YXmUACgkQMUZtbf5S
 IrvSQBAAgOrt4EFopEvVqlTHZbqI45IEqgtXS+YWmlgnjZCgshyMj8q1yK1zzane
 qYxr/NNJ9kV3FdtaynmmHPgEEEfR5kJ/D3B2BsxYDkaDDrD0vbNsBGw+L+/Gbhxl
 N/5l/9FjLyLY1D+EErknuwR5XGuQ6BSDVaKQMhYOiK2hgdnAAI4hszo8Chf6wdD0
 XDBslQ7vpD/05r+eMj0IkS5dSAoGOIFXUxhJ5dqrDbRHiKsIyWqA3PLbYemfAhxI
 s2XckjfmSgGE3FKL8PSFu+EcfHbJQQjLcULJUnqgVcdwEEtRuE9ggEi52nZRXMWM
 4e8sQJAR9Fx7pZy0G1xfS149j6iPU5LjRlU9TNSpVABz14Vvvo3gEL6gyIdsz+xh
 hMN7UBdp0FEaP028CXoIYpaBesvQqj0BSndmee8qsYAtN6j+QKcM2AOSr7JN1uMH
 C/86EDoGAATiEQIVWJvnX5MPmlAoblyLA+RuVhmxkIBx2InGXkFmWqRkXT5l4jtk
 LVl8/TArR4alSQqLXictXCjYlCm9j5N4zFFtEVasSYi7/ZoPfgRNWT+lJ2R8Y+Zv
 +htzGaFuyj6RJTVeFQMrkl3whAtBamo2a0kwg45NnxmmXcspN6kJX1WOIy82+MhD
 Yht7uplSs7MGKA78q/CDU0XBeGjpABUvmplUQBIfrR/jKLW2730=
 =GXs1
 -----END PGP SIGNATURE-----

Merge tag 'net-next-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Jakub Kicinski:
 "Core:

   - support "prefer busy polling" NAPI operation mode, where we defer
     softirq for some time expecting applications to periodically busy
     poll

   - AF_XDP: improve efficiency by more batching and hindering the
     adjacency cache prefetcher

   - af_packet: make packet_fanout.arr size configurable up to 64K

   - tcp: optimize TCP zero copy receive in presence of partial or
     unaligned reads making zero copy a performance win for much smaller
     messages

   - XDP: add bulk APIs for returning / freeing frames

   - sched: support fragmenting IP packets as they come out of conntrack

   - net: allow virtual netdevs to forward UDP L4 and fraglist GSO skbs

  BPF:

   - BPF switch from crude rlimit-based to memcg-based memory accounting

   - BPF type format information for kernel modules and related tracing
     enhancements

   - BPF implement task local storage for BPF LSM

   - allow the FENTRY/FEXIT/RAW_TP tracing programs to use
     bpf_sk_storage

  Protocols:

   - mptcp: improve multiple xmit streams support, memory accounting and
     many smaller improvements

   - TLS: support CHACHA20-POLY1305 cipher

   - seg6: add support for SRv6 End.DT4/DT6 behavior

   - sctp: Implement RFC 6951: UDP Encapsulation of SCTP

   - ppp_generic: add ability to bridge channels directly

   - bridge: Connectivity Fault Management (CFM) support as is defined
     in IEEE 802.1Q section 12.14.

  Drivers:

   - mlx5: make use of the new auxiliary bus to organize the driver
     internals

   - mlx5: more accurate port TX timestamping support

   - mlxsw:
      - improve the efficiency of offloaded next hop updates by using
        the new nexthop object API
      - support blackhole nexthops
      - support IEEE 802.1ad (Q-in-Q) bridging

   - rtw88: major bluetooth co-existance improvements

   - iwlwifi: support new 6 GHz frequency band

   - ath11k: Fast Initial Link Setup (FILS)

   - mt7915: dual band concurrent (DBDC) support

   - net: ipa: add basic support for IPA v4.5

  Refactor:

   - a few pieces of in_interrupt() cleanup work from Sebastian Andrzej
     Siewior

   - phy: add support for shared interrupts; get rid of multiple driver
     APIs and have the drivers write a full IRQ handler, slight growth
     of driver code should be compensated by the simpler API which also
     allows shared IRQs

   - add common code for handling netdev per-cpu counters

   - move TX packet re-allocation from Ethernet switch tag drivers to a
     central place

   - improve efficiency and rename nla_strlcpy

   - number of W=1 warning cleanups as we now catch those in a patchwork
     build bot

  Old code removal:

   - wan: delete the DLCI / SDLA drivers

   - wimax: move to staging

   - wifi: remove old WDS wifi bridging support"

* tag 'net-next-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1922 commits)
  net: hns3: fix expression that is currently always true
  net: fix proc_fs init handling in af_packet and tls
  nfc: pn533: convert comma to semicolon
  af_vsock: Assign the vsock transport considering the vsock address flags
  af_vsock: Set VMADDR_FLAG_TO_HOST flag on the receive path
  vsock_addr: Check for supported flag values
  vm_sockets: Add VMADDR_FLAG_TO_HOST vsock flag
  vm_sockets: Add flags field in the vsock address data structure
  net: Disable NETIF_F_HW_TLS_TX when HW_CSUM is disabled
  tcp: Add logic to check for SYN w/ data in tcp_simple_retransmit
  net: mscc: ocelot: install MAC addresses in .ndo_set_rx_mode from process context
  nfc: s3fwrn5: Release the nfc firmware
  net: vxget: clean up sparse warnings
  mlxsw: spectrum_router: Use eXtended mezzanine to offload IPv4 router
  mlxsw: spectrum: Set KVH XLT cache mode for Spectrum2/3
  mlxsw: spectrum_router_xm: Introduce basic XM cache flushing
  mlxsw: reg: Add Router LPM Cache Enable Register
  mlxsw: reg: Add Router LPM Cache ML Delete Register
  mlxsw: spectrum_router_xm: Implement L-value tracking for M-index
  mlxsw: reg: Add XM Router M Table Register
  ...
2020-12-15 13:22:29 -08:00
Yonatan Linik
a268e0f245 net: fix proc_fs init handling in af_packet and tls
proc_fs was used, in af_packet, without a surrounding #ifdef,
although there is no hard dependency on proc_fs.
That caused the initialization of the af_packet module to fail
when CONFIG_PROC_FS=n.

Specifically, proc_create_net() was used in af_packet.c,
and when it fails, packet_net_init() returns -ENOMEM.
It will always fail when the kernel is compiled without proc_fs,
because, proc_create_net() for example always returns NULL.

The calling order that starts in af_packet.c is as follows:
packet_init()
register_pernet_subsys()
register_pernet_operations()
__register_pernet_operations()
ops_init()
ops->init() (packet_net_ops.init=packet_net_init())
proc_create_net()

It worked in the past because register_pernet_subsys()'s return value
wasn't checked before this Commit 36096f2f4f ("packet: Fix error path in
packet_init.").
It always returned an error, but was not checked before, so everything
was working even when CONFIG_PROC_FS=n.

The fix here is simply to add the necessary #ifdef.

This also fixes a similar error in tls_proc.c, that was found by Jakub
Kicinski.

Fixes: d26b698dd3 ("net/tls: add skeleton of MIB statistics")
Fixes: 36096f2f4f ("packet: Fix error path in packet_init")
Signed-off-by: Yonatan Linik <yonatanlinik@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-14 19:39:30 -08:00
Andra Paraschiv
7f816984f4 af_vsock: Assign the vsock transport considering the vsock address flags
The vsock flags field can be set in the connect path (user space app)
and the (listen) receive path (kernel space logic).

When the vsock transport is assigned, the remote CID is used to
distinguish between types of connection.

Use the vsock flags value (in addition to the CID) from the remote
address to decide which vsock transport to assign. For the sibling VMs
use case, all the vsock packets need to be forwarded to the host, so
always assign the guest->host transport if the VMADDR_FLAG_TO_HOST flag
is set. For the other use cases, the vsock transport assignment logic is
not changed.

Changelog

v3 -> v4

* Update the "remote_flags" local variable type to reflect the change of
  the "svm_flags" field to be 1 byte in size.

v2 -> v3

* Update bitwise check logic to not compare result to the flag value.

v1 -> v2

* Use bitwise operator to check the vsock flag.
* Use the updated "VMADDR_FLAG_TO_HOST" flag naming.
* Merge the checks for the g2h transport assignment in one "if" block.

Signed-off-by: Andra Paraschiv <andraprs@amazon.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-14 19:33:39 -08:00
Andra Paraschiv
1b5f2ab98e af_vsock: Set VMADDR_FLAG_TO_HOST flag on the receive path
The vsock flags can be set during the connect() setup logic, when
initializing the vsock address data structure variable. Then the vsock
transport is assigned, also considering this flags field.

The vsock transport is also assigned on the (listen) receive path. The
flags field needs to be set considering the use case.

Set the value of the vsock flags of the remote address to the one
targeted for packets forwarding to the host, if the following conditions
are met:

* The source CID of the packet is higher than VMADDR_CID_HOST.
* The destination CID of the packet is higher than VMADDR_CID_HOST.

Changelog

v3 -> v4

* No changes.

v2 -> v3

* No changes.

v1 -> v2

* Set the vsock flag on the receive path in the vsock transport
  assignment logic.
* Use bitwise operator for the vsock flag setup.
* Use the updated "VMADDR_FLAG_TO_HOST" flag naming.

Signed-off-by: Andra Paraschiv <andraprs@amazon.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-14 19:33:39 -08:00
Andra Paraschiv
cada7ccd9d vsock_addr: Check for supported flag values
Check if the provided flags value from the vsock address data structure
includes the supported flags in the corresponding kernel version.

The first byte of the "svm_zero" field is used as "svm_flags", so add
the flags check instead.

Changelog

v3 -> v4

* New patch in v4.

Signed-off-by: Andra Paraschiv <andraprs@amazon.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-14 19:33:39 -08:00
Tariq Toukan
ae0b04b238 net: Disable NETIF_F_HW_TLS_TX when HW_CSUM is disabled
With NETIF_F_HW_TLS_TX packets are encrypted in HW. This cannot be
logically done when HW_CSUM offload is off.

Fixes: 2342a8512a ("net: Add TLS TX offload features")
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Boris Pismenny <borisp@nvidia.com>
Link: https://lore.kernel.org/r/20201213143929.26253-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-14 19:31:36 -08:00
Alexander Duyck
c31b70c996 tcp: Add logic to check for SYN w/ data in tcp_simple_retransmit
There are cases where a fastopen SYN may trigger either a ICMP_TOOBIG
message in the case of IPv6 or a fragmentation request in the case of
IPv4. This results in the socket stalling for a second or more as it does
not respond to the message by retransmitting the SYN frame.

Normally a SYN frame should not be able to trigger a ICMP_TOOBIG or
ICMP_FRAG_NEEDED however in the case of fastopen we can have a frame that
makes use of the entire MSS. In the case of fastopen it does, and an
additional complication is that the retransmit queue doesn't contain the
original frames. As a result when tcp_simple_retransmit is called and
walks the list of frames in the queue it may not mark the frames as lost
because both the SYN and the data packet each individually are smaller than
the MSS size after the adjustment. This results in the socket being stalled
until the retransmit timer kicks in and forces the SYN frame out again
without the data attached.

In order to resolve this we can reduce the MSS the packets are compared
to in tcp_simple_retransmit to -1 for cases where we are still in the
TCP_SYN_SENT state for a fastopen socket. Doing this we will mark all of
the packets related to the fastopen SYN as lost.

Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Link: https://lore.kernel.org/r/160780498125.3272.15437756269539236825.stgit@localhost.localdomain
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-14 19:29:55 -08:00
Vasily Averin
54970a2fbb net: drop bogus skb with CHECKSUM_PARTIAL and offset beyond end of trimmed packet
syzbot reproduces BUG_ON in skb_checksum_help():
tun creates (bogus) skb with huge partial-checksummed area and
small ip packet inside. Then ip_rcv trims the skb based on size
of internal ip packet, after that csum offset points beyond of
trimmed skb. Then checksum_tg() called via netfilter hook
triggers BUG_ON:

        offset = skb_checksum_start_offset(skb);
        BUG_ON(offset >= skb_headlen(skb));

To work around the problem this patch forces pskb_trim_rcsum_slow()
to return -EINVAL in described scenario. It allows its callers to
drop such kind of packets.

Link: https://syzkaller.appspot.com/bug?id=b419a5ca95062664fe1a60b764621eb4526e2cd0
Reported-by: syzbot+7010af67ced6105e5ab6@syzkaller.appspotmail.com
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/1b2494af-2c56-8ee2-7bc0-923fcad1cdf8@virtuozzo.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-14 18:41:01 -08:00
Linus Torvalds
adb35e8dc9 Scheduler updates:
- migrate_disable/enable() support which originates from the RT tree and
    is now a prerequisite for the new preemptible kmap_local() API which aims
    to replace kmap_atomic().
 
  - A fair amount of topology and NUMA related improvements
 
  - Improvements for the frequency invariant calculations
 
  - Enhanced robustness for the global CPU priority tracking and decision
    making
 
  - The usual small fixes and enhancements all over the place
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl/XwK4THHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoX28D/9cVrvziSQGfBfuQWnUiw8iOIq1QBa2
 Me+Tvenhfrlt7xU6rbP9ciFu7eTN+fS06m5uQPGI+t22WuJmHzbmw1bJVXfkvYfI
 /QoU+Hg7DkDAn1p7ZKXh0dRkV0nI9ixxSHl0E+Zf1ATBxCUMV2SO85flg6z/4qJq
 3VWUye0dmR7/bhtkIjv5rwce9v2JB2g1AbgYXYTW9lHVoUdGoMSdiZAF4tGyHLnx
 sJ6DMqQ+k+dmPyYO0z5MTzjW/fXit4n9w2e3z9TvRH/uBu58WSW1RBmQYX6aHBAg
 dhT9F4lvTs6lJY23x5RSFWDOv6xAvKF5a0xfb8UZcyH5EoLYrPRvm42a0BbjdeRa
 u0z7LbwIlKA+RFdZzFZWz8UvvO0ljyMjmiuqZnZ5dY9Cd80LSBuxrWeQYG0qg6lR
 Y2povhhCepEG+q8AXIe2YjHKWKKC1s/l/VY3CNnCzcd21JPQjQ4Z5eWGmHif5IED
 CntaeFFhZadR3w02tkX35zFmY3w4soKKrbI4EKWrQwd+cIEQlOSY7dEPI/b5BbYj
 MWAb3P4EG9N77AWTNmbhK4nN0brEYb+rBbCA+5dtNBVhHTxAC7OTWElJOC2O66FI
 e06dREjvwYtOkRUkUguWwErbIai2gJ2MH0VILV3hHoh64oRk7jjM8PZYnjQkdptQ
 Gsq0rJW5iiu/OQ==
 =Oz1V
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Thomas Gleixner:

 - migrate_disable/enable() support which originates from the RT tree
   and is now a prerequisite for the new preemptible kmap_local() API
   which aims to replace kmap_atomic().

 - A fair amount of topology and NUMA related improvements

 - Improvements for the frequency invariant calculations

 - Enhanced robustness for the global CPU priority tracking and decision
   making

 - The usual small fixes and enhancements all over the place

* tag 'sched-core-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (61 commits)
  sched/fair: Trivial correction of the newidle_balance() comment
  sched/fair: Clear SMT siblings after determining the core is not idle
  sched: Fix kernel-doc markup
  x86: Print ratio freq_max/freq_base used in frequency invariance calculations
  x86, sched: Use midpoint of max_boost and max_P for frequency invariance on AMD EPYC
  x86, sched: Calculate frequency invariance for AMD systems
  irq_work: Optimize irq_work_single()
  smp: Cleanup smp_call_function*()
  irq_work: Cleanup
  sched: Limit the amount of NUMA imbalance that can exist at fork time
  sched/numa: Allow a floating imbalance between NUMA nodes
  sched: Avoid unnecessary calculation of load imbalance at clone time
  sched/numa: Rename nr_running and break out the magic number
  sched: Make migrate_disable/enable() independent of RT
  sched/topology: Condition EAS enablement on FIE support
  arm64: Rebuild sched domains on invariance status changes
  sched/topology,schedutil: Wrap sched domains rebuild
  sched/uclamp: Allow to reset a task uclamp constraint value
  sched/core: Fix typos in comments
  Documentation: scheduler: fix information on arch SD flags, sched_domain and sched_debug
  ...
2020-12-14 18:29:11 -08:00
Wang Hai
989a1db06e net: bridge: Fix a warning when del bridge sysfs
I got a warining report:

br_sysfs_addbr: can't create group bridge4/bridge
------------[ cut here ]------------
sysfs group 'bridge' not found for kobject 'bridge4'
WARNING: CPU: 2 PID: 9004 at fs/sysfs/group.c:279 sysfs_remove_group fs/sysfs/group.c:279 [inline]
WARNING: CPU: 2 PID: 9004 at fs/sysfs/group.c:279 sysfs_remove_group+0x153/0x1b0 fs/sysfs/group.c:270
Modules linked in: iptable_nat
...
Call Trace:
  br_dev_delete+0x112/0x190 net/bridge/br_if.c:384
  br_dev_newlink net/bridge/br_netlink.c:1381 [inline]
  br_dev_newlink+0xdb/0x100 net/bridge/br_netlink.c:1362
  __rtnl_newlink+0xe11/0x13f0 net/core/rtnetlink.c:3441
  rtnl_newlink+0x64/0xa0 net/core/rtnetlink.c:3500
  rtnetlink_rcv_msg+0x385/0x980 net/core/rtnetlink.c:5562
  netlink_rcv_skb+0x134/0x3d0 net/netlink/af_netlink.c:2494
  netlink_unicast_kernel net/netlink/af_netlink.c:1304 [inline]
  netlink_unicast+0x4a0/0x6a0 net/netlink/af_netlink.c:1330
  netlink_sendmsg+0x793/0xc80 net/netlink/af_netlink.c:1919
  sock_sendmsg_nosec net/socket.c:651 [inline]
  sock_sendmsg+0x139/0x170 net/socket.c:671
  ____sys_sendmsg+0x658/0x7d0 net/socket.c:2353
  ___sys_sendmsg+0xf8/0x170 net/socket.c:2407
  __sys_sendmsg+0xd3/0x190 net/socket.c:2440
  do_syscall_64+0x33/0x40 arch/x86/entry/common.c:46
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

In br_device_event(), if the bridge sysfs fails to be added,
br_device_event() should return error. This can prevent warining
when removing bridge sysfs that do not exist.

Fixes: bb900b27a2 ("bridge: allow creating bridge devices with netlink")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wang Hai <wanghai38@huawei.com>
Tested-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Link: https://lore.kernel.org/r/20201211122921.40386-1-wanghai38@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-14 18:27:49 -08:00
Paolo Abeni
15e6ca974b mptcp: let MPTCP create max size skbs
Currently the xmit path of the MPTCP protocol creates smaller-
than-max-size skbs, which is suboptimal for the performances.

There are a few things to improve:
- when coalescing to an existing skb, must clear the PUSH flag
- tcp_build_frag() expect the available space as an argument.
  When coalescing is enable MPTCP already subtracted the
  to-be-coalesced skb len. We must increment said argument
  accordingly.

Before:
./use_mptcp.sh netperf -H 127.0.0.1 -t TCP_STREAM
[...]
131072  16384  16384    30.00    24414.86

After:
./use_mptcp.sh netperf -H 127.0.0.1 -t TCP_STREAM
[...]
131072  16384  16384    30.05    28357.69

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-14 17:30:06 -08:00
Paolo Abeni
1bc7327b5f mptcp: pm: simplify select_local_address()
There is no need to unconditionally acquire the join list
lock, we can simply splice the join list into the subflow
list and traverse only the latter.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-14 17:30:06 -08:00
Florian Westphal
50c504a20a mptcp: parse and act on incoming FASTCLOSE option
parse the MPTCP FASTCLOSE subtype.

If provided key matches the local one, schedule the work queue to close
(with tcp reset) all subflows.

The MPTCP socket moves to closed state immediately.

Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-14 17:30:06 -08:00
Florian Westphal
049fe386d3 tcp: parse mptcp options contained in reset packets
Because TCP-level resets only affect the subflow, there is a MPTCP
option to indicate that the MPTCP-level connection should be closed
immediately without a mptcp-level fin exchange.

This is the 'MPTCP fast close option'.  It can be carried on ack
segments or TCP resets.  In the latter case, its needed to parse mptcp
options also for reset packets so that MPTCP can act accordingly.

Next patch will add receive side fastclose support in MPTCP.

Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-14 17:30:06 -08:00
Florian Westphal
ab82e996a1 mptcp: hold mptcp socket before calling tcp_done
When processing options from tcp reset path its possible that
tcp_done(ssk) drops the last reference on the mptcp socket which
results in use-after-free.

Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-14 17:30:06 -08:00
Geliang Tang
ba34c3de71 mptcp: use MPTCPOPT_HMAC_LEN macro
Use the macro MPTCPOPT_HMAC_LEN instead of a constant in struct
mptcp_options_received.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-14 17:30:06 -08:00
Geliang Tang
141694df65 mptcp: remove address when netlink flushes addrs
When the PM netlink flushes the addresses, invoke the remove address
function mptcp_nl_remove_subflow_and_signal_addr to remove the addresses
and the subflows. Since this function should not be invoked under lock,
move __flush_addrs out of the pernet->lock.

Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-14 17:30:06 -08:00
Nicolas Rybowski
3764b0c565 mptcp: attach subflow socket to parent cgroup
It has been observed that the kernel sockets created for the subflows
(except the first one) are not in the same cgroup as their parents.
That's because the additional subflows are created by kernel workers.

This is a problem with eBPF programs attached to the parent's
cgroup won't be executed for the children. But also with any other features
of CGroup linked to a sk.

This patch fixes this behaviour.

As the subflow sockets are created by the kernel, we can't use
'mem_cgroup_sk_alloc' because of the current context being the one of the
kworker. This is why we have to do low level memcg manipulation, if
required.

Suggested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Nicolas Rybowski <nicolas.rybowski@tessares.net>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-14 17:30:06 -08:00
Eelco Chaudron
09d6217254 net: openvswitch: fix TTL decrement exception action execution
Currently, the exception actions are not processed correctly as the wrong
dataset is passed. This change fixes this, including the misleading
comment.

In addition, a check was added to make sure we work on an IPv4 packet,
and not just assume if it's not IPv6 it's IPv4.

This was all tested using OVS with patch,
https://patchwork.ozlabs.org/project/openvswitch/list/?series=21639,
applied and sending packets with a TTL of 1 (and 0), both with IPv4
and IPv6.

Fixes: 69929d4c49 ("net: openvswitch: fix TTL decrement action netlink message format")
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Link: https://lore.kernel.org/r/160733569860.3007.12938188180387116741.stgit@wsfd-netdev64.ntdv.lab.eng.bos.redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-14 17:18:25 -08:00
Linus Torvalds
f9b4240b07 fixes-v5.11
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCX9daOgAKCRCRxhvAZXjc
 ohPkAQChXUB2BAjtIzXlCkZoDBbzHHblm5DZ37oy/4xYFmAcEwEA5sw6dQqyGHnF
 GEP9def51HvXLpBV2BzNUGggo1SoGgQ=
 =w/cO
 -----END PGP SIGNATURE-----

Merge tag 'fixes-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull misc fixes from Christian Brauner:
 "This contains several fixes which felt worth being combined into a
  single branch:

   - Use put_nsproxy() instead of open-coding it switch_task_namespaces()

   - Kirill's work to unify lifecycle management for all namespaces. The
     lifetime counters are used identically for all namespaces types.
     Namespaces may of course have additional unrelated counters and
     these are not altered. This work allows us to unify the type of the
     counters and reduces maintenance cost by moving the counter in one
     place and indicating that basic lifetime management is identical
     for all namespaces.

   - Peilin's fix adding three byte padding to Dmitry's
     PTRACE_GET_SYSCALL_INFO uapi struct to prevent an info leak.

   - Two smal patches to convert from the /* fall through */ comment
     annotation to the fallthrough keyword annotation which I had taken
     into my branch and into -next before df561f6688 ("treewide: Use
     fallthrough pseudo-keyword") made it upstream which fixed this
     tree-wide.

     Since I didn't want to invalidate all testing for other commits I
     didn't rebase and kept them"

* tag 'fixes-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  nsproxy: use put_nsproxy() in switch_task_namespaces()
  sys: Convert to the new fallthrough notation
  signal: Convert to the new fallthrough notation
  time: Use generic ns_common::count
  cgroup: Use generic ns_common::count
  mnt: Use generic ns_common::count
  user: Use generic ns_common::count
  pid: Use generic ns_common::count
  ipc: Use generic ns_common::count
  uts: Use generic ns_common::count
  net: Use generic ns_common::count
  ns: Add a common refcount into ns_common
  ptrace: Prevent kernel-infoleak in ptrace_get_syscall_info()
2020-12-14 16:40:27 -08:00
Jakub Kicinski
7bca5021a4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

1) Missing dependencies in NFT_BRIDGE_REJECT, from Randy Dunlap.

2) Use atomic_inc_return() instead of atomic_add_return() in IPVS,
   from Yejune Deng.

3) Simplify check for overquota in xt_nfacct, from Kaixu Xia.

4) Move nfnl_acct_list away from struct net, from Miao Wang.

5) Pass actual sk in reject actions, from Jan Engelhardt.

6) Add timeout and protoinfo to ctnetlink destroy events,
   from Florian Westphal.

7) Four patches to generalize set infrastructure to support
   for multiple expressions per set element.

* git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next:
  netfilter: nftables: netlink support for several set element expressions
  netfilter: nftables: generalize set extension to support for several expressions
  netfilter: nftables: move nft_expr before nft_set
  netfilter: nftables: generalize set expressions support
  netfilter: ctnetlink: add timeout and protoinfo to destroy events
  netfilter: use actual socket sk for REJECT action
  netfilter: nfnl_acct: remove data from struct net
  netfilter: Remove unnecessary conversion to bool
  ipvs: replace atomic_add_return()
  netfilter: nft_reject_bridge: fix build errors due to code movement
====================

Link: https://lore.kernel.org/r/20201212230513.3465-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-14 15:43:21 -08:00
Jakub Kicinski
a6b5e026e6 Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2020-12-14

1) Expose bpf_sk_storage_*() helpers to iterator programs, from Florent Revest.

2) Add AF_XDP selftests based on veth devs to BPF selftests, from Weqaar Janjua.

3) Support for finding BTF based kernel attach targets through libbpf's
   bpf_program__set_attach_target() API, from Andrii Nakryiko.

4) Permit pointers on stack for helper calls in the verifier, from Yonghong Song.

5) Fix overflows in hash map elem size after rlimit removal, from Eric Dumazet.

6) Get rid of direct invocation of llc in BPF selftests, from Andrew Delgadillo.

7) Fix xsk_recvmsg() to reorder socket state check before access, from Björn Töpel.

8) Add new libbpf API helper to retrieve ring buffer epoll fd, from Brendan Jackman.

9) Batch of minor BPF selftest improvements all over the place, from Florian Lehner,
   KP Singh, Jiri Olsa and various others.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (31 commits)
  selftests/bpf: Add a test for ptr_to_map_value on stack for helper access
  bpf: Permits pointers on stack for helper calls
  libbpf: Expose libbpf ring_buffer epoll_fd
  selftests/bpf: Add set_attach_target() API selftest for module target
  libbpf: Support modules in bpf_program__set_attach_target() API
  selftests/bpf: Silence ima_setup.sh when not running in verbose mode.
  selftests/bpf: Drop the need for LLVM's llc
  selftests/bpf: fix bpf_testmod.ko recompilation logic
  samples/bpf: Fix possible hang in xdpsock with multiple threads
  selftests/bpf: Make selftest compilation work on clang 11
  selftests/bpf: Xsk selftests - adding xdpxceiver to .gitignore
  selftests/bpf: Drop tcp-{client,server}.py from Makefile
  selftests/bpf: Xsk selftests - Bi-directional Sockets - SKB, DRV
  selftests/bpf: Xsk selftests - Socket Teardown - SKB, DRV
  selftests/bpf: Xsk selftests - DRV POLL, NOPOLL
  selftests/bpf: Xsk selftests - SKB POLL, NOPOLL
  selftests/bpf: Xsk selftests framework
  bpf: Only provide bpf_sock_from_file with CONFIG_NET
  bpf: Return -ENOTSUPP when attaching to non-kernel BTF
  xsk: Validate socket state in xsk_recvmsg, prior touching socket members
  ...
====================

Link: https://lore.kernel.org/r/20201214214316.20642-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-14 15:34:36 -08:00
Ilya Dryomov
2f0df6cfa3 libceph: drop ceph_auth_{create,update}_authorizer()
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:50 +01:00
Ilya Dryomov
ce287162d9 libceph, ceph: make use of __ceph_auth_get_authorizer() in msgr1
This shouldn't cause any functional changes.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:50 +01:00
Ilya Dryomov
cd1a677cad libceph, ceph: implement msgr2.1 protocol (crc and secure modes)
Implement msgr2.1 wire protocol, available since nautilus 14.2.11
and octopus 15.2.5.  msgr2.0 wire protocol is not implemented -- it
has several security, integrity and robustness issues and therefore
considered deprecated.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:50 +01:00
Ilya Dryomov
00498b9941 libceph: introduce connection modes and ms_mode option
msgr2 supports two connection modes: crc (plain) and secure (on-wire
encryption).  Connection mode is picked by server based on input from
client.

Introduce ms_mode option:

  ms_mode=legacy        - msgr1 (default)
  ms_mode=crc           - crc mode, if denied fail
  ms_mode=secure        - secure mode, if denied fail
  ms_mode=prefer-crc    - crc mode, if denied agree to secure mode
  ms_mode=prefer-secure - secure mode, if denied agree to crc mode

ms_mode affects all connections, we don't separate connections to mons
like it's done in userspace with ms_client_mode vs ms_mon_client_mode.

For now the default is legacy, to be flipped to prefer-crc after some
time.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:50 +01:00
Ilya Dryomov
313771e80f libceph, rbd: ignore addr->type while comparing in some cases
For libceph, this ensures that libceph instance sharing (share option)
continues to work.  For rbd, this avoids blocklisting alive lock owners
(locker addr is always LEGACY, while watcher addr is ANY in nautilus).

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:50 +01:00
Ilya Dryomov
a5cbd5fc22 libceph, ceph: get and handle cluster maps with addrvecs
In preparation for msgr2, make the cluster send us maps with addrvecs
including both LEGACY and MSGR2 addrs instead of a single LEGACY addr.
This means advertising support for SERVER_NAUTILUS and also some older
features: SERVER_MIMIC, MONENC and MONNAMES.

MONNAMES and MONENC are actually pre-argonaut, we just never updated
ceph_monmap_decode() for them.  Decoding is unconditional, see commit
23c625ce30 ("libceph: assume argonaut on the server side").

SERVER_MIMIC doesn't bear any meaning for the kernel client.

Since ceph_decode_entity_addrvec() is guarded by encoding version
checks (and in msgr2 case it is guarded implicitly by the fact that
server is speaking msgr2), we assume MSG_ADDR2 for it.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:50 +01:00
Ilya Dryomov
8921f25116 libceph: factor out finish_auth()
In preparation for msgr2, factor out finish_auth() so it is suitable
for both existing MAuth message based authentication and upcoming msgr2
authentication exchange.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:50 +01:00
Ilya Dryomov
c1c0ce78f4 libceph: drop ac->ops->name field
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:50 +01:00
Ilya Dryomov
59711f9ec2 libceph: amend cephx init_protocol() and build_request()
In msgr2, initial authentication happens with an exchange of msgr2
control frames -- MAuth message and struct ceph_mon_request_header
aren't used.  Make that optional.

Stop reporting cephx protocol as "x".  Use "cephx" instead.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:50 +01:00
Ilya Dryomov
285ea34fc8 libceph, ceph: incorporate nautilus cephx changes
- request service tickets together with auth ticket.  Currently we get
  auth ticket via CEPHX_GET_AUTH_SESSION_KEY op and then request service
  tickets via CEPHX_GET_PRINCIPAL_SESSION_KEY op in a separate message.
  Since nautilus, desired service tickets are shared togther with auth
  ticket in CEPHX_GET_AUTH_SESSION_KEY reply.

- propagate session key and connection secret, if any.  In preparation
  for msgr2, update handle_reply() and verify_authorizer_reply() auth
  ops to propagate session key and connection secret.  Since nautilus,
  if secure mode is negotiated, connection secret is shared either in
  CEPHX_GET_AUTH_SESSION_KEY reply (for mons) or in a final authorizer
  reply (for osds and mdses).

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:50 +01:00
Ilya Dryomov
6610fff278 libceph: safer en/decoding of cephx requests and replies
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:50 +01:00
Ilya Dryomov
f79e25b087 libceph: more insight into ticket expiry and invalidation
Make it clear that "need" is a union of "missing" and "have, but up
for renewal" and dout when the ticket goes missing due to expiry or
invalidation by client.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:50 +01:00
Ilya Dryomov
a56dd9bf47 libceph: move msgr1 protocol specific fields to its own struct
A couple whitespace fixups, no functional changes.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:50 +01:00
Ilya Dryomov
2f713615dd libceph: move msgr1 protocol implementation to its own file
A pure move, no other changes.

Note that ceph_tcp_recv{msg,page}() and ceph_tcp_send{msg,page}()
helpers are also moved.  msgr2 will bring its own, more efficient,
variants based on iov_iter.  Switching msgr1 to them was considered
but decided against to avoid subtle regressions.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:50 +01:00
Ilya Dryomov
566050e17e libceph: separate msgr1 protocol implementation
In preparation for msgr2, define internal messenger <-> protocol
interface (as opposed to external messenger <-> client interface, which
is struct ceph_connection_operations) consisting of try_read(),
try_write(), revoke(), revoke_incoming(), opened(), reset_session() and
reset_protocol() ops.  The semantics are exactly the same as they are
now.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:49 +01:00
Ilya Dryomov
6503e0b69c libceph: export remaining protocol independent infrastructure
In preparation for msgr2, make all protocol independent functions
in messenger.c global.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:49 +01:00
Ilya Dryomov
699921d9e6 libceph: export zero_page
In preparation for msgr2, make zero_page global.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:49 +01:00
Ilya Dryomov
3fefd43e74 libceph: rename and export con->flags bits
In preparation for msgr2, move the defines to the header file.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:49 +01:00
Ilya Dryomov
6d7f62bfb5 libceph: rename and export con->state states
In preparation for msgr2, rename msgr1 specific states and move the
defines to the header file.

Also drop state transition comments.  They don't cover all possible
transitions (e.g. NEGOTIATING -> STANDBY, etc) and currently do more
harm than good.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:49 +01:00
Ilya Dryomov
30be780a87 libceph: make con->state an int
unsigned long is a leftover from when con->state used to be a set of
bits managed with set_bit(), clear_bit(), etc.  Save a bit of memory.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:49 +01:00
Ilya Dryomov
2f68738037 libceph: don't export ceph_messenger_{init_fini}() to modules
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:49 +01:00
Ilya Dryomov
fd1a154cad libceph: make sure our addr->port is zero and addr->nonce is non-zero
Our messenger instance addr->port is normally zero -- anything else is
nonsensical because as a client we connect to multiple servers and don't
listen on any port.  However, a user can supply an arbitrary addr:port
via ip option and the port is currently preserved.  Zero it.

Conversely, make sure our addr->nonce is non-zero.  A zero nonce is
special: in combination with a zero port, it is used to blocklist the
entire ip.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:49 +01:00
Ilya Dryomov
771294fe07 libceph: factor out ceph_con_get_out_msg()
Move the logic of grabbing the next message from the queue into its own
function.  Like ceph_con_in_msg_alloc(), this is protocol independent.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:49 +01:00
Ilya Dryomov
fc4c128e15 libceph: change ceph_con_in_msg_alloc() to take hdr
ceph_con_in_msg_alloc() is protocol independent, but con->in_hdr (and
struct ceph_msg_header in general) is msgr1 specific.  While the struct
is deeply ingrained inside and outside the messenger, con->in_hdr field
can be separated.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:49 +01:00
Ilya Dryomov
8ee8abf797 libceph: change ceph_msg_data_cursor_init() to take cursor
Make it possible to have local cursors and embed them outside struct
ceph_msg.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:49 +01:00
Ilya Dryomov
0247192809 libceph: handle discarding acked and requeued messages separately
Make it easier to follow and remove dependency on msgr1 specific
CEPH_MSGR_TAG_SEQ.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:49 +01:00
Ilya Dryomov
5cd8da3a1c libceph: drop msg->ack_stamp field
It is set in process_ack() but never used.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:49 +01:00
Ilya Dryomov
d3c1248cac libceph: remove redundant session reset log message
Stick with pr_info message because session reset isn't an error most of
the time.  When it is (i.e. if the server denies the reconnect attempt),
we get a bunch of other pr_err messages.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:49 +01:00
Ilya Dryomov
a3da057bbd libceph: clear con->peer_global_seq on RESETSESSION
con->peer_global_seq is part of session state.  Clear it when
the server tells us to reset, not just in ceph_con_close().

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:48 +01:00
Ilya Dryomov
5963c3d01c libceph: rename reset_connection() to ceph_con_reset_session()
With just session reset bits left, rename appropriately.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:48 +01:00
Ilya Dryomov
3596f4c124 libceph: split protocol reset bits out of reset_connection()
Move protocol reset bits into ceph_con_reset_protocol(), leaving
just session reset bits.

Note that con->out_skip is now reset on faults.  This fixes a crash
in the case of a stateful session getting a fault while in the middle
of revoking a message.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:48 +01:00
Ilya Dryomov
90b6561a05 libceph: don't call reset_connection() on version/feature mismatches
A fault due to a version mismatch or a feature set mismatch used to be
treated differently from other faults: the connection would get closed
without trying to reconnect and there was a ->bad_proto() connection op
for notifying about that.

This changed a long time ago, see commits 6384bb8b8e ("libceph: kill
bad_proto ceph connection op") and 0fa6ebc600 ("libceph: fix protocol
feature mismatch failure path").  Nowadays these aren't any different
from other faults (i.e. we try to reconnect even though the mismatch
won't resolve until the server is replaced).  reset_connection() calls
there are rather confusing because reset_connection() resets a session
together an individual instance of the protocol.  This is cleaned up
in the next patch.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:48 +01:00
Ilya Dryomov
418af5b3bf libceph: lower exponential backoff delay
The current setting allows the backoff to climb up to 5 minutes.  This
is too high -- it becomes hard to tell whether the client is stuck on
something or just in backoff.

In userspace, ms_max_backoff is defaulted to 15 seconds.  Let's do the
same.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:48 +01:00
Ilya Dryomov
b77f8f0e4f libceph: include middle_len in process_message() dout
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-12-14 23:21:48 +01:00
Linus Torvalds
9e4b0d55d8 Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto updates from Herbert Xu:
 "API:
   - Add speed testing on 1420-byte blocks for networking

  Algorithms:
   - Improve performance of chacha on ARM for network packets
   - Improve performance of aegis128 on ARM for network packets

  Drivers:
   - Add support for Keem Bay OCS AES/SM4
   - Add support for QAT 4xxx devices
   - Enable crypto-engine retry mechanism in caam
   - Enable support for crypto engine on sdm845 in qce
   - Add HiSilicon PRNG driver support"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (161 commits)
  crypto: qat - add capability detection logic in qat_4xxx
  crypto: qat - add AES-XTS support for QAT GEN4 devices
  crypto: qat - add AES-CTR support for QAT GEN4 devices
  crypto: atmel-i2c - select CONFIG_BITREVERSE
  crypto: hisilicon/trng - replace atomic_add_return()
  crypto: keembay - Add support for Keem Bay OCS AES/SM4
  dt-bindings: Add Keem Bay OCS AES bindings
  crypto: aegis128 - avoid spurious references crypto_aegis128_update_simd
  crypto: seed - remove trailing semicolon in macro definition
  crypto: x86/poly1305 - Use TEST %reg,%reg instead of CMP $0,%reg
  crypto: x86/sha512 - Use TEST %reg,%reg instead of CMP $0,%reg
  crypto: aesni - Use TEST %reg,%reg instead of CMP $0,%reg
  crypto: cpt - Fix sparse warnings in cptpf
  hwrng: ks-sa - Add dependency on IOMEM and OF
  crypto: lib/blake2s - Move selftest prototype into header file
  crypto: arm/aes-ce - work around Cortex-A57/A72 silion errata
  crypto: ecdh - avoid unaligned accesses in ecdh_set_secret()
  crypto: ccree - rework cache parameters handling
  crypto: cavium - Use dma_set_mask_and_coherent to simplify code
  crypto: marvell/octeontx - Use dma_set_mask_and_coherent to simplify code
  ...
2020-12-14 12:18:19 -08:00
Trond Myklebust
5802f7c2a6 SUNRPC: When expanding the buffer, we may need grow the sparse pages
If we're shifting the page data to the right, and this happens to be a
sparse page array, then we may need to allocate new pages in order to
receive the data.

Reported-by: "Mkrtchyan, Tigran" <tigran.mkrtchyan@desy.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-14 06:51:07 -05:00
Trond Myklebust
f8d0e60f10 SUNRPC: Cleanup - constify a number of xdr_buf helpers
There are a number of xdr helpers for struct xdr_buf that do not change
the structure itself. Mark those as taking const pointers for
documentation purposes.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-14 06:51:07 -05:00
Trond Myklebust
5a5f1c2c2c SUNRPC: Clean up open coded setting of the xdr_stream 'nwords' field
Move the setting of the xdr_stream 'nwords' field into the helpers that
reset the xdr_stream cursor.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-14 06:51:07 -05:00
Trond Myklebust
e43ac22b83 SUNRPC: _copy_to/from_pages() now check for zero length
Clean up callers of _copy_to/from_pages() that still check for a zero
length.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-14 06:51:07 -05:00
Trond Myklebust
6707fbd7d3 SUNRPC: Cleanup xdr_shrink_bufhead()
Clean up xdr_shrink_bufhead() to use the new helpers instead of doing
its own thing.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-14 06:51:07 -05:00
Trond Myklebust
c4f2f591f0 SUNRPC: Fix xdr_expand_hole()
We do want to try to grow the buffer if possible, but if that attempt
fails, we still want to move the data and truncate the XDR message.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-14 06:51:07 -05:00
Trond Myklebust
9a20f6f4e6 SUNRPC: Fixes for xdr_align_data()
The main use case right now for xdr_align_data() is to shift the page
data to the left, and in practice shrink the total XDR data buffer.
This patch ensures that we fix up the accounting for the buffer length
as we shift that data around.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-14 06:51:07 -05:00
Trond Myklebust
c54e959b36 SUNRPC: _shift_data_left/right_pages should check the shift length
Exit early if the shift is zero.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-14 06:51:07 -05:00
Chuck Lever
15261b9126 xprtrdma: Fix XDRBUF_SPARSE_PAGES support
Olga K. observed that rpcrdma_marsh_req() allocates sparse pages
only when it has determined that a Reply chunk is necessary. There
are plenty of cases where no Reply chunk is needed, but the
XDRBUF_SPARSE_PAGES flag is set. The result would be a crash in
rpcrdma_inline_fixup() when it tries to copy parts of the received
Reply into a missing page.

To avoid crashing, handle sparse page allocation up front.

Until XATTR support was added, this issue did not appear often
because the only SPARSE_PAGES consumer always expected a reply large
enough to always require a Reply chunk.

Reported-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-14 06:51:07 -05:00
Dan Aloni
ac9645c873 sunrpc: fix xs_read_xdr_buf for partial pages receive
When receiving pages data, return value 'ret' when positive includes
`buf->page_base`, so we should subtract that before it is used for
changing `offset` and comparing against `want`.

This was discovered on the very rare cases where the server returned a
chunk of bytes that when added to the already received amount of bytes
for the pages happened to match the current `recv.len`, for example
on this case:

     buf->page_base : 258356
     actually received from socket: 1740
     ret : 260096
     want : 260096

In this case neither of the two 'if ... goto out' trigger, and we
continue to tail parsing.

Worth to mention that the ensuing EMSGSIZE from the continued execution of
`xs_read_xdr_buf` may be observed by an application due to 4 superfluous
bytes being added to the pages data.

Fixes: 277e4ab7d5 ("SUNRPC: Simplify TCP receive code by switching to using iterators")
Signed-off-by: Dan Aloni <dan@kernelim.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-14 06:46:55 -05:00
Xie He
13458ffe0a net: x25: Remove unimplemented X.25-over-LLC code stubs
According to the X.25 documentation, there was a plan to implement
X.25-over-802.2-LLC. It never finished but left various code stubs in the
X.25 code. At this time it is unlikely that it would ever finish so it
may be better to remove those code stubs.

Also change the documentation to make it clear that this is not a ongoing
plan anymore. Change words like "will" to "could", "would", etc.

Cc: Martin Schiller <ms@dev.tdt.de>
Signed-off-by: Xie He <xie.he.0141@gmail.com>
Link: https://lore.kernel.org/r/20201209033346.83742-1-xie.he.0141@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-12 17:15:33 -08:00
SeongJae Park
0b9b241406 inet: frags: batch fqdir destroy works
On a few of our systems, I found frequent 'unshare(CLONE_NEWNET)' calls
make the number of active slab objects including 'sock_inode_cache' type
rapidly and continuously increase.  As a result, memory pressure occurs.

In more detail, I made an artificial reproducer that resembles the
workload that we found the problem and reproduce the problem faster.  It
merely repeats 'unshare(CLONE_NEWNET)' 50,000 times in a loop.  It takes
about 2 minutes.  On 40 CPU cores / 70GB DRAM machine, the available
memory continuously reduced in a fast speed (about 120MB per second,
15GB in total within the 2 minutes).  Note that the issue don't
reproduce on every machine.  On my 6 CPU cores machine, the problem
didn't reproduce.

'cleanup_net()' and 'fqdir_work_fn()' are functions that deallocate the
relevant memory objects.  They are asynchronously invoked by the work
queues and internally use 'rcu_barrier()' to ensure safe destructions.
'cleanup_net()' works in a batched maneer in a single thread worker,
while 'fqdir_work_fn()' works for each 'fqdir_exit()' call in the
'system_wq'.  Therefore, 'fqdir_work_fn()' called frequently under the
workload and made the contention for 'rcu_barrier()' high.  In more
detail, the global mutex, 'rcu_state.barrier_mutex' became the
bottleneck.

This commit avoids such contention by doing the 'rcu_barrier()' and
subsequent lightweight works in a batched manner, as similar to that of
'cleanup_net()'.  The fqdir hashtable destruction, which is done before
the 'rcu_barrier()', is still allowed to run in parallel for fast
processing, but this commit makes it to use a dedicated work queue
instead of the 'system_wq', to make sure that the number of threads is
bounded.

Signed-off-by: SeongJae Park <sjpark@amazon.de>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20201211112405.31158-1-sjpark@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-12 15:08:54 -08:00
Jakub Kicinski
e2437ac2f5 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:

====================
pull request (net-next): ipsec-next 2020-12-12

Just one patch this time:

1) Redact the SA keys with kernel lockdown confidentiality.
   If enabled, no secret keys are sent to uuserspace.
   From Antony Antony.

* 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next:
  xfrm: redact SA secret with lockdown confidentiality
====================

Link: https://lore.kernel.org/r/20201212085737.2101294-1-steffen.klassert@secunet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-12 12:28:42 -08:00
Pablo Neira Ayuso
48b0ae046e netfilter: nftables: netlink support for several set element expressions
This patch adds three new netlink attributes to encapsulate a list of
expressions per set elements:

- NFTA_SET_EXPRESSIONS: this attribute provides the set definition in
  terms of expressions. New set elements get attached the list of
  expressions that is specified by this new netlink attribute.
- NFTA_SET_ELEM_EXPRESSIONS: this attribute allows users to restore (or
  initialize) the stateful information of set elements when adding an
  element to the set.
- NFTA_DYNSET_EXPRESSIONS: this attribute specifies the list of
  expressions that the set element gets when it is inserted from the
  packet path.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-12-12 19:20:52 +01:00
Pablo Neira Ayuso
563125a73a netfilter: nftables: generalize set extension to support for several expressions
This patch replaces NFT_SET_EXPR by NFT_SET_EXT_EXPRESSIONS. This new
extension allows to attach several expressions to one set element (not
only one single expression as NFT_SET_EXPR provides). This patch
prepares for support for several expressions per set element in the
netlink userspace API.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-12-12 19:20:24 +01:00
Jakub Kicinski
00f7763a26 A new set of wireless changes:
* validate key indices for key deletion
  * more preamble support in mac80211
  * various 6 GHz scan fixes/improvements
  * a common SAR power limitations API
  * various small fixes & code improvements
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAl/TfYAACgkQB8qZga/f
 l8ShjQ/9Hd6KjvA7keATtdjR7rDHo7H2nBKV/LukpuHsiTRrXTVOAfkcUTOb2hfR
 7SJMzsUXdJGivbwm4lkx5TrIgiJm1hfW3zG0PFOs/bIuXs/KICrb+kLgQWiRIUfa
 RIinf8BGPH3GgcCHDcWFUrnfnBVchrPUx2wIHoCQbCzLHIhB6q6x8jEJA67+smpv
 57tDfUhm6pf6OYOqVN8HYlo0uRAIn1ImneplDelCmCI1dzlneEkMhqZuBXqWpD/I
 C5vU+MjoOsJiW1XkmYOMe6VKQ/Bve06GUWs830S7aROOEfByv+ptlR9IjqvHvPIm
 UI9NivfXQiZr6S7yD1m2xV7a14UMCIzYarwaM/I/NHAWF/Y4vzHFVzQjLfVKqMCV
 dxxsWN+Yg7Gx3T5Fj3NNiQgnPF9ASVqgMrlC59ga+4If0y60V7dOSFuo9HF7AWgP
 NIWKVI3He7Mb5TciM+BX5YQWkJiCSZXs427WLO6p0bp3kAgS6N6BThUraGCogXVF
 1BT/y5G3QzZwg02vL3lxgWglXoH/e63UYCPt0r+i5c83Z+n4YnFSUZyRSViy9Elj
 DkCgdxmP0OtM+FaHxLdYm+FL4GXaGWQVNORIDP0ViSrstPgSxhWIgVj/pKNcKC7g
 bJI/IXm7eQkW5SXOafmhVV0TmvDOt/mM46E0CeWcPTqIhetk3lM=
 =+qd6
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-next-for-net-next-2020-12-11' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next

Johannes Berg says:

====================
A new set of wireless changes:
 * validate key indices for key deletion
 * more preamble support in mac80211
 * various 6 GHz scan fixes/improvements
 * a common SAR power limitations API
 * various small fixes & code improvements

* tag 'mac80211-next-for-net-next-2020-12-11' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next: (35 commits)
  mac80211: add ieee80211_set_sar_specs
  nl80211: add common API to configure SAR power limitations
  mac80211: fix a mistake check for rx_stats update
  mac80211: mlme: save ssid info to ieee80211_bss_conf while assoc
  mac80211: Update rate control on channel change
  mac80211: don't filter out beacons once we start CSA
  mac80211: Fix calculation of minimal channel width
  mac80211: ignore country element TX power on 6 GHz
  mac80211: use bitfield helpers for BA session action frames
  mac80211: support Rx timestamp calculation for all preamble types
  mac80211: don't set set TDLS STA bandwidth wider than possible
  mac80211: support driver-based disconnect with reconnect hint
  cfg80211: support immediate reconnect request hint
  mac80211: use struct assignment for he_obss_pd
  cfg80211: remove struct ieee80211_he_bss_color
  nl80211: validate key indexes for cfg80211_registered_device
  cfg80211: include block-tx flag in channel switch started event
  mac80211: disallow band-switch during CSA
  ieee80211: update reduced neighbor report TBTT info length
  cfg80211: Save the regulatory domain when setting custom regulatory
  ...
====================

Link: https://lore.kernel.org/r/20201211142552.209018-1-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-12 10:07:56 -08:00
Pablo Neira Ayuso
8cfd9b0f85 netfilter: nftables: generalize set expressions support
Currently, the set infrastucture allows for one single expressions per
element. This patch extends the existing infrastructure to allow for up
to two expressions. This is not updating the netlink API yet, this is
coming as an initial preparation patch.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-12-12 11:44:42 +01:00
Florian Westphal
86d21fc747 netfilter: ctnetlink: add timeout and protoinfo to destroy events
DESTROY events do not include the remaining timeout.

Add the timeout if the entry was removed explicitly. This can happen
when a conntrack gets deleted prematurely, e.g. due to a tcp reset,
module removal, netdev notifier (nat/masquerade device went down),
ctnetlink and so on.

Add the protocol state too for the destroy message to check for abnormal
state on connection termination.

Joint work with Pablo.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-12-12 11:44:42 +01:00
Jakub Kicinski
46d5e62dd3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
xdp_return_frame_bulk() needs to pass a xdp_buff
to __xdp_return().

strlcpy got converted to strscpy but here it makes no
functional difference, so just keep the right code.

Conflicts:
	net/netfilter/nf_tables_api.c

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-11 22:29:38 -08:00
Carl Huang
c534e093d8 mac80211: add ieee80211_set_sar_specs
This change registers ieee80211_set_sar_specs to
mac80211_config_ops, so cfg80211 can call it.

Signed-off-by: Carl Huang <cjhuang@codeaurora.org>
Reviewed-by: Brian Norris <briannorris@chromium.org>
Reviewed-by: Abhishek Kumar <kuabhs@chromium.org>
Link: https://lore.kernel.org/r/20201203103728.3034-3-cjhuang@codeaurora.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-12-11 13:39:59 +01:00
Carl Huang
6bdb68cef7 nl80211: add common API to configure SAR power limitations
NL80211_CMD_SET_SAR_SPECS is added to configure SAR from
user space. NL80211_ATTR_SAR_SPEC is used to pass the SAR
power specification when used with NL80211_CMD_SET_SAR_SPECS.

Wireless driver needs to register SAR type, supported frequency
ranges to wiphy, so user space can query it. The index in
frequency range is used to specify which sub band the power
limitation applies to. The SAR type is for compatibility, so later
other SAR mechanism can be implemented without breaking the user
space SAR applications.

Normal process is user space queries the SAR capability, and
gets the index of supported frequency ranges and associates the
power limitation with this index and sends to kernel.

Here is an example of message send to kernel:
8c 00 00 00 08 00 01 00 00 00 00 00 38 00 2b 81
08 00 01 00 00 00 00 00 2c 00 02 80 14 00 00 80
08 00 02 00 00 00 00 00 08 00 01 00 38 00 00 00
14 00 01 80 08 00 02 00 01 00 00 00 08 00 01 00
48 00 00 00

NL80211_CMD_SET_SAR_SPECS:  0x8c
NL80211_ATTR_WIPHY:     0x01(phy idx is 0)
NL80211_ATTR_SAR_SPEC:  0x812b (NLA_NESTED)
NL80211_SAR_ATTR_TYPE:  0x00 (NL80211_SAR_TYPE_POWER)
NL80211_SAR_ATTR_SPECS: 0x8002 (NLA_NESTED)
freq range 0 power: 0x38 in 0.25dbm unit (14dbm)
freq range 1 power: 0x48 in 0.25dbm unit (18dbm)

Signed-off-by: Carl Huang <cjhuang@codeaurora.org>
Reviewed-by: Brian Norris <briannorris@chromium.org>
Reviewed-by: Abhishek Kumar <kuabhs@chromium.org>
Link: https://lore.kernel.org/r/20201203103728.3034-2-cjhuang@codeaurora.org
[minor edits, NLA parse cleanups]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-12-11 13:38:54 +01:00
Wen Gong
f879ac8ed6 mac80211: fix a mistake check for rx_stats update
It should be !is_multicast_ether_addr() in ieee80211_rx_h_sta_process()
for the rx_stats update, below commit remove the !, this patch is to
change it back.

It lead the rx rate "iw wlan0 station dump" become invalid for some
scenario when IEEE80211_HW_USES_RSS is set.

Fixes: 09a740ce35 ("mac80211: receive and process S1G beacons")
Signed-off-by: Wen Gong <wgong@codeaurora.org>
Link: https://lore.kernel.org/r/1607483189-3891-1-git-send-email-wgong@codeaurora.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-12-11 13:20:06 +01:00
Wen Gong
b0140fda62 mac80211: mlme: save ssid info to ieee80211_bss_conf while assoc
The ssid info of ieee80211_bss_conf is filled in ieee80211_start_ap()
for AP mode. For STATION mode, it is empty, save the info from struct
ieee80211_mgd_assoc_data, the struct ieee80211_mgd_assoc_data will be
freed after assoc, so the ssid info of ieee80211_mgd_assoc_data can not
access after assoc, save ssid info to ieee80211_bss_conf, then ssid info
can be still access after assoc.

Signed-off-by: Wen Gong <wgong@codeaurora.org>
Link: https://lore.kernel.org/r/1607312195-3583-2-git-send-email-wgong@codeaurora.org
[reset on disassoc]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-12-11 13:20:05 +01:00
Ilan Peer
44b72ca816 mac80211: Update rate control on channel change
A channel change or a channel bandwidth change can impact the
rate control logic. However, the rate control logic was not updated
before/after such a change, which might result in unexpected
behavior.

Fix this by updating the stations rate control logic when the
corresponding channel context changes.

Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20201206145305.600d967fe3c9.I48305f25cfcc9c032c77c51396e9e9b882748a86@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-12-11 13:20:05 +01:00
Emmanuel Grumbach
189a164d0f mac80211: don't filter out beacons once we start CSA
I hit a bug in which we started a CSA with an action frame,
but the AP changed its mind and didn't change the beacon.
The CSA wasn't cancelled and we lost the connection.

The beacons were ignored because they never changed: they
never contained any CSA IE. Because they never changed, the
CRC of the beacon didn't change either which made us ignore
the beacons instead of processing them.

Now what happens is:
1) beacon has CRC X and it is valid. No CSA IE in the beacon
2) as long as beacon's CRC X, don't process their IEs
3) rx action frame with CSA
4) invalidate the beacon's CRC
5) rx beacon, CRC is still X, but now it is invalid
6) process the beacon, detect there is no CSA IE
7) abort CSA

Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20201206145305.83470b8407e6.I739b907598001362744692744be15335436b8351@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-12-11 13:20:05 +01:00
Ilan Peer
bbf31e88df mac80211: Fix calculation of minimal channel width
When calculating the minimal channel width for channel context,
the current operation Rx channel width of a station was used and not
the overall channel width capability of the station, i.e., both for
Tx and Rx.

Fix ieee80211_get_sta_bw() to use the maximal channel width the
station is capable. While at it make the function static.

Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20201206145305.4387040b99a0.I74bcf19238f75a5960c4098b10e355123d933281@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-12-11 13:20:05 +01:00
Johannes Berg
2dedfe1dbd mac80211: ignore country element TX power on 6 GHz
Updates to the 802.11ax draft are coming that deprecate the
country element in favour of the transmit power envelope
element, and make the maximum transmit power level field in
the triplets reserved, so if we parse them we'd use 0 dBm
transmit power.

Follow suit and completely ignore the element on 6 GHz for
purposes of determining TX power.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20201206145305.9abf9f6b4f88.Icb6e52af586edcc74f1f0360e8f6fc9ef2bfe8f5@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-12-11 13:20:05 +01:00
Johannes Berg
db8ebd06cc mac80211: use bitfield helpers for BA session action frames
Use the appropriate bitfield helpers for encoding and decoding
the capability field in the BA session action frames instead of
open-coding the shifts/masks.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20201206145305.0c46e5097cc0.I06e75706770c40b9ba1cabd1f8a78ab7a05c5b73@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-12-11 13:20:05 +01:00
Avraham Stern
da3882331a mac80211: support Rx timestamp calculation for all preamble types
Add support for calculating the Rx timestamp for HE frames.
Since now all frame types are supported, allow setting the Rx
timestamp regardless of the frame type.

Signed-off-by: Avraham Stern <avraham.stern@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20201206145305.4786559af475.Ia54486bb0a12e5351f9d5c60ef6fcda7c9e7141c@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-12-11 13:20:05 +01:00
Johannes Berg
f65607cdbc mac80211: don't set set TDLS STA bandwidth wider than possible
When we set up a TDLS station, we set sta->sta.bandwidth solely based
on the capabilities, because the "what's the current bandwidth" check
is bypassed and only applied for other types of stations.

This leads to the unfortunate scenario that the sta->sta.bandwidth is
160 MHz if both stations support it, but we never actually configure
this bandwidth unless the AP is already using 160 MHz; even for wider
bandwidth support we only go up to 80 MHz (at least right now.)

For iwlwifi, this can also lead to firmware asserts, telling us that
we've configured the TX rates for a higher bandwidth than is actually
available due to the PHY configuration.

For non-TDLS, we check against the interface's requested bandwidth,
but we explicitly skip this check for TDLS to cope with the wider BW
case. Change this to
 (a) still limit to the TDLS peer's own chandef, which gets factored
     into the overall PHY configuration we request from the driver,
     and
 (b) limit it to when the TDLS peer is authorized, because it's only
     factored into the channel context in this case.

Fixes: 504871e602 ("mac80211: fix bandwidth computation for TDLS peers")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20201206145305.fcc7d29c4590.I11f77e9e25ddf871a3c8d5604650c763e2c5887a@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-12-11 13:20:05 +01:00
Johannes Berg
3f8a39ff28 mac80211: support driver-based disconnect with reconnect hint
Support the driver indicating that a disconnection needs
to be performed, and pass through the reconnect hint in
this case.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20201206145305.5c8dab7a22a0.I58459fdf6968b16c90cab9c574f0f04ca22b0c79@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-12-11 13:20:05 +01:00
Johannes Berg
3bb02143ff cfg80211: support immediate reconnect request hint
There are cases where it's necessary to disconnect, but an
immediate reconnection is desired. Support a hint to userspace
that this is the case, by including a new attribute in the
deauth or disassoc event.

Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20201206145305.58d33941fb9d.I0e7168c205c7949529c8e3b86f3c9b12c01a7017@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-12-11 13:20:05 +01:00
Johannes Berg
a5a55032ea mac80211: use struct assignment for he_obss_pd
Use a struct assignment here, which is clearer than the
memcpy() and type-safe as well.

Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20201206145305.2ab3aad7d5fc.Iaca4ee6db651b7de17e4351f4be7973ff8600186@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-12-11 13:20:05 +01:00
Johannes Berg
539a36ba2f cfg80211: remove struct ieee80211_he_bss_color
We don't really use this struct, we're now using
struct cfg80211_he_bss_color instead.

Change the one place in mac80211 that's using the old
name to use struct assignment instead of memcpy() and
thus remove the wrong sizeof while at it.

Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20201206145305.f6698d97ae4e.Iba2dffcb79c4ab80bde7407609806010b55edfdf@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-12-11 13:20:04 +01:00
Anant Thazhemadam
2d9463083c nl80211: validate key indexes for cfg80211_registered_device
syzbot discovered a bug in which an OOB access was being made because
an unsuitable key_idx value was wrongly considered to be acceptable
while deleting a key in nl80211_del_key().

Since we don't know the cipher at the time of deletion, if
cfg80211_validate_key_settings() were to be called directly in
nl80211_del_key(), even valid keys would be wrongly determined invalid,
and deletion wouldn't occur correctly.
For this reason, a new function - cfg80211_valid_key_idx(), has been
created, to determine if the key_idx value provided is valid or not.
cfg80211_valid_key_idx() is directly called in 2 places -
nl80211_del_key(), and cfg80211_validate_key_settings().

Reported-by: syzbot+49d4cab497c2142ee170@syzkaller.appspotmail.com
Tested-by: syzbot+49d4cab497c2142ee170@syzkaller.appspotmail.com
Suggested-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Anant Thazhemadam <anant.thazhemadam@gmail.com>
Link: https://lore.kernel.org/r/20201204215825.129879-1-anant.thazhemadam@gmail.com
Cc: stable@vger.kernel.org
[also disallow IGTK key IDs if no IGTK cipher is supported]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-12-11 13:20:04 +01:00
Johannes Berg
669b84134a cfg80211: include block-tx flag in channel switch started event
In the NL80211_CMD_CH_SWITCH_STARTED_NOTIFY event, include the
NL80211_ATTR_CH_SWITCH_BLOCK_TX flag attribute if block-tx was
requested by the AP.

Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20201129172929.8953ef22cc64.Ifee9cab337a4369938545920ba5590559e91327a@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-12-11 12:59:37 +01:00
Johannes Berg
3660944a37 mac80211: disallow band-switch during CSA
If the AP advertises a band switch during CSA, we will not have
the right information to continue working with it, since it will
likely (have to) change its capabilities and we don't track any
capability changes at all. Additionally, we store e.g. supported
rates per band, and that information would become invalid.

Since this is a fringe scenario, just disconnect explicitly.

Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20201129172929.0e2327107c06.I461adb07704e056b054a4a7c29b80c95a9f56637@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-12-11 12:59:10 +01:00
Ilan Peer
beee246951 cfg80211: Save the regulatory domain when setting custom regulatory
When custom regulatory was set, only the channels setting was updated, but
the regulatory domain was not saved. Fix it by saving it.

Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20201129172929.290fa5c5568a.Ic5732aa64de6ee97ae3578bd5779fc723ba489d1@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-12-11 12:57:24 +01:00
Avraham Stern
c837cbad40 nl80211: always accept scan request with the duration set
Accept a scan request with the duration set even if the driver
does not support setting the scan dwell. The duration can be used
as a hint to the driver, but the driver may use its internal logic
for setting the scan dwell.

Signed-off-by: Avraham Stern <avraham.stern@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20201129172929.9491a12f9226.Ia9c5b24fcefc5ce5592537507243391633a27e5f@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-12-11 12:57:11 +01:00
Ilan Peer
b45a19dd7e cfg80211: Update TSF and TSF BSSID for multi BSS
When a new BSS entry is created based on multi BSS IE, the
TSF and the TSF BSSID were not updated. Fix it.

Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20201129172929.8377d5063827.I6f2011b6017c2ad507c61a3f1ca03b7177a46e32@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-12-11 12:57:02 +01:00
Ayala Beker
d590a125ee cfg80211: scan PSC channels in case of scan with wildcard SSID
In case of scan request with wildcard SSID, or in case of more
than one SSID in scan request, need to scan PSC channels even though
all the co-located APs found during the legacy bands scan indicated
that all the APs in their ESS are co-located, as we might find different
networks on the PSC channels.

Signed-off-by: Ayala Beker <ayala.beker@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20201129172929.736415a9ca5d.If5b3578ae85e11a707a5da07e66ba85928ba702c@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-12-11 12:55:16 +01:00
Ilan Peer
3598ae87fe mac80211: Skip entries with SAE H2E only membership selector
When parsing supported rates IE.

Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20201129172929.8228e2be791e.I626c93241fef66bc71aa0cb9719aba1b11232cf1@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-12-11 12:54:58 +01:00
Ilan Peer
d6587602c5 cfg80211: Parse SAE H2E only membership selector
This extends the support for drivers that rebuild IEs in the
FW (same as with HT/VHT/HE).

Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20201129172929.4012647275f3.I1a93ae71c57ef0b6f58f99d47fce919d19d65ff0@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-12-11 12:54:41 +01:00
Johannes Berg
4271d4bde0 mac80211: support MIC error/replay detected counters driver update
Support the driver incrementing MIC error and replay detected
counters when having detected a bad frame, if it drops it directly
instead of relying on mac80211 to do the checks.

These are then exposed to userspace, though currently only in some
cases and in debugfs.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20201129172929.fb59be9c6de8.Ife2260887366f585afadd78c983ebea93d2bb54b@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-12-11 12:53:48 +01:00
Shaul Triebitz
081e1e7ece mac80211: he: remove non-bss-conf fields from bss_conf
ack_enabled and multi_sta_back_32bit are station capabilities
and should not be in the bss_conf structure.

Signed-off-by: Shaul Triebitz <shaul.triebitz@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20201129172929.69a7f7753444.I405c4b5245145e24577512c477f19131d4036489@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-12-11 12:52:24 +01:00
Tom Rix
84674ef4d6 mac80211: remove trailing semicolon in macro definitions
The macro uses should have (and already have) the semicolon.

Signed-off-by: Tom Rix <trix@redhat.com>
Link: https://lore.kernel.org/r/20201127193842.2876355-1-trix@redhat.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-12-11 12:51:55 +01:00
Gustavo A. R. Silva
d7832c7187 nl80211: Fix fall-through warnings for Clang
In preparation to enable -Wimplicit-fallthrough for Clang, fix a warning
by explicitly adding a break statement instead of letting the code fall
through to the next case.

Link: https://github.com/KSPP/linux/issues/115
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Link: https://lore.kernel.org/r/fe5afd456a1244751177e53359d3dd149a63a873.1605896060.git.gustavoars@kernel.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-12-11 12:51:33 +01:00
Gustavo A. R. Silva
aaaee2d68a mac80211: Fix fall-through warnings for Clang
In preparation to enable -Wimplicit-fallthrough for Clang, fix multiple
warnings by explicitly adding multiple break statements instead of
letting the code fall through to the next case.

Link: https://github.com/KSPP/linux/issues/115
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Link: https://lore.kernel.org/r/1a9c4e8248e76e1361edbe2471a68773d87f0b67.1605896060.git.gustavoars@kernel.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-12-11 12:51:04 +01:00
Gustavo A. R. Silva
01c9c0ab35 cfg80211: Fix fall-through warnings for Clang
In preparation to enable -Wimplicit-fallthrough for Clang, fix a warning
by explicitly adding a break statement instead of letting the code fall
through to the next case.

Link: https://github.com/KSPP/linux/issues/115
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Link: https://lore.kernel.org/r/ed94a115106fa9c6df94d09b2a6c5791c618c4f2.1605896059.git.gustavoars@kernel.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-12-11 12:50:52 +01:00
Sami Tolvanen
32fc4a9ad5 cfg80211: fix callback type mismatches in wext-compat
Instead of casting callback functions to type iw_handler, which trips
indirect call checking with Clang's Control-Flow Integrity (CFI), add
stub functions with the correct function type for the callbacks.

Reported-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20201117205902.405316-1-samitolvanen@google.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-12-11 12:50:27 +01:00
Colin Ian King
c7ed0e683d net: wireless: make a const array static, makes object smaller
Don't populate the const array bws on the stack but instead it
static. Makes the object code smaller by 80 bytes:

Before:
   text	   data	    bss	    dec	    hex	filename
  85694	  16865	   1216	 103775	  1955f	./net/wireless/reg.o

After:
   text	   data	    bss	    dec	    hex	filename
  85518	  16961	   1216	 103695	  1950f	./net/wireless/reg.o

(gcc version 10.2.0)

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Link: https://lore.kernel.org/r/20201116181636.362729-1-colin.king@canonical.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-12-11 12:50:02 +01:00
Lev Stipakov
36ec144f04 net: mac80211: use core API for updating TX/RX stats
Commits

  d3fd65484c ("net: core: add dev_sw_netstats_tx_add")
  451b05f413 ("net: netdevice.h: sw_netstats_rx_add helper)

have added API to update net device per-cpu TX/RX stats.

Use core API instead of ieee80211_tx/rx_stats().

Signed-off-by: Lev Stipakov <lev@openvpn.net>
Reviewed-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://lore.kernel.org/r/20201113214623.144663-1-lev@openvpn.net
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-12-11 12:47:26 +01:00
Emmanuel Grumbach
14486c8261 rfkill: add a reason to the HW rfkill state
The WLAN device may exist yet not be usable. This can happen
when the WLAN device is controllable by both the host and
some platform internal component.
We need some arbritration that is vendor specific, but when
the device is not available for the host, we need to reflect
this state towards the user space.

Add a reason field to the rfkill object (and event) so that
userspace can know why the device is in rfkill: because some
other platform component currently owns the device, or
because the actual hw rfkill signal is asserted.

Capable userspace can now determine the reason for the rfkill
and possibly do some negotiation on a side band channel using
a proprietary protocol to gain ownership on the device in case
the device is owned by some other component. When the host
gains ownership on the device, the kernel can remove the
RFKILL_HARD_BLOCK_NOT_OWNER reason and the hw rfkill state
will be off. Then, the userspace can bring the device up and
start normal operation.

The rfkill_event structure is enlarged to include the additional
byte, it is now 9 bytes long. Old user space will ask to read
only 8 bytes so that the kernel can know not to feed them with
more data. When the user space writes 8 bytes, new kernels will
just read what is present in the file descriptor. This new byte
is read only from the userspace standpoint anyway.

If a new user space uses an old kernel, it'll ask to read 9 bytes
but will get only 8, and it'll know that it didn't get the new
state. When it'll write 9 bytes, the kernel will again ignore
this new byte which is read only from the userspace standpoint.

Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Link: https://lore.kernel.org/r/20201104134641.28816-1-emmanuel.grumbach@intel.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-12-11 12:47:17 +01:00
David S. Miller
d9838b1d39 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Alexei Starovoitov says:

====================
pull-request: bpf 2020-12-10

The following pull-request contains BPF updates for your *net* tree.

We've added 21 non-merge commits during the last 12 day(s) which contain
a total of 21 files changed, 163 insertions(+), 88 deletions(-).

The main changes are:

1) Fix propagation of 32-bit signed bounds from 64-bit bounds, from Alexei.

2) Fix ring_buffer__poll() return value, from Andrii.

3) Fix race in lwt_bpf, from Cong.

4) Fix test_offload, from Toke.

5) Various xsk fixes.

Please consider pulling these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git

Thanks a lot!

Also thanks to reporters, reviewers and testers of commits in this pull-request:

Cong Wang, Hulk Robot, Jakub Kicinski, Jean-Philippe Brucker, John
Fastabend, Magnus Karlsson, Maxim Mikityanskiy, Yonghong Song
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-12-10 14:29:30 -08:00
Jakub Kicinski
51e13685bd rtnetlink: RCU-annotate both dimensions of rtnl_msg_handlers
We use rcu_assign_pointer to assign both the table and the entries,
but the entries are not marked as __rcu. This generates sparse
warnings.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-12-10 13:35:59 -08:00
Arjun Roy
e0fecb289a tcp: correctly handle increased zerocopy args struct size
A prior patch increased the size of struct tcp_zerocopy_receive
but did not update do_tcp_getsockopt() handling to properly account
for this.

This patch simply reintroduces content erroneously cut from the
referenced prior patch that handles the new struct size.

Fixes: 18fb76ed53 ("net-zerocopy: Copy straggler unaligned data for TCP Rx. zerocopy.")
Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-12-10 13:10:32 -08:00
Oliver Hartkopp
921ca574cd can: isotp: add SF_BROADCAST support for functional addressing
When CAN_ISOTP_SF_BROADCAST is set in the CAN_ISOTP_OPTS flags the CAN_ISOTP
socket is switched into functional addressing mode, where only single frame
(SF) protocol data units can be send on the specified CAN interface and the
given tp.tx_id after bind().

In opposite to normal and extended addressing this socket does not register a
CAN-ID for reception which would be needed for a 1-to-1 ISOTP connection with a
segmented bi-directional data transfer.

Sending SFs on this socket is therefore a TX-only 'broadcast' operation.

Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Thomas Wagner <thwa1@web.de>
Link: https://lore.kernel.org/r/20201206144731.4609-1-socketcan@hartkopp.net
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2020-12-10 09:31:40 +01:00
Guillaume Nault
7fdd375e38 net: sched: Fix dump of MPLS_OPT_LSE_LABEL attribute in cls_flower
TCA_FLOWER_KEY_MPLS_OPT_LSE_LABEL is a u32 attribute (MPLS label is
20 bits long).

Fixes the following bug:

 $ tc filter add dev ethX ingress protocol mpls_uc \
     flower mpls lse depth 2 label 256             \
     action drop

 $ tc filter show dev ethX ingress
   filter protocol mpls_uc pref 49152 flower chain 0
   filter protocol mpls_uc pref 49152 flower chain 0 handle 0x1
     eth_type 8847
     mpls
       lse depth 2 label 0  <-- invalid label 0, should be 256
   ...

Fixes: 61aec25a6d ("cls_flower: Support filtering on multiple MPLS Label Stack Entries")
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-12-09 20:39:38 -08:00
Xie He
6b21c0bb3a net: x25: Fix handling of Restart Request and Restart Confirmation
1. When the x25 module gets loaded, layer 2 may already be running and
connected. In this case, although we are in X25_LINK_STATE_0, we still
need to handle the Restart Request received, rather than ignore it.

2. When we are in X25_LINK_STATE_2, we have already sent a Restart Request
and is waiting for the Restart Confirmation with t20timer. t20timer will
restart itself repeatedly forever so it will always be there, as long as we
are in State 2. So we don't need to check x25_t20timer_pending again.

Fixes: d023b2b9cc ("net/x25: fix restart request/confirm handling")
Cc: Martin Schiller <ms@dev.tdt.de>
Signed-off-by: Xie He <xie.he.0141@gmail.com>
Acked-by: Martin Schiller <ms@dev.tdt.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-12-09 19:34:25 -08:00
Paolo Abeni
d7b1bfd083 mptcp: be careful on subflows shutdown
When the workqueue disposes of the msk, the subflows can still
receive some data from the peer after __mptcp_close_ssk()
completes.

The above could trigger a race between the msk receive path and the
msk destruction. Acquiring the mptcp_data_lock() in __mptcp_destroy_sock()
will not save the day: the rx path could be reached even after msk
destruction completes.

Instead use the subflow 'disposable' flag to prevent entering
the msk receive path after __mptcp_close_ssk().

Fixes: e16163b6e2 ("mptcp: refactor shutdown and close")
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-12-09 19:31:58 -08:00
Paolo Abeni
0597d0f8e0 mptcp: plug subflow context memory leak
When a MPTCP listener socket is closed with unaccepted
children pending, the ULP release callback will be invoked,
but nobody will call into __mptcp_close_ssk() on the
corresponding subflow.

As a consequence, at ULP release time, the 'disposable' flag
will be cleared and the subflow context memory will be leaked.

This change addresses the issue always freeing the context if
the subflow is still in the accept queue at ULP release time.

Additionally, this fixes an incorrect code reference in the
related comment.

Note: this fix leverages the changes introduced by the previous
commit.

Fixes: e16163b6e2 ("mptcp: refactor shutdown and close")
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-12-09 19:31:58 -08:00
Paolo Abeni
5b950ff433 mptcp: link MPC subflow into msk only after accept
Christoph reported the following splat:

WARNING: CPU: 0 PID: 4615 at net/ipv4/inet_connection_sock.c:1031 inet_csk_listen_stop+0x8e8/0xad0 net/ipv4/inet_connection_sock.c:1031
Modules linked in:
CPU: 0 PID: 4615 Comm: syz-executor.4 Not tainted 5.9.0 #37
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:inet_csk_listen_stop+0x8e8/0xad0 net/ipv4/inet_connection_sock.c:1031
Code: 03 00 00 00 e8 79 b2 3d ff e9 ad f9 ff ff e8 1f 76 ba fe be 02 00 00 00 4c 89 f7 e8 62 b2 3d ff e9 14 f9 ff ff e8 08 76 ba fe <0f> 0b e9 97 f8 ff ff e8 fc 75 ba fe be 03 00 00 00 4c 89 f7 e8 3f
RSP: 0018:ffffc900037f7948 EFLAGS: 00010293
RAX: ffff88810a349c80 RBX: ffff888114ee1b00 RCX: ffffffff827b14cd
RDX: 0000000000000000 RSI: ffffffff827b1c38 RDI: 0000000000000005
RBP: ffff88810a2a8000 R08: ffff88810a349c80 R09: fffff520006fef1f
R10: 0000000000000003 R11: fffff520006fef1e R12: ffff888114ee2d00
R13: dffffc0000000000 R14: 0000000000000001 R15: ffff888114ee1d68
FS:  00007f2ac1945700(0000) GS:ffff88811b400000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ffd44798bc0 CR3: 0000000109810002 CR4: 0000000000170ef0
Call Trace:
 __tcp_close+0xd86/0x1110 net/ipv4/tcp.c:2433
 __mptcp_close_ssk+0x256/0x430 net/mptcp/protocol.c:1761
 __mptcp_destroy_sock+0x49b/0x770 net/mptcp/protocol.c:2127
 mptcp_close+0x62d/0x910 net/mptcp/protocol.c:2184
 inet_release+0xe9/0x1f0 net/ipv4/af_inet.c:434
 __sock_release+0xd2/0x280 net/socket.c:596
 sock_close+0x15/0x20 net/socket.c:1277
 __fput+0x276/0x960 fs/file_table.c:281
 task_work_run+0x109/0x1d0 kernel/task_work.c:151
 get_signal+0xe8f/0x1d40 kernel/signal.c:2561
 arch_do_signal+0x88/0x1b60 arch/x86/kernel/signal.c:811
 exit_to_user_mode_loop kernel/entry/common.c:161 [inline]
 exit_to_user_mode_prepare+0x9b/0xf0 kernel/entry/common.c:191
 syscall_exit_to_user_mode+0x22/0x150 kernel/entry/common.c:266
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f2ac1254469
Code: 00 f3 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ff 49 2b 00 f7 d8 64 89 01 48
RSP: 002b:00007f2ac1944dc8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffbf RBX: 000000000069bf00 RCX: 00007f2ac1254469
RDX: 0000000000000000 RSI: 0000000000008982 RDI: 0000000000000003
RBP: 000000000069bf00 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000069bf0c
R13: 00007ffeb53f178f R14: 00000000004668b0 R15: 0000000000000003

After commit 0397c6d85f ("mptcp: keep unaccepted MPC subflow into
join list"), the msk's workqueue and/or PM can touch the MPC
subflow - and acquire its socket lock - even if it's still unaccepted.

If the above event races with the relevant listener socket close, we
can end-up with the above splat.

This change addresses the issue delaying the MPC socket insertion
in conn_list at accept time - that is, partially reverting the
blamed commit.

We must additionally ensure that mptcp_pm_fully_established()
happens after accept() time, or the PM will not be able to
handle properly such event - conn_list could be empty otherwise.

In the receive path, we check the subflow list node to ensure
it is out of the listener queue. Be sure client subflows do
not match transiently such condition moving them into the join
list earlier at creation time.

Since we now have multiple mptcp_pm_fully_established() call sites
from different code-paths, said helper can now race with itself.
Use an additional PM status bit to avoid multiple notifications.

Reported-by: Christoph Paasch <cpaasch@apple.com>
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/103
Fixes: 0397c6d85f ("mptcp: keep unaccepted MPC subflow into join list"),
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-12-09 19:31:58 -08:00
Geliang Tang
432d9e74d8 mptcp: use the variable sk instead of open-coding
Since the local variable sk has been defined, use it instead of
open-coding.

Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-12-09 19:02:16 -08:00
Geliang Tang
13ad9f01a2 mptcp: rename add_addr_signal and mptcp_add_addr_status
Since the RM_ADDR signal had been reused with add_addr_signal, it's not
suitable to call it add_addr_signal or mptcp_add_addr_status. So this
patch renamed add_addr_signal to addr_signal, and renamed
mptcp_add_addr_status to mptcp_addr_signal_status.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-12-09 19:02:16 -08:00
Geliang Tang
42842a425a mptcp: drop rm_addr_signal flag
This patch reused add_addr_signal for the RM_ADDR announcing signal, by
defining a new ADD_ADDR status named MPTCP_RM_ADDR_SIGNAL. Then the flag
rm_addr_signal in PM could be dropped.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-12-09 19:02:15 -08:00
Geliang Tang
90a4aea8b6 mptcp: print out port and ahmac when receiving ADD_ADDR
This patch printed out more debugging information for the ADD_ADDR
suboption parsing on the incoming path.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-12-09 19:02:15 -08:00
Geliang Tang
0f5c9e3f07 mptcp: add port parameter for mptcp_pm_announce_addr
This patch added a new parameter 'port' for mptcp_pm_announce_addr. If
this parameter is true, we set the MPTCP_ADD_ADDR_PORT bit of the
add_addr_signal. That means the announced address is added with a port
number.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-12-09 19:02:15 -08:00
Geliang Tang
fbe0f87ac7 mptcp: send out dedicated packet for ADD_ADDR using port
The process is similar to that of the ADD_ADDR IPv6, this patch also sent
out a pure ack for the ADD_ADDR using port.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-12-09 19:02:15 -08:00
Geliang Tang
4a2777a834 mptcp: add the outgoing ADD_ADDR port support
This patch added a new add_addr_signal type named MPTCP_ADD_ADDR_PORT,
to identify it is an address with port to be added.

It also added a new parameter 'port' for both mptcp_add_addr_len and
mptcp_pm_add_addr_signal.

In mptcp_established_options_add_addr, we check whether the announced
address is added with port. If it is, we put this port number to
mptcp_out_options's port field.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-12-09 19:02:15 -08:00
Geliang Tang
2ec72faec8 mptcp: use adding up size to get ADD_ADDR length
This patch uses adding up size to get the ADD_ADDR suboption length rather
than returning the ADD_ADDR size constants.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-12-09 19:02:15 -08:00
Geliang Tang
22fb85ffae mptcp: add port support for ADD_ADDR suboption writing
In rfc8684, the length of ADD_ADDR suboption with IPv4 address and port
is 18 octets, but mptcp_write_options is 32-bit aligned, so we need to
pad it to 20 octets. All the other port related option lengths need to
be added up 2 octets similarly.

This patch added a new field 'port' in mptcp_out_options. When this
field is set with a port number, we need to add up 4 octets for the
ADD_ADDR suboption, and put the port number into the suboption.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-12-09 19:02:15 -08:00
Geliang Tang
e1ef683222 mptcp: unify ADD_ADDR and ADD_ADDR6 suboptions writing
The length of ADD_ADDR6 is 12 octets longer than ADD_ADDR. That's the
only difference between them.

This patch dropped the duplicate code between ADD_ADDR and ADD_ADDR6
suboptions writing, and unify them into one.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-12-09 19:02:15 -08:00
Geliang Tang
6eb3d1e350 mptcp: unify ADD_ADDR and echo suboptions writing
There are two differences between ADD_ADDR suboption and ADD_ADDR echo
suboption: The length of the former is 8 octets longer than the length
of the latter. The former's echo-flag is 0, and latter's echo-flag is 1.

This patch added two local variables, len and echo, to unify ADD_ADDR
and ADD_ADDR echo suboptions writing.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-12-09 19:02:15 -08:00
David S. Miller
b7e4ba9a91 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

1) Switch to RCU in x_tables to fix possible NULL pointer dereference,
   from Subash Abhinov Kasiviswanathan.

2) Fix netlink dump of dynset timeouts later than 23 days.

3) Add comment for the indirect serialization of the nft commit mutex
   with rtnl_mutex.

4) Remove bogus check for confirmed conntrack when matching on the
   conntrack ID, from Brett Mastbergen.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-12-09 18:55:46 -08:00
Zheng Yongjun
a319aedde4 net: rxrpc: convert comma to semicolon
Replace a comma between expression statements by a semicolon.

Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com>
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-12-09 16:23:07 -08:00
Neal Cardwell
299bcb55ec tcp: fix cwnd-limited bug for TSO deferral where we send nothing
When cwnd is not a multiple of the TSO skb size of N*MSS, we can get
into persistent scenarios where we have the following sequence:

(1) ACK for full-sized skb of N*MSS arrives
  -> tcp_write_xmit() transmit full-sized skb with N*MSS
  -> move pacing release time forward
  -> exit tcp_write_xmit() because pacing time is in the future

(2) TSQ callback or TCP internal pacing timer fires
  -> try to transmit next skb, but TSO deferral finds remainder of
     available cwnd is not big enough to trigger an immediate send
     now, so we defer sending until the next ACK.

(3) repeat...

So we can get into a case where we never mark ourselves as
cwnd-limited for many seconds at a time, even with
bulk/infinite-backlog senders, because:

o In case (1) above, every time in tcp_write_xmit() we have enough
cwnd to send a full-sized skb, we are not fully using the cwnd
(because cwnd is not a multiple of the TSO skb size). So every time we
send data, we are not cwnd limited, and so in the cwnd-limited
tracking code in tcp_cwnd_validate() we mark ourselves as not
cwnd-limited.

o In case (2) above, every time in tcp_write_xmit() that we try to
transmit the "remainder" of the cwnd but defer, we set the local
variable is_cwnd_limited to true, but we do not send any packets, so
sent_pkts is zero, so we don't call the cwnd-limited logic to update
tp->is_cwnd_limited.

Fixes: ca8a226343 ("tcp: make cwnd-limited checks measurement-based, and gentler")
Reported-by: Ingemar Johansson <ingemar.s.johansson@ericsson.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20201209035759.1225145-1-ncardwell.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-09 16:15:54 -08:00
Chris Mi
5137d30365 net: flow_offload: Fix memory leak for indirect flow block
The offending commit introduces a cleanup callback that is invoked
when the driver module is removed to clean up the tunnel device
flow block. But it returns on the first iteration of the for loop.
The remaining indirect flow blocks will never be freed.

Fixes: 1fac52da59 ("net: flow_offload: consolidate indirect flow_block infrastructure")
CC: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Chris Mi <cmi@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
2020-12-09 16:08:33 -08:00
Wei Wang
8ef44b6fe4 tcp: Retain ECT bits for tos reflection
For DCTCP, we have to retain the ECT bits set by the congestion control
algorithm on the socket when reflecting syn TOS in syn-ack, in order to
make ECN work properly.

Fixes: ac8f1710c1 ("tcp: reflect tos value received in SYN to the socket")
Reported-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Wei Wang <weiwan@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-12-09 16:08:23 -08:00
Michal Kubecek
a770bf5156 ethtool: fix stack overflow in ethnl_parse_bitset()
Syzbot reported a stack overflow in bitmap_from_arr32() called from
ethnl_parse_bitset() when bitset from netlink message is longer than
target bitmap length. While ethnl_compact_sanity_checks() makes sure that
trailing part is all zeros (i.e. the request does not try to touch bits
kernel does not recognize), we also need to cap change_bits to nbits so
that we don't try to write past the prepared bitmaps.

Fixes: 88db6d1e4f ("ethtool: add ethnl_parse_bitset() helper")
Reported-by: syzbot+9d39fa49d4df294aab93@syzkaller.appspotmail.com
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Link: https://lore.kernel.org/r/3487ee3a98e14cd526f55b6caaa959d2dcbcad9f.1607465316.git.mkubecek@suse.cz
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-09 15:50:38 -08:00
Pablo Neira Ayuso
102e2c0723 net: sched: incorrect Kconfig dependencies on Netfilter modules
- NET_ACT_CONNMARK and NET_ACT_CTINFO only require conntrack support.
- NET_ACT_IPT only requires NETFILTER_XTABLES symbols, not
  IP_NF_IPTABLES. After this patch, NET_ACT_IPT becomes consistent
  with NET_EMATCH_IPT. NET_ACT_IPT dependency on IP_NF_IPTABLES predates
  Linux-2.6.12-rc2 (initial git repository build).

Fixes: 22a5dc0e5e ("net: sched: Introduce connmark action")
Fixes: 24ec483cec ("net: sched: Introduce act_ctinfo action")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Link: https://lore.kernel.org/r/20201208204707.11268-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-09 15:49:29 -08:00
Oliver Hartkopp
323a391a22 can: isotp: isotp_setsockopt(): block setsockopt on bound sockets
The isotp socket can be widely configured in its behaviour regarding addressing
types, fill-ups, receive pattern tests and link layer length. Usually all
these settings need to be fixed before bind() and can not be changed
afterwards.

This patch adds a check to enforce the common usage pattern.

Fixes: e057dd3fc2 ("can: add ISO 15765-2:2016 transport protocol")
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Tested-by: Thomas Wagner <thwa1@web.de>
Link: https://lore.kernel.org/r/20201203140604.25488-2-socketcan@hartkopp.net
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Link: https://lore.kernel.org/r/20201204133508.742120-3-mkl@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-09 08:44:15 -08:00
Toke Høiland-Jørgensen
998f172962 xdp: Remove the xdp_attachment_flags_ok() callback
Since commit 7f0a838254 ("bpf, xdp: Maintain info on attached XDP BPF
programs in net_device"), the XDP program attachment info is now maintained
in the core code. This interacts badly with the xdp_attachment_flags_ok()
check that prevents unloading an XDP program with different load flags than
it was loaded with. In practice, two kinds of failures are seen:

- An XDP program loaded without specifying a mode (and which then ends up
  in driver mode) cannot be unloaded if the program mode is specified on
  unload.

- The dev_xdp_uninstall() hook always calls the driver callback with the
  mode set to the type of the program but an empty flags argument, which
  means the flags_ok() check prevents the program from being removed,
  leading to bpf prog reference leaks.

The original reason this check was added was to avoid ambiguity when
multiple programs were loaded. With the way the checks are done in the core
now, this is quite simple to enforce in the core code, so let's add a check
there and get rid of the xdp_attachment_flags_ok() callback entirely.

Fixes: 7f0a838254 ("bpf, xdp: Maintain info on attached XDP BPF programs in net_device")
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/bpf/160752225751.110217.10267659521308669050.stgit@toke.dk
2020-12-09 16:27:42 +01:00
Chuck Lever
5e54dafbe0 SUNRPC: Remove XDRBUF_SPARSE_PAGES flag in gss_proxy upcall
There's no need to defer allocation of pages for the receive buffer.

- This upcall is quite infrequent
- gssp_alloc_receive_pages() can allocate the pages with GFP_KERNEL,
  unlike the transport
- gssp_alloc_receive_pages() knows exactly how many pages are needed

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Olga Kornievskaia <kolga@netapp.com>
2020-12-09 09:38:34 -05:00
Roberto Bergantinos Corpas
4b5cff7ed8 sunrpc: clean-up cache downcall
We can simplify code around cache_downcall unifying memory
allocations using kvmalloc. This has the benefit of getting rid of
cache_slow_downcall (and queue_io_mutex), and also matches userland
allocation size and limits.

Signed-off-by: Roberto Bergantinos Corpas <rbergant@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-12-09 09:38:34 -05:00
Brett Mastbergen
2d94b20b95 netfilter: nft_ct: Remove confirmation check for NFT_CT_ID
Since commit 656c8e9cc1 ("netfilter: conntrack: Use consistent ct id
hash calculation") the ct id will not change from initialization to
confirmation.  Removing the confirmation check allows for things like
adding an element to a 'typeof ct id' set in prerouting upon reception
of the first packet of a new connection, and then being able to
reference that set consistently both before and after the connection
is confirmed.

Fixes: 656c8e9cc1 ("netfilter: conntrack: Use consistent ct id hash calculation")
Signed-off-by: Brett Mastbergen <brett.mastbergen@gmail.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-12-09 10:31:58 +01:00
Florent Revest
b60da4955f bpf: Only provide bpf_sock_from_file with CONFIG_NET
This moves the bpf_sock_from_file definition into net/core/filter.c
which only gets compiled with CONFIG_NET and also moves the helper proto
usage next to other tracing helpers that are conditional on CONFIG_NET.

This avoids
  ld: kernel/trace/bpf_trace.o: in function `bpf_sock_from_file':
  bpf_trace.c:(.text+0xe23): undefined reference to `sock_from_file'
When compiling a kernel with BPF and without NET.

Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Florent Revest <revest@chromium.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: KP Singh <kpsingh@kernel.org>
Link: https://lore.kernel.org/bpf/20201208173623.1136863-1-revest@chromium.org
2020-12-08 18:23:36 -08:00
Eric Dumazet
72d05c00d7 tcp: select sane initial rcvq_space.space for big MSS
Before commit a337531b94 ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB")
small tcp_rmem[1] values were overridden by tcp_fixup_rcvbuf() to accommodate various MSS.

This is no longer the case, and Hazem Mohamed Abuelfotoh reported
that DRS would not work for MTU 9000 endpoints receiving regular (1500 bytes) frames.

Root cause is that tcp_init_buffer_space() uses tp->rcv_wnd for upper limit
of rcvq_space.space computation, while it can select later a smaller
value for tp->rcv_ssthresh and tp->window_clamp.

ss -temoi on receiver would show :

skmem:(r0,rb131072,t0,tb46080,f0,w0,o0,bl0,d0) rcv_space:62496 rcv_ssthresh:56596

This means that TCP can not increase its window in tcp_grow_window(),
and that DRS can never kick.

Fix this by making sure that rcvq_space.space is not bigger than number of bytes
that can be held in TCP receive queue.

People unable/unwilling to change their kernel can work around this issue by
selecting a bigger tcp_rmem[1] value as in :

echo "4096 196608 6291456" >/proc/sys/net/ipv4/tcp_rmem

Based on an initial report and patch from Hazem Mohamed Abuelfotoh
 https://lore.kernel.org/netdev/20201204180622.14285-1-abuehaze@amazon.com/

Fixes: a337531b94 ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB")
Fixes: 041a14d267 ("tcp: start receiver buffer autotuning sooner")
Reported-by: Hazem Mohamed Abuelfotoh <abuehaze@amazon.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-12-08 16:27:48 -08:00
Zheng Yongjun
5e359044c1 net: openvswitch: conntrack: simplify the return expression of ovs_ct_limit_get_default_limit()
Simplify the return expression.

Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com>
Reviewed-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-12-08 16:22:54 -08:00
Zheng Yongjun
8daa76a52d net: core: devlink: simplify the return expression of devlink_nl_cmd_trap_set_doit()
Simplify the return expression.

Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-12-08 16:22:54 -08:00
Zheng Yongjun
9faad250ce net: ipv6: rpl_iptunnel: simplify the return expression of rpl_do_srh()
Simplify the return expression.

Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-12-08 16:22:54 -08:00
Zheng Yongjun
57b0637d00 net/sched: cls_u32: simplify the return expression of u32_reoffload_knode()
Simplify the return expression.

Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-12-08 16:22:53 -08:00
Colin Ian King
8354bcbebd net: sched: fix spelling mistake in Kconfig "trys" -> "tries"
There is a spelling mistake in the Kconfig help text. Fix it.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-12-08 16:01:56 -08:00
David S. Miller
e1be4b5990 Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Johan Hedberg says:

====================
pull request: bluetooth-next 2020-12-07

Here's the main bluetooth-next pull request for the 5.11 kernel.

 - Updated Bluetooth entries in MAINTAINERS to include Luiz von Dentz
 - Added support for Realtek 8822CE and 8852A devices
 - Added support for MediaTek MT7615E device
 - Improved workarounds for fake CSR devices
 - Fix Bluetooth qualification test case L2CAP/COS/CFD/BV-14-C
 - Fixes for LL Privacy support
 - Enforce 16 byte encryption key size for FIPS security level
 - Added new mgmt commands for extended advertising support
 - Multiple other smaller fixes & improvements

Please let me know if there are any issues pulling. Thanks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-12-08 15:58:49 -08:00
Julian Wiedmann
97f8841e04 net/af_iucv: use DECLARE_SOCKADDR to cast from sockaddr
This gets us compile-time size checking.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-12-08 15:56:53 -08:00
Cengiz Can
0398ba9e5a net: tipc: prevent possible null deref of link
`tipc_node_apply_property` does a null check on a `tipc_link_entry`
pointer but also accesses the same pointer out of the null check block.

This triggers a warning on Coverity Static Analyzer because we're
implying that `e->link` can BE null.

Move "Update MTU for node link entry" line into if block to make sure
that we're not in a state that `e->link` is null.

Signed-off-by: Cengiz Can <cengiz@kernel.wtf>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-12-08 15:53:41 -08:00
Pablo Neira Ayuso
42f1c27120 netfilter: nftables: comment indirect serialization of commit_mutex with rtnl_mutex
Add an explicit comment in the code to describe the indirect
serialization of the holders of the commit_mutex with the rtnl_mutex.
Commit 90d2723c6d ("netfilter: nf_tables: do not hold reference on
netdevice from preparation phase") already describes this, but a comment
in this case is better for reference.

Reported-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-12-08 21:53:48 +01:00
Pablo Neira Ayuso
917d80d376 netfilter: nft_dynset: fix timeouts later than 23 days
Use nf_msecs_to_jiffies64 and nf_jiffies64_to_msecs as provided by
8e1102d5a1 ("netfilter: nf_tables: support timeouts larger than 23
days"), otherwise ruleset listing breaks.

Fixes: a8b1e36d0d ("netfilter: nft_dynset: fix element timeout for HZ != 1000")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-12-08 20:42:11 +01:00
Rasmus Villemoes
bdc40a3f4b net: dsa: print the MTU value that could not be set
These warnings become somewhat more informative when they include the
MTU value that could not be set and not just the errno.

Signed-off-by: Rasmus Villemoes <rasmus.villemoes@prevas.dk>
Link: https://lore.kernel.org/r/20201205133944.10182-1-rasmus.villemoes@prevas.dk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-08 11:24:07 -08:00
Björn Töpel
3546b9b8ec xsk: Validate socket state in xsk_recvmsg, prior touching socket members
In AF_XDP the socket state needs to be checked, prior touching the
members of the socket. This was not the case for the recvmsg
implementation. Fix that by moving the xsk_is_bound() call.

Fixes: 45a8668184 ("xsk: Add support for recvmsg()")
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20201207082008.132263-1-bjorn.topel@gmail.com
2020-12-08 17:11:58 +01:00
Subash Abhinov Kasiviswanathan
cc00bcaa58 netfilter: x_tables: Switch synchronization to RCU
When running concurrent iptables rules replacement with data, the per CPU
sequence count is checked after the assignment of the new information.
The sequence count is used to synchronize with the packet path without the
use of any explicit locking. If there are any packets in the packet path using
the table information, the sequence count is incremented to an odd value and
is incremented to an even after the packet process completion.

The new table value assignment is followed by a write memory barrier so every
CPU should see the latest value. If the packet path has started with the old
table information, the sequence counter will be odd and the iptables
replacement will wait till the sequence count is even prior to freeing the
old table info.

However, this assumes that the new table information assignment and the memory
barrier is actually executed prior to the counter check in the replacement
thread. If CPU decides to execute the assignment later as there is no user of
the table information prior to the sequence check, the packet path in another
CPU may use the old table information. The replacement thread would then free
the table information under it leading to a use after free in the packet
processing context-

Unable to handle kernel NULL pointer dereference at virtual
address 000000000000008e
pc : ip6t_do_table+0x5d0/0x89c
lr : ip6t_do_table+0x5b8/0x89c
ip6t_do_table+0x5d0/0x89c
ip6table_filter_hook+0x24/0x30
nf_hook_slow+0x84/0x120
ip6_input+0x74/0xe0
ip6_rcv_finish+0x7c/0x128
ipv6_rcv+0xac/0xe4
__netif_receive_skb+0x84/0x17c
process_backlog+0x15c/0x1b8
napi_poll+0x88/0x284
net_rx_action+0xbc/0x23c
__do_softirq+0x20c/0x48c

This could be fixed by forcing instruction order after the new table
information assignment or by switching to RCU for the synchronization.

Fixes: 80055dab5d ("netfilter: x_tables: make xt_replace_table wait until old rules are not used anymore")
Reported-by: Sean Tranchetti <stranche@codeaurora.org>
Reported-by: kernel test robot <lkp@intel.com>
Suggested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-12-08 12:57:39 +01:00
Jakub Kicinski
819f56bad1 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:

====================
pull request (net): ipsec 2020-12-07

1) Sysbot reported fixes for the new 64/32 bit compat layer.
   From Dmitry Safonov.

2) Fix a memory leak in xfrm_user_policy that was introduced
   by adding the 64/32 bit compat layer. From Yu Kuai.

* 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec:
  net: xfrm: fix memory leak in xfrm_user_policy()
  xfrm/compat: Don't allocate memory with __GFP_ZERO
  xfrm/compat: memset(0) 64-bit padding at right place
  xfrm/compat: Translate by copying XFRMA_UNSPEC attribute
====================

Link: https://lore.kernel.org/r/20201207093937.2874932-1-steffen.klassert@secunet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-07 18:29:54 -08:00
Jianguo Wu
f55628b3e7 mptcp: print new line in mptcp_seq_show() if mptcp isn't in use
When do cat /proc/net/netstat, the output isn't append with a new line, it looks like this:
[root@localhost ~]# cat /proc/net/netstat
...
MPTcpExt: 0 0 0 0 0 0 0 0 0 0 0 0 0[root@localhost ~]#

This is because in mptcp_seq_show(), if mptcp isn't in use, net->mib.mptcp_statistics is NULL,
so it just puts all 0 after "MPTcpExt:", and return, forgot the '\n'.

After this patch:

[root@localhost ~]# cat /proc/net/netstat
...
MPTcpExt: 0 0 0 0 0 0 0 0 0 0 0 0 0
[root@localhost ~]#

Fixes: fc518953bc ("mptcp: add and use MIB counter infrastructure")
Signed-off-by: Jianguo Wu <wujianguo@chinatelecom.cn>
Acked-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/142e2fd9-58d9-bb13-fb75-951cccc2331e@163.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-07 17:45:29 -08:00
Joseph Huang
851d0a73c9 bridge: Fix a deadlock when enabling multicast snooping
When enabling multicast snooping, bridge module deadlocks on multicast_lock
if 1) IPv6 is enabled, and 2) there is an existing querier on the same L2
network.

The deadlock was caused by the following sequence: While holding the lock,
br_multicast_open calls br_multicast_join_snoopers, which eventually causes
IP stack to (attempt to) send out a Listener Report (in igmp6_join_group).
Since the destination Ethernet address is a multicast address, br_dev_xmit
feeds the packet back to the bridge via br_multicast_rcv, which in turn
calls br_multicast_add_group, which then deadlocks on multicast_lock.

The fix is to move the call br_multicast_join_snoopers outside of the
critical section. This works since br_multicast_join_snoopers only deals
with IP and does not modify any multicast data structures of the bridge,
so there's no need to hold the lock.

Steps to reproduce:
1. sysctl net.ipv6.conf.all.force_mld_version=1
2. have another querier
3. ip link set dev bridge type bridge mcast_snooping 0 && \
   ip link set dev bridge type bridge mcast_snooping 1 < deadlock >

A typical call trace looks like the following:

[  936.251495]  _raw_spin_lock+0x5c/0x68
[  936.255221]  br_multicast_add_group+0x40/0x170 [bridge]
[  936.260491]  br_multicast_rcv+0x7ac/0xe30 [bridge]
[  936.265322]  br_dev_xmit+0x140/0x368 [bridge]
[  936.269689]  dev_hard_start_xmit+0x94/0x158
[  936.273876]  __dev_queue_xmit+0x5ac/0x7f8
[  936.277890]  dev_queue_xmit+0x10/0x18
[  936.281563]  neigh_resolve_output+0xec/0x198
[  936.285845]  ip6_finish_output2+0x240/0x710
[  936.290039]  __ip6_finish_output+0x130/0x170
[  936.294318]  ip6_output+0x6c/0x1c8
[  936.297731]  NF_HOOK.constprop.0+0xd8/0xe8
[  936.301834]  igmp6_send+0x358/0x558
[  936.305326]  igmp6_join_group.part.0+0x30/0xf0
[  936.309774]  igmp6_group_added+0xfc/0x110
[  936.313787]  __ipv6_dev_mc_inc+0x1a4/0x290
[  936.317885]  ipv6_dev_mc_inc+0x10/0x18
[  936.321677]  br_multicast_open+0xbc/0x110 [bridge]
[  936.326506]  br_multicast_toggle+0xec/0x140 [bridge]

Fixes: 4effd28c12 ("bridge: join all-snoopers multicast address")
Signed-off-by: Joseph Huang <Joseph.Huang@garmin.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Link: https://lore.kernel.org/r/20201204235628.50653-1-Joseph.Huang@garmin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-07 17:14:43 -08:00
Cong Wang
e3366884b3 lwt_bpf: Replace preempt_disable() with migrate_disable()
migrate_disable() is just a wrapper for preempt_disable() in
non-RT kernel. It is safe to replace it, and RT kernel will
benefit.

Note that it is introduced since Feb 2020.

Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201205075946.497763-2-xiyou.wangcong@gmail.com
2020-12-07 11:53:40 -08:00
Dongdong Wang
d9054a1ff5 lwt: Disable BH too in run_lwt_bpf()
The per-cpu bpf_redirect_info is shared among all skb_do_redirect()
and BPF redirect helpers. Callers on RX path are all in BH context,
disabling preemption is not sufficient to prevent BH interruption.

In production, we observed strange packet drops because of the race
condition between LWT xmit and TC ingress, and we verified this issue
is fixed after we disable BH.

Although this bug was technically introduced from the beginning, that
is commit 3a0af8fd61 ("bpf: BPF for lightweight tunnel infrastructure"),
at that time call_rcu() had to be call_rcu_bh() to match the RCU context.
So this patch may not work well before RCU flavor consolidation has been
completed around v5.0.

Update the comments above the code too, as call_rcu() is now BH friendly.

Signed-off-by: Dongdong Wang <wangdongdong.6@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Cong Wang <cong.wang@bytedance.com>
Link: https://lore.kernel.org/bpf/20201205075946.497763-1-xiyou.wangcong@gmail.com
2020-12-07 11:53:39 -08:00
Marcel Holtmann
e6ed8b78ea Bluetooth: Increment management interface revision
Increment the mgmt revision due to the recently added new commands.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2020-12-07 17:02:00 +02:00
Abhishek Pandit-Subedi
dce0a4be80 Bluetooth: Set missing suspend task bits
When suspending, mark SUSPEND_SCAN_ENABLE and SUSPEND_SCAN_DISABLE tasks
correctly when either classic or le scanning is modified.

Signed-off-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Signed-off-by: Howard Chung <howardchung@google.com>
Reviewed-by: Alain Michaud <alainm@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2020-12-07 17:01:46 +02:00
Daniel Winkler
4d9b952857 Bluetooth: Change MGMT security info CMD to be more generic
For advertising, we wish to know the LE tx power capabilities of the
controller in userspace, so this patch edits the Security Info MGMT
command to be more generic, such that other various controller
capabilities can be included in the EIR data. This change also includes
the LE min and max tx power into this newly-named command.

The change was tested by manually verifying that the MGMT command
returns the tx power range as expected in userspace.

Reviewed-by: Sonny Sasaka <sonnysasaka@chromium.org>
Signed-off-by: Daniel Winkler <danielwinkler@google.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2020-12-07 17:01:42 +02:00
Daniel Winkler
7c395ea521 Bluetooth: Query LE tx power on startup
Queries tx power via HCI_LE_Read_Transmit_Power command when the hci
device is initialized, and stores resulting min/max LE power in hdev
struct. If command isn't available (< BT5 support), min/max values
both default to HCI_TX_POWER_INVALID.

This patch is manually verified by ensuring BT5 devices correctly query
and receive controller tx power range.

Reviewed-by: Sonny Sasaka <sonnysasaka@chromium.org>
Signed-off-by: Daniel Winkler <danielwinkler@google.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2020-12-07 17:01:38 +02:00
Daniel Winkler
9bf9f4b630 Bluetooth: Use intervals and tx power from mgmt cmds
This patch takes the min/max intervals and tx power optionally provided
in mgmt interface, stores them in the advertisement struct, and uses
them when configuring the hci requests. While tx power is not used if
extended advertising is unavailable, software rotation will use the min
and max advertising intervals specified by the client.

This change is validated manually by ensuring the min/max intervals are
propagated to the controller on both hatch (extended advertising) and
kukui (no extended advertising) chromebooks, and that tx power is
propagated correctly on hatch. These tests are performed with multiple
advertisements simultaneously.

Reviewed-by: Sonny Sasaka <sonnysasaka@chromium.org>
Signed-off-by: Daniel Winkler <danielwinkler@google.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2020-12-07 17:01:33 +02:00
Daniel Winkler
1241057283 Bluetooth: Break add adv into two mgmt commands
This patch adds support for the new advertising add interface, with the
first command setting advertising parameters and the second to set
advertising data. The set parameters command allows the caller to leave
some fields "unset", with a params bitfield defining which params were
purposefully set. Unset parameters will be given defaults when calling
hci_add_adv_instance. The data passed to the param mgmt command is
allowed to be flexible, so in the future if bluetoothd passes a larger
structure with new params, the mgmt command will ignore the unknown
members at the end.

This change has been validated on both hatch (extended advertising) and
kukui (no extended advertising) chromebooks running bluetoothd that
support this new interface. I ran the following manual tests:
- Set several (3) advertisements using modified test_advertisement.py
- For each, validate correct data and parameters in btmon trace
- Verified both for software rotation and extended adv

Automatic test suite also run, testing many (25) scenarios of single and
multi-advertising for data/parameter correctness.

Reviewed-by: Sonny Sasaka <sonnysasaka@chromium.org>
Signed-off-by: Daniel Winkler <danielwinkler@google.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2020-12-07 17:01:28 +02:00
Daniel Winkler
31aab5c22e Bluetooth: Add helper to set adv data
We wish to handle advertising data separately from advertising
parameters in our new MGMT requests. This change adds a helper that
allows the advertising data and scan response to be updated for an
existing advertising instance.

Reviewed-by: Sonny Sasaka <sonnysasaka@chromium.org>
Signed-off-by: Daniel Winkler <danielwinkler@google.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2020-12-07 17:01:25 +02:00
Howard Chung
80af16a3e4 Bluetooth: Add toggle to switch off interleave scan
This patch add a configurable parameter to switch off the interleave
scan feature.

Signed-off-by: Howard Chung <howardchung@google.com>
Reviewed-by: Alain Michaud <alainm@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2020-12-07 17:01:00 +02:00
Howard Chung
3bc615fa93 Bluetooth: Refactor read default sys config for various types
Refactor read default system configuration function so that it's capable
of returning different types than u16

Signed-off-by: Howard Chung <howardchung@google.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2020-12-07 17:00:56 +02:00
Howard Chung
422bb17f8a Bluetooth: Handle active scan case
This patch adds code to handle the active scan during interleave
scan. The interleave scan will be canceled when users start active scan,
and it will be restarted after active scan stopped.

Signed-off-by: Howard Chung <howardchung@google.com>
Reviewed-by: Alain Michaud <alainm@chromium.org>
Reviewed-by: Manish Mandlik <mmandlik@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2020-12-07 17:00:52 +02:00
Howard Chung
36afe87ac1 Bluetooth: Handle system suspend resume case
This patch adds code to handle the system suspension during interleave
scan. The interleave scan will be canceled when the system is going to
sleep, and will be restarted after waking up.

Signed-off-by: Howard Chung <howardchung@google.com>
Reviewed-by: Alain Michaud <alainm@chromium.org>
Reviewed-by: Manish Mandlik <mmandlik@chromium.org>
Reviewed-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Reviewed-by: Miao-chen Chou <mcchou@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2020-12-07 17:00:47 +02:00
Howard Chung
c4f1f40816 Bluetooth: Interleave with allowlist scan
This patch implements the interleaving between allowlist scan and
no-filter scan. It'll be used to save power when at least one monitor is
registered and at least one pending connection or one device to be
scanned for.

The durations of the allowlist scan and the no-filter scan are
controlled by MGMT command: Set Default System Configuration. The
default values are set randomly for now.

Signed-off-by: Howard Chung <howardchung@google.com>
Reviewed-by: Alain Michaud <alainm@chromium.org>
Reviewed-by: Manish Mandlik <mmandlik@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2020-12-07 17:00:43 +02:00
Edward Vear
a31489d2a3 Bluetooth: Fix attempting to set RPA timeout when unsupported
During controller initialization, an LE Set RPA Timeout command is sent
to the controller if supported. However, the value checked to determine
if the command is supported is incorrect. Page 1921 of the Bluetooth
Core Spec v5.2 shows that bit 2 of octet 35 of the Supported_Commands
field corresponds to the LE Set RPA Timeout command, but currently
bit 6 of octet 35 is checked. This patch checks the correct value
instead.

This issue led to the error seen in the following btmon output during
initialization of an adapter (rtl8761b) and prevented initialization
from completing.

< HCI Command: LE Set Resolvable Private Address Timeout (0x08|0x002e) plen 2
        Timeout: 900 seconds
> HCI Event: Command Complete (0x0e) plen 4
      LE Set Resolvable Private Address Timeout (0x08|0x002e) ncmd 2
        Status: Unsupported Remote Feature / Unsupported LMP Feature (0x1a)
= Close Index: 00:E0:4C:6B:E5:03

The error did not appear when running with this patch.

Signed-off-by: Edward Vear <edwardvear@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2020-12-07 17:00:38 +02:00
Luiz Augusto von Dentz
aeeae47d34 Bluetooth: Rename get_adv_instance_scan_rsp
This renames get_adv_instance_scan_rsp to adv_instance_is_scannable and
make it return a bool since it was not actually properly return the size
of the scan response as one could expect.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2020-12-07 17:00:33 +02:00
Luiz Augusto von Dentz
a76a0d3650 Bluetooth: Fix not sending Set Extended Scan Response
Current code is actually failing on the following tests of mgmt-tester
because get_adv_instance_scan_rsp_len did not account for flags that
cause scan response data to be included resulting in non-scannable
instance when in fact it should be scannable.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2020-12-07 17:00:29 +02:00
Jimmy Wahlberg
5b8ec15d02 Bluetooth: Fix for Bluetooth SIG test L2CAP/COS/CFD/BV-14-C
This test case is meant to verify that multiple
unknown options is included in the response.

BLUETOOTH CORE SPECIFICATION Version 5.2 | Vol 3, Part A
page 1057

'On an unknown option failure (Result=0x0003),
the option(s) that contain anoption type field that is not
understood by the recipient of the L2CAP_CONFIGURATION_REQ
packet shall be included in the L2CAP_CONFIGURATION_RSP
packet unless they are hints.'

Before this patch:

> ACL Data RX: Handle 11 flags 0x02 dlen 24
      L2CAP: Configure Request (0x04) ident 18 len 16
        Destination CID: 64
        Flags: 0x0000
        Option: Unknown (0x10) [mandatory]
        10 00 11 02 11 00 12 02 12 00
< ACL Data TX: Handle 11 flags 0x00 dlen 17
      L2CAP: Configure Response (0x05) ident 18 len 9
        Source CID: 64
        Flags: 0x0000
        Result: Failure - unknown options (0x0003)
        Option: Unknown (0x10) [mandatory]
        12

After this patch:

> ACL Data RX: Handle 11 flags 0x02 dlen 24
      L2CAP: Configure Request (0x04) ident 5 len 16
        Destination CID: 64
        Flags: 0x0000
        Option: Unknown (0x10) [mandatory]
        10 00 11 02 11 00 12 02 12 00
< ACL Data TX: Handle 11 flags 0x00 dlen 23
      L2CAP: Configure Response (0x05) ident 5 len 15
        Source CID: 64
        Flags: 0x0000
        Result: Failure - unknown options (0x0003)
        Option: Unknown (0x10) [mandatory]
        10 11 01 11 12 01 12

Signed-off-by: Jimmy Wahlberg <jimmywa@spotify.com>
Reviewed-by: Luiz Augusto Von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2020-12-07 17:00:23 +02:00
Wei Yongjun
f6b8c6b554 Bluetooth: sco: Fix crash when using BT_SNDMTU/BT_RCVMTU option
This commit add the invalid check for connected socket, without it will
causes the following crash due to sco_pi(sk)->conn being NULL:

KASAN: null-ptr-deref in range [0x0000000000000050-0x0000000000000057]
CPU: 3 PID: 4284 Comm: test_sco Not tainted 5.10.0-rc3+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1 04/01/2014
RIP: 0010:sco_sock_getsockopt+0x45d/0x8e0
Code: 48 c1 ea 03 80 3c 02 00 0f 85 ca 03 00 00 49 8b 9d f8 04 00 00 48 b8 00
      00 00 00 00 fc ff df 48 8d 7b 50 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84
      c0 74 08 3c 03 0f 8e b5 03 00 00 8b 43 50 48 8b 0c
RSP: 0018:ffff88801bb17d88 EFLAGS: 00010206
RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff83a4ecdf
RDX: 000000000000000a RSI: ffffc90002fce000 RDI: 0000000000000050
RBP: 1ffff11003762fb4 R08: 0000000000000001 R09: ffff88810e1008c0
R10: ffffffffbd695dcf R11: fffffbfff7ad2bb9 R12: 0000000000000000
R13: ffff888018ff1000 R14: dffffc0000000000 R15: 000000000000000d
FS:  00007fb4f76c1700(0000) GS:ffff88811af80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00005555e3b7a938 CR3: 00000001117be001 CR4: 0000000000770ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
 ? sco_skb_put_cmsg+0x80/0x80
 ? sco_skb_put_cmsg+0x80/0x80
 __sys_getsockopt+0x12a/0x220
 ? __ia32_sys_setsockopt+0x150/0x150
 ? syscall_enter_from_user_mode+0x18/0x50
 ? rcu_read_lock_bh_held+0xb0/0xb0
 __x64_sys_getsockopt+0xba/0x150
 ? syscall_enter_from_user_mode+0x1d/0x50
 do_syscall_64+0x33/0x40
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: 0fc1a726f8 ("Bluetooth: sco: new getsockopt options BT_SNDMTU/BT_RCVMTU")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Reviewed-by: Luiz Augusto Von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2020-12-07 17:00:15 +02:00
Reo Shiseki
353021588c Bluetooth: fix typo in struct name
Signed-off-by: Reo Shiseki <reoshiseki@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2020-12-07 16:51:22 +02:00
Xin Long
10c678bd0a udp: fix the proto value passed to ip_protocol_deliver_rcu for the segments
Guillaume noticed that: for segments udp_queue_rcv_one_skb() returns the
proto, and it should pass "ret" unmodified to ip_protocol_deliver_rcu().
Otherwize, with a negtive value passed, it will underflow inet_protos.

This can be reproduced with IPIP FOU:

  # ip fou add port 5555 ipproto 4
  # ethtool -K eth1 rx-gro-list on

Fixes: cf329aa42b ("udp: cope with UDP GRO packet misdirection")
Reported-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-12-07 00:32:11 -08:00
Jakub Kicinski
78d6bb584d This cleanup patchset includes the following patches:
- bump version strings, by Simon Wunderlich
 
  - update include for min/max helpers, by Sven Eckelmann
 
  - add infrastructure and netlink functions for routing algo selection,
    by Sven Eckelmann (2 patches)
 
  - drop deprecated debugfs and sysfs support and obsoleted
    functionality, by Sven Eckelmann (3 patches)
 
  - drop unused include in fragmentation.c, by Simon Wunderlich
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAl/KWIkWHHN3QHNpbW9u
 d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoTrTEACUImdOCWv4+NnEQfChQv6Y3i18
 gJABXoOkWLfFpGBUlw/uYzFKpMEWZ0orHig9gucC+rmjNc8veWwAOugJoTPTKQJZ
 /4yndhM0x39vWex03rdDmyqzCEh1V1Q9VcdEuD6XbJDaK5F4jDu3NQVneOijIkN+
 5PzhlvtUlfe8csykOCOoC9Y5wy82fEhcEvuSq+Z6dU3Cb3EGHtEUtZ4orDkpnnml
 7XEcn5C5+OFGlz/ikiszKumTtNK+dmGluOxoyfAzEjQHK7PoTorcXFS2YUoSWeqQ
 gmYZ56RBqEHjo4eqcaEgcqq5v8cTPCEMCB8UQjAffxrhloRKHRhQOysG1+OnzGA8
 IQ2ARHLQCVPVraXF2ixE0D3BvjKmtMmcvZOCXwhCHDajn9jFKAh0+hnInDyv6Fp1
 7eUfpHACL9EQDxKWXeQg37X2mk3hHJ+4zgZOYidahVeKbiiexe2heaHTYAbr9rIf
 8hvtlgMg4AnwL3IxadrKwsbJ5t7TEPLTInf47hPvpRg7SmthDTcgso4VDwmWgB8W
 Tlug8/NoXXDCmDhXUpvyi9+idHe0J8xvHl/2xGC7aSsPAbuhqOuefMKr36YhXJTY
 vBA5Ih5ppylJ8Dzwa0TbonvQbOAinA8YTa6izKxY4e+xTPB2jz/WT2vciEZv2+ig
 vNIPFrLZ7OFMohCDfQ==
 =tNKF
 -----END PGP SIGNATURE-----

Merge tag 'batadv-next-pullrequest-20201204' of git://git.open-mesh.org/linux-merge

Simon Wunderlich says:

====================
This cleanup patchset includes the following patches:

 - bump version strings, by Simon Wunderlich

 - update include for min/max helpers, by Sven Eckelmann

 - add infrastructure and netlink functions for routing algo selection,
   by Sven Eckelmann (2 patches)

 - drop deprecated debugfs and sysfs support and obsoleted
   functionality, by Sven Eckelmann (3 patches)

 - drop unused include in fragmentation.c, by Simon Wunderlich

* tag 'batadv-next-pullrequest-20201204' of git://git.open-mesh.org/linux-merge:
  batman-adv: Drop unused soft-interface.h include in fragmentation.c
  batman-adv: Drop legacy code for auto deleting mesh interfaces
  batman-adv: Drop deprecated debugfs support
  batman-adv: Drop deprecated sysfs support
  batman-adv: Allow selection of routing algorithm over rtnetlink
  batman-adv: Prepare infrastructure for newlink settings
  batman-adv: Add new include for min/max helpers
  batman-adv: Start new development cycle
====================

Link: https://lore.kernel.org/r/20201204154631.21063-1-sw@simonwunderlich.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-05 15:08:06 -08:00
Bongsu Jeon
bcd684aace net/nfc/nci: Support NCI 2.x initial sequence
implement the NCI 2.x initial sequence to support NCI 2.x NFCC.
Since NCI 2.0, CORE_RESET and CORE_INIT sequence have been changed.
If NFCEE supports NCI 2.x, then NCI 2.x initial sequence will work.

In NCI 1.0, Initial sequence and payloads are as below:
(DH)                     (NFCC)
 |  -- CORE_RESET_CMD --> |
 |  <-- CORE_RESET_RSP -- |
 |  -- CORE_INIT_CMD -->  |
 |  <-- CORE_INIT_RSP --  |
 CORE_RESET_RSP payloads are Status, NCI version, Configuration Status.
 CORE_INIT_CMD payloads are empty.
 CORE_INIT_RSP payloads are Status, NFCC Features,
    Number of Supported RF Interfaces, Supported RF Interface,
    Max Logical Connections, Max Routing table Size,
    Max Control Packet Payload Size, Max Size for Large Parameters,
    Manufacturer ID, Manufacturer Specific Information.

In NCI 2.0, Initial Sequence and Parameters are as below:
(DH)                     (NFCC)
 |  -- CORE_RESET_CMD --> |
 |  <-- CORE_RESET_RSP -- |
 |  <-- CORE_RESET_NTF -- |
 |  -- CORE_INIT_CMD -->  |
 |  <-- CORE_INIT_RSP --  |
 CORE_RESET_RSP payloads are Status.
 CORE_RESET_NTF payloads are Reset Trigger,
    Configuration Status, NCI Version, Manufacturer ID,
    Manufacturer Specific Information Length,
    Manufacturer Specific Information.
 CORE_INIT_CMD payloads are Feature1, Feature2.
 CORE_INIT_RSP payloads are Status, NFCC Features,
    Max Logical Connections, Max Routing Table Size,
    Max Control Packet Payload Size,
    Max Data Packet Payload Size of the Static HCI Connection,
    Number of Credits of the Static HCI Connection,
    Max NFC-V RF Frame Size, Number of Supported RF Interfaces,
    Supported RF Interfaces.

Signed-off-by: Bongsu Jeon <bongsu.jeon@samsung.com>
Link: https://lore.kernel.org/r/20201202223147.3472-1-bongsu.jeon@samsung.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-04 17:47:35 -08:00
Hoang Le
43fcd906d9 tipc: support 128bit node identity for peer removing
We add the support to remove a specific node down with 128bit
node identifier, as an alternative to legacy 32-bit node address.

example:
$tipc peer remove identiy <1001002|16777777>

Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Link: https://lore.kernel.org/r/20201203035045.4564-1-hoang.h.le@dektech.com.au
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-04 17:40:27 -08:00
Eric Dumazet
905b2032fa mac80211: mesh: fix mesh_pathtbl_init() error path
If tbl_mpp can not be allocated, we call mesh_table_free(tbl_path)
while tbl_path rhashtable has not yet been initialized, which causes
panics.

Simply factorize the rhashtable_init() call into mesh_table_alloc()

WARNING: CPU: 1 PID: 8474 at kernel/workqueue.c:3040 __flush_work kernel/workqueue.c:3040 [inline]
WARNING: CPU: 1 PID: 8474 at kernel/workqueue.c:3040 __cancel_work_timer+0x514/0x540 kernel/workqueue.c:3136
Modules linked in:
CPU: 1 PID: 8474 Comm: syz-executor663 Not tainted 5.10.0-rc6-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:__flush_work kernel/workqueue.c:3040 [inline]
RIP: 0010:__cancel_work_timer+0x514/0x540 kernel/workqueue.c:3136
Code: 5d c3 e8 bf ae 29 00 0f 0b e9 f0 fd ff ff e8 b3 ae 29 00 0f 0b 43 80 3c 3e 00 0f 85 31 ff ff ff e9 34 ff ff ff e8 9c ae 29 00 <0f> 0b e9 dc fe ff ff 89 e9 80 e1 07 80 c1 03 38 c1 0f 8c 7d fd ff
RSP: 0018:ffffc9000165f5a0 EFLAGS: 00010293
RAX: ffffffff814b7064 RBX: 0000000000000001 RCX: ffff888021c80000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffff888024039ca0 R08: dffffc0000000000 R09: fffffbfff1dd3e64
R10: fffffbfff1dd3e64 R11: 0000000000000000 R12: 1ffff920002cbebd
R13: ffff888024039c88 R14: 1ffff11004807391 R15: dffffc0000000000
FS:  0000000001347880(0000) GS:ffff8880b9d00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020000140 CR3: 000000002cc0a000 CR4: 00000000001506e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 rhashtable_free_and_destroy+0x25/0x9c0 lib/rhashtable.c:1137
 mesh_table_free net/mac80211/mesh_pathtbl.c:69 [inline]
 mesh_pathtbl_init+0x287/0x2e0 net/mac80211/mesh_pathtbl.c:785
 ieee80211_mesh_init_sdata+0x2ee/0x530 net/mac80211/mesh.c:1591
 ieee80211_setup_sdata+0x733/0xc40 net/mac80211/iface.c:1569
 ieee80211_if_add+0xd5c/0x1cd0 net/mac80211/iface.c:1987
 ieee80211_add_iface+0x59/0x130 net/mac80211/cfg.c:125
 rdev_add_virtual_intf net/wireless/rdev-ops.h:45 [inline]
 nl80211_new_interface+0x563/0xb40 net/wireless/nl80211.c:3855
 genl_family_rcv_msg_doit net/netlink/genetlink.c:739 [inline]
 genl_family_rcv_msg net/netlink/genetlink.c:783 [inline]
 genl_rcv_msg+0xe4e/0x1280 net/netlink/genetlink.c:800
 netlink_rcv_skb+0x190/0x3a0 net/netlink/af_netlink.c:2494
 genl_rcv+0x24/0x40 net/netlink/genetlink.c:811
 netlink_unicast_kernel net/netlink/af_netlink.c:1304 [inline]
 netlink_unicast+0x780/0x930 net/netlink/af_netlink.c:1330
 netlink_sendmsg+0x9a8/0xd40 net/netlink/af_netlink.c:1919
 sock_sendmsg_nosec net/socket.c:651 [inline]
 sock_sendmsg net/socket.c:671 [inline]
 ____sys_sendmsg+0x519/0x800 net/socket.c:2353
 ___sys_sendmsg net/socket.c:2407 [inline]
 __sys_sendmsg+0x2b1/0x360 net/socket.c:2440
 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: 60854fd945 ("mac80211: mesh: convert path table to rhashtable")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Reviewed-by: Johannes Berg <johannes@sipsolutions.net>
Link: https://lore.kernel.org/r/20201204162428.2583119-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-04 17:34:25 -08:00
Wang Hai
bb2da7651a openvswitch: fix error return code in validate_and_copy_dec_ttl()
Fix to return a negative error code from the error handling
case instead of 0, as done elsewhere in this function.

Changing 'return start' to 'return action_start' can fix this bug.

Fixes: 69929d4c49 ("net: openvswitch: fix TTL decrement action netlink message format")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wang Hai <wanghai38@huawei.com>
Reviewed-by: Eelco Chaudron <echaudro@redhat.com>
Link: https://lore.kernel.org/r/20201204114314.1596-1-wanghai38@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-04 15:43:14 -08:00
Zhang Changzhong
ee4f52a8de net: bridge: vlan: fix error return code in __vlan_add()
Fix to return a negative error code from the error handling
case instead of 0, as done elsewhere in this function.

Fixes: f8ed289fab ("bridge: vlan: use br_vlan_(get|put)_master to deal with refcounts")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zhang Changzhong <zhangchangzhong@huawei.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Link: https://lore.kernel.org/r/1607071737-33875-1-git-send-email-zhangchangzhong@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-04 15:41:06 -08:00
Zhang Changzhong
b410f04eb5 ipv4: fix error return code in rtm_to_fib_config()
Fix to return a negative error code from the error handling
case instead of 0, as done elsewhere in this function.

Fixes: d15662682d ("ipv4: Allow ipv6 gateway with ipv4 routes")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zhang Changzhong <zhangchangzhong@huawei.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/1607071695-33740-1-git-send-email-zhangchangzhong@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-04 15:38:16 -08:00
Davide Caratti
4eef8b1f36 net/sched: fq_pie: initialize timer earlier in fq_pie_init()
with the following tdc testcase:

 83be: (qdisc, fq_pie) Create FQ-PIE with invalid number of flows

as fq_pie_init() fails, fq_pie_destroy() is called to clean up. Since the
timer is not yet initialized, it's possible to observe a splat like this:

  INFO: trying to register non-static key.
  the code is fine but needs lockdep annotation.
  turning off the locking correctness validator.
  CPU: 0 PID: 975 Comm: tc Not tainted 5.10.0-rc4+ #298
  Hardware name: Red Hat KVM, BIOS 1.11.1-4.module+el8.1.0+4066+0f1aadab 04/01/2014
  Call Trace:
   dump_stack+0x99/0xcb
   register_lock_class+0x12dd/0x1750
   __lock_acquire+0xfe/0x3970
   lock_acquire+0x1c8/0x7f0
   del_timer_sync+0x49/0xd0
   fq_pie_destroy+0x3f/0x80 [sch_fq_pie]
   qdisc_create+0x916/0x1160
   tc_modify_qdisc+0x3c4/0x1630
   rtnetlink_rcv_msg+0x346/0x8e0
   netlink_unicast+0x439/0x630
   netlink_sendmsg+0x719/0xbf0
   sock_sendmsg+0xe2/0x110
   ____sys_sendmsg+0x5ba/0x890
   ___sys_sendmsg+0xe9/0x160
   __sys_sendmsg+0xd3/0x170
   do_syscall_64+0x33/0x40
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  [...]
  ODEBUG: assert_init not available (active state 0) object type: timer_list hint: 0x0
  WARNING: CPU: 0 PID: 975 at lib/debugobjects.c:508 debug_print_object+0x162/0x210
  [...]
  Call Trace:
   debug_object_assert_init+0x268/0x380
   try_to_del_timer_sync+0x6a/0x100
   del_timer_sync+0x9e/0xd0
   fq_pie_destroy+0x3f/0x80 [sch_fq_pie]
   qdisc_create+0x916/0x1160
   tc_modify_qdisc+0x3c4/0x1630
   rtnetlink_rcv_msg+0x346/0x8e0
   netlink_rcv_skb+0x120/0x380
   netlink_unicast+0x439/0x630
   netlink_sendmsg+0x719/0xbf0
   sock_sendmsg+0xe2/0x110
   ____sys_sendmsg+0x5ba/0x890
   ___sys_sendmsg+0xe9/0x160
   __sys_sendmsg+0xd3/0x170
   do_syscall_64+0x33/0x40
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

fix it moving timer_setup() before any failure, like it was done on 'red'
with former commit 608b4adab1 ("net_sched: initialize timer earlier in
red_init()").

Fixes: ec97ecf1eb ("net: sched: add Flow Queue PIE packet scheduler")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Cong Wang <cong.wang@bytedance.com>
Link: https://lore.kernel.org/r/2e78e01c504c633ebdff18d041833cf2e079a3a4.1607020450.git.dcaratti@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-04 14:15:01 -08:00
Arjun Roy
94ab9eb9b2 net-zerocopy: Defer vm zap unless actually needed.
Zapping pages is required only if we are calling vm_insert_page into a
region where pages had previously been mapped. Receive zerocopy allows
reusing such regions, and hitherto called zap_page_range() before
calling vm_insert_page() in that range.

zap_page_range() can also be triggered from userspace with
madvise(MADV_DONTNEED). If userspace is configured to call this before
reusing a segment, or if there was nothing mapped at this virtual
address to begin with, we can avoid calling zap_page_range() under the
socket lock. That said, if userspace does not do that, then we are
still responsible for calling zap_page_range().

This patch adds a flag that the user can use to hint to the kernel
that a zap is not required. If the flag is not set, or if an older
user application does not have a flags field at all, then the kernel
calls zap_page_range as before. Also, if the flag is set but a zap is
still required, the kernel performs that zap as necessary. Thus
incorrectly indicating that a zap can be avoided does not change the
correctness of operation. It also increases the batchsize for
vm_insert_pages and prefetches the page struct for the batch since
we're about to bump the refcount.

An alternative mechanism could be to not have a flag, assume by
default a zap is not needed, and fall back to zapping if needed.
However, this would harm performance for older applications for which
a zap is necessary, and thus we implement it with an explicit flag
so newer applications can opt in.

When using RPC-style traffic with medium sized (tens of KB) RPCs, this
change yields an efficency improvement of about 30% for QPS/CPU usage.

Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-04 13:40:53 -08:00
Arjun Roy
0c3936d32f net-zerocopy: Set zerocopy hint when data is copied
Set zerocopy hint, event when falling back to copy, so that the
pending data can be efficiently received using zerocopy when
possible.

Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-04 13:40:53 -08:00
Arjun Roy
f21a3c4803 net-zerocopy: Introduce short-circuit small reads.
Sometimes, we may call tcp receive zerocopy when inq is 0,
or inq < PAGE_SIZE, or inq is generally small enough that
it is cheaper to copy rather than remap pages.

In these cases, we may want to either return early (inq=0) or
attempt to use the provided copy buffer to simply copy
the received data.

This allows us to save both system call overhead and
the latency of acquiring mmap_sem in read mode for cases where
it would be useless to do so.

This patchset enables this behaviour by:
1. Returning quickly if inq is 0.
2. Attempting to perform a regular copy if a hybrid copybuffer is
   provided and it is large enough to absorb all available bytes.
3. Return quickly if no such buffer was provided and there are less
   than PAGE_SIZE bytes available.

For small RPC ping-pong workloads, normally we would have
1 getsockopt(), 1 recvmsg() and 1 sendmsg() call per RPC. With this
change, we remove the recvmsg() call entirely, reducing the syscall
overhead by about 33%. In testing with small (hundreds of bytes)
RPC traffic, this yields a syscall reduction of about 33% and
an efficiency gain of about 3-5% when defined as QPS/CPU Util.

Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-04 13:40:53 -08:00
Arjun Roy
936ced4157 net-zerocopy: Fast return if inq < PAGE_SIZE
Sometimes, we may call tcp receive zerocopy when inq is 0,
or inq < PAGE_SIZE, in which case we cannot remap pages. In this case,
simply return the appropriate hint for regular copying without taking
mmap_sem.

Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-04 13:40:53 -08:00
Arjun Roy
98917cf0d6 net-zerocopy: Refactor frag-is-remappable test.
Refactor frag-is-remappable test for tcp receive zerocopy. This is
part of a patch set that introduces short-circuited hybrid copies
for small receive operations, which results in roughly 33% fewer
syscalls for small RPC scenarios.

Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-04 13:40:52 -08:00
Arjun Roy
7fba5309ef net-zerocopy: Refactor skb frag fast-forward op.
Refactor skb frag fast-forwarding for tcp receive zerocopy. This is
part of a patch set that introduces short-circuited hybrid copies
for small receive operations, which results in roughly 33% fewer
syscalls for small RPC scenarios.

skb_advance_to_frag(), given a skb and an offset into the skb,
iterates from the first frag for the skb until we're at the frag
specified by the offset. Assuming the offset provided refers to how
many bytes in the skb are already read, the returned frag points to
the next frag we may read from, while offset_frag is set to the number
of bytes from this frag that we have already read.

If frag is not null and offset_frag is equal to 0, then we may be able
to map this frag's page into the process address space with
vm_insert_page(). However, if offset_frag is not equal to 0, then we
cannot do so.

Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-04 13:40:52 -08:00
Arjun Roy
2cd8116184 net-tcp: Introduce tcp_recvmsg_locked().
Refactor tcp_recvmsg() by splitting it into locked and unlocked
portions. Callers already holding the socket lock and not using
ERRQUEUE/cmsg/busy polling can simply call tcp_recvmsg_locked().
This is in preparation for a short-circuit copy performed by
TCP receive zerocopy for small (< PAGE_SIZE, or otherwise requested
by the user) reads.

Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-04 13:40:52 -08:00
Arjun Roy
18fb76ed53 net-zerocopy: Copy straggler unaligned data for TCP Rx. zerocopy.
When TCP receive zerocopy does not successfully map the entire
requested space, it outputs a 'hint' that the caller should recvmsg().

Augment zerocopy to accept a user buffer that it tries to copy this
hint into - if it is possible to copy the entire hint, it will do so.
This elides a recvmsg() call for received traffic that isn't exactly
page-aligned in size.

This was tested with RPC-style traffic of arbitrary sizes. Normally,
each received message required at least one getsockopt() call, and one
recvmsg() call for the remaining unaligned data.

With this change, almost all of the recvmsg() calls are eliminated,
leading to a savings of about 25%-50% in number of system calls
for RPC-style workloads.

Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-04 13:40:52 -08:00
Florent Revest
a50a85e40c bpf: Expose bpf_sk_storage_* to iterator programs
Iterators are currently used to expose kernel information to userspace
over fast procfs-like files but iterators could also be used to
manipulate local storage. For example, the task_file iterator could be
used to initialize a socket local storage with associations between
processes and sockets or to selectively delete local storage values.

Signed-off-by: Florent Revest <revest@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: KP Singh <kpsingh@google.com>
Link: https://lore.kernel.org/bpf/20201204113609.1850150-3-revest@google.com
2020-12-04 22:32:40 +01:00
Florent Revest
dba4a9256b net: Remove the err argument from sock_from_file
Currently, the sock_from_file prototype takes an "err" pointer that is
either not set or set to -ENOTSOCK IFF the returned socket is NULL. This
makes the error redundant and it is ignored by a few callers.

This patch simplifies the API by letting callers deduce the error based
on whether the returned socket is NULL or not.

Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Florent Revest <revest@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: KP Singh <kpsingh@google.com>
Link: https://lore.kernel.org/bpf/20201204113609.1850150-1-revest@google.com
2020-12-04 22:32:40 +01:00
Andrea Mayer
20a081b798 seg6: add VRF support for SRv6 End.DT6 behavior
SRv6 End.DT6 is defined in the SRv6 Network Programming [1].

The Linux kernel already offers an implementation of the SRv6
End.DT6 behavior which permits IPv6 L3 VPNs over SRv6 networks. This
implementation is not particularly suitable in contexts where we need to
deploy IPv6 L3 VPNs among different tenants which share the same network
address schemes. The underlying problem lies in the fact that the
current version of DT6 (called legacy DT6 from now on) needs a complex
configuration to be applied on routers which requires ad-hoc routes and
routing policy rules to ensure the correct isolation of tenants.

Consequently, a new implementation of DT6 has been introduced with the
aim of simplifying the construction of IPv6 L3 VPN services in the
multi-tenant environment using SRv6 networks. To accomplish this task,
we reused the same VRF infrastructure and SRv6 core components already
exploited for implementing the SRv6 End.DT4 behavior.

Currently the two End.DT6 implementations coexist seamlessly and can be
used depending on the context and the user preferences. So, in order to
support both versions of DT6 a new attribute (vrftable) has been
introduced which allows us to differentiate the implementation of the
behavior to be used.

A SRv6 End.DT6 legacy behavior is still instantiated using a command
like the following one:

 $ ip -6 route add 2001:db8::1 encap seg6local action End.DT6 table 100 dev eth0

While to instantiate the SRv6 End.DT6 in VRF mode, the command is still
pretty straight forward:

 $ ip -6 route add 2001:db8::1 encap seg6local action End.DT6 vrftable 100 dev eth0.

Obviously as in the case of SRv6 End.DT4, the VRF strict_mode parameter
must be set (net.vrf.strict_mode=1) and the VRF associated with table
100 must exist.

Please note that the instances of SRv6 End.DT6 legacy and End.DT6 VRF
mode can coexist in the same system/configuration without problems.

[1] https://tools.ietf.org/html/draft-ietf-spring-srv6-network-programming

Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-04 13:30:51 -08:00
Andrea Mayer
664d6f8686 seg6: add support for the SRv6 End.DT4 behavior
SRv6 End.DT4 is defined in the SRv6 Network Programming [1].

The SRv6 End.DT4 is used to implement IPv4 L3VPN use-cases in
multi-tenants environments. It decapsulates the received packets and it
performs IPv4 routing lookup in the routing table of the tenant.

The SRv6 End.DT4 Linux implementation leverages a VRF device in order to
force the routing lookup into the associated routing table.

To make the End.DT4 work properly, it must be guaranteed that the routing
table used for routing lookup operations is bound to one and only one
VRF during the tunnel creation. Such constraint has to be enforced by
enabling the VRF strict_mode sysctl parameter, i.e:
 $ sysctl -wq net.vrf.strict_mode=1.

At JANOG44, LINE corporation presented their multi-tenant DC architecture
using SRv6 [2]. In the slides, they reported that the Linux kernel is
missing the support of SRv6 End.DT4 behavior.

The SRv6 End.DT4 behavior can be instantiated using a command similar to
the following:

 $ ip route add 2001:db8::1 encap seg6local action End.DT4 vrftable 100 dev eth0

We introduce the "vrftable" extension in iproute2 in a following patch.

[1] https://tools.ietf.org/html/draft-ietf-spring-srv6-network-programming
[2] https://speakerdeck.com/line_developers/line-data-center-networking-with-srv6

Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-04 13:30:50 -08:00
Andrea Mayer
cfdf64a034 seg6: add callbacks for customizing the creation/destruction of a behavior
We introduce two callbacks used for customizing the creation/destruction of
a SRv6 behavior. Such callbacks are defined in the new struct
seg6_local_lwtunnel_ops and hereafter we provide a brief description of
them:

 - build_state(...): used for calling the custom constructor of the
   behavior during its initialization phase and after all the attributes
   have been parsed successfully;

 - destroy_state(...): used for calling the custom destructor of the
   behavior before it is completely destroyed.

Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-04 13:30:50 -08:00
Andrea Mayer
0a3021f1d4 seg6: add support for optional attributes in SRv6 behaviors
Before this patch, each SRv6 behavior specifies a set of required
attributes that must be provided by the userspace application when such
behavior is going to be instantiated. If at least one of the required
attributes is not provided, the creation of the behavior fails.

The SRv6 behavior framework lacks a way to manage optional attributes.
By definition, an optional attribute for a SRv6 behavior consists of an
attribute which may or may not be provided by the userspace. Therefore,
if an optional attribute is missing (and thus not supplied by the user)
the creation of the behavior goes ahead without any issue.

This patch explicitly differentiates the required attributes from the
optional attributes. In particular, each behavior can declare a set of
required attributes and a set of optional ones.

The semantic of the required attributes remains *totally* unaffected by
this patch. The introduction of the optional attributes does NOT impact
on the backward compatibility of the existing SRv6 behaviors.

It is essential to note that if an (optional or required) attribute is
supplied to a SRv6 behavior which does not expect it, the behavior
simply discards such attribute without generating any error or warning.
This operating mode remained unchanged both before and after the
introduction of the optional attributes extension.

The optional attributes are one of the key components used to implement
the SRv6 End.DT6 behavior based on the Virtual Routing and Forwarding
(VRF) framework. The optional attributes make possible the coexistence
of the already existing SRv6 End.DT6 implementation with the new SRv6
End.DT6 VRF-based implementation without breaking any backward
compatibility. Further details on the SRv6 End.DT6 behavior (VRF mode)
are reported in subsequent patches.

From the userspace point of view, the support for optional attributes DO
NOT require any changes to the userspace applications, i.e: iproute2
unless new attributes (required or optional) are needed.

Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-04 13:30:50 -08:00
Andrea Mayer
964adce526 seg6: improve management of behavior attributes
Depending on the attribute (i.e.: SEG6_LOCAL_SRH, SEG6_LOCAL_TABLE, etc),
the parse() callback performs some validity checks on the provided input
and updates the tunnel state (slwt) with the result of the parsing
operation. However, an attribute may also need to reserve some additional
resources (i.e.: memory or setting up an eBPF program) in the parse()
callback to complete the parsing operation.

The parse() callbacks are invoked by the parse_nla_action() for each
attribute belonging to a specific behavior. Given a behavior with N
attributes, if the parsing of the i-th attribute fails, the
parse_nla_action() returns immediately with an error. Nonetheless, the
resources acquired during the parsing of the i-1 attributes are not freed
by the parse_nla_action().

Attributes which acquire resources must release them *in an explicit way*
in both the seg6_local_{build/destroy}_state(). However, adding a new
attribute of this type requires changes to
seg6_local_{build/destroy}_state() to release the resources correctly.

The seg6local infrastructure still lacks a simple and structured way to
release the resources acquired in the parse() operations.

We introduced a new callback in the struct seg6_action_param named
destroy(). This callback releases any resource which may have been acquired
in the parse() counterpart. Each attribute may or may not implement the
destroy() callback depending on whether it needs to free some acquired
resources.

The destroy() callback comes with several of advantages:

 1) we can have many attributes as we want for a given behavior with no
    need to explicitly free the taken resources;

 2) As in case of the seg6_local_build_state(), the
    seg6_local_destroy_state() does not need to handle the release of
    resources directly. Indeed, it calls the destroy_attrs() function which
    is in charge of calling the destroy() callback for every set attribute.
    We do not need to patch seg6_local_{build/destroy}_state() anymore as
    we add new attributes;

 3) the code is more readable and better structured. Indeed, all the
    information needed to handle a given attribute are contained in only
    one place;

 4) it facilitates the integration with new features introduced in further
    patches.

Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-04 13:30:50 -08:00
Jakub Kicinski
846c3c9cfe wireless-drivers-next patches for v5.11
First set of patches for v5.11. rtw88 getting improvements to work
 better with Bluetooth and other driver also getting some new features.
 mhi-ath11k-immutable branch was pulled from mhi tree to avoid
 conflicts with mhi tree.
 
 Major changes:
 
 rtw88
 
 * major bluetooth co-existance improvements
 
 wilc1000
 
 * Wi-Fi Multimedia (WMM) support
 
 ath11k
 
 * Fast Initial Link Setup (FILS) discovery and unsolicited broadcast
   probe response support
 
 * qcom,ath11k-calibration-variant Device Tree setting
 
 * cold boot calibration support
 
 * new DFS region: JP
 
 wnc36xx
 
 * enable connection monitoring and keepalive in firmware
 
 ath10k
 
 * firmware IRAM recovery feature
 
 mhi
 
 * merge mhi-ath11k-immutable branch to make MHI API change go smoothly
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJfyTQyAAoJEG4XJFUm622bCdcIAIyVnqdW7pnoDmWIyQmAEnD9
 vGARkzghPHXnufpOzohyDdxT12X9klhrxSVIgzEgH1/pl3i1PpnF6KXyGFCC44Lw
 wrLXhQygPzmIW1IZtJJE3G72WExXoRjWx6LD1I7C7oEIduqFixXADmK2tKzFp795
 Jxum+sOeT6+Dk1OvO/fIroBHX73mRE9zAuiTIMpt2G1j8uXs9QVfcTbTrUshLASN
 0sX9J6JutltBuM4G7+bFpVzKnLnlQ7ebUaF6nvTCQsgHWZwkS7yAubSWX9sFohbR
 UXgQHNE83s/esOg7nBxAfqTKP8mbxsobmxZtxE5GR5vFY5FJDxqP9Zc2KzPp39w=
 =CbX/
 -----END PGP SIGNATURE-----

Merge tag 'wireless-drivers-next-2020-12-03' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next

Kalle Valo says:

====================
wireless-drivers-next patches for v5.11

First set of patches for v5.11. rtw88 getting improvements to work
better with Bluetooth and other driver also getting some new features.
mhi-ath11k-immutable branch was pulled from mhi tree to avoid
conflicts with mhi tree.

Major changes:

rtw88
 * major bluetooth co-existance improvements
wilc1000
 * Wi-Fi Multimedia (WMM) support
ath11k
 * Fast Initial Link Setup (FILS) discovery and unsolicited broadcast
   probe response support
 * qcom,ath11k-calibration-variant Device Tree setting
 * cold boot calibration support
 * new DFS region: JP
wnc36xx
 * enable connection monitoring and keepalive in firmware
ath10k
 * firmware IRAM recovery feature
mhi
 * merge mhi-ath11k-immutable branch to make MHI API change go smoothly

* tag 'wireless-drivers-next-2020-12-03' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next: (180 commits)
  wl1251: remove trailing semicolon in macro definition
  airo: remove trailing semicolon in macro definition
  wilc1000: added queue support for WMM
  wilc1000: call complete() for failure in wilc_wlan_txq_add_cfg_pkt()
  wilc1000: free resource in wilc_wlan_txq_add_mgmt_pkt() for failure path
  wilc1000: free resource in wilc_wlan_txq_add_net_pkt() for failure path
  wilc1000: added 'ndo_set_mac_address' callback support
  brcmfmac: expose firmware config files through modinfo
  wlcore: Switch to using the new API kobj_to_dev()
  rtw88: coex: add feature to enhance HID coexistence performance
  rtw88: coex: upgrade coexistence A2DP mechanism
  rtw88: coex: add action for coexistence in hardware initial
  rtw88: coex: add function to avoid cck lock
  rtw88: coex: change the coexistence mechanism for WLAN connected
  rtw88: coex: change the coexistence mechanism for HID
  rtw88: coex: update AFH information while in free-run mode
  rtw88: coex: update the mechanism for A2DP + PAN
  rtw88: coex: add debug message
  rtw88: coex: run coexistence when WLAN entering/leaving LPS
  Revert "rtl8xxxu: Add Buffalo WI-U3-866D to list of supported devices"
  ...
====================

Link: https://lore.kernel.org/r/20201203185732.9CFA5C433ED@smtp.codeaurora.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-04 10:56:37 -08:00
Zhang Changzhong
12c8a8ca11 xsk: Return error code if force_zc is set
If force_zc is set, we should exit out with an error, not fall back to
copy mode.

Fixes: 921b68692a ("xsk: Enable sharing of dma mappings")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zhang Changzhong <zhangchangzhong@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/1607077277-41995-1-git-send-email-zhangchangzhong@huawei.com
2020-12-04 16:48:31 +01:00
Jakub Kicinski
a1dd1d8697 Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2020-12-03

The main changes are:

1) Support BTF in kernel modules, from Andrii.

2) Introduce preferred busy-polling, from Björn.

3) bpf_ima_inode_hash() and bpf_bprm_opts_set() helpers, from KP Singh.

4) Memcg-based memory accounting for bpf objects, from Roman.

5) Allow bpf_{s,g}etsockopt from cgroup bind{4,6} hooks, from Stanislav.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (118 commits)
  selftests/bpf: Fix invalid use of strncat in test_sockmap
  libbpf: Use memcpy instead of strncpy to please GCC
  selftests/bpf: Add fentry/fexit/fmod_ret selftest for kernel module
  selftests/bpf: Add tp_btf CO-RE reloc test for modules
  libbpf: Support attachment of BPF tracing programs to kernel modules
  libbpf: Factor out low-level BPF program loading helper
  bpf: Allow to specify kernel module BTFs when attaching BPF programs
  bpf: Remove hard-coded btf_vmlinux assumption from BPF verifier
  selftests/bpf: Add CO-RE relocs selftest relying on kernel module BTF
  selftests/bpf: Add support for marking sub-tests as skipped
  selftests/bpf: Add bpf_testmod kernel module for testing
  libbpf: Add kernel module BTF support for CO-RE relocations
  libbpf: Refactor CO-RE relocs to not assume a single BTF object
  libbpf: Add internal helper to load BTF data by FD
  bpf: Keep module's btf_data_size intact after load
  bpf: Fix bpf_put_raw_tracepoint()'s use of __module_address()
  selftests/bpf: Add Userspace tests for TCP_WINDOW_CLAMP
  bpf: Adds support for setting window clamp
  samples/bpf: Fix spelling mistake "recieving" -> "receiving"
  bpf: Fix cold build of test_progs-no_alu32
  ...
====================

Link: https://lore.kernel.org/r/20201204021936.85653-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-04 07:48:12 -08:00
Borwankar, Antara
bdeca45a0c mac80211: set SDATA_STATE_RUNNING for monitor interfaces
During restarrt, mac80211 is supposed to reconfigure the driver.
When there's a monitor interface, the interface is added and the
channel context for it was created, but not assigned to it as it
was not considered running during the restart.

Fix this by setting SDATA_STATE_RUNNING while adding monitor
interfaces.

Signed-off-by: Borwankar, Antara <antara.borwankar@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20201129172929.e1df99693a4c.I494579f28018c2d0b9d4083a664cf872c28405ae@changeid
[reword commit log]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-12-04 12:45:25 +01:00
Sara Sharon
f495acd885 cfg80211: initialize rekey_data
In case we have old supplicant, the akm field is uninitialized.

Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20201129172929.930f0ab7ebee.Ic546e384efab3f4a89f318eafddc3eb7d556aecb@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-12-04 12:35:58 +01:00
Wen Gong
8fca2b8706 mac80211: fix return value of ieee80211_chandef_he_6ghz_oper
ieee80211_chandef_he_6ghz_oper() needs to return true if it
determined a value 6 GHz chandef, fix that.

Fixes: 1d00ce807e ("mac80211: support S1G association")
Signed-off-by: Wen Gong <wgong@codeaurora.org>
Link: https://lore.kernel.org/r/1606121152-3452-1-git-send-email-wgong@codeaurora.org
[rewrite commit message]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-12-04 12:35:58 +01:00
Simon Wunderlich
34a14c2e63 batman-adv: Drop unused soft-interface.h include in fragmentation.c
The commit 992b03b88e ("batman-adv: Don't always reallocate the
fragmentation skb head") removed the last user of functions from
soft-interface.h.

Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2020-12-04 08:41:16 +01:00
Sven Eckelmann
a962cb29bb batman-adv: Drop legacy code for auto deleting mesh interfaces
The only way to automatically drop batadv mesh interfaces when all soft
interfaces were removed was dropped with the sysfs support. It is no longer
needed to have them handled by kernel anymore.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2020-12-04 08:40:52 +01:00
Sven Eckelmann
aff6f5a68b batman-adv: Drop deprecated debugfs support
The debugfs support in batman-adv was marked as deprecated by the commit
00caf6a2b3 ("batman-adv: Mark debugfs functionality as deprecated") and
scheduled for removal in 2021.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2020-12-04 08:40:52 +01:00
Sven Eckelmann
76e9f27628 batman-adv: Drop deprecated sysfs support
The sysfs in batman-adv support was marked as deprecated by the commit
42cdd52148 ("batman-adv: ABI: Mark sysfs files as deprecated") and
scheduled for removal in 2021.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2020-12-04 08:40:52 +01:00
Sven Eckelmann
a5ad457eea batman-adv: Allow selection of routing algorithm over rtnetlink
A batadv net_device is associated to a B.A.T.M.A.N. routing algorithm. This
algorithm has to be selected before the interface is initialized and cannot
be changed after that. The only way to select this algorithm was a module
parameter which specifies the default algorithm used during the creation of
the net_device.

This module parameter is writeable over
/sys/module/batman_adv/parameters/routing_algo and thus allows switching of
the routing algorithm:

1. change routing_algo parameter
2. create new batadv net_device

But this is not race free because another process can be scheduled between
1 + 2 and in that time frame change the routing_algo parameter again.

It is much cleaner to directly provide this information inside the
rtnetlink's RTM_NEWLINK message. The two processes would be (in regards of
the creation parameter of their batadv interfaces) be isolated. This also
eases the integration of batadv devices inside tools like network-manager
or systemd-networkd which are not expecting to operate on /sys before a new
net_device is created.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2020-12-04 08:40:52 +01:00
Sven Eckelmann
128254ceea batman-adv: Prepare infrastructure for newlink settings
The batadv generic netlink family can be used to retrieve the current state
and set various configuration settings. But there are also settings which
must be set before the actual interface is created.

The rtnetlink already uses IFLA_INFO_DATA to allow net_device families to
transfer such configurations. The minimal required functionality for this
is now available for the batadv rtnl_link_ops. Also a new IFLA class of
attributes will be attached to it because rtnetlink only allows 51
different attributes but batadv_nl_attrs already contains 62 attributes.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2020-12-04 08:40:52 +01:00
Sven Eckelmann
fcd193e1df batman-adv: Add new include for min/max helpers
The commit b296a6d533 ("kernel.h: split out min()/max() et al. helpers")
moved the min/max helper functionality from kernel.h to minmax.h. Adjust
the kernel code accordingly to avoid fragile indirect includes.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2020-12-04 08:40:52 +01:00
Simon Wunderlich
fee3e9554a batman-adv: Start new development cycle
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2020-12-04 08:40:52 +01:00
Andrii Nakryiko
22dc4a0f5e bpf: Remove hard-coded btf_vmlinux assumption from BPF verifier
Remove a permeating assumption thoughout BPF verifier of vmlinux BTF. Instead,
wherever BTF type IDs are involved, also track the instance of struct btf that
goes along with the type ID. This allows to gradually add support for kernel
module BTFs and using/tracking module types across BPF helper calls and
registers.

This patch also renames btf_id() function to btf_obj_id() to minimize naming
clash with using btf_id to denote BTF *type* ID, rather than BTF *object*'s ID.

Also, altough btf_vmlinux can't get destructed and thus doesn't need
refcounting, module BTFs need that, so apply BTF refcounting universally when
BPF program is using BTF-powered attachment (tp_btf, fentry/fexit, etc). This
makes for simpler clean up code.

Now that BTF type ID is not enough to uniquely identify a BTF type, extend BPF
trampoline key to include BTF object ID. To differentiate that from target
program BPF ID, set 31st bit of type ID. BTF type IDs (at least currently) are
not allowed to take full 32 bits, so there is no danger of confusing that bit
with a valid BTF type ID.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201203204634.1325171-10-andrii@kernel.org
2020-12-03 17:38:21 -08:00
Prankur gupta
cb81110997 bpf: Adds support for setting window clamp
Adds a new bpf_setsockopt for TCP sockets, TCP_BPF_WINDOW_CLAMP,
which sets the maximum receiver window size. It will be useful for
limiting receiver window based on RTT.

Signed-off-by: Prankur gupta <prankgup@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201202213152.435886-2-prankgup@fb.com
2020-12-03 17:23:24 -08:00
Jakub Kicinski
55fd59b003 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Conflicts:
	drivers/net/ethernet/ibm/ibmvnic.c

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-03 15:44:09 -08:00
Florian Westphal
3ecfbe3e82 mptcp: emit tcp reset when a join request fails
RFC 8684 says:
 If the token is unknown or the host wants to refuse subflow establishment
 (for example, due to a limit on the number of subflows it will permit),
 the receiver will send back a reset (RST) signal, analogous to an unknown
 port in TCP, containing an MP_TCPRST option (Section 3.6) with an
 "MPTCP specific error" reason code.

mptcp-next doesn't support MP_TCPRST yet, this can be added in another
change.

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-03 12:56:03 -08:00
Florian Westphal
7ea851d19b tcp: merge 'init_req' and 'route_req' functions
The Multipath-TCP standard (RFC 8684) says that an MPTCP host should send
a TCP reset if the token in a MP_JOIN request is unknown.

At this time we don't do this, the 3whs completes and the 'new subflow'
is reset afterwards.  There are two ways to allow MPTCP to send the
reset.

1. override 'send_synack' callback and emit the rst from there.
   The drawback is that the request socket gets inserted into the
   listeners queue just to get removed again right away.

2. Send the reset from the 'route_req' function instead.
   This avoids the 'add&remove request socket', but route_req lacks the
   skb that is required to send the TCP reset.

Instead of just adding the skb to that function for MPTCP sake alone,
Paolo suggested to merge init_req and route_req functions.

This saves one indirection from syn processing path and provides the skb
to the merged function at the same time.

'send reset on unknown mptcp join token' is added in next patch.

Suggested-by: Paolo Abeni <pabeni@redhat.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-03 12:56:03 -08:00
Davide Caratti
9608fa6530 net/sched: act_mpls: ensure LSE is pullable before reading it
when 'act_mpls' is used to mangle the LSE, the current value is read from
the packet dereferencing 4 bytes at mpls_hdr(): ensure that the label is
contained in the skb "linear" area.

Found by code inspection.

v2:
 - use MPLS_HLEN instead of sizeof(new_lse), thanks to Jakub Kicinski

Fixes: 2a2ea50870 ("net: sched: add mpls manipulation actions to TC")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Acked-by: Guillaume Nault <gnault@redhat.com>
Link: https://lore.kernel.org/r/3243506cba43d14858f3bd21ee0994160e44d64a.1606987058.git.dcaratti@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-03 11:13:37 -08:00
Davide Caratti
43c13605ba net: openvswitch: ensure LSE is pullable before reading it
when openvswitch is configured to mangle the LSE, the current value is
read from the packet dereferencing 4 bytes at mpls_hdr(): ensure that
the label is contained in the skb "linear" area.

Found by code inspection.

Fixes: d27cf5c59a ("net: core: add MPLS update core helper and use in OvS")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Link: https://lore.kernel.org/r/aa099f245d93218b84b5c056b67b6058ccf81a66.1606987185.git.dcaratti@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-03 11:13:29 -08:00
Davide Caratti
13de4ed9e3 net: skbuff: ensure LSE is pullable before decrementing the MPLS ttl
skb_mpls_dec_ttl() reads the LSE without ensuring that it is contained in
the skb "linear" area. Fix this calling pskb_may_pull() before reading the
current ttl.

Found by code inspection.

Fixes: 2a2ea50870 ("net: sched: add mpls manipulation actions to TC")
Reported-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Link: https://lore.kernel.org/r/53659f28be8bc336c113b5254dc637cc76bbae91.1606987074.git.dcaratti@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-03 11:13:21 -08:00
Roman Gushchin
819a4f3235 bpf: Eliminate rlimit-based memory accounting for xskmap maps
Do not use rlimit-based memory accounting for xskmap maps.
It has been replaced with the memcg-based memory accounting.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20201201215900.3569844-31-guro@fb.com
2020-12-02 18:32:47 -08:00
Roman Gushchin
0d2c4f9640 bpf: Eliminate rlimit-based memory accounting for sockmap and sockhash maps
Do not use rlimit-based memory accounting for sockmap and sockhash maps.
It has been replaced with the memcg-based memory accounting.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20201201215900.3569844-29-guro@fb.com
2020-12-02 18:32:47 -08:00
Roman Gushchin
28e1dcdef0 bpf: Refine memcg-based memory accounting for xskmap maps
Extend xskmap memory accounting to include the memory taken by
the xsk_map_node structure.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201201215900.3569844-18-guro@fb.com
2020-12-02 18:32:46 -08:00
Roman Gushchin
7846dd9f83 bpf: Refine memcg-based memory accounting for sockmap and sockhash maps
Include internal metadata into the memcg-based memory accounting.
Also include the memory allocated on updating an element.

Signed-off-by: Roman Gushchin <guro@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201201215900.3569844-17-guro@fb.com
2020-12-02 18:32:46 -08:00
Dan Carpenter
6ee50c8e26 net/x25: prevent a couple of overflows
The .x25_addr[] address comes from the user and is not necessarily
NUL terminated.  This leads to a couple problems.  The first problem is
that the strlen() in x25_bind() can read beyond the end of the buffer.

The second problem is more subtle and could result in memory corruption.
The call tree is:
  x25_connect()
  --> x25_write_internal()
      --> x25_addr_aton()

The .x25_addr[] buffers are copied to the "addresses" buffer from
x25_write_internal() so it will lead to stack corruption.

Verify that the strings are NUL terminated and return -EINVAL if they
are not.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Fixes: a9288525d2 ("X25: Dont let x25_bind use addresses containing characters")
Reported-by: "kiyin(尹亮)" <kiyin@tencent.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Martin Schiller <ms@dev.tdt.de>
Link: https://lore.kernel.org/r/X8ZeAKm8FnFpN//B@mwanda
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-02 17:26:36 -08:00
Xuan Zhuo
3413f04141 xsk: Change the tx writeable condition
Modify the tx writeable condition from the queue is not full to the
number of present tx queues is less than the half of the total number
of queues. Because the tx queue not full is a very short time, this will
cause a large number of EPOLLOUT events, and cause a large number of
process wake up.

Fixes: 35fcde7f8d ("xsk: support for Tx")
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/508fef55188d4e1160747ead64c6dcda36735880.1606555939.git.xuanzhuo@linux.alibaba.com
2020-12-03 01:14:38 +01:00
Xuan Zhuo
f5da54187e xsk: Replace datagram_poll by sock_poll_wait
datagram_poll will judge the current socket status (EPOLLIN, EPOLLOUT)
based on the traditional socket information (eg: sk_wmem_alloc), but
this does not apply to xsk. So this patch uses sock_poll_wait instead of
datagram_poll, and the mask is calculated by xsk_poll.

Fixes: c497176cb2 ("xsk: add Rx receive functions and poll support")
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/e82f4697438cd63edbf271ebe1918db8261b7c09.1606555939.git.xuanzhuo@linux.alibaba.com
2020-12-03 01:14:14 +01:00
Stanislav Fomichev
427167c0b0 bpf: Allow bpf_{s,g}etsockopt from cgroup bind{4,6} hooks
I have to now lock/unlock socket for the bind hook execution.
That shouldn't cause any overhead because the socket is unbound
and shouldn't receive any traffic.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrey Ignatov <rdna@fb.com>
Link: https://lore.kernel.org/bpf/20201202172516.3483656-3-sdf@google.com
2020-12-02 13:25:11 -08:00
Eric Dumazet
05e3ecea4a mptcp: avoid potential infinite loop in mptcp_recvmsg()
If a packet is ready in receive queue, and application isssues
a recvmsg()/recvfrom()/recvmmsg() request asking for zero bytes,
we hang in mptcp_recvmsg().

Fixes: ea4ca586b1 ("mptcp: refine MPTCP-level ack scheduling")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Link: https://lore.kernel.org/r/20201202171657.1185108-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-02 12:06:12 -08:00
Antoine Tenart
832ba59649 net: ip6_gre: set dev->hard_header_len when using header_ops
syzkaller managed to crash the kernel using an NBMA ip6gre interface. I
could reproduce it creating an NBMA ip6gre interface and forwarding
traffic to it:

  skbuff: skb_under_panic: text:ffffffff8250e927 len:148 put:44 head:ffff8c03c7a33
  ------------[ cut here ]------------
  kernel BUG at net/core/skbuff.c:109!
  Call Trace:
  skb_push+0x10/0x10
  ip6gre_header+0x47/0x1b0
  neigh_connected_output+0xae/0xf0

ip6gre tunnel provides its own header_ops->create, and sets it
conditionally when initializing the tunnel in NBMA mode. When
header_ops->create is used, dev->hard_header_len should reflect the
length of the header created. Otherwise, when not used,
dev->needed_headroom should be used.

Fixes: eb95f52fc7 ("net: ipv6_gre: Fix GRO to work on IPv6 over GRE tap")
Cc: Maria Pasechnik <mariap@mellanox.com>
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Link: https://lore.kernel.org/r/20201130161911.464106-1-atenart@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-02 11:16:12 -08:00
Fedor Tokarev
35a6d39672 net: sunrpc: Fix 'snprintf' return value check in 'do_xprt_debugfs'
'snprintf' returns the number of characters which would have been written
if enough space had been available, excluding the terminating null byte.
Thus, the return value of 'sizeof(buf)' means that the last character
has been dropped.

Signed-off-by: Fedor Tokarev <ftokarev@gmail.com>
Fixes: 2f34b8bfae ("SUNRPC: add links for all client xprts to debugfs")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02 14:05:54 -05:00
Trond Myklebust
eee1f54964 SUNRPC: Fix open coded xdr_stream_remaining()
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02 14:05:54 -05:00
Trond Myklebust
0279024f22 SUNRPC: Fix up xdr_set_page()
While we always want to align to the next page and/or the beginning of
the tail buffer when we call xdr_set_next_page(), the functions
xdr_align_data() and xdr_expand_hole() really want to align to the next
object in that next page or tail.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02 14:05:54 -05:00
Trond Myklebust
9ed5af268e SUNRPC: Clean up the handling of page padding in rpc_prepare_reply_pages()
rpc_prepare_reply_pages() currently expects the 'hdrsize' argument to
contain the length of the data that we expect to want placed in the head
kvec plus a count of 1 word of padding that is placed after the page data.
This is very confusing when trying to read the code, and sometimes leads
to callers adding an arbitrary value of '1' just in order to satisfy the
requirement (whether or not the page data actually needs such padding).

This patch aims to clarify the code by changing the 'hdrsize' argument
to remove that 1 word of padding. This means we need to subtract the
padding from all the existing callers.

Fixes: 02ef04e432 ("NFS: Account for XDR pad of buf->pages")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02 14:05:53 -05:00
Trond Myklebust
1d97316692 SUNRPC: Fix up xdr_read_pages() to take arbitrary object lengths
Fix up xdr_read_pages() so that it can handle object lengths that are
larger than the page length, by simply aligning to the next object in
the buffer tail.
The function will continue to return the length of the truncate object
data that actually fit into the pages.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02 14:05:53 -05:00
Trond Myklebust
8d86e373b0 SUNRPC: Clean up helpers xdr_set_iov() and xdr_set_page_base()
Allow xdr_set_iov() to set a base so that we can use it to set the
cursor to a specific position in the kvec buffer.

If the new base overflows the kvec/pages buffer in either xdr_set_iov()
or xdr_set_page_base(), then truncate it so that we point to the end of
the buffer.

Finally, change both function to return the number of bytes remaining to
read in their buffers.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02 14:05:53 -05:00
Trond Myklebust
2b1f83d108 SUNRPC: Fix up typo in xdr_init_decode()
We already know that the head buffer and page are empty, so if there is
any data, it is in the tail.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02 14:05:53 -05:00
Trond Myklebust
4aceaaea5e SUNRPC: Fix up open coded kmemdup_nul()
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02 14:05:53 -05:00
Trond Myklebust
c87b056e58 SUNRPC: Remove unused function xprt_load_transport()
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02 14:05:53 -05:00
Trond Myklebust
1fc5f13186 SUNRPC: Add a helper to return the transport identifier given a netid
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02 14:05:52 -05:00
Trond Myklebust
9bccd26461 SUNRPC: Close a race with transport setup and module put
After we've looked up the transport module, we need to ensure it can't
go away until we've finished running the transport setup code.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02 14:05:52 -05:00
Trond Myklebust
d5aa6b22e2 SUNRPC: xprt_load_transport() needs to support the netid "rdma6"
According to RFC5666, the correct netid for an IPv6 addressed RDMA
transport is "rdma6", which we've supported as a mount option since
Linux-4.7. The problem is when we try to load the module "xprtrdma6",
that will fail, since there is no modulealias of that name.

Fixes: 181342c5eb ("xprtrdma: Add rdma6 option to support NFS/RDMA IPv6")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02 14:05:52 -05:00
Trond Myklebust
e4c72201b6 SUNRPC: rpc_wake_up() should wake up tasks in the correct order
Currently, we wake up the tasks by priority queue ordering, which means
that we ignore the batching that is supposed to help with QoS issues.

Fixes: c049f8ea9a ("SUNRPC: Remove the bh-safe lock requirement on the rpc_wait_queue->lock")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-12-02 14:05:51 -05:00
Chuck Lever
0359af7ac3 SUNRPC: Remove XDRBUF_SPARSE_PAGES flag in gss_proxy upcall
There's no need to defer allocation of pages for the receive buffer.

- This upcall is quite infrequent
- gssp_alloc_receive_pages() can allocate the pages with GFP_KERNEL,
  unlike the transport
- gssp_alloc_receive_pages() knows exactly how many pages are needed

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Olga Kornievskaia <aglo@umich.edu>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-12-02 10:06:52 -05:00
Guvenc Gulce
a3db10efcc net/smc: Add support for obtaining SMCR device list
Deliver SMCR device information via netlink based
diagnostic interface.

Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-01 17:56:13 -08:00
Guvenc Gulce
aaf95523d5 net/smc: Add support for obtaining SMCD device list
Deliver SMCD device information via netlink based
diagnostic interface.

Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-01 17:56:13 -08:00
Guvenc Gulce
8f9dde4bf2 net/smc: Add SMC-D Linkgroup diagnostic support
Deliver SMCD Linkgroup information via netlink based
diagnostic interface.

Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-01 17:56:13 -08:00
Guvenc Gulce
5a7e09d58f net/smc: Introduce SMCR get link command
Introduce get link command which loops through
all available links of all available link groups. It
uses the SMC-R linkgroup list as entry point, not
the socket list, which makes linkgroup diagnosis
possible, in case linkgroup does not contain active
connections anymore.

Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-01 17:56:13 -08:00
Guvenc Gulce
e9b8c845cb net/smc: Introduce SMCR get linkgroup command
Introduce get linkgroup command which loops through
all available SMCR linkgroups. It uses the SMC-R linkgroup
list as entry point, not the socket list, which makes
linkgroup diagnosis possible, in case linkgroup does not
contain active connections anymore.

Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-01 17:56:13 -08:00
Guvenc Gulce
099b990bd1 net/smc: Add support for obtaining system information
Add new netlink command to obtain system information
of the smc module.

Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-01 17:56:13 -08:00
Guvenc Gulce
e8372d9d21 net/smc: Introduce generic netlink interface for diagnostic purposes
Introduce generic netlink interface infrastructure to expose
the diagnostic information regarding smc linkgroups, links and devices.

Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-01 17:56:12 -08:00
Guvenc Gulce
49407ae2bc net/smc: Refactor smc ism v2 capability handling
Encapsulate the smc ism v2 capability boolean value
in a function for better information hiding.

Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-01 17:56:12 -08:00
Guvenc Gulce
6443b2f60e net/smc: Add diagnostic information to link structure
During link creation add net-device ifindex and ib-device
name to link structure. This is needed for diagnostic purposes.

When diagnostic information is gathered, we need to traverse
device, linkgroup and link structures, to be able to do that
we need to hold a spinlock for the linkgroup list, without this
diagnostic information in link structure, another device list
mutex holding would be necessary to dereference the device
pointer in the link structure which would be impossible when
holding a spinlock already.

Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-01 17:56:12 -08:00
Guvenc Gulce
3d453f53c7 net/smc: Add diagnostic information to smc ib-device
During smc ib-device creation, add network device ifindex to smc
ib-device structure. Register for netdevice changes and update ib-device
accordingly. This is needed for diagnostic purposes.

Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-01 17:56:12 -08:00
Guvenc Gulce
ddc992866f net/smc: Add link counters for IB device ports
Add link counters to the structure of the smc ib device, one counter per
ib port. Increase/decrease the counters as needed in the corresponding
routines.

Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-01 17:56:12 -08:00
Guvenc Gulce
07d51580ff net/smc: Add connection counters for links
Add connection counters to the structure of the link.
Increase/decrease the counters as needed in the corresponding
routines.

Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-01 17:56:12 -08:00
Guvenc Gulce
8b2f0f44f0 net/smc: Use active link of the connection
Use active link of the connection directly and not
via linkgroup array structure when obtaining link
data of the connection.

Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-01 17:56:12 -08:00
Karsten Graul
8cf3f3e423 net/smc: use helper smc_conn_abort() in listen processing
The helper smc_connect_abort() can be used by the listen processing
functions, too. And rename this helper to smc_conn_abort() to make the
purpose clearer.
No functional change.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-01 17:56:12 -08:00
Rohit Maheshwari
d31c080075 net/tls: make sure tls offload sets salt_size
Recent changes made to remove AES constants started using protocol
aware salt_size. ctx->prot_info's salt_size is filled in tls sw case,
but not in tls offload mode, and was working so far because of the
hard coded value was used.

Fixes: 6942a284fb ("net/tls: make inline helpers protocol-aware")
Signed-off-by: Rohit Maheshwari <rohitm@chelsio.com>
Link: https://lore.kernel.org/r/20201201090752.27355-1-rohitm@chelsio.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-01 17:51:30 -08:00
Jason Gunthorpe
2b0a999ba0 Linux 5.10-rc6
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl/EM9oeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG/3kH/RNkFyTlHlUkZpJx
 8Ks2yWgUln7YhZcmOaG/IcIyWnhCgo3l35kiaH7XxM+rPMZzidp51MHUllaTAQDc
 u+5EFHMJsmTWUfE8ocHPb1cPdYEDSoVr6QUsixbL9+uADpRz+VZVtWMb89EiyMrC
 wvLIzpnqY5UNriWWBxD0hrmSsT4g9XCsauer4k2KB+zvebwg6vFOMCFLFc2qz7fb
 ABsrPFqLZOMp+16chGxyHP7LJ6ygI/Hwf7tPW8ppv4c+hes4HZg7yqJxXhV02QbJ
 s10s6BTcEWMqKg/T6L/VoScsMHWUcNdvrr3uuPQhgup240XdmB1XO8rOKddw27e7
 VIjrjNw=
 =4ZaP
 -----END PGP SIGNATURE-----

Merge tag 'v5.10-rc6' into rdma.git for-next

For dependencies in following patches

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-12-01 20:40:50 -04:00
Vladimir Oltean
c214550ff8 net: delete __dev_getfirstbyhwtype
The last user of the RTNL brother of dev_getfirstbyhwtype (the latter
being synchronized under RCU) has been deleted in commit b4db2b35fc
("afs: Use core kernel UUID generation").

Cc: Arnd Bergmann <arnd@arndb.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20201129200550.2433401-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-01 15:40:01 -08:00
Randy Dunlap
637b77fdca net/tipc: fix all function Return: notation
Fix Return: kernel-doc notation in all net/tipc/ source files.
Also keep ReST list notation intact for output formatting.
Fix a few typos in comments.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-01 15:38:17 -08:00
Randy Dunlap
f172f4b81a net/tipc: fix socket.c kernel-doc
Fix socket.c kernel-doc warnings in preparation for adding to the
networking docbook.

Also, for rcvbuf_limit(), use bullet notation so that the lines do
not run together.

../net/tipc/socket.c:130: warning: Function parameter or member 'cong_links' not described in 'tipc_sock'
../net/tipc/socket.c:130: warning: Function parameter or member 'probe_unacked' not described in 'tipc_sock'
../net/tipc/socket.c:130: warning: Function parameter or member 'snd_win' not described in 'tipc_sock'
../net/tipc/socket.c:130: warning: Function parameter or member 'peer_caps' not described in 'tipc_sock'
../net/tipc/socket.c:130: warning: Function parameter or member 'rcv_win' not described in 'tipc_sock'
../net/tipc/socket.c:130: warning: Function parameter or member 'group' not described in 'tipc_sock'
../net/tipc/socket.c:130: warning: Function parameter or member 'oneway' not described in 'tipc_sock'
../net/tipc/socket.c:130: warning: Function parameter or member 'nagle_start' not described in 'tipc_sock'
../net/tipc/socket.c:130: warning: Function parameter or member 'snd_backlog' not described in 'tipc_sock'
../net/tipc/socket.c:130: warning: Function parameter or member 'msg_acc' not described in 'tipc_sock'
../net/tipc/socket.c:130: warning: Function parameter or member 'pkt_cnt' not described in 'tipc_sock'
../net/tipc/socket.c:130: warning: Function parameter or member 'expect_ack' not described in 'tipc_sock'
../net/tipc/socket.c:130: warning: Function parameter or member 'nodelay' not described in 'tipc_sock'
../net/tipc/socket.c:130: warning: Function parameter or member 'group_is_open' not described in 'tipc_sock'
../net/tipc/socket.c:267: warning: Function parameter or member 'sk' not described in 'tsk_advance_rx_queue'
../net/tipc/socket.c:295: warning: Function parameter or member 'sk' not described in 'tsk_rej_rx_queue'
../net/tipc/socket.c:295: warning: Function parameter or member 'error' not described in 'tsk_rej_rx_queue'
../net/tipc/socket.c:894: warning: Function parameter or member 'tsk' not described in 'tipc_send_group_msg'
../net/tipc/socket.c:1187: warning: Function parameter or member 'net' not described in 'tipc_sk_mcast_rcv'
../net/tipc/socket.c:1323: warning: Function parameter or member 'inputq' not described in 'tipc_sk_conn_proto_rcv'
../net/tipc/socket.c:1323: warning: Function parameter or member 'xmitq' not described in 'tipc_sk_conn_proto_rcv'
../net/tipc/socket.c:1885: warning: Function parameter or member 'sock' not described in 'tipc_recvmsg'
../net/tipc/socket.c:1993: warning: Function parameter or member 'sock' not described in 'tipc_recvstream'
../net/tipc/socket.c:2313: warning: Function parameter or member 'xmitq' not described in 'tipc_sk_filter_rcv'
../net/tipc/socket.c:2404: warning: Function parameter or member 'xmitq' not described in 'tipc_sk_enqueue'
../net/tipc/socket.c:2456: warning: Function parameter or member 'net' not described in 'tipc_sk_rcv'
../net/tipc/socket.c:2693: warning: Function parameter or member 'kern' not described in 'tipc_accept'
../net/tipc/socket.c:3816: warning: Excess function parameter 'sysctl_tipc_sk_filter' description in 'tipc_sk_filtering'

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-01 15:38:12 -08:00
Randy Dunlap
4476441e48 net/tipc: fix node.c kernel-doc
Fix node.c kernel-doc warnings in preparation for adding to the
networking docbook.

../net/tipc/node.c:141: warning: Function parameter or member 'kref' not described in 'tipc_node'
../net/tipc/node.c:141: warning: Function parameter or member 'bc_entry' not described in 'tipc_node'
../net/tipc/node.c:141: warning: Function parameter or member 'failover_sent' not described in 'tipc_node'
../net/tipc/node.c:141: warning: Function parameter or member 'peer_id' not described in 'tipc_node'
../net/tipc/node.c:141: warning: Function parameter or member 'peer_id_string' not described in 'tipc_node'
../net/tipc/node.c:141: warning: Function parameter or member 'conn_sks' not described in 'tipc_node'
../net/tipc/node.c:141: warning: Function parameter or member 'keepalive_intv' not described in 'tipc_node'
../net/tipc/node.c:141: warning: Function parameter or member 'timer' not described in 'tipc_node'
../net/tipc/node.c:141: warning: Function parameter or member 'peer_net' not described in 'tipc_node'
../net/tipc/node.c:141: warning: Function parameter or member 'peer_hash_mix' not described in 'tipc_node'
../net/tipc/node.c:273: warning: Function parameter or member '__n' not described in 'tipc_node_crypto_rx'
../net/tipc/node.c:822: warning: Function parameter or member 'n' not described in '__tipc_node_link_up'
../net/tipc/node.c:822: warning: Function parameter or member 'bearer_id' not described in '__tipc_node_link_up'
../net/tipc/node.c:822: warning: Function parameter or member 'xmitq' not described in '__tipc_node_link_up'
../net/tipc/node.c:888: warning: Function parameter or member 'n' not described in 'tipc_node_link_up'
../net/tipc/node.c:888: warning: Function parameter or member 'bearer_id' not described in 'tipc_node_link_up'
../net/tipc/node.c:888: warning: Function parameter or member 'xmitq' not described in 'tipc_node_link_up'
../net/tipc/node.c:948: warning: Function parameter or member 'n' not described in '__tipc_node_link_down'
../net/tipc/node.c:948: warning: Function parameter or member 'bearer_id' not described in '__tipc_node_link_down'
../net/tipc/node.c:948: warning: Function parameter or member 'xmitq' not described in '__tipc_node_link_down'
../net/tipc/node.c:948: warning: Function parameter or member 'maddr' not described in '__tipc_node_link_down'
../net/tipc/node.c:1537: warning: Function parameter or member 'net' not described in 'tipc_node_get_linkname'
../net/tipc/node.c:1537: warning: Function parameter or member 'len' not described in 'tipc_node_get_linkname'
../net/tipc/node.c:1891: warning: Function parameter or member 'n' not described in 'tipc_node_check_state'
../net/tipc/node.c:1891: warning: Function parameter or member 'xmitq' not described in 'tipc_node_check_state'

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-01 15:38:09 -08:00
Randy Dunlap
5c5d6796d4 net/tipc: fix name_table.c kernel-doc
Fix name_table.c kernel-doc warnings in preparation for adding to the
networking docbook.

../net/tipc/name_table.c:115: warning: Function parameter or member 'start' not described in 'service_range_foreach_match'
../net/tipc/name_table.c:115: warning: Function parameter or member 'end' not described in 'service_range_foreach_match'
../net/tipc/name_table.c:127: warning: Function parameter or member 'start' not described in 'service_range_match_first'
../net/tipc/name_table.c:127: warning: Function parameter or member 'end' not described in 'service_range_match_first'
../net/tipc/name_table.c:176: warning: Function parameter or member 'start' not described in 'service_range_match_next'
../net/tipc/name_table.c:176: warning: Function parameter or member 'end' not described in 'service_range_match_next'
../net/tipc/name_table.c:225: warning: Function parameter or member 'type' not described in 'tipc_publ_create'
../net/tipc/name_table.c:225: warning: Function parameter or member 'lower' not described in 'tipc_publ_create'
../net/tipc/name_table.c:225: warning: Function parameter or member 'upper' not described in 'tipc_publ_create'
../net/tipc/name_table.c:225: warning: Function parameter or member 'scope' not described in 'tipc_publ_create'
../net/tipc/name_table.c:225: warning: Function parameter or member 'node' not described in 'tipc_publ_create'
../net/tipc/name_table.c:225: warning: Function parameter or member 'port' not described in 'tipc_publ_create'
../net/tipc/name_table.c:225: warning: Function parameter or member 'key' not described in 'tipc_publ_create'
../net/tipc/name_table.c:252: warning: Function parameter or member 'type' not described in 'tipc_service_create'
../net/tipc/name_table.c:252: warning: Function parameter or member 'hd' not described in 'tipc_service_create'
../net/tipc/name_table.c:367: warning: Function parameter or member 'sr' not described in 'tipc_service_remove_publ'
../net/tipc/name_table.c:367: warning: Function parameter or member 'node' not described in 'tipc_service_remove_publ'
../net/tipc/name_table.c:367: warning: Function parameter or member 'key' not described in 'tipc_service_remove_publ'
../net/tipc/name_table.c:383: warning: Function parameter or member 'pa' not described in 'publication_after'
../net/tipc/name_table.c:383: warning: Function parameter or member 'pb' not described in 'publication_after'
../net/tipc/name_table.c:401: warning: Function parameter or member 'service' not described in 'tipc_service_subscribe'
../net/tipc/name_table.c:401: warning: Function parameter or member 'sub' not described in 'tipc_service_subscribe'
../net/tipc/name_table.c:546: warning: Function parameter or member 'net' not described in 'tipc_nametbl_translate'
../net/tipc/name_table.c:546: warning: Function parameter or member 'type' not described in 'tipc_nametbl_translate'
../net/tipc/name_table.c:546: warning: Function parameter or member 'instance' not described in 'tipc_nametbl_translate'
../net/tipc/name_table.c:546: warning: Function parameter or member 'dnode' not described in 'tipc_nametbl_translate'
../net/tipc/name_table.c:762: warning: Function parameter or member 'net' not described in 'tipc_nametbl_withdraw'
../net/tipc/name_table.c:762: warning: Function parameter or member 'type' not described in 'tipc_nametbl_withdraw'
../net/tipc/name_table.c:762: warning: Function parameter or member 'lower' not described in 'tipc_nametbl_withdraw'
../net/tipc/name_table.c:762: warning: Function parameter or member 'upper' not described in 'tipc_nametbl_withdraw'
../net/tipc/name_table.c:762: warning: Function parameter or member 'key' not described in 'tipc_nametbl_withdraw'
../net/tipc/name_table.c:796: warning: Function parameter or member 'sub' not described in 'tipc_nametbl_subscribe'
../net/tipc/name_table.c:826: warning: Function parameter or member 'sub' not described in 'tipc_nametbl_unsubscribe'
../net/tipc/name_table.c:876: warning: Function parameter or member 'net' not described in 'tipc_service_delete'
../net/tipc/name_table.c:876: warning: Function parameter or member 'sc' not described in 'tipc_service_delete'

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-01 15:38:05 -08:00
Randy Dunlap
cb67296e8c net/tipc: fix name_distr.c kernel-doc
Fix name_distr.c kernel-doc warnings in preparation for adding to the
networking docbook.

../net/tipc/name_distr.c:55: warning: Function parameter or member 'i' not described in 'publ_to_item'
../net/tipc/name_distr.c:55: warning: Function parameter or member 'p' not described in 'publ_to_item'
../net/tipc/name_distr.c:70: warning: Function parameter or member 'net' not described in 'named_prepare_buf'
../net/tipc/name_distr.c:70: warning: Function parameter or member 'type' not described in 'named_prepare_buf'
../net/tipc/name_distr.c:70: warning: Function parameter or member 'size' not described in 'named_prepare_buf'
../net/tipc/name_distr.c:70: warning: Function parameter or member 'dest' not described in 'named_prepare_buf'
../net/tipc/name_distr.c:88: warning: Function parameter or member 'net' not described in 'tipc_named_publish'
../net/tipc/name_distr.c:88: warning: Function parameter or member 'publ' not described in 'tipc_named_publish'
../net/tipc/name_distr.c:116: warning: Function parameter or member 'net' not described in 'tipc_named_withdraw'
../net/tipc/name_distr.c:116: warning: Function parameter or member 'publ' not described in 'tipc_named_withdraw'
../net/tipc/name_distr.c:147: warning: Function parameter or member 'net' not described in 'named_distribute'
../net/tipc/name_distr.c:147: warning: Function parameter or member 'seqno' not described in 'named_distribute'
../net/tipc/name_distr.c:199: warning: Function parameter or member 'net' not described in 'tipc_named_node_up'
../net/tipc/name_distr.c:199: warning: Function parameter or member 'dnode' not described in 'tipc_named_node_up'
../net/tipc/name_distr.c:199: warning: Function parameter or member 'capabilities' not described in 'tipc_named_node_up'
../net/tipc/name_distr.c:225: warning: Function parameter or member 'net' not described in 'tipc_publ_purge'
../net/tipc/name_distr.c:225: warning: Function parameter or member 'publ' not described in 'tipc_publ_purge'
../net/tipc/name_distr.c:225: warning: Function parameter or member 'addr' not described in 'tipc_publ_purge'
../net/tipc/name_distr.c:272: warning: Function parameter or member 'net' not described in 'tipc_update_nametbl'
../net/tipc/name_distr.c:272: warning: Function parameter or member 'i' not described in 'tipc_update_nametbl'
../net/tipc/name_distr.c:272: warning: Function parameter or member 'node' not described in 'tipc_update_nametbl'
../net/tipc/name_distr.c:272: warning: Function parameter or member 'dtype' not described in 'tipc_update_nametbl'
../net/tipc/name_distr.c:353: warning: Function parameter or member 'net' not described in 'tipc_named_rcv'
../net/tipc/name_distr.c:353: warning: Function parameter or member 'namedq' not described in 'tipc_named_rcv'
../net/tipc/name_distr.c:353: warning: Function parameter or member 'rcv_nxt' not described in 'tipc_named_rcv'
../net/tipc/name_distr.c:353: warning: Function parameter or member 'open' not described in 'tipc_named_rcv'
../net/tipc/name_distr.c:383: warning: Function parameter or member 'net' not described in 'tipc_named_reinit'

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-01 15:38:01 -08:00
Randy Dunlap
a99df449b0 net/tipc: fix link.c kernel-doc
Fix link.c kernel-doc warnings in preparation for adding to the
networking docbook.

../net/tipc/link.c:200: warning: Function parameter or member 'session' not described in 'tipc_link'
../net/tipc/link.c:200: warning: Function parameter or member 'snd_nxt_state' not described in 'tipc_link'
../net/tipc/link.c:200: warning: Function parameter or member 'rcv_nxt_state' not described in 'tipc_link'
../net/tipc/link.c:200: warning: Function parameter or member 'in_session' not described in 'tipc_link'
../net/tipc/link.c:200: warning: Function parameter or member 'active' not described in 'tipc_link'
../net/tipc/link.c:200: warning: Function parameter or member 'if_name' not described in 'tipc_link'
../net/tipc/link.c:200: warning: Function parameter or member 'rst_cnt' not described in 'tipc_link'
../net/tipc/link.c:200: warning: Function parameter or member 'drop_point' not described in 'tipc_link'
../net/tipc/link.c:200: warning: Function parameter or member 'failover_reasm_skb' not described in 'tipc_link'
../net/tipc/link.c:200: warning: Function parameter or member 'failover_deferdq' not described in 'tipc_link'
../net/tipc/link.c:200: warning: Function parameter or member 'transmq' not described in 'tipc_link'
../net/tipc/link.c:200: warning: Function parameter or member 'backlog' not described in 'tipc_link'
../net/tipc/link.c:200: warning: Function parameter or member 'snd_nxt' not described in 'tipc_link'
../net/tipc/link.c:200: warning: Function parameter or member 'rcv_unacked' not described in 'tipc_link'
../net/tipc/link.c:200: warning: Function parameter or member 'deferdq' not described in 'tipc_link'
../net/tipc/link.c:200: warning: Function parameter or member 'window' not described in 'tipc_link'
../net/tipc/link.c:200: warning: Function parameter or member 'min_win' not described in 'tipc_link'
../net/tipc/link.c:200: warning: Function parameter or member 'ssthresh' not described in 'tipc_link'
../net/tipc/link.c:200: warning: Function parameter or member 'max_win' not described in 'tipc_link'
../net/tipc/link.c:200: warning: Function parameter or member 'cong_acks' not described in 'tipc_link'
../net/tipc/link.c:200: warning: Function parameter or member 'checkpoint' not described in 'tipc_link'
../net/tipc/link.c:200: warning: Function parameter or member 'reasm_tnlmsg' not described in 'tipc_link'
../net/tipc/link.c:200: warning: Function parameter or member 'last_gap' not described in 'tipc_link'
../net/tipc/link.c:200: warning: Function parameter or member 'last_ga' not described in 'tipc_link'
../net/tipc/link.c:200: warning: Function parameter or member 'bc_rcvlink' not described in 'tipc_link'
../net/tipc/link.c:200: warning: Function parameter or member 'bc_sndlink' not described in 'tipc_link'
../net/tipc/link.c:200: warning: Function parameter or member 'nack_state' not described in 'tipc_link'
../net/tipc/link.c:200: warning: Function parameter or member 'bc_peer_is_up' not described in 'tipc_link'
../net/tipc/link.c:473: warning: Function parameter or member 'self' not described in 'tipc_link_create'
../net/tipc/link.c:473: warning: Function parameter or member 'peer_id' not described in 'tipc_link_create'
../net/tipc/link.c:473: warning: Excess function parameter 'ownnode' description in 'tipc_link_create'
../net/tipc/link.c:544: warning: Function parameter or member 'ownnode' not described in 'tipc_link_bc_create'
../net/tipc/link.c:544: warning: Function parameter or member 'peer' not described in 'tipc_link_bc_create'
../net/tipc/link.c:544: warning: Function parameter or member 'peer_id' not described in 'tipc_link_bc_create'
../net/tipc/link.c:544: warning: Function parameter or member 'peer_caps' not described in 'tipc_link_bc_create'
../net/tipc/link.c:544: warning: Function parameter or member 'bc_sndlink' not described in 'tipc_link_bc_create'

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-01 15:37:56 -08:00
Randy Dunlap
ec6a1649fe net/tipc: fix bearer.c for kernel-doc
Fix kernel-doc warnings in bearer.c:

../net/tipc/bearer.c:77: warning: Function parameter or member 'name' not described in 'tipc_media_find'
../net/tipc/bearer.c:91: warning: Function parameter or member 'type' not described in 'media_find_id'
../net/tipc/bearer.c:105: warning: Function parameter or member 'buf' not described in 'tipc_media_addr_printf'
../net/tipc/bearer.c:105: warning: Function parameter or member 'len' not described in 'tipc_media_addr_printf'
../net/tipc/bearer.c:105: warning: Function parameter or member 'a' not described in 'tipc_media_addr_printf'
../net/tipc/bearer.c:174: warning: Function parameter or member 'net' not described in 'tipc_bearer_find'
../net/tipc/bearer.c:174: warning: Function parameter or member 'name' not described in 'tipc_bearer_find'
../net/tipc/bearer.c:238: warning: Function parameter or member 'net' not described in 'tipc_enable_bearer'
../net/tipc/bearer.c:238: warning: Function parameter or member 'name' not described in 'tipc_enable_bearer'
../net/tipc/bearer.c:238: warning: Function parameter or member 'disc_domain' not described in 'tipc_enable_bearer'
../net/tipc/bearer.c:238: warning: Function parameter or member 'prio' not described in 'tipc_enable_bearer'
../net/tipc/bearer.c:238: warning: Function parameter or member 'attr' not described in 'tipc_enable_bearer'
../net/tipc/bearer.c:350: warning: Function parameter or member 'net' not described in 'tipc_reset_bearer'
../net/tipc/bearer.c:350: warning: Function parameter or member 'b' not described in 'tipc_reset_bearer'
../net/tipc/bearer.c:374: warning: Function parameter or member 'net' not described in 'bearer_disable'
../net/tipc/bearer.c:374: warning: Function parameter or member 'b' not described in 'bearer_disable'
../net/tipc/bearer.c:462: warning: Function parameter or member 'net' not described in 'tipc_l2_send_msg'
../net/tipc/bearer.c:479: warning: Function parameter or member 'net' not described in 'tipc_l2_send_msg'

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-01 15:37:51 -08:00
Randy Dunlap
5fcb7d47fe net/tipc: fix various kernel-doc warnings
kernel-doc and Sphinx fixes to eliminate lots of warnings
in preparation for adding to the networking docbook.

../net/tipc/crypto.c:57: warning: cannot understand function prototype: 'enum '
../net/tipc/crypto.c:69: warning: cannot understand function prototype: 'enum '
../net/tipc/crypto.c:130: warning: Function parameter or member 'tfm' not described in 'tipc_tfm'
../net/tipc/crypto.c:130: warning: Function parameter or member 'list' not described in 'tipc_tfm'
../net/tipc/crypto.c:172: warning: Function parameter or member 'stat' not described in 'tipc_crypto_stats'
../net/tipc/crypto.c:232: warning: Function parameter or member 'flags' not described in 'tipc_crypto'
../net/tipc/crypto.c:329: warning: Function parameter or member 'ukey' not described in 'tipc_aead_key_validate'
../net/tipc/crypto.c:329: warning: Function parameter or member 'info' not described in 'tipc_aead_key_validate'
../net/tipc/crypto.c:482: warning: Function parameter or member 'aead' not described in 'tipc_aead_tfm_next'
../net/tipc/trace.c:43: warning: cannot understand function prototype: 'unsigned long sysctl_tipc_sk_filter[5] __read_mostly = '

Documentation/networking/tipc:57: ../net/tipc/msg.c:584: WARNING: Unexpected indentation.
Documentation/networking/tipc:63: ../net/tipc/name_table.c:536: WARNING: Unexpected indentation.
Documentation/networking/tipc:63: ../net/tipc/name_table.c:537: WARNING: Block quote ends without a blank line; unexpected unindent.
Documentation/networking/tipc:78: ../net/tipc/socket.c:3809: WARNING: Unexpected indentation.
Documentation/networking/tipc:78: ../net/tipc/socket.c:3807: WARNING: Inline strong start-string without end-string.
Documentation/networking/tipc:72: ../net/tipc/node.c:904: WARNING: Unexpected indentation.
Documentation/networking/tipc:39: ../net/tipc/crypto.c:97: WARNING: Block quote ends without a blank line; unexpected unindent.
Documentation/networking/tipc:39: ../net/tipc/crypto.c:98: WARNING: Block quote ends without a blank line; unexpected unindent.
Documentation/networking/tipc:39: ../net/tipc/crypto.c:141: WARNING: Inline strong start-string without end-string.

../net/tipc/discover.c:82: warning: Function parameter or member 'skb' not described in 'tipc_disc_init_msg'

../net/tipc/msg.c:69: warning: Function parameter or member 'gfp' not described in 'tipc_buf_acquire'
../net/tipc/msg.c:382: warning: Function parameter or member 'offset' not described in 'tipc_msg_build'
../net/tipc/msg.c:708: warning: Function parameter or member 'net' not described in 'tipc_msg_lookup_dest'

../net/tipc/subscr.c:65: warning: Function parameter or member 'seq' not described in 'tipc_sub_check_overlap'
../net/tipc/subscr.c:65: warning: Function parameter or member 'found_lower' not described in 'tipc_sub_check_overlap'
../net/tipc/subscr.c:65: warning: Function parameter or member 'found_upper' not described in 'tipc_sub_check_overlap'

../net/tipc/udp_media.c:75: warning: Function parameter or member 'proto' not described in 'udp_media_addr'
../net/tipc/udp_media.c:75: warning: Function parameter or member 'port' not described in 'udp_media_addr'
../net/tipc/udp_media.c:75: warning: Function parameter or member 'ipv4' not described in 'udp_media_addr'
../net/tipc/udp_media.c:75: warning: Function parameter or member 'ipv6' not described in 'udp_media_addr'
../net/tipc/udp_media.c:98: warning: Function parameter or member 'rcast' not described in 'udp_bearer'

Also fixed a typo of "duest" to "dest".

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-01 15:37:46 -08:00
Randy Dunlap
ff10527e89 net/tipc: fix tipc header files for kernel-doc
Fix tipc header files for adding to the networking docbook.

Remove some uses of "/**" that were not kernel-doc notation.

Fix some source formatting to eliminate Sphinx warnings.

Add missing struct member and function argument kernel-doc descriptions.

Correct the description of a couple of struct members that were
marked as "(FIXME)".

Documentation/networking/tipc:18: ../net/tipc/name_table.h:65: WARNING: Unexpected indentation.
Documentation/networking/tipc:18: ../net/tipc/name_table.h:66: WARNING: Block quote ends without a blank line; unexpected unindent.

../net/tipc/bearer.h:128: warning: Function parameter or member 'min_win' not described in 'tipc_media'
../net/tipc/bearer.h:128: warning: Function parameter or member 'max_win' not described in 'tipc_media'

../net/tipc/bearer.h:171: warning: Function parameter or member 'min_win' not described in 'tipc_bearer'
../net/tipc/bearer.h:171: warning: Function parameter or member 'max_win' not described in 'tipc_bearer'
../net/tipc/bearer.h:171: warning: Function parameter or member 'disc' not described in 'tipc_bearer'
../net/tipc/bearer.h:171: warning: Function parameter or member 'up' not described in 'tipc_bearer'
../net/tipc/bearer.h:171: warning: Function parameter or member 'refcnt' not described in 'tipc_bearer'

../net/tipc/name_distr.h:68: warning: Function parameter or member 'port' not described in 'distr_item'

../net/tipc/name_table.h:111: warning: Function parameter or member 'services' not described in 'name_table'
../net/tipc/name_table.h:111: warning: Function parameter or member 'cluster_scope_lock' not described in 'name_table'
../net/tipc/name_table.h:111: warning: Function parameter or member 'rc_dests' not described in 'name_table'
../net/tipc/name_table.h:111: warning: Function parameter or member 'snd_nxt' not described in 'name_table'

../net/tipc/subscr.h:67: warning: Function parameter or member 'kref' not described in 'tipc_subscription'
../net/tipc/subscr.h:67: warning: Function parameter or member 'net' not described in 'tipc_subscription'
../net/tipc/subscr.h:67: warning: Function parameter or member 'service_list' not described in 'tipc_subscription'
../net/tipc/subscr.h:67: warning: Function parameter or member 'conid' not described in 'tipc_subscription'
../net/tipc/subscr.h:67: warning: Function parameter or member 'inactive' not described in 'tipc_subscription'
../net/tipc/subscr.h:67: warning: Function parameter or member 'lock' not described in 'tipc_subscription'

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-01 15:37:41 -08:00
Hoang Le
0643334902 tipc: fix incompatible mtu of transmission
In commit 682cd3cf94
("tipc: confgiure and apply UDP bearer MTU on running links"), we
introduced a function to change UDP bearer MTU and applied this new value
across existing per-link. However, we did not apply this new MTU value at
node level. This lead to packet dropped at link level if its size is
greater than new MTU value.

To fix this issue, we also apply this new MTU value for node level.

Fixes: 682cd3cf94 ("tipc: confgiure and apply UDP bearer MTU on running links")
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Link: https://lore.kernel.org/r/20201130025544.3602-1-hoang.h.le@dektech.com.au
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-01 15:26:57 -08:00
Danielle Ratson
22ec19f3ae bridge: switchdev: Notify about VLAN protocol changes
Drivers that support bridge offload need to be notified about changes to
the bridge's VLAN protocol so that they could react accordingly and
potentially veto the change.

Add a new switchdev attribute to communicate the change to drivers.

Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-01 15:21:13 -08:00
Lukas Bulwahn
9e39394fae net/ipv6: propagate user pointer annotation
For IPV6_2292PKTOPTIONS, do_ipv6_getsockopt() stores the user pointer
optval in the msg_control field of the msghdr.

Hence, sparse rightfully warns at ./net/ipv6/ipv6_sockglue.c:1151:33:

  warning: incorrect type in assignment (different address spaces)
      expected void *msg_control
      got char [noderef] __user *optval

Since commit 1f466e1f15 ("net: cleanly handle kernel vs user buffers for
->msg_control"), user pointers shall be stored in the msg_control_user
field, and kernel pointers in the msg_control field. This allows to
propagate __user annotations nicely through this struct.

Store optval in msg_control_user to properly record and propagate the
memory space annotation of this pointer.

Note that msg_control_is_user is set to true, so the key invariant, i.e.,
use msg_control_user if and only if msg_control_is_user is true, holds.

The msghdr is further used in the six alternative put_cmsg() calls, with
msg_control_is_user being true, put_cmsg() picks msg_control_user
preserving the __user annotation and passes that properly to
copy_to_user().

No functional change. No change in object code.

Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20201127093421.21673-1-lukas.bulwahn@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-01 11:42:33 -08:00
Marco Elver
fa69ee5aa4 net: switch to storing KCOV handle directly in sk_buff
It turns out that usage of skb extensions can cause memory leaks. Ido
Schimmel reported: "[...] there are instances that blindly overwrite
'skb->extensions' by invoking skb_copy_header() after __alloc_skb()."

Therefore, give up on using skb extensions for KCOV handle, and instead
directly store kcov_handle in sk_buff.

Fixes: 6370cc3bbd ("net: add kcov handle to skb extensions")
Fixes: 85ce50d337 ("net: kcov: don't select SKB_EXTENSIONS when there is no NET")
Fixes: 97f53a08cb ("net: linux/skbuff.h: combine SKB_EXTENSIONS + KCOV handling")
Link: https://lore.kernel.org/linux-wireless/20201121160941.GA485907@shredder.lan/
Reported-by: Ido Schimmel <idosch@idosch.org>
Signed-off-by: Marco Elver <elver@google.com>
Link: https://lore.kernel.org/r/20201125224840.2014773-1-elver@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-01 11:26:19 -08:00
Vlad Buslov
0fca55ed98 net: sched: remove redundant 'rtnl_held' argument
Functions tfilter_notify_chain() and tcf_get_next_proto() are always called
with rtnl lock held in current implementation. Moreover, attempting to call
them without rtnl lock would cause a warning down the call chain in
function __tcf_get_next_proto() that requires the lock to be held by
callers. Remove the 'rtnl_held' argument in order to simplify the code and
make rtnl lock requirement explicit.

Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Link: https://lore.kernel.org/r/20201127151205.23492-1-vladbu@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-01 11:24:56 -08:00
Jan Engelhardt
04295878be netfilter: use actual socket sk for REJECT action
True to the message of commit v5.10-rc1-105-g46d6c5ae953c, _do_
actually make use of state->sk when possible, such as in the REJECT
modules.

Reported-by: Minqiang Chen <ptpt52@gmail.com>
Cc: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Jan Engelhardt <jengelh@inai.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-12-01 14:33:55 +01:00
Wang Shanker
f7583f02a5 netfilter: nfnl_acct: remove data from struct net
This patch removes nfnl_acct_list from struct net to reduce the default
memory footprint for the netns structure.

Signed-off-by: Miao Wang <shankerwangmiao@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-12-01 09:45:29 +01:00
Kaixu Xia
0ef083d51b netfilter: Remove unnecessary conversion to bool
Here we could use the '!=' expression to fix the following coccicheck
warning:

./net/netfilter/xt_nfacct.c:30:41-46: WARNING: conversion to bool not needed here

Reported-by: Tosk Robot <tencent_os_robot@tencent.com>
Signed-off-by: Kaixu Xia <kaixuxia@tencent.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-12-01 09:45:29 +01:00
Paolo Abeni
6e628cd3a8 mptcp: use mptcp release_cb for delayed tasks
We have some tasks triggered by the subflow receive path
which require to access the msk socket status, specifically:
mptcp_clean_una() and mptcp_push_pending()

We have almost everything in place to defer to the msk
release_cb such tasks when the msk sock is owned.

Since the worker is no more used to clean the acked data,
for fallback sockets we need to explicitly flush them.

As an added bonus we can move the wake-up code in __mptcp_clean_una(),
simplify a lot mptcp_poll() and move the timer update under
the data lock.

The worker is now used only to process and send DATA_FIN
packets and do the mptcp-level retransmissions.

Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-30 17:55:23 -08:00
Paolo Abeni
7439d687b7 mptcp: avoid a few atomic ops in the rx path
Extending the data_lock scope in mptcp_incoming_option
we can use that to protect both snd_una and wnd_end.
In the typical case, we will have a single atomic op instead of 2

Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-30 17:55:23 -08:00
Paolo Abeni
724cfd2ee8 mptcp: allocate TX skbs in msk context
Move the TX skbs allocation in mptcp_sendmsg() scope,
and tentatively pre-allocate a skbs number proportional
to the sendmsg() length.

Use the ssk tx skb cache to prevent the subflow allocation.

This allows removing the msk skb extension cache and will
make possible the later patches.

Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-30 17:55:23 -08:00
Paolo Abeni
879526030c mptcp: protect the rx path with the msk socket spinlock
Such spinlock is currently used only to protect the 'owned'
flag inside the socket lock itself. With this patch, we extend
its scope to protect the whole msk receive path and
sk_forward_memory.

Given the above, we can always move data into the msk receive
queue (and OoO queue) from the subflow.

We leverage the previous commit, so that we need to acquire the
spinlock in the tx path only when moving fwd memory.

recvmsg() must now explicitly acquire the socket spinlock
when moving skbs out of sk_receive_queue. To reduce the number of
lock operations required we use a second rx queue and splice the
first into the latter in mptcp_lock_sock(). Additionally rmem
allocated memory is bulk-freed via release_cb()

Acked-by: Florian Westphal <fw@strlen.de>
Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-30 17:55:23 -08:00
Paolo Abeni
e93da92896 mptcp: implement wmem reservation
This leverages the previous commit to reserve the wmem
required for the sendmsg() operation when the msk socket
lock is first acquired.
Some heuristics are used to get a reasonable [over] estimation of
the whole memory required. If we can't forward alloc such amount
fallback to a reasonable small chunk, otherwise enter the wait
for memory path.

When sendmsg() needs more memory it looks at wmem_reserved
first and if that is exhausted, move more space from
sk_forward_alloc.

The reserved memory is not persistent and is released at the
next socket unlock via the release_cb().

Overall this will simplify the next patch.

Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-30 17:55:23 -08:00
Paolo Abeni
ad80b0fc6e mptcp: open code mptcp variant for lock_sock
This allows invoking an additional callback under the
socket spin lock.

Will be used by the next patches to avoid additional
spin lock contention.

Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-30 17:55:23 -08:00
Björn Töpel
b02e5a0ebb xsk: Propagate napi_id to XDP socket Rx path
Add napi_id to the xdp_rxq_info structure, and make sure the XDP
socket pick up the napi_id in the Rx path. The napi_id is used to find
the corresponding NAPI structure for socket busy polling.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/bpf/20201130185205.196029-7-bjorn.topel@gmail.com
2020-12-01 00:09:25 +01:00
Björn Töpel
a0731952d9 xsk: Add busy-poll support for {recv,send}msg()
Wire-up XDP socket busy-poll support for recvmsg() and sendmsg(). If
the XDP socket prefers busy-polling, make sure that no wakeup/IPI is
performed.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20201130185205.196029-6-bjorn.topel@gmail.com
2020-12-01 00:09:25 +01:00
Björn Töpel
e392081837 xsk: Check need wakeup flag in sendmsg()
Add a check for need wake up in sendmsg(), so that if a user calls
sendmsg() when no wakeup is needed, do not trigger a wakeup.

To simplify the need wakeup check in the syscall, unconditionally
enable the need wakeup flag for Tx. This has a side-effect for poll();
If poll() is called for a socket without enabled need wakeup, a Tx
wakeup is unconditionally performed.

The wakeup matrix for AF_XDP now looks like:

need wakeup | poll()       | sendmsg()   | recvmsg()
------------+--------------+-------------+------------
disabled    | wake Tx      | wake Tx     | nop
enabled     | check flag;  | check flag; | check flag;
            |   wake Tx/Rx |   wake Tx   |   wake Rx

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20201130185205.196029-5-bjorn.topel@gmail.com
2020-12-01 00:09:25 +01:00
Björn Töpel
45a8668184 xsk: Add support for recvmsg()
Add support for non-blocking recvmsg() to XDP sockets. Previously,
only sendmsg() was supported by XDP socket. Now, for symmetry and the
upcoming busy-polling support, recvmsg() is added.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20201130185205.196029-4-bjorn.topel@gmail.com
2020-12-01 00:09:25 +01:00
Björn Töpel
7c951cafc0 net: Add SO_BUSY_POLL_BUDGET socket option
This option lets a user set a per socket NAPI budget for
busy-polling. If the options is not set, it will use the default of 8.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/bpf/20201130185205.196029-3-bjorn.topel@gmail.com
2020-12-01 00:09:25 +01:00
Björn Töpel
7fd3253a7d net: Introduce preferred busy-polling
The existing busy-polling mode, enabled by the SO_BUSY_POLL socket
option or system-wide using the /proc/sys/net/core/busy_read knob, is
an opportunistic. That means that if the NAPI context is not
scheduled, it will poll it. If, after busy-polling, the budget is
exceeded the busy-polling logic will schedule the NAPI onto the
regular softirq handling.

One implication of the behavior above is that a busy/heavy loaded NAPI
context will never enter/allow for busy-polling. Some applications
prefer that most NAPI processing would be done by busy-polling.

This series adds a new socket option, SO_PREFER_BUSY_POLL, that works
in concert with the napi_defer_hard_irqs and gro_flush_timeout
knobs. The napi_defer_hard_irqs and gro_flush_timeout knobs were
introduced in commit 6f8b12d661 ("net: napi: add hard irqs deferral
feature"), and allows for a user to defer interrupts to be enabled and
instead schedule the NAPI context from a watchdog timer. When a user
enables the SO_PREFER_BUSY_POLL, again with the other knobs enabled,
and the NAPI context is being processed by a softirq, the softirq NAPI
processing will exit early to allow the busy-polling to be performed.

If the application stops performing busy-polling via a system call,
the watchdog timer defined by gro_flush_timeout will timeout, and
regular softirq handling will resume.

In summary; Heavy traffic applications that prefer busy-polling over
softirq processing should use this option.

Example usage:

  $ echo 2 | sudo tee /sys/class/net/ens785f1/napi_defer_hard_irqs
  $ echo 200000 | sudo tee /sys/class/net/ens785f1/gro_flush_timeout

Note that the timeout should be larger than the userspace processing
window, otherwise the watchdog will timeout and fall back to regular
softirq processing.

Enable the SO_BUSY_POLL/SO_PREFER_BUSY_POLL options on your socket.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/bpf/20201130185205.196029-2-bjorn.topel@gmail.com
2020-12-01 00:09:25 +01:00
Björn Töpel
ed1182dc00 xdp: Handle MEM_TYPE_XSK_BUFF_POOL correctly in xdp_return_buff()
It turns out that it does exist a path where xdp_return_buff() is
being passed an XDP buffer of type MEM_TYPE_XSK_BUFF_POOL. This path
is when AF_XDP zero-copy mode is enabled, and a buffer is redirected
to a DEVMAP with an attached XDP program that drops the buffer.

This change simply puts the handling of MEM_TYPE_XSK_BUFF_POOL back
into xdp_return_buff().

Fixes: 82c41671ca ("xdp: Simplify xdp_return_{frame, frame_rx_napi, buff}")
Reported-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Link: https://lore.kernel.org/bpf/20201127171726.123627-1-bjorn.topel@gmail.com
2020-11-30 23:00:26 +01:00
Chuck Lever
c1346a1216 NFSD: Replace the internals of the READ_BUF() macro
Convert the READ_BUF macro in nfs4xdr.c from open code to instead
use the new xdr_stream-style decoders already in use by the encode
side (and by the in-kernel NFS client implementation). Once this
conversion is done, each individual NFSv4 argument decoder can be
independently cleaned up to replace these macros with C code.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:36 -05:00
Chuck Lever
5191955d6f SUNRPC: Prepare for xdr_stream-style decoding on the server-side
A "permanent" struct xdr_stream is allocated in struct svc_rqst so
that it is usable by all server-side decoders. A per-rqst scratch
buffer is also allocated to handle decoding XDR data items that
cross page boundaries.

To demonstrate how it will be used, add the first call site for the
new svcxdr_init_decode() API.

As an additional part of the overall conversion, add symbolic
constants for successful and failed XDR operations. Returning "0" is
overloaded. Sometimes it means something failed, but sometimes it
means success. To make it more clear when XDR decoding functions
succeed or fail, introduce symbolic constants.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:35 -05:00
Chuck Lever
0ae4c3e8a6 SUNRPC: Add xdr_set_scratch_page() and xdr_reset_scratch_buffer()
Clean up: De-duplicate some frequently-used code.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 14:46:35 -05:00
Chuck Lever
156708adf2 SUNRPC: Move the svc_xdr_recvfrom() tracepoint
Commit c509f15a58 ("SUNRPC: Split the xdr_buf event class") added
display of the rqst's XID to the svc_xdr_buf_class. However, when
the recvfrom tracepoint fires, rq_xid has yet to be filled in with
the current XID. So it ends up recording the previous XID that was
handled by that svc_rqst.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 13:00:24 -05:00
Chuck Lever
d7cc739726 svcrdma: support multiple Read chunks per RPC
An efficient way to handle multiple Read chunks is to post them all
together and then take a single completion. This is also how the
code is already structured: when the Read completion fires, all
portions of the incoming RPC message are available to be assembled.

The difficult problem is setting up the Read sink buffers so that
the server pulls the client's data into place, making subsequent
pull-up unnecessary. There are several cases:

* No Read chunks. No-op.

* One data item Read chunk. This is the fast case, where the inline
  part of the RPC-over-RDMA message becomes the head and tail, and
  the data item chunk is placed in buf->pages.

* A Position-zero Read chunk. Treated like TCP: the Read chunk is
  pulled into contiguous pages.

+ A Position-zero Read chunk with data item chunks. Treated like
  TCP: all of the Read chunks are pulled into contiguous pages.

+ Multiple data item chunks. Treated like TCP: the inline part is
  copied and the data item chunks are pulled into contiguous pages.

The "*" cases are already supported. This patch adds support for the
"+" cases.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 13:00:23 -05:00
Chuck Lever
d96962e6d0 svcrdma: Use the new parsed chunk list when pulling Read chunks
As a pre-requisite for handling multiple Read chunks in each Read
list, convert svc_rdma_recv_read_chunk() to use the new parsed Read
chunk list.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 13:00:23 -05:00
Chuck Lever
bafe9c27d5 svcrdma: Rename info::ri_chunklen
I'm about to change the purpose of ri_chunklen: Instead of tracking
the number of bytes in one Read chunk, it will track the total
number of bytes in the Read list. Rename it for clarity.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 13:00:23 -05:00
Chuck Lever
b704be09dc svcrdma: Clean up chunk tracepoints
We already have trace_svcrdma_decode_rseg(), which records each
ingress Read segment. Instead of reporting those again when they
are about to be posted as RDMA Reads, let's fire one tracepoint
before posting each type of chunk.

So we'll get:

        nfsd-1998  [002]   321.666615: svcrdma_decode_rseg:  cq.id=4 cid=42 segno=0 position=0 192@0x013ca9ebfae14000:0xb0010b05
        nfsd-1998  [002]   321.666615: svcrdma_decode_rseg:  cq.id=4 cid=42 segno=1 position=0 7688@0x013ca9ebf914e000:0xb0010a05
        nfsd-1998  [002]   321.666615: svcrdma_decode_rseg:  cq.id=4 cid=42 segno=2 position=0 28@0x013ca9ebfae15000:0xb0010905
        nfsd-1998  [002]   321.666622: svcrdma_decode_rqst:  cq.id=4 cid=42 xid=0x013ca9eb vers=1 credits=128 proc=RDMA_NOMSG hdrlen=100

        nfsd-1998  [002]   321.666642: svcrdma_post_read_chunk: cq.id=3 cid=112 sqecount=3

kworker/2:1H-221   [002]   321.673949: svcrdma_wc_read:      cq.id=3 cid=112 status=SUCCESS (0/0x0)

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 13:00:23 -05:00
Chuck Lever
7954c8503b svcrdma: Remove chunk list pointers
Clean up: These pointers are no longer used.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 13:00:23 -05:00
Chuck Lever
41bc163ffe svcrdma: Support multiple Write chunks in svc_rdma_send_reply_chunk
Refactor svc_rdma_send_reply_chunk() so that it Sends only the parts
of rq_res that do not contain a result payload.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 13:00:23 -05:00
Chuck Lever
2371bcc056 svcrdma: Support multiple Write chunks in svc_rdma_map_reply_msg()
Refactor: svc_rdma_map_reply_msg() is restructured to DMA map only
the parts of rq_res that do not contain a result payload.

This change has been tested to confirm that it does not cause a
regression in the no Write chunk and single Write chunk cases.
Multiple Write chunks have not been tested.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 13:00:23 -05:00
Chuck Lever
9d0b09d5ef svcrdma: Support multiple write chunks when pulling up
When counting the number of SGEs needed to construct a Send request,
do not count result payloads. And, when copying the Reply message
into the pull-up buffer, result payloads are not to be copied to the
Send buffer.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 13:00:22 -05:00
Chuck Lever
6911f3e10c svcrdma: Use parsed chunk lists to encode Reply transport headers
Refactor: Instead of re-parsing the ingress RPC Call transport
header when constructing the egress RPC Reply transport header, use
the new parsed Write list and Reply chunk, which are version-
agnostic and already XDR decoded.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 13:00:22 -05:00
Chuck Lever
7a1cbfa180 svcrdma: Use parsed chunk lists to construct RDMA Writes
Refactor: Instead of re-parsing the ingress RPC Call transport
header when constructing RDMA Writes, use the new parsed chunk lists
for the Write list and Reply chunk, which are version-agnostic and
already XDR-decoded.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 13:00:22 -05:00
Chuck Lever
58b2e0fefa svcrdma: Use parsed chunk lists to detect reverse direction replies
Refactor: Don't duplicate header decoding smarts here. Instead, use
the new parsed chunk lists.

Note that the XID sanity test is also removed. The XID is already
looked up by the cb handler, and is rejected if it's not recognized.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 13:00:22 -05:00
Chuck Lever
eb3de6a49d svcrdma: Use parsed chunk lists to derive the inv_rkey
Refactor: Don't duplicate header decoding smarts here. Instead, use
the new parsed chunk lists.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 13:00:22 -05:00
Chuck Lever
78147ca8b4 svcrdma: Add a "parsed chunk list" data structure
This simple data structure binds the location of each data payload
inside of an RPC message to the chunk that will be used to push it
to or pull it from the client.

There are several benefits to this small additional overhead:

 * It enables support for more than one chunk in incoming Read and
   Write lists.

 * It translates the version-specific on-the-wire format into a
   generic in-memory structure, enabling support for multiple
   versions of the RPC/RDMA transport protocol.

 * It enables the server to re-organize a chunk list if it needs to
   adjust where Read chunk data lands in server memory without
   altering the contents of the XDR-encoded Receive buffer.

Construction of these lists is done while sanity checking each
incoming RPC/RDMA header. Subsequent patches will make use of the
generated data structures.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 13:00:22 -05:00
Chuck Lever
ded380f100 svcrdma: Clean up svc_rdma_encode_reply_chunk()
Refactor: Match the control flow of svc_rdma_encode_write_list().

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 13:00:22 -05:00
Chuck Lever
f6ad77590a svcrdma: Post RDMA Writes while XDR encoding replies
The only RPC/RDMA ordering requirement between RDMA Writes and RDMA
Sends is that the responder must post the Writes on the Send queue
before posting the Send that conveys the RPC Reply for that Write
payload.

The Linux NFS server implementation now has a transport method that
can post result Payload Writes earlier than svc_rdma_sendto:

   ->xpo_result_payload()

This gets RDMA Writes going earlier so they are more likely to be
complete at the remote end before the Send completes.

Some care must be taken with pulled-up Replies. We don't want to
push the Write chunk and then send the same payload data via Send.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 13:00:22 -05:00
Chuck Lever
76e5492b16 NFSD: Invoke svc_encode_result_payload() in "read" NFSD encoders
Have the NFSD encoders annotate the boundaries of every
direct-data-placement eligible result data payload. Then change
svcrdma to use that annotation instead of the xdr->page_len
when handling Write chunks.

For NFSv4 on RDMA, that enables the ability to recognize multiple
result payloads per compound. This is a pre-requisite for supporting
multiple Write chunks per RPC transaction.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 13:00:22 -05:00
Chuck Lever
03493bca08 SUNRPC: Rename svc_encode_read_payload()
Clean up: "result payload" is a less confusing name for these
payloads. "READ payload" reflects only the NFS usage.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 13:00:21 -05:00
Chuck Lever
ab1394ee7a svcrdma: Refactor the RDMA Write path
Refactor for subsequent changes.

Constify the xdr_buf argument to ensure the code here does not
modify it, and to enable callers to pass in a
"const struct xdr_buf *".

At the same time, rename the helper functions, which emit RDMA
Writes, not RDMA Sends, and add documenting comments.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 13:00:21 -05:00
Chuck Lever
51bad8cc13 svcrdma: Const-ify the xdr_buf arguments
Clean up: Ensure the code in rw.c does not modify the argument, and
enable callers to also use "const struct xdr_buf *".

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 13:00:21 -05:00
Chuck Lever
5a7e702670 SUNRPC: Adjust synopsis of xdr_buf_subsegment()
Clean up: This enables xdr_buf_subsegment()'s callers to pass in a
const pointer to that buffer.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 13:00:21 -05:00
Chuck Lever
e5decb2eb5 svcrdma: Catch another Reply chunk overflow case
When space in the Reply chunk runs out in the middle of a segment,
we end up passing a zero-length SGL to rdma_rw_ctx_init(), and it
oopses.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-11-30 13:00:21 -05:00
Jakub Kicinski
bd2d5c54dc Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

1) Fix insufficient validation of IPSET_ATTR_IPADDR_IPV6 reported
   by syzbot.

2) Remove spurious reports on nf_tables when lockdep gets disabled,
   from Florian Westphal.

3) Fix memleak in the error path of error path of
   ip_vs_control_net_init(), from Wang Hai.

4) Fix missing control data in flow dissector, otherwise IP address
   matching in hardware offload infra does not work.

5) Fix hardware offload match on prefix IP address when userspace
   does not send a bitwise expression to represent the prefix.

* git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf:
  netfilter: nftables_offload: build mask based from the matching bytes
  netfilter: nftables_offload: set address type in control dissector
  ipvs: fix possible memory leak in ip_vs_control_net_init
  netfilter: nf_tables: avoid false-postive lockdep splat
  netfilter: ipset: prevent uninit-value in hash_ip6_add
====================

Link: https://lore.kernel.org/r/20201127190313.24947-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-28 13:20:23 -08:00
Guillaume Nault
1ebf179037 ipv4: Fix tos mask in inet_rtm_getroute()
When inet_rtm_getroute() was converted to use the RCU variants of
ip_route_input() and ip_route_output_key(), the TOS parameters
stopped being masked with IPTOS_RT_MASK before doing the route lookup.

As a result, "ip route get" can return a different route than what
would be used when sending real packets.

For example:

    $ ip route add 192.0.2.11/32 dev eth0
    $ ip route add unreachable 192.0.2.11/32 tos 2
    $ ip route get 192.0.2.11 tos 2
    RTNETLINK answers: No route to host

But, packets with TOS 2 (ECT(0) if interpreted as an ECN bit) would
actually be routed using the first route:

    $ ping -c 1 -Q 2 192.0.2.11
    PING 192.0.2.11 (192.0.2.11) 56(84) bytes of data.
    64 bytes from 192.0.2.11: icmp_seq=1 ttl=64 time=0.173 ms

    --- 192.0.2.11 ping statistics ---
    1 packets transmitted, 1 received, 0% packet loss, time 0ms
    rtt min/avg/max/mdev = 0.173/0.173/0.173/0.000 ms

This patch re-applies IPTOS_RT_MASK in inet_rtm_getroute(), to
return results consistent with real route lookups.

Fixes: 3765d35ed8 ("net: ipv4: Convert inet_rtm_getroute to rcu versions of route lookup")
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/b2d237d08317ca55926add9654a48409ac1b8f5b.1606412894.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-28 13:14:23 -08:00
Jakub Kicinski
3771b82242 Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2020-11-28

1) Do not reference the skb for xsk's generic TX side since when looped
   back into RX it might crash in generic XDP, from Björn Töpel.

2) Fix umem cleanup on a partially set up xsk socket when being destroyed,
   from Magnus Karlsson.

3) Fix an incorrect netdev reference count when failing xsk_bind() operation,
   from Marek Majtyka.

4) Fix bpftool to set an error code on failed calloc() in build_btf_type_table(),
   from Zhen Lei.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  bpf: Add MAINTAINERS entry for BPF LSM
  bpftool: Fix error return value in build_btf_type_table
  net, xsk: Avoid taking multiple skbuff references
  xsk: Fix incorrect netdev reference count
  xsk: Fix umem cleanup bug at socket destruct
  MAINTAINERS: Update XDP and AF_XDP entries
====================

Link: https://lore.kernel.org/r/20201128005104.1205-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-28 12:04:57 -08:00
Jakub Kicinski
28d35ad083 Here are some batman-adv bugfixes:
- Fix head/tailroom issues for fragments, by Sven Eckelmann (3 patches)
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAl/BORsWHHN3QHNpbW9u
 d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoX+AD/wJYmiGL8beD3eCBQdTjmvJNRbZ
 MIlzse2LsiSk/4G/UbrEr11M4pc6gFDghH/VXs2MxD2hd6qJsJH7ulAQKMPzTEqD
 WhwlyJ08O9HIivaN2Ri1ibwmLIDAYlvR2bsWu95weI/hDdE5ups/kIvBudM8b+fC
 HKQ/e9wGaiB+zUUobvl6bP0Av6u3xPNTHFkXgGTo1shgLISAYFsmEoYQ3yZYrYKy
 P9ckwLoJwMD6AODMRvWVmW4dzi5QSBWN/84XgQSoqdlZeNklIcTY2P5XWdePo7cv
 wTnSxm4gJwdU7i8dBXLbtOYzmavzq0JkQHnRcFcKm6RWMbaeNNwfYUwMLp4h4ZtG
 Mjh0qt4IyuRb5Tf7jkoZ9UR4fIftpuMNMMvE37Cfr7wUZKWtVbG9jHyydU2MZ678
 vAF5NigYHa88i+MSdbXq8fPflmhddiHsN05nzI3olzX7tBFWNDf+71P82Gm4zKuM
 YC4ZlZH7Rrb79C13CJm5xKN0MI5TYGDYJeFLV4Km7hZ9H9EyDJcsuSFPgV0R0hg5
 Y5rWIWRrK/lU1aSXPSEuyL1P+tbH5z5xeYxdPUz2WroqNTyre6ycXy+wuoyCw5Ij
 t91wqD8pNC1/3nynBbI2kcLeGFPjDTEfWYcYN7OE3Ts6AXr3fs8x1XUcUsseAZ1P
 oxaUxzzfCcDno3abtg==
 =mORx
 -----END PGP SIGNATURE-----

Merge tag 'batadv-net-pullrequest-20201127' of git://git.open-mesh.org/linux-merge

Simon Wunderlich says:

====================
Here are some batman-adv bugfixes:

 - Fix head/tailroom issues for fragments, by Sven Eckelmann (3 patches)

* tag 'batadv-net-pullrequest-20201127' of git://git.open-mesh.org/linux-merge:
  batman-adv: Don't always reallocate the fragmentation skb head
  batman-adv: Reserve needed_*room for fragments
  batman-adv: Consider fragmentation for needed_headroom
====================

Link: https://lore.kernel.org/r/20201127173849.19208-1-sw@simonwunderlich.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-28 11:56:06 -08:00
Antoine Tenart
44f64f23ba netfilter: bridge: reset skb->pkt_type after NF_INET_POST_ROUTING traversal
Netfilter changes PACKET_OTHERHOST to PACKET_HOST before invoking the
hooks as, while it's an expected value for a bridge, routing expects
PACKET_HOST. The change is undone later on after hook traversal. This
can be seen with pairs of functions updating skb>pkt_type and then
reverting it to its original value:

For hook NF_INET_PRE_ROUTING:
  setup_pre_routing / br_nf_pre_routing_finish

For hook NF_INET_FORWARD:
  br_nf_forward_ip / br_nf_forward_finish

But the third case where netfilter does this, for hook
NF_INET_POST_ROUTING, the packet type is changed in br_nf_post_routing
but never reverted. A comment says:

  /* We assume any code from br_dev_queue_push_xmit onwards doesn't care
   * about the value of skb->pkt_type. */

But when having a tunnel (say vxlan) attached to a bridge we have the
following call trace:

  br_nf_pre_routing
  br_nf_pre_routing_ipv6
     br_nf_pre_routing_finish
  br_nf_forward_ip
     br_nf_forward_finish
  br_nf_post_routing           <- pkt_type is updated to PACKET_HOST
     br_nf_dev_queue_xmit      <- but not reverted to its original value
  vxlan_xmit
     vxlan_xmit_one
        skb_tunnel_check_pmtu  <- a check on pkt_type is performed

In this specific case, this creates issues such as when an ICMPv6 PTB
should be sent back. When CONFIG_BRIDGE_NETFILTER is enabled, the PTB
isn't sent (as skb_tunnel_check_pmtu checks if pkt_type is PACKET_HOST
and returns early).

If the comment is right and no one cares about the value of
skb->pkt_type after br_dev_queue_push_xmit (which isn't true), resetting
it to its original value should be safe.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Reviewed-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20201123174902.622102-1-atenart@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-28 11:46:51 -08:00
Marcelo Ricardo Leitner
3567e23379 net/sched: act_ct: enable stats for HW offloaded entries
By setting NF_FLOWTABLE_COUNTER. Otherwise, the updates added by
commit ef803b3cf9 ("netfilter: flowtable: add counter support in HW
offload") are not effective when using act_ct.

While at it, now that we have the flag set, protect the call to
nf_ct_acct_update() by commit beb97d3a31 ("net/sched: act_ct: update
nf_conn_acct for act_ct SW offload in flowtable") with the check on
NF_FLOWTABLE_COUNTER, as also done on other places.

Note that this shouldn't impact performance as these stats are only
enabled when net.netfilter.nf_conntrack_acct is enabled.

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: wenxu <wenxu@ucloud.cn>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Link: https://lore.kernel.org/r/481a65741261fd81b0a0813e698af163477467ec.1606415787.git.marcelo.leitner@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-28 11:43:40 -08:00
Jakub Kicinski
5c39f26e67 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Trivial conflict in CAN, keep the net-next + the byteswap wrapper.

Conflicts:
	drivers/net/can/usb/gs_usb.c

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-27 18:25:27 -08:00
Jon Maloy
b6f88d9c2f tipc: update address terminology in code
We update the terminology in the code so that deprecated structure
names and macros are replaced with those currently recommended in
the user API.

struct tipc_portid   -> struct tipc_socket_addr
struct tipc_name     -> struct tipc_service_addr
struct tipc_name_seq -> struct tipc_service_range

TIPC_ADDR_ID       -> TIPC_SOCKET_ADDR
TIPC_ADDR_NAME     -> TIPC_SERVICE_ADDR
TIPC_ADDR_NAMESEQ  -> TIPC_SERVICE_RANGE
TIPC_CFG_SRV       -> TIPC_NODE_STATE

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-27 17:34:01 -08:00
Jon Maloy
5f75e0a0e9 tipc: make node number calculation reproducible
The 32-bit node number, aka node hash or node address, is calculated
based on the 128-bit node identity when it is not set explicitly by
the user. In future commits we will need to perform this hash operation
on peer nodes while feeling safe that we obtain the same result.

We do this by interpreting the initial hash as a network byte order
number. Whenever we need to use the number locally on a node
we must therefore translate it to host byte order to obtain an
architecure independent result.

Furthermore, given the context where we use this number, we must not
allow it to be zero unless the node identity also is zero. Hence, in
the rare cases when the xor-ed hash value may end up as zero we replace
it with a fix number, knowing that the code anyway is capable of
handling hash collisions.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-27 17:34:01 -08:00
Jon Maloy
60c102eede tipc: refactor tipc_sk_bind() function
We refactor the tipc_sk_bind() function, so that the lock handling
is handled separately from the logics. We also move some sanity
tests to earlier in the call chain, to the function tipc_bind().

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-27 17:34:01 -08:00
Martin Schiller
139d6eb149 net/x25: remove x25_kill_by_device()
Remove obsolete function x25_kill_by_device(). It's not used any more.

Signed-off-by: Martin Schiller <ms@dev.tdt.de>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-27 17:22:51 -08:00
Martin Schiller
d023b2b9cc net/x25: fix restart request/confirm handling
We have to take the actual link state into account to handle
restart requests/confirms well.

Signed-off-by: Martin Schiller <ms@dev.tdt.de>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-27 17:22:51 -08:00
Martin Schiller
62480b992b net/lapb: fix t1 timer handling for LAPB_STATE_0
1. DTE interface changes immediately to LAPB_STATE_1 and start sending
   SABM(E).

2. DCE interface sends N2-times DM and changes to LAPB_STATE_1
   afterwards if there is no response in the meantime.

Signed-off-by: Martin Schiller <ms@dev.tdt.de>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-27 17:22:51 -08:00
Martin Schiller
a4989fa911 net/lapb: support netdev events
This patch allows layer2 (LAPB) to react to netdev events itself and
avoids the detour via layer3 (X.25).

1. Establish layer2 on NETDEV_UP events, if the carrier is already up.

2. Call lapb_disconnect_request() on NETDEV_GOING_DOWN events to signal
   the peer that the connection will go down.
   (Only when the carrier is up.)

3. When a NETDEV_DOWN event occur, clear all queues, enter state
   LAPB_STATE_0 and stop all timers.

4. The NETDEV_CHANGE event makes it possible to handle carrier loss and
   detection.

   In case of Carrier Loss, clear all queues, enter state LAPB_STATE_0
   and stop all timers.

   In case of Carrier Detection, we start timer t1 on a DCE interface,
   and on a DTE interface we change to state LAPB_STATE_1 and start
   sending SABM(E).

Signed-off-by: Martin Schiller <ms@dev.tdt.de>
Acked-by: Xie He <xie.he.0141@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-27 17:22:51 -08:00
Martin Schiller
7eed751b3b net/x25: handle additional netdev events
1. Add / remove x25_link_device by NETDEV_REGISTER/UNREGISTER and also
   by NETDEV_POST_TYPE_CHANGE/NETDEV_PRE_TYPE_CHANGE.

   This change is needed so that the x25_neigh struct for an interface
   is already created when it shows up and is kept independently if the
   interface goes UP or DOWN.

   This is used in an upcomming commit, where x25 params of an neighbour
   will get configurable through ioctls.

2. NETDEV_CHANGE event makes it possible to handle carrier loss and
   detection. If carrier is lost, clean up everything related to this
   neighbour by calling x25_link_terminated().

3. Also call x25_link_terminated() for NETDEV_DOWN events and remove the
   call to x25_clear_forward_by_dev() in x25_route_device_down(), as
   this is already called by x25_kill_by_neigh() which gets called by
   x25_link_terminated().

4. Do nothing for NETDEV_UP and NETDEV_GOING_DOWN events, as these will
   be handled in layer 2 (LAPB) and layer3 (X.25) will be informed by
   layer2 when layer2 link is established and layer3 link should be
   initiated.

Signed-off-by: Martin Schiller <ms@dev.tdt.de>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-27 17:22:51 -08:00
wenxu
c129412f74 net/sched: sch_frag: add generic packet fragment support.
Currently kernel tc subsystem can do conntrack in cat_ct. But when several
fragment packets go through the act_ct, function tcf_ct_handle_fragments
will defrag the packets to a big one. But the last action will redirect
mirred to a device which maybe lead the reassembly big packet over the mtu
of target device.

This patch add support for a xmit hook to mirred, that gets executed before
xmiting the packet. Then, when act_ct gets loaded, it configs that hook.
The frag xmit hook maybe reused by other modules.

Signed-off-by: wenxu <wenxu@ucloud.cn>
Acked-by: Cong Wang <cong.wang@bytedance.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-27 14:36:02 -08:00
wenxu
fa6d639930 net/sched: act_mirred: refactor the handle of xmit
This one is prepare for the next patch.

Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-27 14:36:02 -08:00
wenxu
aadaca9e7c net/sched: fix miss init the mru in qdisc_skb_cb
The mru in the qdisc_skb_cb should be init as 0. Only defrag packets in the
act_ct will set the value.

Fixes: 038ebb1a71 ("net/sched: act_ct: fix miss set mru for ovs after defrag in act_ct")
Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-27 14:36:02 -08:00
Vadim Fedorenko
74ea610602 net/tls: add CHACHA20-POLY1305 configuration
Add ChaCha-Poly specific configuration code.

Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-27 14:32:37 -08:00
Vadim Fedorenko
a6acbe6235 net/tls: add CHACHA20-POLY1305 specific behavior
RFC 7905 defines special behavior for ChaCha-Poly TLS sessions.
The differences are in the calculation of nonce and the absence
of explicit IV. This behavior is like TLSv1.3 partly.

Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-27 14:32:37 -08:00
Vadim Fedorenko
6942a284fb net/tls: make inline helpers protocol-aware
Inline functions defined in tls.h have a lot of AES-specific
constants. Remove these constants and change argument to struct
tls_prot_info to have an access to cipher type in later patches

Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-27 14:32:37 -08:00
Zhu Yanjun
bb1b25cab0 xdp: Remove the functions xsk_map_inc and xsk_map_put
The functions xsk_map_put() and xsk_map_inc() are simple wrappers and
as such, replace these functions with the functions bpf_map_inc() and
bpf_map_put() and remove some error testing code.

Signed-off-by: Zhu Yanjun <zyjzyj2000@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/1606402998-12562-1-git-send-email-yanjunz@nvidia.com
2020-11-27 23:00:51 +01:00
Jakub Kicinski
d0742c49ca linux-can-fixes-for-5.10-20201127
-----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCgAxFiEEK3kIWJt9yTYMP3ehqclaivrt76kFAl/Ay8sTHG1rbEBwZW5n
 dXRyb25peC5kZQAKCRCpyVqK+u3vqZvTB/9TfdD9pr5E4zIfStIlir7bvamDRK3y
 QVQd4+gaB6gLKNAQo0Kh+osbMZAm43ixzEtoLEHDOd+XaaNDyJdUpNUXyf+/ZZgs
 twMoCuY1vnO/1LPtVhLuggiyUWhc8YFPs7ZuFYmAWS+kYDMG4qxF3rMHcTjKw+bD
 TcD383ZldO8I2cTmdhT8ldx3tk0UUSCvG284Db43asUkTOhPs8SrkFLVlAZZfcNc
 d3gxzQYuMyR/s/ektyFP39ed9PO/OOQcg401XbT7aXvnCgaecTJHLIDfSvz6f69M
 HHybn8CWs68+Ne4k6kC4c5tGJzz3QGt+q1dO1J1IENG99DsDpoEnfNiS
 =xlpC
 -----END PGP SIGNATURE-----

Merge tag 'linux-can-fixes-for-5.10-20201127' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can

Marc Kleine-Budde says:

====================
pull-request: can 2020-11-27

The first patch is by me and target the gs_usb driver and fixes the endianess
problem with candleLight firmware.

Another patch by me for the mcp251xfd driver add sanity checking to bail out if
no IRQ is configured.

The next three patches target the m_can driver. A patch by me removes the
hardcoded IRQF_TRIGGER_FALLING from the request_threaded_irq() as this clashes
with the trigger level specified in the DT. Further a patch by me fixes the
nominal bitiming tseg2 min value for modern m_can cores. Pankaj Sharma's patch
add support for cores version 3.3.x.

The last patch by Oliver Hartkopp is for af_can and converts a WARN() into a
pr_warn(), which is triggered by the syzkaller. It was able to create a
situation where the closing of a socket runs simultaneously to the notifier
call chain for removing the CAN network device in use.

* tag 'linux-can-fixes-for-5.10-20201127' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can:
  can: af_can: can_rx_unregister(): remove WARN() statement from list operation sanity check
  can: m_can: m_can_dev_setup(): add support for bosch mcan version 3.3.0
  can: m_can: fix nominal bitiming tseg2 min for version >= 3.1
  can: m_can: m_can_open(): remove IRQF_TRIGGER_FALLING from request_threaded_irq()'s flags
  can: mcp251xfd: mcp251xfd_probe(): bail out if no IRQ was given
  can: gs_usb: fix endianess problem with candleLight firmware
====================

Link: https://lore.kernel.org/r/20201127100301.512603-1-mkl@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-27 11:13:39 -08:00
Willem de Bruijn
985f733742 sock: set sk_err to ee_errno on dequeue from errq
When setting sk_err, set it to ee_errno, not ee_origin.

Commit f5f99309fa ("sock: do not set sk_err in
sock_dequeue_err_skb") disabled updating sk_err on errq dequeue,
which is correct for most error types (origins):

  -       sk->sk_err = err;

Commit 38b257938a ("sock: reset sk_err when the error queue is
empty") reenabled the behavior for IMCP origins, which do require it:

  +       if (icmp_next)
  +               sk->sk_err = SKB_EXT_ERR(skb_next)->ee.ee_origin;

But read from ee_errno.

Fixes: 38b257938a ("sock: reset sk_err when the error queue is empty")
Reported-by: Ayush Ranjan <ayushranjan@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Link: https://lore.kernel.org/r/20201126151220.2819322-1-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-27 11:07:06 -08:00
Paolo Abeni
d3ab78858f mptcp: fix NULL ptr dereference on bad MPJ
If an msk listener receives an MPJ carrying an invalid token, it
will zero the request socket msk entry. That should later
cause fallback and subflow reset - as per RFC - at
subflow_syn_recv_sock() time due to failing hmac validation.

Since commit 4cf8b7e48a ("subflow: introduce and use
mptcp_can_accept_new_subflow()"), we unconditionally dereference
- in mptcp_can_accept_new_subflow - the subflow request msk
before performing hmac validation. In the above scenario we
hit a NULL ptr dereference.

Address the issue doing the hmac validation earlier.

Fixes: 4cf8b7e48a ("subflow: introduce and use mptcp_can_accept_new_subflow()")
Tested-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Link: https://lore.kernel.org/r/03b2cfa3ac80d8fc18272edc6442a9ddf0b1e34e.1606400227.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-27 11:05:31 -08:00
Eelco Chaudron
69929d4c49 net: openvswitch: fix TTL decrement action netlink message format
Currently, the openvswitch module is not accepting the correctly formated
netlink message for the TTL decrement action. For both setting and getting
the dec_ttl action, the actions should be nested in the
OVS_DEC_TTL_ATTR_ACTION attribute as mentioned in the openvswitch.h uapi.

When the original patch was sent, it was tested with a private OVS userspace
implementation. This implementation was unfortunately not upstreamed and
reviewed, hence an erroneous version of this patch was sent out.

Leaving the patch as-is would cause problems as the kernel module could
interpret additional attributes as actions and vice-versa, due to the
actions not being encapsulated/nested within the actual attribute, but
being concatinated after it.

Fixes: 744676e777 ("openvswitch: add TTL decrement action")
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Link: https://lore.kernel.org/r/160622121495.27296.888010441924340582.stgit@wsfd-netdev64.ntdv.lab.eng.bos.redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-27 11:03:06 -08:00
Pablo Neira Ayuso
a5d45bc0dc netfilter: nftables_offload: build mask based from the matching bytes
Userspace might match on prefix bytes of header fields if they are on
the byte boundary, this requires that the mask is adjusted accordingly.
Use NFT_OFFLOAD_MATCH_EXACT() for meta since prefix byte matching is not
allowed for this type of selector.

The bitwise expression might be optimized out by userspace, hence the
kernel needs to infer the prefix from the number of payload bytes to
match on. This patch adds nft_payload_offload_mask() to calculate the
bitmask to match on the prefix.

Fixes: c9626a2cbd ("netfilter: nf_tables: add hardware offload support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-11-27 12:10:47 +01:00
Pablo Neira Ayuso
3c78e9e0d3 netfilter: nftables_offload: set address type in control dissector
This patch adds nft_flow_rule_set_addr_type() to set the address type
from the nft_payload expression accordingly.

If the address type is not set in the control dissector then a rule that
matches either on source or destination IP address does not work.

After this patch, nft hardware offload generates the flow dissector
configuration as tc-flower does to match on an IP address.

This patch has been also tested functionally to make sure packets are
filtered out by the NIC.

This is also getting the code aligned with the existing netfilter flow
offload infrastructure which is also setting the control dissector.

Fixes: c9626a2cbd ("netfilter: nf_tables: add hardware offload support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-11-27 12:10:46 +01:00
Wang Hai
4bc3c8dc9f ipvs: fix possible memory leak in ip_vs_control_net_init
kmemleak report a memory leak as follows:

BUG: memory leak
unreferenced object 0xffff8880759ea000 (size 256):
backtrace:
[<00000000c0bf2deb>] kmem_cache_zalloc include/linux/slab.h:656 [inline]
[<00000000c0bf2deb>] __proc_create+0x23d/0x7d0 fs/proc/generic.c:421
[<000000009d718d02>] proc_create_reg+0x8e/0x140 fs/proc/generic.c:535
[<0000000097bbfc4f>] proc_create_net_data+0x8c/0x1b0 fs/proc/proc_net.c:126
[<00000000652480fc>] ip_vs_control_net_init+0x308/0x13a0 net/netfilter/ipvs/ip_vs_ctl.c:4169
[<000000004c927ebe>] __ip_vs_init+0x211/0x400 net/netfilter/ipvs/ip_vs_core.c:2429
[<00000000aa6b72d9>] ops_init+0xa8/0x3c0 net/core/net_namespace.c:151
[<00000000153fd114>] setup_net+0x2de/0x7e0 net/core/net_namespace.c:341
[<00000000be4e4f07>] copy_net_ns+0x27d/0x530 net/core/net_namespace.c:482
[<00000000f1c23ec9>] create_new_namespaces+0x382/0xa30 kernel/nsproxy.c:110
[<00000000098a5757>] copy_namespaces+0x2e6/0x3b0 kernel/nsproxy.c:179
[<0000000026ce39e9>] copy_process+0x220a/0x5f00 kernel/fork.c:2072
[<00000000b71f4efe>] _do_fork+0xc7/0xda0 kernel/fork.c:2428
[<000000002974ee96>] __do_sys_clone3+0x18a/0x280 kernel/fork.c:2703
[<0000000062ac0a4d>] do_syscall_64+0x33/0x40 arch/x86/entry/common.c:46
[<0000000093f1ce2c>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

In the error path of ip_vs_control_net_init(), remove_proc_entry() needs
to be called to remove the added proc entry, otherwise a memory leak
will occur.

Also, add some '#ifdef CONFIG_PROC_FS' because proc_create_net* return NULL
when PROC is not used.

Fixes: b17fc9963f ("IPVS: netns, ip_vs_stats and its procfs")
Fixes: 61b1ab4583 ("IPVS: netns, add basic init per netns.")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wang Hai <wanghai38@huawei.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-11-27 12:10:46 +01:00
Ingo Molnar
a787bdaff8 Merge branch 'linus' into sched/core, to resolve semantic conflict
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-11-27 11:10:50 +01:00
Antony Antony
c7a5899eb2 xfrm: redact SA secret with lockdown confidentiality
redact XFRM SA secret in the netlink response to xfrm_get_sa()
or dumpall sa.
Enable lockdown, confidentiality mode, at boot or at run time.

e.g. when enabled:
cat /sys/kernel/security/lockdown
none integrity [confidentiality]

ip xfrm state
src 172.16.1.200 dst 172.16.1.100
	proto esp spi 0x00000002 reqid 2 mode tunnel
	replay-window 0
	aead rfc4106(gcm(aes)) 0x0000000000000000000000000000000000000000 96

note: the aead secret is redacted.
Redacting secret is also a FIPS 140-2 requirement.

v1->v2
 - add size checks before memset calls
v2->v3
 - replace spaces with tabs for consistency
v3->v4
 - use kernel lockdown instead of a /proc setting
v4->v5
 - remove kconfig option

Reviewed-by: Stephan Mueller <smueller@chronox.de>
Signed-off-by: Antony Antony <antony.antony@secunet.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-11-27 11:03:06 +01:00
Oliver Hartkopp
d73ff9b7c4 can: af_can: can_rx_unregister(): remove WARN() statement from list operation sanity check
To detect potential bugs in CAN protocol implementations (double removal of
receiver entries) a WARN() statement has been used if no matching list item was
found for removal.

The fault injection issued by syzkaller was able to create a situation where
the closing of a socket runs simultaneously to the notifier call chain for
removing the CAN network device in use.

This case is very unlikely in real life but it doesn't break anything.
Therefore we just replace the WARN() statement with pr_warn() to preserve the
notification for the CAN protocol development.

Reported-by: syzbot+381d06e0c8eaacb8706f@syzkaller.appspotmail.com
Reported-by: syzbot+d0ddd88c9a7432f041e6@syzkaller.appspotmail.com
Reported-by: syzbot+76d62d3b8162883c7d11@syzkaller.appspotmail.com
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Link: https://lore.kernel.org/r/20201126192140.14350-1-socketcan@hartkopp.net
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2020-11-27 10:49:28 +01:00
Sven Eckelmann
992b03b88e batman-adv: Don't always reallocate the fragmentation skb head
When a packet is fragmented by batman-adv, the original batman-adv header
is not modified. Only a new fragmentation is inserted between the original
one and the ethernet header. The code must therefore make sure that it has
a writable region of this size in the skbuff head.

But it is not useful to always reallocate the skbuff by this size even when
there would be more than enough headroom still in the skb. The reallocation
is just to costly during in this codepath.

Fixes: ee75ed8887 ("batman-adv: Fragment and send skbs larger than mtu")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2020-11-27 08:02:55 +01:00
Sven Eckelmann
c5cbfc8755 batman-adv: Reserve needed_*room for fragments
The batadv net_device is trying to propagate the needed_headroom and
needed_tailroom from the lower devices. This is needed to avoid cost
intensive reallocations using pskb_expand_head during the transmission.

But the fragmentation code split the skb's without adding extra room at the
end/beginning of the various fragments. This reduced the performance of
transmissions over complex scenarios (batadv on vxlan on wireguard) because
the lower devices had to perform the reallocations at least once.

Fixes: ee75ed8887 ("batman-adv: Fragment and send skbs larger than mtu")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2020-11-27 08:02:55 +01:00
Sven Eckelmann
4ca23e2c20 batman-adv: Consider fragmentation for needed_headroom
If a batman-adv packets has to be fragmented, then the original batman-adv
packet header is not stripped away. Instead, only a new header is added in
front of the packet after it was split.

This size must be considered to avoid cost intensive reallocations during
the transmission through the various device layers.

Fixes: 7bca68c784 ("batman-adv: Add lower layer needed_(head|tail)room to own ones")
Reported-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2020-11-27 08:02:55 +01:00
Maxim Mikityanskiy
025cc2fb6a net/tls: Protect from calling tls_dev_del for TLS RX twice
tls_device_offload_cleanup_rx doesn't clear tls_ctx->netdev after
calling tls_dev_del if TLX TX offload is also enabled. Clearing
tls_ctx->netdev gets postponed until tls_device_gc_task. It leaves a
time frame when tls_device_down may get called and call tls_dev_del for
RX one extra time, confusing the driver, which may lead to a crash.

This patch corrects this racy behavior by adding a flag to prevent
tls_device_down from calling tls_dev_del the second time.

Fixes: e8f6979981 ("net/tls: Add generic NIC offload infrastructure")
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Link: https://lore.kernel.org/r/20201125221810.69870-1-saeedm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-25 17:31:06 -08:00
Parav Pandit
a7b4364950 devlink: Make sure devlink instance and port are in same net namespace
When devlink reload operation is not used, netdev of an Ethernet port may
be present in different net namespace than the net namespace of the
devlink instance.

Ensure that both the devlink instance and devlink port netdev are located
in same net namespace.

Fixes: 070c63f20f ("net: devlink: allow to change namespaces during reload")
Signed-off-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-25 17:26:34 -08:00
Parav Pandit
b187c9b417 devlink: Hold rtnl lock while reading netdev attributes
A netdevice of a devlink port can be moved to different net namespace
than its parent devlink instance.
This scenario occurs when devlink reload is not used.

When netdevice is undergoing migration to net namespace, its ifindex
and name may change.

In such use case, devlink port query may read stale netdev attributes.

Fix it by reading them under rtnl lock.

Fixes: bfcd3a4661 ("Introduce devlink infrastructure")
Signed-off-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-25 17:26:34 -08:00
Florian Westphal
c0700dfa2c netfilter: nf_tables: avoid false-postive lockdep splat
There are reports wrt lockdep splat in nftables, e.g.:
------------[ cut here ]------------
WARNING: CPU: 2 PID: 31416 at net/netfilter/nf_tables_api.c:622
lockdep_nfnl_nft_mutex_not_held+0x28/0x38 [nf_tables]
...

These are caused by an earlier, unrelated bug such as a n ABBA deadlock
in a different subsystem.
In such an event, lockdep is disabled and lockdep_is_held returns true
unconditionally.  This then causes the WARN() in nf_tables.

Make the WARN conditional on lockdep still active to avoid this.

Fixes: f102d66b33 ("netfilter: nf_tables: use dedicated mutex to guard transactions")
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Link: https://lore.kernel.org/linux-kselftest/CA+G9fYvFUpODs+NkSYcnwKnXm62tmP=ksLeBPmB+KFrB2rvCtQ@mail.gmail.com/
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-11-26 00:09:42 +01:00
Eric Dumazet
68ad89de91 netfilter: ipset: prevent uninit-value in hash_ip6_add
syzbot found that we are not validating user input properly
before copying 16 bytes [1].

Using NLA_BINARY in ipaddr_policy[] for IPv6 address is not correct,
since it ensures at most 16 bytes were provided.

We should instead make sure user provided exactly 16 bytes.

In old kernels (before v4.20), fix would be to remove the NLA_BINARY,
since NLA_POLICY_EXACT_LEN() was not yet available.

[1]
BUG: KMSAN: uninit-value in hash_ip6_add+0x1cba/0x3a50 net/netfilter/ipset/ip_set_hash_gen.h:892
CPU: 1 PID: 11611 Comm: syz-executor.0 Not tainted 5.10.0-rc4-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x21c/0x280 lib/dump_stack.c:118
 kmsan_report+0xf7/0x1e0 mm/kmsan/kmsan_report.c:118
 __msan_warning+0x5f/0xa0 mm/kmsan/kmsan_instr.c:197
 hash_ip6_add+0x1cba/0x3a50 net/netfilter/ipset/ip_set_hash_gen.h:892
 hash_ip6_uadt+0x976/0xbd0 net/netfilter/ipset/ip_set_hash_ip.c:267
 call_ad+0x329/0xd00 net/netfilter/ipset/ip_set_core.c:1720
 ip_set_ad+0x111f/0x1440 net/netfilter/ipset/ip_set_core.c:1808
 ip_set_uadd+0xf6/0x110 net/netfilter/ipset/ip_set_core.c:1833
 nfnetlink_rcv_msg+0xc7d/0xdf0 net/netfilter/nfnetlink.c:252
 netlink_rcv_skb+0x70a/0x820 net/netlink/af_netlink.c:2494
 nfnetlink_rcv+0x4f0/0x4380 net/netfilter/nfnetlink.c:600
 netlink_unicast_kernel net/netlink/af_netlink.c:1304 [inline]
 netlink_unicast+0x11da/0x14b0 net/netlink/af_netlink.c:1330
 netlink_sendmsg+0x173c/0x1840 net/netlink/af_netlink.c:1919
 sock_sendmsg_nosec net/socket.c:651 [inline]
 sock_sendmsg net/socket.c:671 [inline]
 ____sys_sendmsg+0xc7a/0x1240 net/socket.c:2353
 ___sys_sendmsg net/socket.c:2407 [inline]
 __sys_sendmsg+0x6d5/0x830 net/socket.c:2440
 __do_sys_sendmsg net/socket.c:2449 [inline]
 __se_sys_sendmsg+0x97/0xb0 net/socket.c:2447
 __x64_sys_sendmsg+0x4a/0x70 net/socket.c:2447
 do_syscall_64+0x9f/0x140 arch/x86/entry/common.c:48
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x45deb9
Code: 0d b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 db b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007fe2e503fc78 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 0000000000029ec0 RCX: 000000000045deb9
RDX: 0000000000000000 RSI: 0000000020000140 RDI: 0000000000000003
RBP: 000000000118bf60 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000118bf2c
R13: 000000000169fb7f R14: 00007fe2e50409c0 R15: 000000000118bf2c

Uninit was stored to memory at:
 kmsan_save_stack_with_flags mm/kmsan/kmsan.c:121 [inline]
 kmsan_internal_chain_origin+0xad/0x130 mm/kmsan/kmsan.c:289
 __msan_chain_origin+0x57/0xa0 mm/kmsan/kmsan_instr.c:147
 ip6_netmask include/linux/netfilter/ipset/pfxlen.h:49 [inline]
 hash_ip6_netmask net/netfilter/ipset/ip_set_hash_ip.c:185 [inline]
 hash_ip6_uadt+0xb1c/0xbd0 net/netfilter/ipset/ip_set_hash_ip.c:263
 call_ad+0x329/0xd00 net/netfilter/ipset/ip_set_core.c:1720
 ip_set_ad+0x111f/0x1440 net/netfilter/ipset/ip_set_core.c:1808
 ip_set_uadd+0xf6/0x110 net/netfilter/ipset/ip_set_core.c:1833
 nfnetlink_rcv_msg+0xc7d/0xdf0 net/netfilter/nfnetlink.c:252
 netlink_rcv_skb+0x70a/0x820 net/netlink/af_netlink.c:2494
 nfnetlink_rcv+0x4f0/0x4380 net/netfilter/nfnetlink.c:600
 netlink_unicast_kernel net/netlink/af_netlink.c:1304 [inline]
 netlink_unicast+0x11da/0x14b0 net/netlink/af_netlink.c:1330
 netlink_sendmsg+0x173c/0x1840 net/netlink/af_netlink.c:1919
 sock_sendmsg_nosec net/socket.c:651 [inline]
 sock_sendmsg net/socket.c:671 [inline]
 ____sys_sendmsg+0xc7a/0x1240 net/socket.c:2353
 ___sys_sendmsg net/socket.c:2407 [inline]
 __sys_sendmsg+0x6d5/0x830 net/socket.c:2440
 __do_sys_sendmsg net/socket.c:2449 [inline]
 __se_sys_sendmsg+0x97/0xb0 net/socket.c:2447
 __x64_sys_sendmsg+0x4a/0x70 net/socket.c:2447
 do_syscall_64+0x9f/0x140 arch/x86/entry/common.c:48
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Uninit was stored to memory at:
 kmsan_save_stack_with_flags mm/kmsan/kmsan.c:121 [inline]
 kmsan_internal_chain_origin+0xad/0x130 mm/kmsan/kmsan.c:289
 kmsan_memcpy_memmove_metadata+0x25e/0x2d0 mm/kmsan/kmsan.c:226
 kmsan_memcpy_metadata+0xb/0x10 mm/kmsan/kmsan.c:246
 __msan_memcpy+0x46/0x60 mm/kmsan/kmsan_instr.c:110
 ip_set_get_ipaddr6+0x2cb/0x370 net/netfilter/ipset/ip_set_core.c:310
 hash_ip6_uadt+0x439/0xbd0 net/netfilter/ipset/ip_set_hash_ip.c:255
 call_ad+0x329/0xd00 net/netfilter/ipset/ip_set_core.c:1720
 ip_set_ad+0x111f/0x1440 net/netfilter/ipset/ip_set_core.c:1808
 ip_set_uadd+0xf6/0x110 net/netfilter/ipset/ip_set_core.c:1833
 nfnetlink_rcv_msg+0xc7d/0xdf0 net/netfilter/nfnetlink.c:252
 netlink_rcv_skb+0x70a/0x820 net/netlink/af_netlink.c:2494
 nfnetlink_rcv+0x4f0/0x4380 net/netfilter/nfnetlink.c:600
 netlink_unicast_kernel net/netlink/af_netlink.c:1304 [inline]
 netlink_unicast+0x11da/0x14b0 net/netlink/af_netlink.c:1330
 netlink_sendmsg+0x173c/0x1840 net/netlink/af_netlink.c:1919
 sock_sendmsg_nosec net/socket.c:651 [inline]
 sock_sendmsg net/socket.c:671 [inline]
 ____sys_sendmsg+0xc7a/0x1240 net/socket.c:2353
 ___sys_sendmsg net/socket.c:2407 [inline]
 __sys_sendmsg+0x6d5/0x830 net/socket.c:2440
 __do_sys_sendmsg net/socket.c:2449 [inline]
 __se_sys_sendmsg+0x97/0xb0 net/socket.c:2447
 __x64_sys_sendmsg+0x4a/0x70 net/socket.c:2447
 do_syscall_64+0x9f/0x140 arch/x86/entry/common.c:48
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Uninit was created at:
 kmsan_save_stack_with_flags mm/kmsan/kmsan.c:121 [inline]
 kmsan_internal_poison_shadow+0x5c/0xf0 mm/kmsan/kmsan.c:104
 kmsan_slab_alloc+0x8d/0xe0 mm/kmsan/kmsan_hooks.c:76
 slab_alloc_node mm/slub.c:2906 [inline]
 __kmalloc_node_track_caller+0xc61/0x15f0 mm/slub.c:4512
 __kmalloc_reserve net/core/skbuff.c:142 [inline]
 __alloc_skb+0x309/0xae0 net/core/skbuff.c:210
 alloc_skb include/linux/skbuff.h:1094 [inline]
 netlink_alloc_large_skb net/netlink/af_netlink.c:1176 [inline]
 netlink_sendmsg+0xdb8/0x1840 net/netlink/af_netlink.c:1894
 sock_sendmsg_nosec net/socket.c:651 [inline]
 sock_sendmsg net/socket.c:671 [inline]
 ____sys_sendmsg+0xc7a/0x1240 net/socket.c:2353
 ___sys_sendmsg net/socket.c:2407 [inline]
 __sys_sendmsg+0x6d5/0x830 net/socket.c:2440
 __do_sys_sendmsg net/socket.c:2449 [inline]
 __se_sys_sendmsg+0x97/0xb0 net/socket.c:2447
 __x64_sys_sendmsg+0x4a/0x70 net/socket.c:2447
 do_syscall_64+0x9f/0x140 arch/x86/entry/common.c:48
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: a7b4f989a6 ("netfilter: ipset: IP set core support")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Acked-by: Jozsef Kadlecsik <kadlec@netfilter.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-11-26 00:09:42 +01:00
Yunsheng Lin
6454eca81e net: Use lockdep_assert_in_softirq() in napi_consume_skb()
Use napi_consume_skb() to assert the case when it is not called
in a atomic softirq context.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-25 15:08:38 -08:00
Paolo Abeni
fd8976790a mptcp: be careful on MPTCP-level ack.
We can enter the main mptcp_recvmsg() loop even when
no subflows are connected. As note by Eric, that would
result in a divide by zero oops on ack generation.

Address the issue by checking the subflow status before
sending the ack.

Additionally protect mptcp_recvmsg() against invocation
with weird socket states.

v1 -> v2:
 - removed unneeded inline keyword - Jakub

Reported-and-suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Fixes: ea4ca586b1 ("mptcp: refine MPTCP-level ack scheduling")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/5370c0ae03449239e3d1674ddcfb090cf6f20abe.1606253206.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-25 13:36:16 -08:00
Horatiu Vultur
bfd042321a bridge: mrp: Implement LC mode for MRP
Extend MRP to support LC mode(link check) for the interconnect port.
This applies only to the interconnect ring.

Opposite to RC mode(ring check) the LC mode is using CFM frames to
detect when the link goes up or down and based on that the userspace
will need to react.
One advantage of the LC mode over RC mode is that there will be fewer
frames in the normal rings. Because RC mode generates InTest on all
ports while LC mode sends CFM frame only on the interconnect port.

All 4 nodes part of the interconnect ring needs to have the same mode.
And it is not possible to have running LC and RC mode at the same time
on a node.

Whenever the MIM starts it needs to detect the status of the other 3
nodes in the interconnect ring so it would send a frame called
InLinkStatus, on which the clients needs to reply with their link
status.

This patch adds InLinkStatus frame type and extends existing rules on
how to forward this frame.

Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Link: https://lore.kernel.org/r/20201124082525.273820-1-horatiu.vultur@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-25 13:33:35 -08:00
Vlad Buslov
f460019b4c net: sched: alias action flags with TCA_ACT_ prefix
Currently both filter and action flags use same "TCA_" prefix which makes
them hard to distinguish to code and confusing for users. Create aliases
for existing action flags constants with "TCA_ACT_" prefix.

Signed-off-by: Vlad Buslov <vlad@buslov.dev>
Link: https://lore.kernel.org/r/20201124164054.893168-1-vlad@buslov.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-25 12:34:44 -08:00
Florian Westphal
b6d69fc8e8 mptcp: put reference in mptcp timeout timer
On close this timer might be scheduled. mptcp uses sk_reset_timer for
this, so the a reference on the mptcp socket is taken.

This causes a refcount leak which can for example be reproduced
with 'mp_join_server_v4.pkt' from the mptcp-packetdrill repo.

The leak has nothing to do with join requests, v1_mp_capable_bind_no_cs.pkt
works too when replacing the last ack mpcapable to v1 instead of v0.

unreferenced object 0xffff888109bba040 (size 2744):
  comm "packetdrill", [..]
  backtrace:
    [..] sk_prot_alloc.isra.0+0x2b/0xc0
    [..] sk_clone_lock+0x2f/0x740
    [..] mptcp_sk_clone+0x33/0x1a0
    [..] subflow_syn_recv_sock+0x2b1/0x690 [..]

Fixes: e16163b6e2 ("mptcp: refactor shutdown and close")
Cc: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20201124162446.11448-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-25 12:32:45 -08:00
Eric Dumazet
2543a6000e gro_cells: reduce number of synchronize_net() calls
After cited commit, gro_cells_destroy() became damn slow
on hosts with a lot of cores.

This is because we have one additional synchronize_net() per cpu as
stated in the changelog.

gro_cells_init() is setting NAPI_STATE_NO_BUSY_POLL, and this was enough
to not have one synchronize_net() call per netif_napi_del()

We can factorize all the synchronize_net() to a single one,
right before freeing per-cpu memory.

Fixes: 5198d545db ("net: remove napi_hash_del() from driver-facing API")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20201124203822.1360107-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-25 11:28:12 -08:00
Wang Hai
e255e11e66 ipv6: addrlabel: fix possible memory leak in ip6addrlbl_net_init
kmemleak report a memory leak as follows:

unreferenced object 0xffff8880059c6a00 (size 64):
  comm "ip", pid 23696, jiffies 4296590183 (age 1755.384s)
  hex dump (first 32 bytes):
    20 01 00 10 00 00 00 00 00 00 00 00 00 00 00 00   ...............
    1c 00 00 00 00 00 00 00 00 00 00 00 07 00 00 00  ................
  backtrace:
    [<00000000aa4e7a87>] ip6addrlbl_add+0x90/0xbb0
    [<0000000070b8d7f1>] ip6addrlbl_net_init+0x109/0x170
    [<000000006a9ca9d4>] ops_init+0xa8/0x3c0
    [<000000002da57bf2>] setup_net+0x2de/0x7e0
    [<000000004e52d573>] copy_net_ns+0x27d/0x530
    [<00000000b07ae2b4>] create_new_namespaces+0x382/0xa30
    [<000000003b76d36f>] unshare_nsproxy_namespaces+0xa1/0x1d0
    [<0000000030653721>] ksys_unshare+0x3a4/0x780
    [<0000000007e82e40>] __x64_sys_unshare+0x2d/0x40
    [<0000000031a10c08>] do_syscall_64+0x33/0x40
    [<0000000099df30e7>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

We should free all rules when we catch an error in ip6addrlbl_net_init().
otherwise a memory leak will occur.

Fixes: 2a8cc6c890 ("[IPV6] ADDRCONF: Support RFC3484 configurable address selection policy table.")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wang Hai <wanghai38@huawei.com>
Link: https://lore.kernel.org/r/20201124071728.8385-1-wanghai38@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-25 11:20:16 -08:00
Jakub Kicinski
26c8996526 Here is a batman-adv bugfix:
- set module owner to THIS_MODULE, by Taehee Yoo
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAl+9DbQWHHN3QHNpbW9u
 d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoWlxD/0TOgtU9fVykvjMG3U0rnB6pvp7
 rFH1Srqg5hsJ2PNLAF8fr0o4PDTyiAztirdJ4C4+ietug0ozxe4GZARB+/rPEk/3
 u558cDMZaqLOo195g70s+8cr0XXPy8Bt4ip/X4uW8DNTluKMiQcKNDrVkc6Cysa/
 5cnT4wf6RftG9eMbtG1wdEzqG/goQpmrP+bo+J1QdBQEycqMwegTWYZyU9behS+m
 u47FIa20XQ8FmoZx3oooGNb7z9iDHzwVc7PMHw19Xv14bahtz1BwydERdJMde3hP
 UD4HeAQla6Gl0ZO2pS5MuCQxEu8aphel75a3g+D4XCdyyIl1B3FzK1F+DWdoHlnx
 g0g7xv0nnfpF2wVhGkgIKfssQXino9YPyIjl/zLJ7YXXrTZCzkOjGiumSvrvXgox
 mTtvDvbSIOylQksPF5UJtoOo6mWihg7HERyrOwfIV6UxKuqbKhv3e17gUlkZPIcp
 ec4iFjGheE0H00zSzBkeb+76M0XQmDTbvmjUfOxaxP4o+u5W27QE+HznOTrQ9uBa
 FrMt++XjxSfmVsiFrpqE4LQyYfcFjO3Ei5yYfR3tYfDx4qegDvqksLbzb3Fkp2G9
 Pz9HEc+PYJk/w3N2Q+zSpbDJfUvHjyIPDeqsTdtbbUPUyfkGmghT7F9V+vof2jqm
 6pijD3ji8JN1wmh9yw==
 =Qr0B
 -----END PGP SIGNATURE-----

Merge tag 'batadv-net-pullrequest-20201124' of git://git.open-mesh.org/linux-merge

Simon Wunderlich says:

====================
Here is a batman-adv bugfix:

 - set module owner to THIS_MODULE, by Taehee Yoo

* tag 'batadv-net-pullrequest-20201124' of git://git.open-mesh.org/linux-merge:
  batman-adv: set .owner to THIS_MODULE
====================

Link: https://lore.kernel.org/r/20201124134417.17269-1-sw@simonwunderlich.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-24 16:49:14 -08:00
Alexander Duyck
407c85c7dd tcp: Set ECT0 bit in tos/tclass for synack when BPF needs ECN
When a BPF program is used to select between a type of TCP congestion
control algorithm that uses either ECN or not there is a case where the
synack for the frame was coming up without the ECT0 bit set. A bit of
research found that this was due to the final socket being configured to
dctcp while the listener socket was staying in cubic.

To reproduce it all that is needed is to monitor TCP traffic while running
the sample bpf program "samples/bpf/tcp_cong_kern.c". What is observed,
assuming tcp_dctcp module is loaded or compiled in and the traffic matches
the rules in the sample file, is that for all frames with the exception of
the synack the ECT0 bit is set.

To address that it is necessary to make one additional call to
tcp_bpf_ca_needs_ecn using the request socket and then use the output of
that to set the ECT0 bit for the tos/tclass of the packet.

Fixes: 91b5b21c7c ("bpf: Add support for changing congestion control")
Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
Link: https://lore.kernel.org/r/160593039663.2604.1374502006916871573.stgit@localhost.localdomain
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-24 14:12:55 -08:00
Heiner Kallweit
1d155dfdf5 net: warn if gso_type isn't set for a GSO SKB
In bug report [0] a warning in r8169 driver was reported that was
caused by an invalid GSO SKB (gso_type was 0). See [1] for a discussion
about this issue. Still the origin of the invalid GSO SKB isn't clear.

It shouldn't be a network drivers task to check for invalid GSO SKB's.
Also, even if issue [0] can be fixed, we can't be sure that a
similar issue doesn't pop up again at another place.
Therefore let gso_features_check() check for such invalid GSO SKB's.

[0] https://bugzilla.kernel.org/show_bug.cgi?id=209423
[1] https://www.spinics.net/lists/netdev/msg690794.html

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://lore.kernel.org/r/97c78d21-7f0b-d843-df17-3589f224d2cf@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-24 14:02:10 -08:00
Björn Töpel
36ccdf8582 net, xsk: Avoid taking multiple skbuff references
Commit 642e450b6b ("xsk: Do not discard packet when NETDEV_TX_BUSY")
addressed the problem that packets were discarded from the Tx AF_XDP
ring, when the driver returned NETDEV_TX_BUSY. Part of the fix was
bumping the skbuff reference count, so that the buffer would not be
freed by dev_direct_xmit(). A reference count larger than one means
that the skbuff is "shared", which is not the case.

If the "shared" skbuff is sent to the generic XDP receive path,
netif_receive_generic_xdp(), and pskb_expand_head() is entered the
BUG_ON(skb_shared(skb)) will trigger.

This patch adds a variant to dev_direct_xmit(), __dev_direct_xmit(),
where a user can select the skbuff free policy. This allows AF_XDP to
avoid bumping the reference count, but still keep the NETDEV_TX_BUSY
behavior.

Fixes: 642e450b6b ("xsk: Do not discard packet when NETDEV_TX_BUSY")
Reported-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20201123175600.146255-1-bjorn.topel@gmail.com
2020-11-24 22:39:56 +01:00
Moshe Shemesh
5204bb683c devlink: Fix reload stats structure
Fix reload stats structure exposed to the user. Change stats structure
hierarchy to have the reload action as a parent of the stat entry and
then stat entry includes value per limit. This will also help to avoid
string concatenation on iproute2 output.

Reload stats structure before this fix:
"stats": {
    "reload": {
        "driver_reinit": 2,
        "fw_activate": 1,
        "fw_activate_no_reset": 0
     }
}

After this fix:
"stats": {
    "reload": {
        "driver_reinit": {
            "unspecified": 2
        },
        "fw_activate": {
            "unspecified": 1,
            "no_reset": 0
        }
}

Fixes: a254c26426 ("devlink: Add reload stats")
Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/1606109785-25197-1-git-send-email-moshe@mellanox.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-24 13:04:04 -08:00
Ido Schimmel
f0a5013e29 devlink: Add blackhole_nexthop trap
Add a packet trap to report packets that were dropped due to a
blackhole nexthop.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-24 12:14:56 -08:00
Jakub Kicinski
23c01ed3b0 rxrpc development
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAl+8E4YACgkQ+7dXa6fL
 C2tehQ/8C1O+p1UBmQ4k/eXkYqsIvRQJGXGheSClxvym8tQiut/sZEKsCqucFmO8
 Y2gKEYJB3xYIqbZjjJUFYbL70Q5qG/cXgKn3rCpMjhn/0UAoUSwaGyoURcLwaz2B
 asipDSvbvZk7OwX4ScUTi/b/7AHtO6l16Y6yRzUqneOsSkXYk1+HhK2/4rPDKEJf
 cxocGIjjWs4d1erG2Aqgn3BZ2GsDNb88NJYi5qgc0ywFgRacX7oL0WFeu6pD1Ja+
 8vZLjFYB/bDeLBlvzpnWal99lUiSee8BbUoQn1iGJ3EekUVWkkNNfjhpYRjzIIR5
 GsJfbDHP16PHhLZgxBvNBwkWoqqBw3X/Zn0dkBDZgjlZQvLbeds6QRd6Ag/M2CJI
 D1QtmBDBtr4UZfRFm+5hfpTqf2jeOc4o68WROH3FEdsnKvICkttKucE9LaykokFD
 gQLNmkn2tb1D1yS2Y4iPucV3kYUU1Y6yw++Ck0cE2isBauN0gHtr9UAXZEc57ILc
 hXCFrJTbGkph3TimQDd1QbXZ28eExRBnn+H5RyYHeEidv03d2UO5uFZaJ5b5UGX0
 sE+3aeWLTH3fFTBvMnYPyFjzalSky8V5zOlUltwVCx7PpiDe6P7rKpO+ZXEi4OXP
 mHz7HM6h1iEnnrqEGNz7L39pTqKVbfBxDMvFra07dyKkvVthF1w=
 =TvbK
 -----END PGP SIGNATURE-----

Merge tag 'rxrpc-next-20201123' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

David Howells says:

====================
rxrpc: Prelude to gssapi support

Here are some patches that do some reorganisation of the security class
handling in rxrpc to allow implementation of the RxGK security class that
will allow AF_RXRPC to use GSSAPI-negotiated tokens and better crypto.  The
RxGK security class is not included in this patchset.

It does the following things:

 (1) Add a keyrings patch to provide the original key description, as
     provided to add_key(), to the payload preparser so that it can
     interpret the content on that basis.  Unfortunately, the rxrpc_s key
     type wasn't written to interpret its payload as anything other than a
     string of bytes comprising a key, but for RxGK, more information is
     required as multiple Kerberos enctypes are supported.

 (2) Remove the rxk5 security class key parsing.  The rxk5 class never got
     rolled out in OpenAFS and got replaced with rxgk.

 (3) Support the creation of rxrpc keys with multiple tokens of different
     types.  If some types are not supported, the ENOPKG error is
     suppressed if at least one other token's type is supported.

 (4) Punt the handling of server keys (rxrpc_s type) to the appropriate
     security class.

 (5) Organise the security bits in the rxrpc_connection struct into a
     union to make it easier to override for other classes.

 (6) Move some bits from core code into rxkad that won't be appropriate to
     rxgk.

* tag 'rxrpc-next-20201123' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  rxrpc: Ask the security class how much space to allow in a packet
  rxrpc: rxkad: Don't use pskb_pull() to advance through the response packet
  rxrpc: Organise connection security to use a union
  rxrpc: Don't reserve security header in Tx DATA skbuff
  rxrpc: Merge prime_packet_security into init_connection_security
  rxrpc: Fix example key name in a comment
  rxrpc: Ignore unknown tokens in key payload unless no known tokens
  rxrpc: Make the parsing of xdr payloads more coherent
  rxrpc: Allow security classes to give more info on server keys
  rxrpc: Don't leak the service-side session key to userspace
  rxrpc: Hand server key parsing off to the security class
  rxrpc: Split the server key type (rxrpc_s) into its own file
  rxrpc: Don't retain the server key in the connection
  rxrpc: Support keys with multiple authentication tokens
  rxrpc: List the held token types in the key description in /proc/keys
  rxrpc: Remove the rxk5 security class as it's now defunct
  keys: Provide the original description to the key preparser
====================

Link: https://lore.kernel.org/r/160616220405.830164.2239716599743995145.stgit@warthog.procyon.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-24 12:05:58 -08:00
Peter Zijlstra
545b8c8df4 smp: Cleanup smp_call_function*()
Get rid of the __call_single_node union and cleanup the API a little
to avoid external code relying on the structure layout as much.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
2020-11-24 16:47:49 +01:00
Christophe JAILLET
5112cf59d7 sctp: Fix some typo
s/tranport/transport/

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Link: https://lore.kernel.org/r/20201122180704.1366636-1-christophe.jaillet@wanadoo.fr
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-23 17:44:11 -08:00
Eyal Birger
d549699048 net/packet: fix packet receive on L3 devices without visible hard header
In the patchset merged by commit b9fcf0a0d8
("Merge branch 'support-AF_PACKET-for-layer-3-devices'") L3 devices which
did not have header_ops were given one for the purpose of protocol parsing
on af_packet transmit path.

That change made af_packet receive path regard these devices as having a
visible L3 header and therefore aligned incoming skb->data to point to the
skb's mac_header. Some devices, such as ipip, xfrmi, and others, do not
reset their mac_header prior to ingress and therefore their incoming
packets became malformed.

Ideally these devices would reset their mac headers, or af_packet would be
able to rely on dev->hard_header_len being 0 for such cases, but it seems
this is not the case.

Fix by changing af_packet RX ll visibility criteria to include the
existence of a '.create()' header operation, which is used when creating
a device hard header - via dev_hard_header() - by upper layers, and does
not exist in these L3 devices.

As this predicate may be useful in other situations, add it as a common
dev_has_header() helper in netdevice.h.

Fixes: b9fcf0a0d8 ("Merge branch 'support-AF_PACKET-for-layer-3-devices'")
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Acked-by: Jason A. Donenfeld <Jason@zx2c4.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20201121062817.3178900-1-eyal.birger@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-23 17:29:36 -08:00
Jakub Kicinski
cc69837fca net: don't include ethtool.h from netdevice.h
linux/netdevice.h is included in very many places, touching any
of its dependecies causes large incremental builds.

Drop the linux/ethtool.h include, linux/netdevice.h just needs
a forward declaration of struct ethtool_ops.

Fix all the places which made use of this implicit include.

Acked-by: Johannes Berg <johannes@sipsolutions.net>
Acked-by: Shannon Nelson <snelson@pensando.io>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Link: https://lore.kernel.org/r/20201120225052.1427503-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-23 17:27:04 -08:00
Kurt Kanzenbach
8551fad63c net: dsa: tag_hellcreek: Cleanup includes
Remove unused and add needed includes. No functional change.

Suggested-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-23 16:57:21 -08:00
Stefano Garzarella
3fe356d58e vsock/virtio: discard packets only when socket is really closed
Starting from commit 8692cefc43 ("virtio_vsock: Fix race condition
in virtio_transport_recv_pkt"), we discard packets in
virtio_transport_recv_pkt() if the socket has been released.

When the socket is connected, we schedule a delayed work to wait the
RST packet from the other peer, also if SHUTDOWN_MASK is set in
sk->sk_shutdown.
This is done to complete the virtio-vsock shutdown algorithm, releasing
the port assigned to the socket definitively only when the other peer
has consumed all the packets.

If we discard the RST packet received, the socket will be closed only
when the VSOCK_CLOSE_TIMEOUT is reached.

Sergio discovered the issue while running ab(1) HTTP benchmark using
libkrun [1] and observing a latency increase with that commit.

To avoid this issue, we discard packet only if the socket is really
closed (SOCK_DONE flag is set).
We also set SOCK_DONE in virtio_transport_release() when we don't need
to wait any packets from the other peer (we didn't schedule the delayed
work). In this case we remove the socket from the vsock lists, releasing
the port assigned.

[1] https://github.com/containers/libkrun

Fixes: 8692cefc43 ("virtio_vsock: Fix race condition in virtio_transport_recv_pkt")
Cc: justin.he@arm.com
Reported-by: Sergio Lopez <slp@redhat.com>
Tested-by: Sergio Lopez <slp@redhat.com>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Acked-by: Jia He <justin.he@arm.com>
Link: https://lore.kernel.org/r/20201120104736.73749-1-sgarzare@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-23 16:36:29 -08:00
Ricardo Dias
01770a1661 tcp: fix race condition when creating child sockets from syncookies
When the TCP stack is in SYN flood mode, the server child socket is
created from the SYN cookie received in a TCP packet with the ACK flag
set.

The child socket is created when the server receives the first TCP
packet with a valid SYN cookie from the client. Usually, this packet
corresponds to the final step of the TCP 3-way handshake, the ACK
packet. But is also possible to receive a valid SYN cookie from the
first TCP data packet sent by the client, and thus create a child socket
from that SYN cookie.

Since a client socket is ready to send data as soon as it receives the
SYN+ACK packet from the server, the client can send the ACK packet (sent
by the TCP stack code), and the first data packet (sent by the userspace
program) almost at the same time, and thus the server will equally
receive the two TCP packets with valid SYN cookies almost at the same
instant.

When such event happens, the TCP stack code has a race condition that
occurs between the momement a lookup is done to the established
connections hashtable to check for the existence of a connection for the
same client, and the moment that the child socket is added to the
established connections hashtable. As a consequence, this race condition
can lead to a situation where we add two child sockets to the
established connections hashtable and deliver two sockets to the
userspace program to the same client.

This patch fixes the race condition by checking if an existing child
socket exists for the same client when we are adding the second child
socket to the established connections socket. If an existing child
socket exists, we drop the packet and discard the second child socket
to the same client.

Signed-off-by: Ricardo Dias <rdias@singlestore.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20201120111133.GA67501@rdias-suse-pc.lan
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-23 16:32:33 -08:00
Paul Moore
3df98d7921 lsm,selinux: pass flowi_common instead of flowi to the LSM hooks
As pointed out by Herbert in a recent related patch, the LSM hooks do
not have the necessary address family information to use the flowi
struct safely.  As none of the LSMs currently use any of the protocol
specific flowi information, replace the flowi pointers with pointers
to the address family independent flowi_common struct.

Reported-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: James Morris <jamorris@linux.microsoft.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2020-11-23 18:36:21 -05:00
Jason Gunthorpe
ed92f6a52b Linux 5.10-rc5
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl+69egeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGTSYH/ifRBlaxy5UiHFc0
 2zdR7pkjWrYfDTTT3sazIAhdlzzcfnkUqgFxOP45F4ZIqeTzunH3sUY+5UlT9IX7
 liUgnLxQ/1R9Gx8kPGQfu+tLCey78xVFydGsqJoW9sPRw2R+apMdGGa/lOrk+OXz
 DXIN+dDnGFqwCCNJpK+rxQQhFf++IPpSI8z6Y23moOFhsDZrEziHuVFy2FGyRM6z
 prZ/us/tcobE8ptCk1RmOxLoJ1DR6UxpA2vLimTE+JD8siOsSWPbjE0KudnWCnd5
 BLqIjrsPJbSxyuzzK3v9dnO5wMv7tMDuMIuYM/MQTXDttNwtsqt/aP6gdnUCym7N
 5eHEj5g=
 =MuO1
 -----END PGP SIGNATURE-----

Merge tag 'v5.10-rc5' into rdma.git for-next

For dependencies in following patches

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-11-23 16:50:59 -04:00
David Howells
d7d775b1ff rxrpc: Ask the security class how much space to allow in a packet
Ask the security class how much header and trailer space to allow for when
allocating a packet, given how much data is remaining.

This will allow the rxgk security class to stick both a trailer in as well
as a header as appropriate in the future.

Signed-off-by: David Howells <dhowells@redhat.com>
2020-11-23 19:53:11 +00:00
David Howells
ceff522db2 rxrpc: rxkad: Don't use pskb_pull() to advance through the response packet
In the rxkad security class, don't use pskb_pull() to advance through the
contents of the response packet.  There's no point, especially as the next
and last access to the skbuff still has to allow for the wire header in the
offset (which we didn't advance over).

Better to just add the displacement to the next offset.

Signed-off-by: David Howells <dhowells@redhat.com>
2020-11-23 18:09:30 +00:00
David Howells
521bb3049c rxrpc: Organise connection security to use a union
Organise the security information in the rxrpc_connection struct to use a
union to allow for different data for different security classes.

Signed-off-by: David Howells <dhowells@redhat.com>
2020-11-23 18:09:30 +00:00
David Howells
f4bdf3d683 rxrpc: Don't reserve security header in Tx DATA skbuff
Insert the security header into the skbuff representing a DATA packet to be
transmitted rather than using skb_reserve() when the packet is allocated.
This makes it easier to apply crypto that spans the security header and the
data, particularly in the upcoming RxGK class where we have a common
encrypt-and-checksum function that is used in a number of circumstances.

Signed-off-by: David Howells <dhowells@redhat.com>
2020-11-23 18:09:30 +00:00
David Howells
8d47a43c48 rxrpc: Merge prime_packet_security into init_connection_security
Merge the ->prime_packet_security() into the ->init_connection_security()
hook as they're always called together.

Signed-off-by: David Howells <dhowells@redhat.com>
2020-11-23 18:09:30 +00:00
David Howells
177b898966 rxrpc: Fix example key name in a comment
Fix an example of an rxrpc key name in a comment.

Signed-off-by: David Howells <dhowells@redhat.com>
2020-11-23 18:09:30 +00:00
David Howells
9a0e6464f4 rxrpc: Ignore unknown tokens in key payload unless no known tokens
When parsing a payload for an rxrpc-type key, ignore any tokens that are
not of a known type and don't give an error for them - unless there are no
tokens of a known type.

Signed-off-by: David Howells <dhowells@redhat.com>
2020-11-23 18:09:30 +00:00
David Howells
4c20c33340 rxrpc: Make the parsing of xdr payloads more coherent
Make the parsing of xdr-encoded payloads, as passed to add_key, more
coherent.  Shuttling back and forth between various variables was a bit
hard to follow.

Signed-off-by: David Howells <dhowells@redhat.com>
2020-11-23 18:09:30 +00:00
David Howells
d5953f6543 rxrpc: Allow security classes to give more info on server keys
Allow a security class to give more information on an rxrpc_s-type key when
it is viewed in /proc/keys.  This will allow the upcoming RxGK security
class to show the enctype name here.

Signed-off-by: David Howells <dhowells@redhat.com>
2020-11-23 18:09:29 +00:00
David Howells
d2ae4e9182 rxrpc: Don't leak the service-side session key to userspace
Don't let someone reading a service-side rxrpc-type key get access to the
session key that was exchanged with the client.  The server application
will, at some point, need to be able to read the information in the ticket,
but this probably shouldn't include the key material.

Signed-off-by: David Howells <dhowells@redhat.com>
2020-11-23 18:09:29 +00:00
David Howells
12da59fcab rxrpc: Hand server key parsing off to the security class
Hand responsibility for parsing a server key off to the security class.  We
can determine which class from the description.  This is necessary as rxgk
server keys have different lookup requirements and different content
requirements (dependent on crypto type) to those of rxkad server keys.

Signed-off-by: David Howells <dhowells@redhat.com>
2020-11-23 18:09:29 +00:00
David Howells
ca7fb10059 rxrpc: Split the server key type (rxrpc_s) into its own file
Split the server private key type (rxrpc_s) out into its own file rather
than mingling it with the authentication/client key type (rxrpc) since they
don't really bear any relation.

Signed-off-by: David Howells <dhowells@redhat.com>
2020-11-23 18:09:29 +00:00
David Howells
ec832bd06d rxrpc: Don't retain the server key in the connection
Don't retain a pointer to the server key in the connection, but rather get
it on demand when the server has to deal with a response packet.

This is necessary to implement RxGK (GSSAPI-mediated transport class),
where we can't know which key we'll need until we've challenged the client
and got back the response.

This also means that we don't need to do a key search in the accept path in
softirq mode.

Also, whilst we're at it, allow the security class to ask for a kvno and
encoding-type variant of a server key as RxGK needs different keys for
different encoding types.  Keys of this type have an extra bit in the
description:

	"<service-id>:<security-index>:<kvno>:<enctype>"

Signed-off-by: David Howells <dhowells@redhat.com>
2020-11-23 18:09:29 +00:00
David Howells
41057ebde0 rxrpc: Support keys with multiple authentication tokens
rxrpc-type keys can have multiple tokens attached for different security
classes.  Currently, rxrpc always picks the first one, whether or not the
security class it indicates is supported.

Add preliminary support for choosing which security class will be used
(this will need to be directed from a higher layer) and go through the
tokens to find one that's supported.

Signed-off-by: David Howells <dhowells@redhat.com>
2020-11-23 18:09:29 +00:00
David Howells
0727d3ec38 rxrpc: List the held token types in the key description in /proc/keys
When viewing an rxrpc-type key through /proc/keys, display a list of held
token types.

Signed-off-by: David Howells <dhowells@redhat.com>
2020-11-23 18:09:29 +00:00
David Howells
8a5dc32115 rxrpc: Remove the rxk5 security class as it's now defunct
Remove the rxrpc rxk5 security class as it's now defunct and nothing uses
it anymore.

Signed-off-by: David Howells <dhowells@redhat.com>
2020-11-23 18:09:29 +00:00
Jens Axboe
b713c195d5 net: provide __sys_shutdown_sock() that takes a socket
No functional changes in this patch, needed to provide io_uring support
for shutdown(2).

Cc: netdev@vger.kernel.org
Cc: David S. Miller <davem@davemloft.net>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-11-23 09:15:15 -07:00
Marek Majtyka
178648916e xsk: Fix incorrect netdev reference count
Fix incorrect netdev reference count in xsk_bind operation. Incorrect
reference count of the device appears when a user calls bind with the
XDP_ZEROCOPY flag on an interface which does not support zero-copy.
In such a case, an error is returned but the reference count is not
decreased. This change fixes the fault, by decreasing the reference count
in case of such an error.

The problem being corrected appeared in '162c820ed896' for the first time,
and the code was moved to new file location over the time with commit
'c2d3d6a47462'. This specific patch applies to all version starting
from 'c2d3d6a47462'. The same solution should be applied but on different
file (net/xdp/xdp_umem.c) and function (xdp_umem_assign_dev) for versions
from '162c820ed896' to 'c2d3d6a47462' excluded.

Fixes: 162c820ed8 ("xdp: hold device for umem regardless of zero-copy mode")
Signed-off-by: Marek Majtyka <marekx.majtyka@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20201120151443.105903-1-marekx.majtyka@intel.com
2020-11-23 13:19:33 +01:00
Yejune Deng
988187e881 ipvs: replace atomic_add_return()
atomic_inc_return() looks better

Signed-off-by: Yejune Deng <yejune.deng@gmail.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-11-22 13:45:52 +01:00
Randy Dunlap
fd2d6bc4c2 netfilter: nft_reject_bridge: fix build errors due to code movement
Fix build errors in net/bridge/netfilter/nft_reject_bridge.ko
by selecting NF_REJECT_IPV4, which provides the missing symbols.

ERROR: modpost: "nf_reject_skb_v4_tcp_reset" [net/bridge/netfilter/nft_reject_bridge.ko] undefined!
ERROR: modpost: "nf_reject_skb_v4_unreach" [net/bridge/netfilter/nft_reject_bridge.ko] undefined!

Fixes: fa538f7cf0 ("netfilter: nf_reject: add reject skbuff creation helpers")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-11-22 13:44:51 +01:00
Jakub Kicinski
5e08723967 linux-can-next-for-5.11-20201120
-----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCgAxFiEEK3kIWJt9yTYMP3ehqclaivrt76kFAl+3q+gTHG1rbEBwZW5n
 dXRyb25peC5kZQAKCRCpyVqK+u3vqYdjB/46Qv583QBIbM8YaXN+XTSfxSufp0Ku
 G0R88mJKdOalgeiD58TaZ3hAegjOW44nCMmZP64mQB7XNvRiyNk8j69VDsRy+4Zy
 r86Hi0TIlpeM5oNHb84SpvIO01elmuYNlaoU9t7NQV5YlKA2Lq+kcmMLGa7Z1zVk
 IZ2KF2eF2V83p/4A5t1cgQV6LRlDBgAeoe/9pyG0mNX3Fv8lI3WIqjRkuevSm6iJ
 XMEuc7kjUZ8Yd59Vvnipm89teD5lCyIpxeJ2hrOsiUlkwyosdCxewMeohJ596a2L
 t5DoiT+3nJLOBQ/IKCYe5RoCf7sFZ8nfjqcNYeOm7Yw1CbXg9KwGZTU7
 =BedD
 -----END PGP SIGNATURE-----

Merge tag 'linux-can-next-for-5.11-20201120' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next

Marc Kleine-Budde says:

====================
pull-request: can-next 2020-11-20

The first patch is by Yegor Yefremov and he improves the j1939 documentaton by
adding tables for the CAN identifier and its fields.

Then there are 8 patches by Oliver Hartkopp targeting the CAN driver
infrastructure and drivers. These add support for optional DLC element to the
Classical CAN frame structure. See patch ea7800565a ("can: add optional DLC
element to Classical CAN frame structure") for details. Oliver's last patch
adds len8_dlc support to several drivers. Stefan Mätje provides a patch to add
len8_dlc support to the esd_usb2 driver.

The next patch is by Oliver Hartkopp, too and adds support for modification of
Classical CAN DLCs to CAN GW sockets.

The next 3 patches target the nxp,flexcan DT bindings. One patch by my adds the
missing uint32 reference to the clock-frequency property. Joakim Zhang's
patches fix the fsl,clk-source property and add the IMX_SC_R_CAN() macro to the
imx firmware header file, which will be used in the flexcan driver later.

Another patch by Joakim Zhang prepares the flexcan driver for SCU based
stop-mode, by giving the existing, GPR based stop-mode, a _GPR postfix.

The next 5 patches are by me, target the flexcan driver, and clean up the
.ndo_open and .ndo_stop callbacks. These patches try to fix a sporadically
hanging flexcan_close() during simultanious ifdown, sending of CAN messages and
probably open CAN bus. I was never able to reproduce, but these seem to fix the
problem at the reporting user. As these changes are rather big, I'd like to
mainline them via net-next/master.

The next patches are by Jimmy Assarsson and Christer Beskow, they add support
for new USB devices to the existing kvaser_usb driver.

The last patch is by Kaixu Xia and simplifies the return in the
mcp251xfd_chip_softreset() function in the mcp251xfd driver.

* tag 'linux-can-next-for-5.11-20201120' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next: (25 commits)
  can: mcp251xfd: remove useless code in mcp251xfd_chip_softreset
  can: kvaser_usb: Add new Kvaser hydra devices
  can: kvaser_usb: kvaser_usb_hydra: Add support for new device variant
  can: kvaser_usb: Add new Kvaser Leaf v2 devices
  can: kvaser_usb: Add USB_{LEAF,HYDRA}_PRODUCT_ID_END defines
  can: flexcan: flexcan_close(): change order if commands to properly shut down the controller
  can: flexcan: flexcan_open(): completely initialize controller before requesting IRQ
  can: flexcan: flexcan_rx_offload_setup(): factor out mailbox and rx-offload setup into separate function
  can: flexcan: move enabling/disabling of interrupts from flexcan_chip_{start,stop}() to callers
  can: flexcan: factor out enabling and disabling of interrupts into separate function
  can: flexcan: rename macro FLEXCAN_QUIRK_SETUP_STOP_MODE -> FLEXCAN_QUIRK_SETUP_STOP_MODE_GPR
  dt-bindings: firmware: add IMX_SC_R_CAN(x) macro for CAN
  dt-bindings: can: fsl,flexcan: fix fsl,clk-source property
  dt-bindings: can: fsl,flexcan: add uint32 reference to clock-frequency property
  can: gw: support modification of Classical CAN DLCs
  can: drivers: add len8_dlc support for esd_usb2 CAN adapter
  can: drivers: add len8_dlc support for various CAN adapters
  can: drivers: introduce helpers to access Classical CAN DLC values
  can: update documentation for DLC usage in Classical CAN
  can: rename CAN FD related can_len2dlc and can_dlc2len helpers
  ...
====================

Link: https://lore.kernel.org/r/20201120133318.3428231-1-mkl@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-21 14:58:12 -08:00
Julian Wiedmann
c5dab0941f net/af_iucv: set correct sk_protocol for child sockets
Child sockets erroneously inherit their parent's sk_type (ie. SOCK_*),
instead of the PF_IUCV protocol that the parent was created with in
iucv_sock_create().

We're currently not using sk->sk_protocol ourselves, so this shouldn't
have much impact (except eg. getting the output in skb_dump() right).

Fixes: eac3731bd0 ("[S390]: Add AF_IUCV socket support")
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Link: https://lore.kernel.org/r/20201120100657.34407-1-jwi@linux.ibm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-21 14:43:45 -08:00
Heiner Kallweit
7609ecb2ed net: bridge: switch to net core statistics counters handling
Use netdev->tstats instead of a member of net_bridge for storing
a pointer to the per-cpu counters. This allows us to use core
functionality for statistics handling.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://lore.kernel.org/r/9bad2be2-fd84-7c6e-912f-cee433787018@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-21 14:40:50 -08:00
Alexander Duyck
55472017a4 tcp: Set INET_ECN_xmit configuration in tcp_reinit_congestion_control
When setting congestion control via a BPF program it is seen that the
SYN/ACK for packets within a given flow will not include the ECT0 flag. A
bit of simple printk debugging shows that when this is configured without
BPF we will see the value INET_ECN_xmit value initialized in
tcp_assign_congestion_control however when we configure this via BPF the
socket is in the closed state and as such it isn't configured, and I do not
see it being initialized when we transition the socket into the listen
state. The result of this is that the ECT0 bit is configured based on
whatever the default state is for the socket.

Any easy way to reproduce this is to monitor the following with tcpdump:
tools/testing/selftests/bpf/test_progs -t bpf_tcp_ca

Without this patch the SYN/ACK will follow whatever the default is. If dctcp
all SYN/ACK packets will have the ECT0 bit set, and if it is not then ECT0
will be cleared on all SYN/ACK packets. With this patch applied the SYN/ACK
bit matches the value seen on the other packets in the given stream.

Fixes: 91b5b21c7c ("bpf: Add support for changing congestion control")
Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-20 18:09:47 -08:00
Alexander Duyck
861602b577 tcp: Allow full IP tos/IPv6 tclass to be reflected in L3 header
An issue was recently found where DCTCP SYN/ACK packets did not have the
ECT bit set in the L3 header. A bit of code review found that the recent
change referenced below had gone though and added a mask that prevented the
ECN bits from being populated in the L3 header.

This patch addresses that by rolling back the mask so that it is only
applied to the flags coming from the incoming TCP request instead of
applying it to the socket tos/tclass field. Doing this the ECT bits were
restored in the SYN/ACK packets in my testing.

One thing that is not addressed by this patch set is the fact that
tcp_reflect_tos appears to be incompatible with ECN based congestion
avoidance algorithms. At a minimum the feature should likely be documented
which it currently isn't.

Fixes: ac8f1710c1 ("tcp: reflect tos value received in SYN to the socket")
Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
Acked-by: Wei Wang <weiwan@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-20 18:09:47 -08:00
Paolo Abeni
ea4ca586b1 mptcp: refine MPTCP-level ack scheduling
Send timely MPTCP-level ack is somewhat difficult when
the insertion into the msk receive level is performed
by the worker.

It needs TCP-level dup-ack to notify the MPTCP-level
ack_seq increase, as both the TCP-level ack seq and the
rcv window are unchanged.

We can actually avoid processing incoming data with the
worker, and let the subflow or recevmsg() send ack as needed.

When recvmsg() moves the skbs inside the msk receive queue,
the msk space is still unchanged, so tcp_cleanup_rbuf() could
end-up skipping TCP-level ack generation. Anyway, when
__mptcp_move_skbs() is invoked, a known amount of bytes is
going to be consumed soon: we update rcv wnd computation taking
them in account.

Additionally we need to explicitly trigger tcp_cleanup_rbuf()
when recvmsg() consumes a significant amount of the receive buffer.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-20 15:33:25 -08:00
Florian Westphal
fa3fe2b150 mptcp: track window announced to peer
OoO handling attempts to detect when packet is out-of-window by testing
current ack sequence and remaining space vs. sequence number.

This doesn't work reliably. Store the highest allowed sequence number
that we've announced and use it to detect oow packets.

Do this when mptcp options get written to the packet (wire format).
For this to work we need to move the write_options call until after
stack selected a new tcp window.

Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-20 15:33:25 -08:00
Geliang Tang
84dfe3677a mptcp: send out dedicated ADD_ADDR packet
When ADD_ADDR suboption includes an IPv6 address, the size is 28 octets.
It will not fit when other MPTCP suboptions are included in this packet,
e.g. DSS. So here we send out an ADD_ADDR dedicated packet to carry only
ADD_ADDR suboption, no other MPTCP suboptions.

In mptcp_pm_announce_addr, we check whether this is an IPv6 ADD_ADDR.
If it is, we set the flag MPTCP_ADD_ADDR_IPV6 to true. Then we call
mptcp_pm_nl_add_addr_send_ack to sent out a new pure ACK packet.

In mptcp_established_options_add_addr, we check whether this is a pure
ACK packet for ADD_ADDR. If it is, we drop all other MPTCP suboptions
in this packet, only put ADD_ADDR suboption in it.

Suggested-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-20 15:33:25 -08:00
Geliang Tang
d91d322a72 mptcp: change add_addr_signal type
This patch changed the 'add_addr_signal' type from bool to char, so that
we could encode the addr type there.

Suggested-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-20 15:33:25 -08:00
Paolo Abeni
0397c6d85f mptcp: keep unaccepted MPC subflow into join list
This will simplify all operation dealing with subflows
before accept time (e.g. data fin processing, add_addr).

The join list is already flushed by mptcp_stream_accept()
before returning the newly created msk to the user space.

This also fixes an potential bug present into the old code:
conn_list was manipulated without helding the msk lock
in mptcp_stream_accept().

Tested-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-20 15:33:25 -08:00
Florian Westphal
860975c6f8 mptcp: skip to next candidate if subflow has unacked data
In case a subflow path is blocked, MPTCP-level retransmit may not take
place anymore because such subflow is likely to have unacked data left
in its write queue.

Ignore subflows that have experienced loss and test next candidate.

Fixes: 3b1d6210a9 ("mptcp: implement and use MPTCP-level retransmission")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-20 15:33:24 -08:00
Paolo Abeni
26aa231439 mptcp: fix state tracking for fallback socket
We need to cope with some more state transition for
fallback sockets, or could still end-up moving to TCP_CLOSE
too early and avoid spooling some pending data

Fixes: e16163b6e2 ("mptcp: refactor shutdown and close")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-20 15:33:24 -08:00
Paolo Abeni
b2771d2419 mptcp: drop WORKER_RUNNING status bit
Only mptcp_close() can actually cancel the workqueue,
no need to add and use this flag.

Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-20 15:33:24 -08:00
Christian Eggers
30abc9cd9c net: dsa: avoid potential use-after-free error
If dsa_switch_ops::port_txtstamp() returns false, clone will be freed
immediately. Shouldn't store a pointer to freed memory.

Signed-off-by: Christian Eggers <ceggers@arri.de>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Tested-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/20201119110906.25558-1-ceggers@arri.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-20 15:02:50 -08:00
Vadim Fedorenko
20ffc7adf5 net/tls: missing received data after fast remote close
In case when tcp socket received FIN after some data and the
parser haven't started before reading data caller will receive
an empty buffer. This behavior differs from plain TCP socket and
leads to special treating in user-space.
The flow that triggers the race is simple. Server sends small
amount of data right after the connection is configured to use TLS
and closes the connection. In this case receiver sees TLS Handshake
data, configures TLS socket right after Change Cipher Spec record.
While the configuration is in process, TCP socket receives small
Application Data record, Encrypted Alert record and FIN packet. So
the TCP socket changes sk_shutdown to RCV_SHUTDOWN and sk_flag with
SK_DONE bit set. The received data is not parsed upon arrival and is
never sent to user-space.

Patch unpauses parser directly if we have unparsed data in tcp
receive queue.

Fixes: fcf4793e27 ("tls: check RCV_SHUTDOWN in tls_wait_data")
Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru>
Link: https://lore.kernel.org/r/1605801588-12236-1-git-send-email-vfedorenko@novek.ru
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-20 10:25:26 -08:00
Anmol Karn
3b3fd068c5 rose: Fix Null pointer dereference in rose_send_frame()
rose_send_frame() dereferences `neigh->dev` when called from
rose_transmit_clear_request(), and the first occurrence of the
`neigh` is in rose_loopback_timer() as `rose_loopback_neigh`,
and it is initialized in rose_add_loopback_neigh() as NULL.
i.e when `rose_loopback_neigh` used in rose_loopback_timer()
its `->dev` was still NULL and rose_loopback_timer() was calling
rose_rx_call_request() without checking for NULL.

- net/rose/rose_link.c
This bug seems to get triggered in this line:

rose_call = (ax25_address *)neigh->dev->dev_addr;

Fix it by adding NULL checking for `rose_loopback_neigh->dev`
in rose_loopback_timer().

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Reported-by: syzbot+a1c743815982d9496393@syzkaller.appspotmail.com
Tested-by: syzbot+a1c743815982d9496393@syzkaller.appspotmail.com
Link: https://syzkaller.appspot.com/bug?id=9d2a7ca8c7f2e4b682c97578dfa3f236258300b3
Signed-off-by: Anmol Karn <anmol.karan123@gmail.com>
Link: https://lore.kernel.org/r/20201119191043.28813-1-anmol.karan123@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-20 10:04:58 -08:00
Paolo Abeni
12f4bd8622 net: add annotation for sock_{lock,unlock}_fast
The static checker is fooled by the non-static locking scheme
implemented by the mentioned helpers.
Let's make its life easier adding some unconditional annotation
so that the helpers are now interpreted as a plain spinlock from
sparse.

v1 -> v2:
 - add __releases() annotation to unlock_sock_fast()

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/6ed7ae627d8271fb7f20e0a9c6750fbba1ac2635.1605634911.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-20 10:00:54 -08:00
Numan Siddique
e2ef5203c8 net: openvswitch: Be liberal in tcp conntrack.
There is no easy way to distinguish if a conntracked tcp packet is
marked invalid because of tcp_in_window() check error or because
it doesn't belong to an existing connection. With this patch,
openvswitch sets liberal tcp flag for the established sessions so
that out of window packets are not marked invalid.

A helper function - nf_ct_set_tcp_be_liberal(nf_conn) is added which
sets this flag for both the directions of the nf_conn.

Suggested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Numan Siddique <nusiddiq@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20201116130126.3065077-1-nusiddiq@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-20 09:53:48 -08:00
Magnus Karlsson
537cf4e3cc xsk: Fix umem cleanup bug at socket destruct
Fix a bug that is triggered when a partially setup socket is
destroyed. For a fully setup socket, a socket that has been bound to a
device, the cleanup of the umem is performed at the end of the buffer
pool's cleanup work queue item. This has to be performed in a work
queue, and not in RCU cleanup, as it is doing a vunmap that cannot
execute in interrupt context. However, when a socket has only been
partially set up so that a umem has been created but the buffer pool
has not, the code erroneously directly calls the umem cleanup function
instead of using a work queue, and this leads to a BUG_ON() in
vunmap().

As there in this case is no buffer pool, we cannot use its work queue,
so we need to introduce a work queue for the umem and schedule this for
the cleanup. So in the case there is no pool, we are going to use the
umem's own work queue to schedule the cleanup. But if there is a
pool, the cleanup of the umem is still being performed by the pool's
work queue, as it is important that the umem is cleaned up after the
pool.

Fixes: e5e1a4bc91 ("xsk: Fix possible memory leak at socket close")
Reported-by: Marek Majtyka <marekx.majtyka@intel.com>
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Marek Majtyka <marekx.majtyka@intel.com>
Link: https://lore.kernel.org/bpf/1605873219-21629-1-git-send-email-magnus.karlsson@gmail.com
2020-11-20 15:52:39 +01:00
Oliver Hartkopp
94c23097f9 can: gw: support modification of Classical CAN DLCs
Add support for data length code modifications for Classical CAN.

The netlink configuration interface always allowed to pass any value
that fits into a byte, therefore only the modification process had to be
extended to handle the raw DLC represenation of Classical CAN frames.

When a DLC value from 0 .. F is provided for Classical CAN frame
modifications the 'len' value is modified as-is with the exception that
potentially existing 9 .. F DLC values in the len8_dlc element are moved
to the 'len' element for the modification operation by mod_retrieve_ccdlc().

After the modification the Classical CAN frame DLC information is brought
back into the correct format by mod_store_ccdlc() which is filling 'len'
and 'len8_dlc' accordingly.

Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Link: https://lore.kernel.org/r/20201119084921.2621-1-socketcan@hartkopp.net
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2020-11-20 12:05:14 +01:00
Oliver Hartkopp
c7b7496779 can: replace can_dlc as variable/element for payload length
The naming of can_dlc as element of struct can_frame and also as variable
name is misleading as it claims to be a 'data length CODE' but in reality
it always was a plain data length.

With the indroduction of a new 'len' element in struct can_frame we can now
remove can_dlc as name and make clear which of the former uses was a plain
length (-> 'len') or a data length code (-> 'dlc') value.

Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Link: https://lore.kernel.org/r/20201120100444.3199-1-socketcan@hartkopp.net
[mkl: gs_usb: keep struct gs_host_frame::can_dlc as is]
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2020-11-20 12:04:12 +01:00
Bhaumik Bhatt
2ca7e30d3b
net: qrtr: Unprepare MHI channels during remove
Reset MHI device channels when driver remove is called due to
module unload or any crash scenario. This will make sure that
MHI channels no longer remain enabled for transfers since the
MHI stack does not take care of this anymore after the auto-start
channels feature was removed.

Signed-off-by: Bhaumik Bhatt <bbhatt@codeaurora.org>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
2020-11-20 11:55:17 +05:30
Oliver Herms
6b13d8f71f IPv6: RTM_GETROUTE: Add RTA_ENCAP to result
This patch adds an IPv6 routes encapsulation attribute
to the result of netlink RTM_GETROUTE requests
(i.e. ip route get 2001:db8::).

Signed-off-by: Oliver Herms <oliver.peter.herms@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20201118230651.GA8861@tws
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-19 21:56:20 -08:00
Paolo Abeni
b680a214ec mptcp: update rtx timeout only if required.
We must start the retransmission timer only there are
pending data in the rtx queue.
Otherwise we can hit a WARN_ON in mptcp_reset_timer(),
as syzbot demonstrated.

Reported-and-tested-by: syzbot+42aa53dafb66a07e5a24@syzkaller.appspotmail.com
Fixes: d9ca1de8c0 ("mptcp: move page frag allocation in mptcp_sendmsg()")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Link: https://lore.kernel.org/r/1a72039f112cae048c44d398ffa14e0a1432db3d.1605737083.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-19 21:54:24 -08:00
Jacob Keller
52cc5f3a16 devlink: move flash end and begin to core devlink
When performing a flash update via devlink, device drivers may inform
user space of status updates via
devlink_flash_update_(begin|end|timeout|status)_notify functions.

It is expected that drivers do not send any status notifications unless
they send a begin and end message. If a driver sends a status
notification without sending the appropriate end notification upon
finishing (regardless of success or failure), the current implementation
of the devlink userspace program can get stuck endlessly waiting for the
end notification that will never come.

The current ice driver implementation may send such a status message
without the appropriate end notification in rare cases.

Fixing the ice driver is relatively simple: we just need to send the
begin_notify at the start of the function and always send an end_notify
no matter how the function exits.

Rather than assuming driver authors will always get this right in the
future, lets just fix the API so that it is not possible to get wrong.
Make devlink_flash_update_begin_notify and
devlink_flash_update_end_notify static, and call them in devlink.c core
code. Always send the begin_notify just before calling the driver's
flash_update routine. Always send the end_notify just after the routine
returns regardless of success or failure.

Doing this makes the status notification easier to use from the driver,
as it no longer needs to worry about catching failures and cleaning up
by calling devlink_flash_update_end_notify. It is now no longer possible
to do the wrong thing in this regard. We also save a couple of lines of
code in each driver.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Acked-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-19 21:41:02 -08:00