Commit Graph

7353 Commits

Author SHA1 Message Date
Xin Long
dbd50f238d net: move the hsize check to the else block in skb_segment
After commit 89319d3801 ("net: Add frag_list support to skb_segment"),
it goes to process frag_list when !hsize in skb_segment(). However, when
using skb frag_list, sg normally should not be set. In this case, hsize
will be set with len right before !hsize check, then it won't go to
frag_list processing code.

So the right thing to do is move the hsize check to the else block, so
that it won't affect the !hsize check for frag_list processing.

v1->v2:
  - change to do "hsize <= 0" check instead of "!hsize", and also move
    "hsize < 0" into else block, to save some cycles, as Alex suggested.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-16 19:05:59 -08:00
Jakub Kicinski
2d9116be76 Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2021-01-16

1) Extend atomic operations to the BPF instruction set along with x86-64 JIT support,
   that is, atomic{,64}_{xchg,cmpxchg,fetch_{add,and,or,xor}}, from Brendan Jackman.

2) Add support for using kernel module global variables (__ksym externs in BPF
   programs) retrieved via module's BTF, from Andrii Nakryiko.

3) Generalize BPF stackmap's buildid retrieval and add support to have buildid
   stored in mmap2 event for perf, from Jiri Olsa.

4) Various fixes for cross-building BPF sefltests out-of-tree which then will
   unblock wider automated testing on ARM hardware, from Jean-Philippe Brucker.

5) Allow to retrieve SOL_SOCKET opts from sock_addr progs, from Daniel Borkmann.

6) Clean up driver's XDP buffer init and split into two helpers to init per-
   descriptor and non-changing fields during processing, from Lorenzo Bianconi.

7) Minor misc improvements to libbpf & bpftool, from Ian Rogers.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (41 commits)
  perf: Add build id data in mmap2 event
  bpf: Add size arg to build_id_parse function
  bpf: Move stack_map_get_build_id into lib
  bpf: Document new atomic instructions
  bpf: Add tests for new BPF atomic operations
  bpf: Add bitwise atomic instructions
  bpf: Pull out a macro for interpreting atomic ALU operations
  bpf: Add instructions for atomic_[cmp]xchg
  bpf: Add BPF_FETCH field / create atomic_fetch_add instruction
  bpf: Move BPF_STX reserved field check into BPF_STX verifier code
  bpf: Rename BPF_XADD and prepare to encode other atomics in .imm
  bpf: x86: Factor out a lookup table for some ALU opcodes
  bpf: x86: Factor out emission of REX byte
  bpf: x86: Factor out emission of ModR/M for *(reg + off)
  tools/bpftool: Add -Wall when building BPF programs
  bpf, libbpf: Avoid unused function warning on bpf_tail_call_static
  selftests/bpf: Install btf_dump test cases
  selftests/bpf: Fix installation of urandom_read
  selftests/bpf: Move generated test files to $(TEST_GEN_FILES)
  selftests/bpf: Fix out-of-tree build
  ...
====================

Link: https://lore.kernel.org/r/20210116012922.17823-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-15 17:57:26 -08:00
Tom Rix
e794e7fa19 neighbor: remove definition of DEBUG
Defining DEBUG should only be done in development.
So remove DEBUG.

Signed-off-by: Tom Rix <trix@redhat.com>
Link: https://lore.kernel.org/r/20210114212917.48174-1-trix@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-15 17:51:18 -08:00
Jakub Kicinski
1d9f03c0a1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-14 18:34:50 -08:00
Eran Ben Elisha
4f1cc51f34 net: flow_dissector: Parse PTP L2 packet header
Add support for parsing PTP L2 packet header. Such packet consists
of an L2 header (with ethertype of ETH_P_1588), PTP header, body
and an optional suffix.

Signed-off-by: Eran Ben Elisha <eranbe@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-14 18:24:54 -08:00
Tariq Toukan
25537d71e2 net: Allow NETIF_F_HW_TLS_TX if IP_CSUM && IPV6_CSUM
Cited patch below blocked the TLS TX device offload unless HW_CSUM
is set. This broke devices that use IP_CSUM && IP6_CSUM.
Here we fix it.

Note that the single HW_TLS_TX feature flag indicates support for
both IPv4/6, hence it should still be disabled in case only one of
(IP_CSUM | IPV6_CSUM) is set.

Fixes: ae0b04b238 ("net: Disable NETIF_F_HW_TLS_TX when HW_CSUM is disabled")
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reported-by: Rohit Maheshwari <rohitm@chelsio.com>
Reviewed-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Link: https://lore.kernel.org/r/20210114151215.7061-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-14 10:56:13 -08:00
Eric Dumazet
3226b158e6 net: avoid 32 x truesize under-estimation for tiny skbs
Both virtio net and napi_get_frags() allocate skbs
with a very small skb->head

While using page fragments instead of a kmalloc backed skb->head might give
a small performance improvement in some cases, there is a huge risk of
under estimating memory usage.

For both GOOD_COPY_LEN and GRO_MAX_HEAD, we can fit at least 32 allocations
per page (order-3 page in x86), or even 64 on PowerPC

We have been tracking OOM issues on GKE hosts hitting tcp_mem limits
but consuming far more memory for TCP buffers than instructed in tcp_mem[2]

Even if we force napi_alloc_skb() to only use order-0 pages, the issue
would still be there on arches with PAGE_SIZE >= 32768

This patch makes sure that small skb head are kmalloc backed, so that
other objects in the slab page can be reused instead of being held as long
as skbs are sitting in socket queues.

Note that we might in the future use the sk_buff napi cache,
instead of going through a more expensive __alloc_skb()

Another idea would be to use separate page sizes depending
on the allocated length (to never have more than 4 frags per page)

I would like to thank Greg Thelen for his precious help on this matter,
analysing crash dumps is always a time consuming task.

Fixes: fd11a83dd3 ("net: Pull out core bits of __netdev_alloc_skb and add __napi_alloc_skb")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Link: https://lore.kernel.org/r/20210113161819.1155526-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-14 10:51:52 -08:00
Menglong Dong
324cefaf1c net: core: use eth_type_vlan in __netif_receive_skb_core
Replace the check for ETH_P_8021Q and ETH_P_8021AD in
__netif_receive_skb_core with eth_type_vlan.

Signed-off-by: Menglong Dong <dong.menglong@zte.com.cn>
Link: https://lore.kernel.org/r/20210111104221.3451-1-dong.menglong@zte.com.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-13 19:08:54 -08:00
Daniel Borkmann
bcd6f4a8be bpf: Allow to retrieve sol_socket opts from sock_addr progs
The _bpf_setsockopt() is able to set some of the SOL_SOCKET level options,
however, _bpf_getsockopt() has little support to actually retrieve them.
This small patch adds few misc options such as SO_MARK, SO_PRIORITY and
SO_BINDTOIFINDEX. For the latter getter and setter are added. The mark and
priority in particular allow to retrieve the options from BPF cgroup hooks
to then implement custom behavior / settings on the syscall hooks compared
to other sockets that stick to the defaults, for example.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/cba44439b801e5ddc1170e5be787f4dc93a2d7f9.1610406333.git.daniel@iogearbox.net
2021-01-12 14:44:53 -08:00
Willem de Bruijn
97550f6fa5 net: compound page support in skb_seq_read
skb_seq_read iterates over an skb, returning pointer and length of
the next data range with each call.

It relies on kmap_atomic to access highmem pages when needed.

An skb frag may be backed by a compound page, but kmap_atomic maps
only a single page. There are not enough kmap slots to always map all
pages concurrently.

Instead, if kmap_atomic is needed, iterate over each page.

As this increases the number of calls, avoid this unless needed.
The necessary condition is captured in skb_frag_must_loop.

I tried to make the change as obvious as possible. It should be easy
to verify that nothing changes if skb_frag_must_loop returns false.

Tested:
  On an x86 platform with
    CONFIG_HIGHMEM=y
    CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP=y
    CONFIG_NETFILTER_XT_MATCH_STRING=y

  Run
    ip link set dev lo mtu 1500
    iptables -A OUTPUT -m string --string 'badstring' -algo bm -j ACCEPT
    dd if=/dev/urandom of=in bs=1M count=20
    nc -l -p 8000 > /dev/null &
    nc -w 1 -q 0 localhost 8000 < in

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-11 18:20:09 -08:00
Eric Dumazet
1d11fa6967 net-gro: remove GRO_DROP
GRO_DROP can only be returned from napi_gro_frags()
if the skb has not been allocated by a prior napi_get_frags()

Since drivers must use napi_get_frags() and test its result
before populating the skb with metadata, we can safely remove
GRO_DROP since it offers no practical use.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
Acked-by: Edward Cree <ecree.xilinx@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-09 14:24:26 -08:00
Jakub Kicinski
766b0515d5 net: make sure devices go through netdev_wait_all_refs
If register_netdevice() fails at the very last stage - the
notifier call - some subsystems may have already seen it and
grabbed a reference. struct net_device can't be freed right
away without calling netdev_wait_all_refs().

Now that we have a clean interface in form of dev->needs_free_netdev
and lenient free_netdev() we can undo what commit 93ee31f14f ("[NET]:
Fix free_netdev on register_netdev failure.") has done and complete
the unregistration path by bringing the net_set_todo() call back.

After registration fails user is still expected to explicitly
free the net_device, so make sure ->needs_free_netdev is cleared,
otherwise rolling back the registration will cause the old double
free for callers who release rtnl_lock before the free.

This also solves the problem of priv_destructor not being called
on notifier error.

net_set_todo() will be moved back into unregister_netdevice_queue()
in a follow up.

Reported-by: Hulk Robot <hulkci@huawei.com>
Reported-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-08 19:27:41 -08:00
Jakub Kicinski
c269a24ce0 net: make free_netdev() more lenient with unregistering devices
There are two flavors of handling netdev registration:
 - ones called without holding rtnl_lock: register_netdev() and
   unregister_netdev(); and
 - those called with rtnl_lock held: register_netdevice() and
   unregister_netdevice().

While the semantics of the former are pretty clear, the same can't
be said about the latter. The netdev_todo mechanism is utilized to
perform some of the device unregistering tasks and it hooks into
rtnl_unlock() so the locked variants can't actually finish the work.
In general free_netdev() does not mix well with locked calls. Most
drivers operating under rtnl_lock set dev->needs_free_netdev to true
and expect core to make the free_netdev() call some time later.

The part where this becomes most problematic is error paths. There is
no way to unwind the state cleanly after a call to register_netdevice(),
since unreg can't be performed fully without dropping locks.

Make free_netdev() more lenient, and defer the freeing if device
is being unregistered. This allows error paths to simply call
free_netdev() both after register_netdevice() failed, and after
a call to unregister_netdevice() but before dropping rtnl_lock.

Simplify the error paths which are currently doing gymnastics
around free_netdev() handling.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-08 19:27:41 -08:00
Jakub Kicinski
2b446e650b docs: net: explain struct net_device lifetime
Explain the two basic flows of struct net_device's operation.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-08 19:27:41 -08:00
Baptiste Lepers
fd2ddef043 udp: Prevent reuseport_select_sock from reading uninitialized socks
reuse->socks[] is modified concurrently by reuseport_add_sock. To
prevent reading values that have not been fully initialized, only read
the array up until the last known safe index instead of incorrectly
re-reading the last index of the array.

Fixes: acdcecc612 ("udp: correct reuseport selection with connected sockets")
Signed-off-by: Baptiste Lepers <baptiste.lepers@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20210107051110.12247-1-baptiste.lepers@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-08 19:15:40 -08:00
Dongseok Yi
53475c5dd8 net: fix use-after-free when UDP GRO with shared fraglist
skbs in fraglist could be shared by a BPF filter loaded at TC. If TC
writes, it will call skb_ensure_writable -> pskb_expand_head to create
a private linear section for the head_skb. And then call
skb_clone_fraglist -> skb_get on each skb in the fraglist.

skb_segment_list overwrites part of the skb linear section of each
fragment itself. Even after skb_clone, the frag_skbs share their
linear section with their clone in PF_PACKET.

Both sk_receive_queue of PF_PACKET and PF_INET (or PF_INET6) can have
a link for the same frag_skbs chain. If a new skb (not frags) is
queued to one of the sk_receive_queue, multiple ptypes can see and
release this. It causes use-after-free.

[ 4443.426215] ------------[ cut here ]------------
[ 4443.426222] refcount_t: underflow; use-after-free.
[ 4443.426291] WARNING: CPU: 7 PID: 28161 at lib/refcount.c:190
refcount_dec_and_test_checked+0xa4/0xc8
[ 4443.426726] pstate: 60400005 (nZCv daif +PAN -UAO)
[ 4443.426732] pc : refcount_dec_and_test_checked+0xa4/0xc8
[ 4443.426737] lr : refcount_dec_and_test_checked+0xa0/0xc8
[ 4443.426808] Call trace:
[ 4443.426813]  refcount_dec_and_test_checked+0xa4/0xc8
[ 4443.426823]  skb_release_data+0x144/0x264
[ 4443.426828]  kfree_skb+0x58/0xc4
[ 4443.426832]  skb_queue_purge+0x64/0x9c
[ 4443.426844]  packet_set_ring+0x5f0/0x820
[ 4443.426849]  packet_setsockopt+0x5a4/0xcd0
[ 4443.426853]  __sys_setsockopt+0x188/0x278
[ 4443.426858]  __arm64_sys_setsockopt+0x28/0x38
[ 4443.426869]  el0_svc_common+0xf0/0x1d0
[ 4443.426873]  el0_svc_handler+0x74/0x98
[ 4443.426880]  el0_svc+0x8/0xc

Fixes: 3a1296a38d (net: Support GRO/GSO fraglist chaining.)
Signed-off-by: Dongseok Yi <dseok.yi@samsung.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/1610072918-174177-1-git-send-email-dseok.yi@samsung.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-08 19:14:02 -08:00
Lorenzo Bianconi
be9df4aff6 net, xdp: Introduce xdp_prepare_buff utility routine
Introduce xdp_prepare_buff utility routine to initialize per-descriptor
xdp_buff fields (e.g. xdp_buff pointers). Rely on xdp_prepare_buff() in
all XDP capable drivers.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Shay Agroskin <shayagr@amazon.com>
Acked-by: Martin Habets <habetsm.xilinx@gmail.com>
Acked-by: Camelia Groza <camelia.groza@nxp.com>
Acked-by: Marcin Wojtas <mw@semihalf.com>
Link: https://lore.kernel.org/bpf/45f46f12295972a97da8ca01990b3e71501e9d89.1608670965.git.lorenzo@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2021-01-08 13:39:24 -08:00
Lorenzo Bianconi
43b5169d83 net, xdp: Introduce xdp_init_buff utility routine
Introduce xdp_init_buff utility routine to initialize xdp_buff fields
const over NAPI iterations (e.g. frame_sz or rxq pointer). Rely on
xdp_init_buff in all XDP capable drivers.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Shay Agroskin <shayagr@amazon.com>
Acked-by: Martin Habets <habetsm.xilinx@gmail.com>
Acked-by: Camelia Groza <camelia.groza@nxp.com>
Acked-by: Marcin Wojtas <mw@semihalf.com>
Link: https://lore.kernel.org/bpf/7f8329b6da1434dc2b05a77f2e800b29628a8913.1608670965.git.lorenzo@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2021-01-08 13:39:24 -08:00
Zheng Yongjun
ec24e11e08 bpf: Replace fput with sockfd_put in sock map
The function sockfd_lookup uses fget on the value that is stored in
the file field of the returned structure, so fput should ultimately
be applied to this value. This can be done directly, but it seems
better to use the specific macro sockfd_put, which does the same
thing.

The cleanup was done using the following semantic patch:
    (http://www.emn.fr/x-info/coccinelle/)

    // <smpl>
    @@
    expression s;
    @@

       s = sockfd_lookup(...)
       ...
    +  sockfd_put(s);
    ?- fput(s->file);
    // </smpl>

Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20201229134834.22962-1-zhengyongjun3@huawei.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2021-01-08 13:39:24 -08:00
Jonathan Lemon
8e04491724 skbuff: Rename skb_zcopy_{get|put} to net_zcopy_{get|put}
Unlike the rest of the skb_zcopy_ functions, these routines
operate on a 'struct ubuf', not a skb.  Remove the 'skb_'
prefix from the naming to make things clearer.

Suggested-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07 16:08:37 -08:00
Jonathan Lemon
04c2d33eab skbuff: add flags to ubuf_info for ubuf setup
Currently, when an ubuf is attached to a new skb, the shared
flags word is initialized to a fixed value.  Instead of doing
this, set the default flags in the ubuf, and have new skbs
inherit from this default.

This is needed when setting up different zerocopy types.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07 16:08:37 -08:00
Jonathan Lemon
06b4feb37e net: group skb_shinfo zerocopy related bits together.
In preparation for expanded zerocopy (TX and RX), move
the zerocopy related bits out of tx_flags into their own
flag word.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07 16:08:37 -08:00
Jonathan Lemon
8c793822c5 skbuff: rename sock_zerocopy_* to msg_zerocopy_*
At Willem's suggestion, rename the sock_zerocopy_* functions
so that they match the MSG_ZEROCOPY flag, which makes it clear
they are specific to this zerocopy implementation.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07 16:08:35 -08:00
Jonathan Lemon
70c4316749 skbuff: Call skb_zcopy_clear() before unref'ing fragments
RX zerocopy fragment pages which are not allocated from the
system page pool require special handling.  Give the callback
in skb_zcopy_clear() a chance to process them first.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07 16:06:38 -08:00
Jonathan Lemon
236a6b1cd5 skbuff: Call sock_zerocopy_put_abort from skb_zcopy_put_abort
The sock_zerocopy_put_abort function contains logic which is
specific to the current zerocopy implementation.  Add a wrapper
which checks the callback and dispatches apppropriately.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07 16:06:37 -08:00
Jonathan Lemon
36177832f4 skbuff: Add skb parameter to the ubuf zerocopy callback
Add an optional skb parameter to the zerocopy callback parameter,
which is passed down from skb_zcopy_clear().  This gives access
to the original skb, which is needed for upcoming RX zero-copy
error handling.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07 16:06:37 -08:00
Jonathan Lemon
e76d46cfff skbuff: replace sock_zerocopy_get with skb_zcopy_get
Rename the get routines for consistency.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07 16:06:37 -08:00
Jonathan Lemon
59776362b1 skbuff: replace sock_zerocopy_put() with skb_zcopy_put()
Replace sock_zerocopy_put with the generic skb_zcopy_put()
function.  Pass 'true' as the success argument, as this
is identical to no change.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07 16:06:37 -08:00
Jonathan Lemon
75518851a2 skbuff: Push status and refcounts into sock_zerocopy_callback
Before this change, the caller of sock_zerocopy_callback would
need to save the zerocopy status, decrement and check the refcount,
and then call the callback function - the callback was only invoked
when the refcount reached zero.

Now, the caller just passes the status into the callback function,
which saves the status and handles its own refcounts.

This makes the behavior of the sock_zerocopy_callback identical
to the tpacket and vhost callbacks.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07 16:06:37 -08:00
Jonathan Lemon
d6adf1b103 skbuff: simplify sock_zerocopy_put
All 'struct ubuf_info' users should have a callback defined
as of commit 0a4a060bb2 ("sock: fix zerocopy_success regression
with msg_zerocopy").

Remove the dead code path to consume_skb(), which makes
assumptions about how the structure was allocated.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07 16:06:37 -08:00
Jakub Kicinski
876c4384ae udp_tunnel: hard-wire NDOs to udp_tunnel_nic_*_port() helpers
All drivers use udp_tunnel_nic_*_port() helpers, prepare for
NDO removal by invoking those helpers directly.

The helpers are safe to call on all devices, they check if
device has the UDP tunnel state initialized.

Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07 12:53:29 -08:00
weichenchen
a533b70a65 net: neighbor: fix a crash caused by mod zero
pneigh_enqueue() tries to obtain a random delay by mod
NEIGH_VAR(p, PROXY_DELAY). However, NEIGH_VAR(p, PROXY_DELAY)
migth be zero at that point because someone could write zero
to /proc/sys/net/ipv4/neigh/[device]/proxy_delay after the
callers check it.

This patch uses prandom_u32_max() to get a random delay instead
which avoids potential division by zero.

Signed-off-by: weichenchen <weichen.chen@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-12-28 14:49:48 -08:00
Antoine Tenart
4ae2bb8164 net-sysfs: take the rtnl lock when accessing xps_rxqs_map and num_tc
Accesses to dev->xps_rxqs_map (when using dev->num_tc) should be
protected by the rtnl lock, like we do for netif_set_xps_queue. I didn't
see an actual bug being triggered, but let's be safe here and take the
rtnl lock while accessing the map in sysfs.

Fixes: 8af2c06ff4 ("net-sysfs: Add interface for Rx queue(s) map per Tx queue")
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-28 13:26:46 -08:00
Antoine Tenart
2d57b4f142 net-sysfs: take the rtnl lock when storing xps_rxqs
Two race conditions can be triggered when storing xps rxqs, resulting in
various oops and invalid memory accesses:

1. Calling netdev_set_num_tc while netif_set_xps_queue:

   - netif_set_xps_queue uses dev->tc_num as one of the parameters to
     compute the size of new_dev_maps when allocating it. dev->tc_num is
     also used to access the map, and the compiler may generate code to
     retrieve this field multiple times in the function.

   - netdev_set_num_tc sets dev->tc_num.

   If new_dev_maps is allocated using dev->tc_num and then dev->tc_num
   is set to a higher value through netdev_set_num_tc, later accesses to
   new_dev_maps in netif_set_xps_queue could lead to accessing memory
   outside of new_dev_maps; triggering an oops.

2. Calling netif_set_xps_queue while netdev_set_num_tc is running:

   2.1. netdev_set_num_tc starts by resetting the xps queues,
        dev->tc_num isn't updated yet.

   2.2. netif_set_xps_queue is called, setting up the map with the
        *old* dev->num_tc.

   2.3. netdev_set_num_tc updates dev->tc_num.

   2.4. Later accesses to the map lead to out of bound accesses and
        oops.

   A similar issue can be found with netdev_reset_tc.

One way of triggering this is to set an iface up (for which the driver
uses netdev_set_num_tc in the open path, such as bnx2x) and writing to
xps_rxqs in a concurrent thread. With the right timing an oops is
triggered.

Both issues have the same fix: netif_set_xps_queue, netdev_set_num_tc
and netdev_reset_tc should be mutually exclusive. We do that by taking
the rtnl lock in xps_rxqs_store.

Fixes: 8af2c06ff4 ("net-sysfs: Add interface for Rx queue(s) map per Tx queue")
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-28 13:26:46 -08:00
Antoine Tenart
fb25038586 net-sysfs: take the rtnl lock when accessing xps_cpus_map and num_tc
Accesses to dev->xps_cpus_map (when using dev->num_tc) should be
protected by the rtnl lock, like we do for netif_set_xps_queue. I didn't
see an actual bug being triggered, but let's be safe here and take the
rtnl lock while accessing the map in sysfs.

Fixes: 184c449f91 ("net: Add support for XPS with QoS via traffic classes")
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-28 13:26:46 -08:00
Antoine Tenart
1ad58225db net-sysfs: take the rtnl lock when storing xps_cpus
Two race conditions can be triggered when storing xps cpus, resulting in
various oops and invalid memory accesses:

1. Calling netdev_set_num_tc while netif_set_xps_queue:

   - netif_set_xps_queue uses dev->tc_num as one of the parameters to
     compute the size of new_dev_maps when allocating it. dev->tc_num is
     also used to access the map, and the compiler may generate code to
     retrieve this field multiple times in the function.

   - netdev_set_num_tc sets dev->tc_num.

   If new_dev_maps is allocated using dev->tc_num and then dev->tc_num
   is set to a higher value through netdev_set_num_tc, later accesses to
   new_dev_maps in netif_set_xps_queue could lead to accessing memory
   outside of new_dev_maps; triggering an oops.

2. Calling netif_set_xps_queue while netdev_set_num_tc is running:

   2.1. netdev_set_num_tc starts by resetting the xps queues,
        dev->tc_num isn't updated yet.

   2.2. netif_set_xps_queue is called, setting up the map with the
        *old* dev->num_tc.

   2.3. netdev_set_num_tc updates dev->tc_num.

   2.4. Later accesses to the map lead to out of bound accesses and
        oops.

   A similar issue can be found with netdev_reset_tc.

One way of triggering this is to set an iface up (for which the driver
uses netdev_set_num_tc in the open path, such as bnx2x) and writing to
xps_cpus in a concurrent thread. With the right timing an oops is
triggered.

Both issues have the same fix: netif_set_xps_queue, netdev_set_num_tc
and netdev_reset_tc should be mutually exclusive. We do that by taking
the rtnl lock in xps_cpus_store.

Fixes: 184c449f91 ("net: Add support for XPS with QoS via traffic classes")
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-28 13:26:46 -08:00
Lijun Pan
7061eb8cfa net: core: introduce __netdev_notify_peers
There are some use cases for netdev_notify_peers in the context
when rtnl lock is already held. Introduce lockless version
of netdev_notify_peers call to save the extra code to call
	call_netdevice_notifiers(NETDEV_NOTIFY_PEERS, dev);
	call_netdevice_notifiers(NETDEV_RESEND_IGMP, dev);
After that, convert netdev_notify_peers to call the new helper.

Suggested-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Lijun Pan <ljp@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-16 11:43:25 -08:00
Linus Torvalds
d635a69dd4 Networking updates for 5.11
Core:
 
  - support "prefer busy polling" NAPI operation mode, where we defer softirq
    for some time expecting applications to periodically busy poll
 
  - AF_XDP: improve efficiency by more batching and hindering
            the adjacency cache prefetcher
 
  - af_packet: make packet_fanout.arr size configurable up to 64K
 
  - tcp: optimize TCP zero copy receive in presence of partial or unaligned
         reads making zero copy a performance win for much smaller messages
 
  - XDP: add bulk APIs for returning / freeing frames
 
  - sched: support fragmenting IP packets as they come out of conntrack
 
  - net: allow virtual netdevs to forward UDP L4 and fraglist GSO skbs
 
 BPF:
 
  - BPF switch from crude rlimit-based to memcg-based memory accounting
 
  - BPF type format information for kernel modules and related tracing
    enhancements
 
  - BPF implement task local storage for BPF LSM
 
  - allow the FENTRY/FEXIT/RAW_TP tracing programs to use bpf_sk_storage
 
 Protocols:
 
  - mptcp: improve multiple xmit streams support, memory accounting and
           many smaller improvements
 
  - TLS: support CHACHA20-POLY1305 cipher
 
  - seg6: add support for SRv6 End.DT4/DT6 behavior
 
  - sctp: Implement RFC 6951: UDP Encapsulation of SCTP
 
  - ppp_generic: add ability to bridge channels directly
 
  - bridge: Connectivity Fault Management (CFM) support as is defined in
            IEEE 802.1Q section 12.14.
 
 Drivers:
 
  - mlx5: make use of the new auxiliary bus to organize the driver internals
 
  - mlx5: more accurate port TX timestamping support
 
  - mlxsw:
    - improve the efficiency of offloaded next hop updates by using
      the new nexthop object API
    - support blackhole nexthops
    - support IEEE 802.1ad (Q-in-Q) bridging
 
  - rtw88: major bluetooth co-existance improvements
 
  - iwlwifi: support new 6 GHz frequency band
 
  - ath11k: Fast Initial Link Setup (FILS)
 
  - mt7915: dual band concurrent (DBDC) support
 
  - net: ipa: add basic support for IPA v4.5
 
 Refactor:
 
  - a few pieces of in_interrupt() cleanup work from Sebastian Andrzej Siewior
 
  - phy: add support for shared interrupts; get rid of multiple driver
         APIs and have the drivers write a full IRQ handler, slight growth
 	of driver code should be compensated by the simpler API which
 	also allows shared IRQs
 
  - add common code for handling netdev per-cpu counters
 
  - move TX packet re-allocation from Ethernet switch tag drivers to
    a central place
 
  - improve efficiency and rename nla_strlcpy
 
  - number of W=1 warning cleanups as we now catch those in a patchwork
    build bot
 
 Old code removal:
 
  - wan: delete the DLCI / SDLA drivers
 
  - wimax: move to staging
 
  - wifi: remove old WDS wifi bridging support
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAl/YXmUACgkQMUZtbf5S
 IrvSQBAAgOrt4EFopEvVqlTHZbqI45IEqgtXS+YWmlgnjZCgshyMj8q1yK1zzane
 qYxr/NNJ9kV3FdtaynmmHPgEEEfR5kJ/D3B2BsxYDkaDDrD0vbNsBGw+L+/Gbhxl
 N/5l/9FjLyLY1D+EErknuwR5XGuQ6BSDVaKQMhYOiK2hgdnAAI4hszo8Chf6wdD0
 XDBslQ7vpD/05r+eMj0IkS5dSAoGOIFXUxhJ5dqrDbRHiKsIyWqA3PLbYemfAhxI
 s2XckjfmSgGE3FKL8PSFu+EcfHbJQQjLcULJUnqgVcdwEEtRuE9ggEi52nZRXMWM
 4e8sQJAR9Fx7pZy0G1xfS149j6iPU5LjRlU9TNSpVABz14Vvvo3gEL6gyIdsz+xh
 hMN7UBdp0FEaP028CXoIYpaBesvQqj0BSndmee8qsYAtN6j+QKcM2AOSr7JN1uMH
 C/86EDoGAATiEQIVWJvnX5MPmlAoblyLA+RuVhmxkIBx2InGXkFmWqRkXT5l4jtk
 LVl8/TArR4alSQqLXictXCjYlCm9j5N4zFFtEVasSYi7/ZoPfgRNWT+lJ2R8Y+Zv
 +htzGaFuyj6RJTVeFQMrkl3whAtBamo2a0kwg45NnxmmXcspN6kJX1WOIy82+MhD
 Yht7uplSs7MGKA78q/CDU0XBeGjpABUvmplUQBIfrR/jKLW2730=
 =GXs1
 -----END PGP SIGNATURE-----

Merge tag 'net-next-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Jakub Kicinski:
 "Core:

   - support "prefer busy polling" NAPI operation mode, where we defer
     softirq for some time expecting applications to periodically busy
     poll

   - AF_XDP: improve efficiency by more batching and hindering the
     adjacency cache prefetcher

   - af_packet: make packet_fanout.arr size configurable up to 64K

   - tcp: optimize TCP zero copy receive in presence of partial or
     unaligned reads making zero copy a performance win for much smaller
     messages

   - XDP: add bulk APIs for returning / freeing frames

   - sched: support fragmenting IP packets as they come out of conntrack

   - net: allow virtual netdevs to forward UDP L4 and fraglist GSO skbs

  BPF:

   - BPF switch from crude rlimit-based to memcg-based memory accounting

   - BPF type format information for kernel modules and related tracing
     enhancements

   - BPF implement task local storage for BPF LSM

   - allow the FENTRY/FEXIT/RAW_TP tracing programs to use
     bpf_sk_storage

  Protocols:

   - mptcp: improve multiple xmit streams support, memory accounting and
     many smaller improvements

   - TLS: support CHACHA20-POLY1305 cipher

   - seg6: add support for SRv6 End.DT4/DT6 behavior

   - sctp: Implement RFC 6951: UDP Encapsulation of SCTP

   - ppp_generic: add ability to bridge channels directly

   - bridge: Connectivity Fault Management (CFM) support as is defined
     in IEEE 802.1Q section 12.14.

  Drivers:

   - mlx5: make use of the new auxiliary bus to organize the driver
     internals

   - mlx5: more accurate port TX timestamping support

   - mlxsw:
      - improve the efficiency of offloaded next hop updates by using
        the new nexthop object API
      - support blackhole nexthops
      - support IEEE 802.1ad (Q-in-Q) bridging

   - rtw88: major bluetooth co-existance improvements

   - iwlwifi: support new 6 GHz frequency band

   - ath11k: Fast Initial Link Setup (FILS)

   - mt7915: dual band concurrent (DBDC) support

   - net: ipa: add basic support for IPA v4.5

  Refactor:

   - a few pieces of in_interrupt() cleanup work from Sebastian Andrzej
     Siewior

   - phy: add support for shared interrupts; get rid of multiple driver
     APIs and have the drivers write a full IRQ handler, slight growth
     of driver code should be compensated by the simpler API which also
     allows shared IRQs

   - add common code for handling netdev per-cpu counters

   - move TX packet re-allocation from Ethernet switch tag drivers to a
     central place

   - improve efficiency and rename nla_strlcpy

   - number of W=1 warning cleanups as we now catch those in a patchwork
     build bot

  Old code removal:

   - wan: delete the DLCI / SDLA drivers

   - wimax: move to staging

   - wifi: remove old WDS wifi bridging support"

* tag 'net-next-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1922 commits)
  net: hns3: fix expression that is currently always true
  net: fix proc_fs init handling in af_packet and tls
  nfc: pn533: convert comma to semicolon
  af_vsock: Assign the vsock transport considering the vsock address flags
  af_vsock: Set VMADDR_FLAG_TO_HOST flag on the receive path
  vsock_addr: Check for supported flag values
  vm_sockets: Add VMADDR_FLAG_TO_HOST vsock flag
  vm_sockets: Add flags field in the vsock address data structure
  net: Disable NETIF_F_HW_TLS_TX when HW_CSUM is disabled
  tcp: Add logic to check for SYN w/ data in tcp_simple_retransmit
  net: mscc: ocelot: install MAC addresses in .ndo_set_rx_mode from process context
  nfc: s3fwrn5: Release the nfc firmware
  net: vxget: clean up sparse warnings
  mlxsw: spectrum_router: Use eXtended mezzanine to offload IPv4 router
  mlxsw: spectrum: Set KVH XLT cache mode for Spectrum2/3
  mlxsw: spectrum_router_xm: Introduce basic XM cache flushing
  mlxsw: reg: Add Router LPM Cache Enable Register
  mlxsw: reg: Add Router LPM Cache ML Delete Register
  mlxsw: spectrum_router_xm: Implement L-value tracking for M-index
  mlxsw: reg: Add XM Router M Table Register
  ...
2020-12-15 13:22:29 -08:00
Tariq Toukan
ae0b04b238 net: Disable NETIF_F_HW_TLS_TX when HW_CSUM is disabled
With NETIF_F_HW_TLS_TX packets are encrypted in HW. This cannot be
logically done when HW_CSUM offload is off.

Fixes: 2342a8512a ("net: Add TLS TX offload features")
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Boris Pismenny <borisp@nvidia.com>
Link: https://lore.kernel.org/r/20201213143929.26253-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-14 19:31:36 -08:00
Vasily Averin
54970a2fbb net: drop bogus skb with CHECKSUM_PARTIAL and offset beyond end of trimmed packet
syzbot reproduces BUG_ON in skb_checksum_help():
tun creates (bogus) skb with huge partial-checksummed area and
small ip packet inside. Then ip_rcv trims the skb based on size
of internal ip packet, after that csum offset points beyond of
trimmed skb. Then checksum_tg() called via netfilter hook
triggers BUG_ON:

        offset = skb_checksum_start_offset(skb);
        BUG_ON(offset >= skb_headlen(skb));

To work around the problem this patch forces pskb_trim_rcsum_slow()
to return -EINVAL in described scenario. It allows its callers to
drop such kind of packets.

Link: https://syzkaller.appspot.com/bug?id=b419a5ca95062664fe1a60b764621eb4526e2cd0
Reported-by: syzbot+7010af67ced6105e5ab6@syzkaller.appspotmail.com
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/1b2494af-2c56-8ee2-7bc0-923fcad1cdf8@virtuozzo.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-14 18:41:01 -08:00
Linus Torvalds
adb35e8dc9 Scheduler updates:
- migrate_disable/enable() support which originates from the RT tree and
    is now a prerequisite for the new preemptible kmap_local() API which aims
    to replace kmap_atomic().
 
  - A fair amount of topology and NUMA related improvements
 
  - Improvements for the frequency invariant calculations
 
  - Enhanced robustness for the global CPU priority tracking and decision
    making
 
  - The usual small fixes and enhancements all over the place
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl/XwK4THHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoX28D/9cVrvziSQGfBfuQWnUiw8iOIq1QBa2
 Me+Tvenhfrlt7xU6rbP9ciFu7eTN+fS06m5uQPGI+t22WuJmHzbmw1bJVXfkvYfI
 /QoU+Hg7DkDAn1p7ZKXh0dRkV0nI9ixxSHl0E+Zf1ATBxCUMV2SO85flg6z/4qJq
 3VWUye0dmR7/bhtkIjv5rwce9v2JB2g1AbgYXYTW9lHVoUdGoMSdiZAF4tGyHLnx
 sJ6DMqQ+k+dmPyYO0z5MTzjW/fXit4n9w2e3z9TvRH/uBu58WSW1RBmQYX6aHBAg
 dhT9F4lvTs6lJY23x5RSFWDOv6xAvKF5a0xfb8UZcyH5EoLYrPRvm42a0BbjdeRa
 u0z7LbwIlKA+RFdZzFZWz8UvvO0ljyMjmiuqZnZ5dY9Cd80LSBuxrWeQYG0qg6lR
 Y2povhhCepEG+q8AXIe2YjHKWKKC1s/l/VY3CNnCzcd21JPQjQ4Z5eWGmHif5IED
 CntaeFFhZadR3w02tkX35zFmY3w4soKKrbI4EKWrQwd+cIEQlOSY7dEPI/b5BbYj
 MWAb3P4EG9N77AWTNmbhK4nN0brEYb+rBbCA+5dtNBVhHTxAC7OTWElJOC2O66FI
 e06dREjvwYtOkRUkUguWwErbIai2gJ2MH0VILV3hHoh64oRk7jjM8PZYnjQkdptQ
 Gsq0rJW5iiu/OQ==
 =Oz1V
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Thomas Gleixner:

 - migrate_disable/enable() support which originates from the RT tree
   and is now a prerequisite for the new preemptible kmap_local() API
   which aims to replace kmap_atomic().

 - A fair amount of topology and NUMA related improvements

 - Improvements for the frequency invariant calculations

 - Enhanced robustness for the global CPU priority tracking and decision
   making

 - The usual small fixes and enhancements all over the place

* tag 'sched-core-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (61 commits)
  sched/fair: Trivial correction of the newidle_balance() comment
  sched/fair: Clear SMT siblings after determining the core is not idle
  sched: Fix kernel-doc markup
  x86: Print ratio freq_max/freq_base used in frequency invariance calculations
  x86, sched: Use midpoint of max_boost and max_P for frequency invariance on AMD EPYC
  x86, sched: Calculate frequency invariance for AMD systems
  irq_work: Optimize irq_work_single()
  smp: Cleanup smp_call_function*()
  irq_work: Cleanup
  sched: Limit the amount of NUMA imbalance that can exist at fork time
  sched/numa: Allow a floating imbalance between NUMA nodes
  sched: Avoid unnecessary calculation of load imbalance at clone time
  sched/numa: Rename nr_running and break out the magic number
  sched: Make migrate_disable/enable() independent of RT
  sched/topology: Condition EAS enablement on FIE support
  arm64: Rebuild sched domains on invariance status changes
  sched/topology,schedutil: Wrap sched domains rebuild
  sched/uclamp: Allow to reset a task uclamp constraint value
  sched/core: Fix typos in comments
  Documentation: scheduler: fix information on arch SD flags, sched_domain and sched_debug
  ...
2020-12-14 18:29:11 -08:00
Linus Torvalds
f9b4240b07 fixes-v5.11
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCX9daOgAKCRCRxhvAZXjc
 ohPkAQChXUB2BAjtIzXlCkZoDBbzHHblm5DZ37oy/4xYFmAcEwEA5sw6dQqyGHnF
 GEP9def51HvXLpBV2BzNUGggo1SoGgQ=
 =w/cO
 -----END PGP SIGNATURE-----

Merge tag 'fixes-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull misc fixes from Christian Brauner:
 "This contains several fixes which felt worth being combined into a
  single branch:

   - Use put_nsproxy() instead of open-coding it switch_task_namespaces()

   - Kirill's work to unify lifecycle management for all namespaces. The
     lifetime counters are used identically for all namespaces types.
     Namespaces may of course have additional unrelated counters and
     these are not altered. This work allows us to unify the type of the
     counters and reduces maintenance cost by moving the counter in one
     place and indicating that basic lifetime management is identical
     for all namespaces.

   - Peilin's fix adding three byte padding to Dmitry's
     PTRACE_GET_SYSCALL_INFO uapi struct to prevent an info leak.

   - Two smal patches to convert from the /* fall through */ comment
     annotation to the fallthrough keyword annotation which I had taken
     into my branch and into -next before df561f6688 ("treewide: Use
     fallthrough pseudo-keyword") made it upstream which fixed this
     tree-wide.

     Since I didn't want to invalidate all testing for other commits I
     didn't rebase and kept them"

* tag 'fixes-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  nsproxy: use put_nsproxy() in switch_task_namespaces()
  sys: Convert to the new fallthrough notation
  signal: Convert to the new fallthrough notation
  time: Use generic ns_common::count
  cgroup: Use generic ns_common::count
  mnt: Use generic ns_common::count
  user: Use generic ns_common::count
  pid: Use generic ns_common::count
  ipc: Use generic ns_common::count
  uts: Use generic ns_common::count
  net: Use generic ns_common::count
  ns: Add a common refcount into ns_common
  ptrace: Prevent kernel-infoleak in ptrace_get_syscall_info()
2020-12-14 16:40:27 -08:00
Jakub Kicinski
a6b5e026e6 Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2020-12-14

1) Expose bpf_sk_storage_*() helpers to iterator programs, from Florent Revest.

2) Add AF_XDP selftests based on veth devs to BPF selftests, from Weqaar Janjua.

3) Support for finding BTF based kernel attach targets through libbpf's
   bpf_program__set_attach_target() API, from Andrii Nakryiko.

4) Permit pointers on stack for helper calls in the verifier, from Yonghong Song.

5) Fix overflows in hash map elem size after rlimit removal, from Eric Dumazet.

6) Get rid of direct invocation of llc in BPF selftests, from Andrew Delgadillo.

7) Fix xsk_recvmsg() to reorder socket state check before access, from Björn Töpel.

8) Add new libbpf API helper to retrieve ring buffer epoll fd, from Brendan Jackman.

9) Batch of minor BPF selftest improvements all over the place, from Florian Lehner,
   KP Singh, Jiri Olsa and various others.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (31 commits)
  selftests/bpf: Add a test for ptr_to_map_value on stack for helper access
  bpf: Permits pointers on stack for helper calls
  libbpf: Expose libbpf ring_buffer epoll_fd
  selftests/bpf: Add set_attach_target() API selftest for module target
  libbpf: Support modules in bpf_program__set_attach_target() API
  selftests/bpf: Silence ima_setup.sh when not running in verbose mode.
  selftests/bpf: Drop the need for LLVM's llc
  selftests/bpf: fix bpf_testmod.ko recompilation logic
  samples/bpf: Fix possible hang in xdpsock with multiple threads
  selftests/bpf: Make selftest compilation work on clang 11
  selftests/bpf: Xsk selftests - adding xdpxceiver to .gitignore
  selftests/bpf: Drop tcp-{client,server}.py from Makefile
  selftests/bpf: Xsk selftests - Bi-directional Sockets - SKB, DRV
  selftests/bpf: Xsk selftests - Socket Teardown - SKB, DRV
  selftests/bpf: Xsk selftests - DRV POLL, NOPOLL
  selftests/bpf: Xsk selftests - SKB POLL, NOPOLL
  selftests/bpf: Xsk selftests framework
  bpf: Only provide bpf_sock_from_file with CONFIG_NET
  bpf: Return -ENOTSUPP when attaching to non-kernel BTF
  xsk: Validate socket state in xsk_recvmsg, prior touching socket members
  ...
====================

Link: https://lore.kernel.org/r/20201214214316.20642-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-14 15:34:36 -08:00
Jakub Kicinski
46d5e62dd3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
xdp_return_frame_bulk() needs to pass a xdp_buff
to __xdp_return().

strlcpy got converted to strscpy but here it makes no
functional difference, so just keep the right code.

Conflicts:
	net/netfilter/nf_tables_api.c

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-11 22:29:38 -08:00
David S. Miller
d9838b1d39 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Alexei Starovoitov says:

====================
pull-request: bpf 2020-12-10

The following pull-request contains BPF updates for your *net* tree.

We've added 21 non-merge commits during the last 12 day(s) which contain
a total of 21 files changed, 163 insertions(+), 88 deletions(-).

The main changes are:

1) Fix propagation of 32-bit signed bounds from 64-bit bounds, from Alexei.

2) Fix ring_buffer__poll() return value, from Andrii.

3) Fix race in lwt_bpf, from Cong.

4) Fix test_offload, from Toke.

5) Various xsk fixes.

Please consider pulling these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git

Thanks a lot!

Also thanks to reporters, reviewers and testers of commits in this pull-request:

Cong Wang, Hulk Robot, Jakub Kicinski, Jean-Philippe Brucker, John
Fastabend, Magnus Karlsson, Maxim Mikityanskiy, Yonghong Song
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-12-10 14:29:30 -08:00
Jakub Kicinski
51e13685bd rtnetlink: RCU-annotate both dimensions of rtnl_msg_handlers
We use rcu_assign_pointer to assign both the table and the entries,
but the entries are not marked as __rcu. This generates sparse
warnings.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-12-10 13:35:59 -08:00
Chris Mi
5137d30365 net: flow_offload: Fix memory leak for indirect flow block
The offending commit introduces a cleanup callback that is invoked
when the driver module is removed to clean up the tunnel device
flow block. But it returns on the first iteration of the for loop.
The remaining indirect flow blocks will never be freed.

Fixes: 1fac52da59 ("net: flow_offload: consolidate indirect flow_block infrastructure")
CC: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Chris Mi <cmi@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
2020-12-09 16:08:33 -08:00
Toke Høiland-Jørgensen
998f172962 xdp: Remove the xdp_attachment_flags_ok() callback
Since commit 7f0a838254 ("bpf, xdp: Maintain info on attached XDP BPF
programs in net_device"), the XDP program attachment info is now maintained
in the core code. This interacts badly with the xdp_attachment_flags_ok()
check that prevents unloading an XDP program with different load flags than
it was loaded with. In practice, two kinds of failures are seen:

- An XDP program loaded without specifying a mode (and which then ends up
  in driver mode) cannot be unloaded if the program mode is specified on
  unload.

- The dev_xdp_uninstall() hook always calls the driver callback with the
  mode set to the type of the program but an empty flags argument, which
  means the flags_ok() check prevents the program from being removed,
  leading to bpf prog reference leaks.

The original reason this check was added was to avoid ambiguity when
multiple programs were loaded. With the way the checks are done in the core
now, this is quite simple to enforce in the core code, so let's add a check
there and get rid of the xdp_attachment_flags_ok() callback entirely.

Fixes: 7f0a838254 ("bpf, xdp: Maintain info on attached XDP BPF programs in net_device")
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/bpf/160752225751.110217.10267659521308669050.stgit@toke.dk
2020-12-09 16:27:42 +01:00
Florent Revest
b60da4955f bpf: Only provide bpf_sock_from_file with CONFIG_NET
This moves the bpf_sock_from_file definition into net/core/filter.c
which only gets compiled with CONFIG_NET and also moves the helper proto
usage next to other tracing helpers that are conditional on CONFIG_NET.

This avoids
  ld: kernel/trace/bpf_trace.o: in function `bpf_sock_from_file':
  bpf_trace.c:(.text+0xe23): undefined reference to `sock_from_file'
When compiling a kernel with BPF and without NET.

Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Florent Revest <revest@chromium.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: KP Singh <kpsingh@kernel.org>
Link: https://lore.kernel.org/bpf/20201208173623.1136863-1-revest@chromium.org
2020-12-08 18:23:36 -08:00
Zheng Yongjun
8daa76a52d net: core: devlink: simplify the return expression of devlink_nl_cmd_trap_set_doit()
Simplify the return expression.

Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-12-08 16:22:54 -08:00