If a generic XDP program changes the destination MAC address from/to
multicast/broadcast, the skb->pkt_type is updated to properly handle
the packet when passed up the stack. When changing the MAC from/to
the NICs MAC, PACKET_HOST/OTHERHOST is not updated, though, making
the behavior different from that of native XDP.
Remember the PACKET_HOST/OTHERHOST state before calling the program
in generic XDP, and update pkt_type accordingly if the destination
MAC address has changed. As eth_type_trans() assumes a default
pkt_type of PACKET_HOST, restore that before calling it.
The use case for this is when a XDP program wants to push received
packets up the stack by rewriting the MAC to the NICs MAC, for
example by cluster nodes sharing MAC addresses.
Fixes: 2972495699 ("net: fix generic XDP to handle if eth header was mangled")
Signed-off-by: Martin Willi <martin@strongswan.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210419141559.8611-1-martin@strongswan.org
netdev_unregister_timeout_secs=0 can lead to printing the
"waiting for dev to become free" message every jiffy.
This is too frequent and unnecessary.
Set the min value to 1 second.
Also fix the merge issue introduced by
"net: make unregister netdev warning timeout configurable":
it changed "refcnt != 1" to "refcnt".
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Fixes: 5aa3afe107 ("net: make unregister netdev warning timeout configurable")
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds dev_fill_forward_path() which resolves the path to reach
the real netdevice from the IP forwarding side. This function takes as
input the netdevice and the destination hardware address and it walks
down the devices calling .ndo_fill_forward_path() for each device until
the real device is found.
For instance, assuming the following topology:
IP forwarding
/ \
br0 eth0
/ \
eth1 eth2
.
.
.
ethX
ab💿ef🆎cd:ef
where eth1 and eth2 are bridge ports and eth0 provides WAN connectivity.
ethX is the interface in another box which is connected to the eth1
bridge port.
For packets going through IP forwarding to br0 whose destination MAC
address is ab💿ef🆎cd:ef, dev_fill_forward_path() provides the
following path:
br0 -> eth1
.ndo_fill_forward_path for br0 looks up at the FDB for the bridge port
from the destination MAC address to get the bridge port eth1.
This information allows to create a fast path that bypasses the classic
bridge and IP forwarding paths, so packets go directly from the bridge
port eth1 to eth0 (wan interface) and vice versa.
fast path
.------------------------.
/ \
| IP forwarding |
| / \ \/
| br0 eth0
. / \
-> eth1 eth2
.
.
.
ethX
ab💿ef🆎cd:ef
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
netdev_wait_allrefs() issues a warning if refcount does not drop to 0
after 10 seconds. While 10 second wait generally should not happen
under normal workload in normal environment, it seems to fire falsely
very often during fuzzing and/or in qemu emulation (~10x slower).
At least it's not possible to understand if it's really a false
positive or not. Automated testing generally bumps all timeouts
to very high values to avoid flake failures.
Add net.core.netdev_unregister_timeout_secs sysctl to make
the timeout configurable for automated testing systems.
Lowering the timeout may also be useful for e.g. manual bisection.
The default value matches the current behavior.
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=211877
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
When adding CONFIG_PCPU_DEV_REFCNT, I forgot that the
initial net device refcount was 0.
When CONFIG_PCPU_DEV_REFCNT is not set, this means
the first dev_hold() triggers an illegal refcount
operation (addition on 0)
refcount_t: addition on 0; use-after-free.
WARNING: CPU: 0 PID: 1 at lib/refcount.c:25 refcount_warn_saturate+0x128/0x1a4
Fix is to change initial (and final) refcount to be 1.
Also add a missing kerneldoc piece, as reported by
Stephen Rothwell.
Fixes: 919067cc84 ("net: add CONFIG_PCPU_DEV_REFCNT")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Guenter Roeck <groeck@google.com>
Tested-by: Guenter Roeck <groeck@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since their introduction in commit 04157469b7 ("net: Use static_key
for XPS maps"), xps_needed and xps_rxqs_needed were never used outside
net/core/dev.c, so I don't really understand why they were exported as
symbols in the first place.
This is needed in order to silence a "make W=1" warning about these
static keys not being declared as static variables, but not having a
previous declaration in a header file nonetheless.
Cc: Amritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I was working on a syzbot issue, claiming one device could not be
dismantled because its refcount was -1
unregister_netdevice: waiting for sit0 to become free. Usage count = -1
It would be nice if syzbot could trigger a warning at the time
this reference count became negative.
This patch adds CONFIG_PCPU_DEV_REFCNT options which defaults
to per cpu variables (as before this patch) on SMP builds.
v2: free_dev label in alloc_netdev_mqs() is moved to avoid
a compiler warning (-Wunused-label), as reported
by kernel test robot <lkp@intel.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In __netif_set_xps_queue, old map entries from the old dev_maps are
freed but their corresponding entry in the old dev_maps aren't NULLed.
Fix this.
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When setting up an new dev_maps in __netif_set_xps_queue, we remove and
free maps from unused CPUs/rx-queues near the end of the function; by
calling remove_xps_queue. However it's possible those maps are also part
of the old not-freed-yet dev_maps, which might be used concurrently.
When that happens, a map can be freed while its corresponding entry in
the old dev_maps table isn't NULLed, leading to: "BUG: KASAN:
use-after-free" in different places.
This fixes the map freeing logic for unused CPUs/rx-queues, to also NULL
the map entries from the old dev_maps table.
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Improve the readability of the loop removing tx-queue from unused
CPUs/rx-queues in __netif_set_xps_queue. The change should only be
cosmetic.
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds an helper, xps_copy_dev_maps, to copy maps from dev_maps
to new_dev_maps at a given index. The logic should be the same, with an
improved code readability and maintenance.
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move the xps maps (xps_cpus_map and xps_rxqs_map) to an array in
net_device. That will simplify a lot the code removing the need for lots
of if/else conditionals as the correct map will be available using its
offset in the array.
This should not modify the xps maps behaviour in any way.
Suggested-by: Alexander Duyck <alexander.duyck@gmail.com>
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove the xps possible_mask. It was an optimization but we can just
loop from 0 to nr_ids now that it is embedded in the xps dev_maps. That
simplifies the code a bit.
Suggested-by: Alexander Duyck <alexander.duyck@gmail.com>
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Embed nr_ids (the number of cpu for the xps cpus map, and the number of
rxqs for the xps cpus map) in dev_maps. That will help not accessing out
of bound memory if those values change after dev_maps was allocated.
Suggested-by: Alexander Duyck <alexander.duyck@gmail.com>
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The xps cpus/rxqs map is accessed using dev->num_tc, which is used when
allocating the map. But later updates of dev->num_tc can lead to having
a mismatch between the maps and how they're accessed. In such cases the
map values do not make any sense and out of bound accesses can occur
(that can be easily seen using KASAN).
This patch aims at fixing this by embedding num_tc into the maps, using
the value at the time the map is created. This brings two improvements:
- The maps can be accessed using the embedded num_tc, so we know for
sure we won't have out of bound accesses.
- Checks can be made before accessing the maps so we know the values
retrieved will make sense.
We also update __netif_set_xps_queue to conditionally copy old maps from
dev_maps in the new one only if the number of traffic classes from both
maps match.
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
__dev_alloc_name(), when supplied with a name containing '%d',
will search for the first available device number to generate a
unique device name.
Since commit ff92741270 ("net:
introduce name_node struct to be used in hashlist") network
devices may have alternate names. __dev_alloc_name() does take
these alternate names into account, possibly generating a name
that is already taken and failing with -ENFILE as a result.
This demonstrates the bug:
# rmmod dummy 2>/dev/null
# ip link property add dev lo altname dummy0
# modprobe dummy numdummies=1
modprobe: ERROR: could not insert 'dummy': Too many open files in system
Instead of creating a device named dummy1, modprobe fails.
Fix this by checking all the names in the d->name_node list, not just d->name.
Signed-off-by: Jiri Bohac <jbohac@suse.cz>
Fixes: ff92741270 ("net: introduce name_node struct to be used in hashlist")
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, napi_thread_wait() checks for NAPI_STATE_SCHED bit to
determine if the kthread owns this napi and could call napi->poll() on
it. However, if socket busy poll is enabled, it is possible that the
busy poll thread grabs this SCHED bit (after the previous napi->poll()
invokes napi_complete_done() and clears SCHED bit) and tries to poll
on the same napi. napi_disable() could grab the SCHED bit as well.
This patch tries to fix this race by adding a new bit
NAPI_STATE_SCHED_THREADED in napi->state. This bit gets set in
____napi_schedule() if the threaded mode is enabled, and gets cleared
in napi_complete_done(), and we only poll the napi in kthread if this
bit is set. This helps distinguish the ownership of the napi between
kthread and other scenarios and fixes the race issue.
Fixes: 29863d41bb ("net: implement threaded-able napi poll loop support")
Reported-by: Martin Zaharinov <micron10@gmail.com>
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Wei Wang <weiwan@google.com>
Cc: Alexander Duyck <alexanderduyck@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a non-initial netns is destroyed, the usual policy is to delete
all virtual network interfaces contained, but move physical interfaces
back to the initial netns. This keeps the physical interface visible
on the system.
CAN devices are somewhat special, as they define rtnl_link_ops even
if they are physical devices. If a CAN interface is moved into a
non-initial netns, destroying that netns lets the interface vanish
instead of moving it back to the initial netns. default_device_exit()
skips CAN interfaces due to having rtnl_link_ops set. Reproducer:
ip netns add foo
ip link set can0 netns foo
ip netns delete foo
WARNING: CPU: 1 PID: 84 at net/core/dev.c:11030 ops_exit_list+0x38/0x60
CPU: 1 PID: 84 Comm: kworker/u4:2 Not tainted 5.10.19 #1
Workqueue: netns cleanup_net
[<c010e700>] (unwind_backtrace) from [<c010a1d8>] (show_stack+0x10/0x14)
[<c010a1d8>] (show_stack) from [<c086dc10>] (dump_stack+0x94/0xa8)
[<c086dc10>] (dump_stack) from [<c086b938>] (__warn+0xb8/0x114)
[<c086b938>] (__warn) from [<c086ba10>] (warn_slowpath_fmt+0x7c/0xac)
[<c086ba10>] (warn_slowpath_fmt) from [<c0629f20>] (ops_exit_list+0x38/0x60)
[<c0629f20>] (ops_exit_list) from [<c062a5c4>] (cleanup_net+0x230/0x380)
[<c062a5c4>] (cleanup_net) from [<c0142c20>] (process_one_work+0x1d8/0x438)
[<c0142c20>] (process_one_work) from [<c0142ee4>] (worker_thread+0x64/0x5a8)
[<c0142ee4>] (worker_thread) from [<c0148a98>] (kthread+0x148/0x14c)
[<c0148a98>] (kthread) from [<c0100148>] (ret_from_fork+0x14/0x2c)
To properly restore physical CAN devices to the initial netns on owning
netns exit, introduce a flag on rtnl_link_ops that can be set by drivers.
For CAN devices setting this flag, default_device_exit() considers them
non-virtual, applying the usual namespace move.
The issue was introduced in the commit mentioned below, as at that time
CAN devices did not have a dellink() operation.
Fixes: e008b5fc8d ("net: Simplfy default_device_exit and improve batching.")
Link: https://lore.kernel.org/r/20210302122423.872326-1-martin@strongswan.org
Signed-off-by: Martin Willi <martin@strongswan.org>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
For wireless devices (e.g. mt76 driver) multiple net_devices belongs to
the same wireless phy and the napi object is registered in a dummy
netdevice related to the wireless phy.
Export dev_set_threaded in order to be reused in device drivers enabling
threaded NAPI.
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
'hash' stores not the flow hash, but the index of the GRO bucket
corresponding to it.
Change its name to 'bucket' to avoid confusion while reading lines
like '__set_bit(hash, &napi->gro_bitmask)'.
Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
GRO bucket index doesn't change through the entire function.
Store a pointer to the corresponding bucket instead of its member
and use it consistently through the function.
It is performance-safe since &gro_list->list == gro_list.
Misc: remove superfluous braces around single-line branches.
Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
gro_list_prepare() always returns &napi->gro_hash[bucket].list,
without any variations. Moreover, it uses 'napi' argument only to
have access to this list, and calculates the bucket index for the
second time (firstly it happens at the beginning of
dev_gro_receive()) to do that.
Given that dev_gro_receive() already has an index to the needed
list, just pass it as the first argument to eliminate redundant
calculations, and make gro_list_prepare() return void.
Also, both arguments of gro_list_prepare() can be constified since
this function can only modify the skbs from the bucket list.
Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
In preparation to enable -Wimplicit-fallthrough for Clang, fix a warning
by explicitly adding a break statement instead of letting the code fall
through to the next case.
Link: https://github.com/KSPP/linux/issues/115
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf-next 2021-02-16
The following pull-request contains BPF updates for your *net-next* tree.
There's a small merge conflict between 7eeba1706e ("tcp: Add receive timestamp
support for receive zerocopy.") from net-next tree and 9cacf81f81 ("bpf: Remove
extra lock_sock for TCP_ZEROCOPY_RECEIVE") from bpf-next tree. Resolve as follows:
[...]
lock_sock(sk);
err = tcp_zerocopy_receive(sk, &zc, &tss);
err = BPF_CGROUP_RUN_PROG_GETSOCKOPT_KERN(sk, level, optname,
&zc, &len, err);
release_sock(sk);
[...]
We've added 116 non-merge commits during the last 27 day(s) which contain
a total of 156 files changed, 5662 insertions(+), 1489 deletions(-).
The main changes are:
1) Adds support of pointers to types with known size among global function
args to overcome the limit on max # of allowed args, from Dmitrii Banshchikov.
2) Add bpf_iter for task_vma which can be used to generate information similar
to /proc/pid/maps, from Song Liu.
3) Enable bpf_{g,s}etsockopt() from all sock_addr related program hooks. Allow
rewriting bind user ports from BPF side below the ip_unprivileged_port_start
range, both from Stanislav Fomichev.
4) Prevent recursion on fentry/fexit & sleepable programs and allow map-in-map
as well as per-cpu maps for the latter, from Alexei Starovoitov.
5) Add selftest script to run BPF CI locally. Also enable BPF ringbuffer
for sleepable programs, both from KP Singh.
6) Extend verifier to enable variable offset read/write access to the BPF
program stack, from Andrei Matei.
7) Improve tc & XDP MTU handling and add a new bpf_check_mtu() helper to
query device MTU from programs, from Jesper Dangaard Brouer.
8) Allow bpf_get_socket_cookie() helper also be called from [sleepable] BPF
tracing programs, from Florent Revest.
9) Extend x86 JIT to pad JMPs with NOPs for helping image to converge when
otherwise too many passes are required, from Gary Lin.
10) Verifier fixes on atomics with BPF_FETCH as well as function-by-function
verification both related to zero-extension handling, from Ilya Leoshkevich.
11) Better kernel build integration of resolve_btfids tool, from Jiri Olsa.
12) Batch of AF_XDP selftest cleanups and small performance improvement
for libbpf's xsk map redirect for newer kernels, from Björn Töpel.
13) Follow-up BPF doc and verifier improvements around atomics with
BPF_FETCH, from Brendan Jackman.
14) Permit zero-sized data sections e.g. if ELF .rodata section contains
read-only data from local variables, from Yonghong Song.
15) veth driver skb bulk-allocation for ndo_xdp_xmit, from Lorenzo Bianconi.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
napi_frags_finish() and napi_skb_finish() can only be called inside
NAPI Rx context, so we can feed NAPI cache with skbuff_heads that
got NAPI_MERGED_FREE verdict instead of immediate freeing.
Replace __kfree_skb() with __kfree_skb_defer() in napi_skb_finish()
and move napi_skb_free_stolen_head() to skbuff.c, so it can drop skbs
to NAPI cache.
As many drivers call napi_alloc_skb()/napi_get_frags() on their
receive path, this becomes especially useful.
Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
This function isn't much needed as NAPI skb queue gets bulk-freed
anyway when there's no more room, and even may reduce the efficiency
of bulk operations.
It will be even less needed after reusing skb cache on allocation path,
so remove it and this way lighten network softirqs a bit.
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
The use-case for dropping the MTU check when TC-BPF does redirect to
ingress, is described by Eyal Birger in email[0]. The summary is the
ability to increase packet size (e.g. with IPv6 headers for NAT64) and
ingress redirect packet and let normal netstack fragment packet as needed.
[0] https://lore.kernel.org/netdev/CAHsH6Gug-hsLGHQ6N0wtixdOa85LDZ3HNRHVd0opR=19Qo4W4Q@mail.gmail.com/
V15:
- missing static for function declaration
V9:
- Make net_device "up" (IFF_UP) check explicit in skb_do_redirect
V4:
- Keep net_device "up" (IFF_UP) check.
- Adjustment to handle bpf_redirect_peer() helper
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/161287790971.790810.11785274340154740591.stgit@firesoul
dev_ifsioc_locked() is called with only RCU read lock, so when
there is a parallel writer changing the mac address, it could
get a partially updated mac address, as shown below:
Thread 1 Thread 2
// eth_commit_mac_addr_change()
memcpy(dev->dev_addr, addr->sa_data, ETH_ALEN);
// dev_ifsioc_locked()
memcpy(ifr->ifr_hwaddr.sa_data,
dev->dev_addr,...);
Close this race condition by guarding them with a RW semaphore,
like netdev_get_name(). We can not use seqlock here as it does not
allow blocking. The writers already take RTNL anyway, so this does
not affect the slow path. To avoid bothering existing
dev_set_mac_address() callers in drivers, introduce a new wrapper
just for user-facing callers on ioctl and rtnetlink paths.
Note, bonding also changes slave mac addresses but that requires
a separate patch due to the complexity of bonding code.
Fixes: 3710becf8a ("net: RCU locking for simple ioctl()")
Reported-by: "Gong, Sishuai" <sishuai@purdue.edu>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds a new sysfs attribute to the network device class.
Said attribute provides a per-device control to enable/disable the
threaded mode for all the napi instances of the given network device,
without the need for a device up/down.
User sets it to 1 or 0 to enable or disable threaded mode.
Note: when switching between threaded and the current softirq based mode
for a napi instance, it will not immediately take effect if the napi is
currently being polled. The mode switch will happen for the next time
napi_schedule() is called.
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Co-developed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Co-developed-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Wei Wang <weiwan@google.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch allows running each napi poll loop inside its own
kernel thread.
The kthread is created during netif_napi_add() if dev->threaded
is set. And threaded mode is enabled in napi_enable(). We will
provide a way to set dev->threaded and enable threaded mode
without a device up/down in the following patch.
Once that threaded mode is enabled and the kthread is
started, napi_schedule() will wake-up such thread instead
of scheduling the softirq.
The threaded poll loop behaves quite likely the net_rx_action,
but it does not have to manipulate local irqs and uses
an explicit scheduling point based on netdev_budget.
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Co-developed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Co-developed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Wei Wang <weiwan@google.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This commit introduces a new function __napi_poll() which does the main
logic of the existing napi_poll() function, and will be called by other
functions in later commits.
This idea and implementation is done by Felix Fietkau <nbd@nbd.name> and
is proposed as part of the patch to move napi work to work_queue
context.
This commit by itself is a code restructure.
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Wei Wang <weiwan@google.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit c80794323e ("net: Fix packet reordering caused by GRO and
listified RX cooperation") had the unfortunate effect of adding
latencies in common workloads.
Before the patch, GRO packets were immediately passed to
upper stacks.
After the patch, we can accumulate quite a lot of GRO
packets (depdending on NAPI budget).
My fix is counting in napi->rx_count number of segments
instead of number of logical packets.
Fixes: c80794323e ("net: Fix packet reordering caused by GRO and listified RX cooperation")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Bisected-by: John Sperbeck <jsperbeck@google.com>
Tested-by: Jian Yang <jianyang@google.com>
Cc: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Edward Cree <ecree.xilinx@gmail.com>
Reviewed-by: Alexander Lobakin <alobakin@pm.me>
Link: https://lore.kernel.org/r/20210204213146.4192368-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
NETIF_F_IP|IPV6_CSUM feature flag indicates UDP and TCP csum offload
while NETIF_F_HW_CSUM feature flag indicates ip generic csum offload
for HW, which includes not only for TCP/UDP csum, but also for other
protocols' csum like GRE's.
However, in skb_csum_hwoffload_help() it only checks features against
NETIF_F_CSUM_MASK(NETIF_F_HW|IP|IPV6_CSUM). So if it's a non TCP/UDP
packet and the features doesn't support NETIF_F_HW_CSUM, but supports
NETIF_F_IP|IPV6_CSUM only, it would still return 0 and leave the HW
to do csum.
This patch is to support ip generic csum processing by checking
NETIF_F_HW_CSUM for all protocols, and check (NETIF_F_IP_CSUM |
NETIF_F_IPV6_CSUM) only for TCP and UDP.
Note that we're using skb->csum_offset to check if it's a TCP/UDP
proctol, this might be fragile. However, as Alex said, for now we
only have a few L4 protocols that are requesting Tx csum offload,
we'd better fix this until a new protocol comes with a same csum
offset.
v1->v2:
- not extend skb->csum_not_inet, but use skb->csum_offset to tell
if it's an UDP/TCP csum packet.
v2->v3:
- add a note in the changelog, as Willem suggested.
Suggested-by: Alexander Duyck <alexander.duyck@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch adds TCP_NLA_TTL to SCM_TIMESTAMPING_OPT_STATS that exports
the time-to-live or hop limit of the latest incoming packet with
SCM_TSTAMP_ACK. The value exported may not be from the packet that acks
the sequence when incoming packets are aggregated. Exporting the
time-to-live or hop limit value of incoming packets helps to estimate
the hop count of the path of the flow that may change over time.
Signed-off-by: Yousuk Seung <ysseung@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Link: https://lore.kernel.org/r/20210120204155.552275-1-ysseung@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch add the TCA_FLOWER_KEY_CT_FLAGS_INVALID flag to
match the ct_state with invalid for conntrack.
Signed-off-by: wenxu <wenxu@ucloud.cn>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Link: https://lore.kernel.org/r/1611045110-682-1-git-send-email-wenxu@ucloud.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Similar to the change for rollback_registered() -
rollback_registered_many() was a part of unregister_netdevice_many()
minus the net_set_todo(), which is no longer needed.
Functionally this patch moves the list_empty() check back after:
BUG_ON(dev_boot_phase);
ASSERT_RTNL();
but I can't find any reason why that would be an issue.
Reviewed-by: Edwin Peer <edwin.peer@broadcom.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Move rollback_registered_many() and add a temporary
forward declaration to make merging the code into
unregister_netdevice_many() easier to review.
No functional changes.
Reviewed-by: Edwin Peer <edwin.peer@broadcom.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
rollback_registered() is a local helper, it's common for driver
code to call unregister_netdevice_queue(dev, NULL) when they
want to unregister netdevices under rtnl_lock. Inline
rollback_registered() and adjust the only remaining caller.
Reviewed-by: Edwin Peer <edwin.peer@broadcom.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Commit 93ee31f14f ("[NET]: Fix free_netdev on register_netdev
failure.") moved net_set_todo() outside of rollback_registered()
so that rollback_registered() can be used in the failure path of
register_netdevice() but without risking a double free.
Since commit cf124db566 ("net: Fix inconsistent teardown and
release of private netdev state."), however, we have a better
way of handling that condition, since destructors don't call
free_netdev() directly.
After the change in commit c269a24ce0 ("net: make free_netdev()
more lenient with unregistering devices") we can now move
net_set_todo() back.
Reviewed-by: Edwin Peer <edwin.peer@broadcom.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Conflicts:
drivers/net/can/dev.c
commit 03f16c5075 ("can: dev: can_restart: fix use after free bug")
commit 3e77f70e73 ("can: dev: move driver related infrastructure into separate subdir")
Code move.
drivers/net/dsa/b53/b53_common.c
commit 8e4052c32d ("net: dsa: b53: fix an off by one in checking "vlan->vid"")
commit b7a9e0da2d ("net: switchdev: remove vid_begin -> vid_end range from VLAN objects")
Field rename.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
With NETIF_F_HW_TLS_RX packets are decrypted in HW. This cannot be
logically done when RXCSUM offload is off.
Fixes: 14136564c8 ("net: Add TLS RX offload feature")
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Boris Pismenny <borisp@nvidia.com>
Link: https://lore.kernel.org/r/20210117151538.9411-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch is to define a inline function skb_csum_is_sctp(), and
also replace all places where it checks if it's a SCTP CSUM skb.
This function would be used later in many networking drivers in
the following patches.
Suggested-by: Alexander Duyck <alexander.duyck@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ndo_sk_get_lower_dev returns the lower netdev that corresponds to
a given socket.
Additionally, we implement a helper netdev_sk_get_lowest_dev() to get
the lowest one in chain.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Boris Pismenny <borisp@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Borkmann says:
====================
pull-request: bpf-next 2021-01-16
1) Extend atomic operations to the BPF instruction set along with x86-64 JIT support,
that is, atomic{,64}_{xchg,cmpxchg,fetch_{add,and,or,xor}}, from Brendan Jackman.
2) Add support for using kernel module global variables (__ksym externs in BPF
programs) retrieved via module's BTF, from Andrii Nakryiko.
3) Generalize BPF stackmap's buildid retrieval and add support to have buildid
stored in mmap2 event for perf, from Jiri Olsa.
4) Various fixes for cross-building BPF sefltests out-of-tree which then will
unblock wider automated testing on ARM hardware, from Jean-Philippe Brucker.
5) Allow to retrieve SOL_SOCKET opts from sock_addr progs, from Daniel Borkmann.
6) Clean up driver's XDP buffer init and split into two helpers to init per-
descriptor and non-changing fields during processing, from Lorenzo Bianconi.
7) Minor misc improvements to libbpf & bpftool, from Ian Rogers.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (41 commits)
perf: Add build id data in mmap2 event
bpf: Add size arg to build_id_parse function
bpf: Move stack_map_get_build_id into lib
bpf: Document new atomic instructions
bpf: Add tests for new BPF atomic operations
bpf: Add bitwise atomic instructions
bpf: Pull out a macro for interpreting atomic ALU operations
bpf: Add instructions for atomic_[cmp]xchg
bpf: Add BPF_FETCH field / create atomic_fetch_add instruction
bpf: Move BPF_STX reserved field check into BPF_STX verifier code
bpf: Rename BPF_XADD and prepare to encode other atomics in .imm
bpf: x86: Factor out a lookup table for some ALU opcodes
bpf: x86: Factor out emission of REX byte
bpf: x86: Factor out emission of ModR/M for *(reg + off)
tools/bpftool: Add -Wall when building BPF programs
bpf, libbpf: Avoid unused function warning on bpf_tail_call_static
selftests/bpf: Install btf_dump test cases
selftests/bpf: Fix installation of urandom_read
selftests/bpf: Move generated test files to $(TEST_GEN_FILES)
selftests/bpf: Fix out-of-tree build
...
====================
Link: https://lore.kernel.org/r/20210116012922.17823-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Cited patch below blocked the TLS TX device offload unless HW_CSUM
is set. This broke devices that use IP_CSUM && IP6_CSUM.
Here we fix it.
Note that the single HW_TLS_TX feature flag indicates support for
both IPv4/6, hence it should still be disabled in case only one of
(IP_CSUM | IPV6_CSUM) is set.
Fixes: ae0b04b238 ("net: Disable NETIF_F_HW_TLS_TX when HW_CSUM is disabled")
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reported-by: Rohit Maheshwari <rohitm@chelsio.com>
Reviewed-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Link: https://lore.kernel.org/r/20210114151215.7061-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>