2005-04-16 22:20:36 +00:00
|
|
|
/*
|
2017-02-09 06:56:04 +00:00
|
|
|
* NET3 Protocol independent device support routines.
|
2005-04-16 22:20:36 +00:00
|
|
|
*
|
|
|
|
* This program is free software; you can redistribute it and/or
|
|
|
|
* modify it under the terms of the GNU General Public License
|
|
|
|
* as published by the Free Software Foundation; either version
|
|
|
|
* 2 of the License, or (at your option) any later version.
|
|
|
|
*
|
|
|
|
* Derived from the non IP parts of dev.c 1.0.19
|
2017-02-09 06:56:04 +00:00
|
|
|
* Authors: Ross Biro
|
2005-04-16 22:20:36 +00:00
|
|
|
* Fred N. van Kempen, <waltje@uWalt.NL.Mugnet.ORG>
|
|
|
|
* Mark Evans, <evansmp@uhura.aston.ac.uk>
|
|
|
|
*
|
|
|
|
* Additional Authors:
|
|
|
|
* Florian la Roche <rzsfl@rz.uni-sb.de>
|
|
|
|
* Alan Cox <gw4pts@gw4pts.ampr.org>
|
|
|
|
* David Hinds <dahinds@users.sourceforge.net>
|
|
|
|
* Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
|
|
|
|
* Adam Sulmicki <adam@cfar.umd.edu>
|
|
|
|
* Pekka Riikonen <priikone@poesidon.pspt.fi>
|
|
|
|
*
|
|
|
|
* Changes:
|
|
|
|
* D.J. Barrow : Fixed bug where dev->refcnt gets set
|
2017-02-09 06:56:04 +00:00
|
|
|
* to 2 if register_netdev gets called
|
|
|
|
* before net_dev_init & also removed a
|
|
|
|
* few lines of code in the process.
|
2005-04-16 22:20:36 +00:00
|
|
|
* Alan Cox : device private ioctl copies fields back.
|
|
|
|
* Alan Cox : Transmit queue code does relevant
|
|
|
|
* stunts to keep the queue safe.
|
|
|
|
* Alan Cox : Fixed double lock.
|
|
|
|
* Alan Cox : Fixed promisc NULL pointer trap
|
|
|
|
* ???????? : Support the full private ioctl range
|
|
|
|
* Alan Cox : Moved ioctl permission check into
|
|
|
|
* drivers
|
|
|
|
* Tim Kordas : SIOCADDMULTI/SIOCDELMULTI
|
|
|
|
* Alan Cox : 100 backlog just doesn't cut it when
|
|
|
|
* you start doing multicast video 8)
|
|
|
|
* Alan Cox : Rewrote net_bh and list manager.
|
2017-02-09 06:56:04 +00:00
|
|
|
* Alan Cox : Fix ETH_P_ALL echoback lengths.
|
2005-04-16 22:20:36 +00:00
|
|
|
* Alan Cox : Took out transmit every packet pass
|
|
|
|
* Saved a few bytes in the ioctl handler
|
|
|
|
* Alan Cox : Network driver sets packet type before
|
|
|
|
* calling netif_rx. Saves a function
|
|
|
|
* call a packet.
|
|
|
|
* Alan Cox : Hashed net_bh()
|
|
|
|
* Richard Kooijman: Timestamp fixes.
|
|
|
|
* Alan Cox : Wrong field in SIOCGIFDSTADDR
|
|
|
|
* Alan Cox : Device lock protection.
|
2017-02-09 06:56:04 +00:00
|
|
|
* Alan Cox : Fixed nasty side effect of device close
|
2005-04-16 22:20:36 +00:00
|
|
|
* changes.
|
|
|
|
* Rudi Cilibrasi : Pass the right thing to
|
|
|
|
* set_mac_address()
|
|
|
|
* Dave Miller : 32bit quantity for the device lock to
|
|
|
|
* make it work out on a Sparc.
|
|
|
|
* Bjorn Ekwall : Added KERNELD hack.
|
|
|
|
* Alan Cox : Cleaned up the backlog initialise.
|
|
|
|
* Craig Metz : SIOCGIFCONF fix if space for under
|
|
|
|
* 1 device.
|
|
|
|
* Thomas Bogendoerfer : Return ENODEV for dev_open, if there
|
|
|
|
* is no device open function.
|
|
|
|
* Andi Kleen : Fix error reporting for SIOCGIFCONF
|
|
|
|
* Michael Chastain : Fix signed/unsigned for SIOCGIFCONF
|
|
|
|
* Cyrus Durgin : Cleaned for KMOD
|
|
|
|
* Adam Sulmicki : Bug Fix : Network Device Unload
|
|
|
|
* A network device unload needs to purge
|
|
|
|
* the backlog queue.
|
|
|
|
* Paul Rusty Russell : SIOCSIFNAME
|
|
|
|
* Pekka Riikonen : Netdev boot-time settings code
|
|
|
|
* Andrew Morton : Make unregister_netdevice wait
|
2017-02-09 06:56:04 +00:00
|
|
|
* indefinitely on dev->refcnt
|
|
|
|
* J Hadi Salim : - Backlog queue sampling
|
2005-04-16 22:20:36 +00:00
|
|
|
* - netif_rx() feedback
|
|
|
|
*/
|
|
|
|
|
2016-12-24 19:46:01 +00:00
|
|
|
#include <linux/uaccess.h>
|
2005-04-16 22:20:36 +00:00
|
|
|
#include <linux/bitops.h>
|
2006-01-11 20:17:47 +00:00
|
|
|
#include <linux/capability.h>
|
2005-04-16 22:20:36 +00:00
|
|
|
#include <linux/cpu.h>
|
|
|
|
#include <linux/types.h>
|
|
|
|
#include <linux/kernel.h>
|
2009-11-10 07:20:34 +00:00
|
|
|
#include <linux/hash.h>
|
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 08:04:11 +00:00
|
|
|
#include <linux/slab.h>
|
2005-04-16 22:20:36 +00:00
|
|
|
#include <linux/sched.h>
|
2017-05-08 22:59:53 +00:00
|
|
|
#include <linux/sched/mm.h>
|
2006-03-21 06:33:17 +00:00
|
|
|
#include <linux/mutex.h>
|
2005-04-16 22:20:36 +00:00
|
|
|
#include <linux/string.h>
|
|
|
|
#include <linux/mm.h>
|
|
|
|
#include <linux/socket.h>
|
|
|
|
#include <linux/sockios.h>
|
|
|
|
#include <linux/errno.h>
|
|
|
|
#include <linux/interrupt.h>
|
|
|
|
#include <linux/if_ether.h>
|
|
|
|
#include <linux/netdevice.h>
|
|
|
|
#include <linux/etherdevice.h>
|
2008-06-19 23:15:47 +00:00
|
|
|
#include <linux/ethtool.h>
|
2005-04-16 22:20:36 +00:00
|
|
|
#include <linux/notifier.h>
|
|
|
|
#include <linux/skbuff.h>
|
2016-07-19 19:16:48 +00:00
|
|
|
#include <linux/bpf.h>
|
2017-04-18 19:36:58 +00:00
|
|
|
#include <linux/bpf_trace.h>
|
2007-09-12 10:01:34 +00:00
|
|
|
#include <net/net_namespace.h>
|
2005-04-16 22:20:36 +00:00
|
|
|
#include <net/sock.h>
|
2015-11-18 14:30:52 +00:00
|
|
|
#include <net/busy_poll.h>
|
2005-04-16 22:20:36 +00:00
|
|
|
#include <linux/rtnetlink.h>
|
|
|
|
#include <linux/stat.h>
|
|
|
|
#include <net/dst.h>
|
2015-10-23 01:17:16 +00:00
|
|
|
#include <net/dst_metadata.h>
|
2005-04-16 22:20:36 +00:00
|
|
|
#include <net/pkt_sched.h>
|
2017-05-17 09:07:54 +00:00
|
|
|
#include <net/pkt_cls.h>
|
2005-04-16 22:20:36 +00:00
|
|
|
#include <net/checksum.h>
|
2009-11-26 06:07:08 +00:00
|
|
|
#include <net/xfrm.h>
|
2005-04-16 22:20:36 +00:00
|
|
|
#include <linux/highmem.h>
|
|
|
|
#include <linux/init.h>
|
|
|
|
#include <linux/module.h>
|
|
|
|
#include <linux/netpoll.h>
|
|
|
|
#include <linux/rcupdate.h>
|
|
|
|
#include <linux/delay.h>
|
|
|
|
#include <net/iw_handler.h>
|
|
|
|
#include <asm/current.h>
|
2005-12-03 13:39:35 +00:00
|
|
|
#include <linux/audit.h>
|
2006-06-18 04:24:58 +00:00
|
|
|
#include <linux/dmaengine.h>
|
2006-06-22 09:57:17 +00:00
|
|
|
#include <linux/err.h>
|
2006-08-15 23:34:13 +00:00
|
|
|
#include <linux/ctype.h>
|
2007-05-16 05:46:18 +00:00
|
|
|
#include <linux/if_arp.h>
|
2008-06-17 00:02:28 +00:00
|
|
|
#include <linux/if_vlan.h>
|
2008-07-15 10:47:03 +00:00
|
|
|
#include <linux/ip.h>
|
2008-09-21 05:05:50 +00:00
|
|
|
#include <net/ip.h>
|
2014-10-06 12:05:13 +00:00
|
|
|
#include <net/mpls.h>
|
2008-07-15 10:47:03 +00:00
|
|
|
#include <linux/ipv6.h>
|
|
|
|
#include <linux/in.h>
|
2008-07-21 16:48:06 +00:00
|
|
|
#include <linux/jhash.h>
|
|
|
|
#include <linux/random.h>
|
2009-06-15 10:02:23 +00:00
|
|
|
#include <trace/events/napi.h>
|
2010-08-23 09:45:02 +00:00
|
|
|
#include <trace/events/net.h>
|
2010-08-23 09:46:12 +00:00
|
|
|
#include <trace/events/skb.h>
|
2010-03-30 22:35:50 +00:00
|
|
|
#include <linux/pci.h>
|
2010-09-17 04:39:16 +00:00
|
|
|
#include <linux/inetdevice.h>
|
2011-01-19 11:03:53 +00:00
|
|
|
#include <linux/cpu_rmap.h>
|
2012-02-24 07:31:31 +00:00
|
|
|
#include <linux/static_key.h>
|
2013-06-10 08:39:41 +00:00
|
|
|
#include <linux/hashtable.h>
|
2013-06-20 08:15:51 +00:00
|
|
|
#include <linux/vmalloc.h>
|
2013-11-15 05:18:50 +00:00
|
|
|
#include <linux/if_macvlan.h>
|
2014-08-05 02:11:48 +00:00
|
|
|
#include <linux/errqueue.h>
|
net: gro: add a per device gro flush timer
Tuning coalescing parameters on NIC can be really hard.
Servers can handle both bulk and RPC like traffic, with conflicting
goals : bulk flows want as big GRO packets as possible, RPC want minimal
latencies.
To reach big GRO packets on 10Gbe NIC, one can use :
ethtool -C eth0 rx-usecs 4 rx-frames 44
But this penalizes rpc sessions, with an increase of latencies, up to
50% in some cases, as NICs generally do not force an interrupt when
a packet with TCP Push flag is received.
Some NICs do not have an absolute timer, only a timer rearmed for every
incoming packet.
This patch uses a different strategy : Let GRO stack decides what do do,
based on traffic pattern.
Packets with Push flag wont be delayed.
Packets without Push flag might be held in GRO engine, if we keep
receiving data.
This new mechanism is off by default, and shall be enabled by setting
/sys/class/net/ethX/gro_flush_timeout to a value in nanosecond.
To fully enable this mechanism, drivers should use napi_complete_done()
instead of napi_complete().
Tested:
Ran 200 netperf TCP_STREAM from A to B (10Gbe mlx4 link, 8 RX queues)
Without this feature, we send back about 305,000 ACK per second.
GRO aggregation ratio is low (811/305 = 2.65 segments per GRO packet)
Setting a timer of 2000 nsec is enough to increase GRO packet sizes
and reduce number of ACK packets. (811/19.2 = 42)
Receiver performs less calls to upper stacks, less wakes up.
This also reduces cpu usage on the sender, as it receives less ACK
packets.
Note that reducing number of wakes up increases cpu efficiency, but can
decrease QPS, as applications wont have the chance to warmup cpu caches
doing a partial read of RPC requests/answers if they fit in one skb.
B:~# sar -n DEV 1 10 | grep eth0 | tail -1
Average: eth0 811269.80 305732.30 1199462.57 19705.72 0.00
0.00 0.50
B:~# echo 2000 >/sys/class/net/eth0/gro_flush_timeout
B:~# sar -n DEV 1 10 | grep eth0 | tail -1
Average: eth0 811577.30 19230.80 1199916.51 1239.80 0.00
0.00 0.50
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-07 05:09:44 +00:00
|
|
|
#include <linux/hrtimer.h>
|
netfilter: add netfilter ingress hook after handle_ing() under unique static key
This patch adds the Netfilter ingress hook just after the existing tc ingress
hook, that seems to be the consensus solution for this.
Note that the Netfilter hook resides under the global static key that enables
ingress filtering. Nonetheless, Netfilter still also has its own static key for
minimal impact on the existing handle_ing().
* Without this patch:
Result: OK: 6216490(c6216338+d152) usec, 100000000 (60byte,0frags)
16086246pps 7721Mb/sec (7721398080bps) errors: 100000000
42.46% kpktgend_0 [kernel.kallsyms] [k] __netif_receive_skb_core
25.92% kpktgend_0 [kernel.kallsyms] [k] kfree_skb
7.81% kpktgend_0 [pktgen] [k] pktgen_thread_worker
5.62% kpktgend_0 [kernel.kallsyms] [k] ip_rcv
2.70% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_internal
2.34% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_sk
1.44% kpktgend_0 [kernel.kallsyms] [k] __build_skb
* With this patch:
Result: OK: 6214833(c6214731+d101) usec, 100000000 (60byte,0frags)
16090536pps 7723Mb/sec (7723457280bps) errors: 100000000
41.23% kpktgend_0 [kernel.kallsyms] [k] __netif_receive_skb_core
26.57% kpktgend_0 [kernel.kallsyms] [k] kfree_skb
7.72% kpktgend_0 [pktgen] [k] pktgen_thread_worker
5.55% kpktgend_0 [kernel.kallsyms] [k] ip_rcv
2.78% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_internal
2.06% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_sk
1.43% kpktgend_0 [kernel.kallsyms] [k] __build_skb
* Without this patch + tc ingress:
tc filter add dev eth4 parent ffff: protocol ip prio 1 \
u32 match ip dst 4.3.2.1/32
Result: OK: 9269001(c9268821+d179) usec, 100000000 (60byte,0frags)
10788648pps 5178Mb/sec (5178551040bps) errors: 100000000
40.99% kpktgend_0 [kernel.kallsyms] [k] __netif_receive_skb_core
17.50% kpktgend_0 [kernel.kallsyms] [k] kfree_skb
11.77% kpktgend_0 [cls_u32] [k] u32_classify
5.62% kpktgend_0 [kernel.kallsyms] [k] tc_classify_compat
5.18% kpktgend_0 [pktgen] [k] pktgen_thread_worker
3.23% kpktgend_0 [kernel.kallsyms] [k] tc_classify
2.97% kpktgend_0 [kernel.kallsyms] [k] ip_rcv
1.83% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_internal
1.50% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_sk
0.99% kpktgend_0 [kernel.kallsyms] [k] __build_skb
* With this patch + tc ingress:
tc filter add dev eth4 parent ffff: protocol ip prio 1 \
u32 match ip dst 4.3.2.1/32
Result: OK: 9308218(c9308091+d126) usec, 100000000 (60byte,0frags)
10743194pps 5156Mb/sec (5156733120bps) errors: 100000000
42.01% kpktgend_0 [kernel.kallsyms] [k] __netif_receive_skb_core
17.78% kpktgend_0 [kernel.kallsyms] [k] kfree_skb
11.70% kpktgend_0 [cls_u32] [k] u32_classify
5.46% kpktgend_0 [kernel.kallsyms] [k] tc_classify_compat
5.16% kpktgend_0 [pktgen] [k] pktgen_thread_worker
2.98% kpktgend_0 [kernel.kallsyms] [k] ip_rcv
2.84% kpktgend_0 [kernel.kallsyms] [k] tc_classify
1.96% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_internal
1.57% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_sk
Note that the results are very similar before and after.
I can see gcc gets the code under the ingress static key out of the hot path.
Then, on that cold branch, it generates the code to accomodate the netfilter
ingress static key. My explanation for this is that this reduces the pressure
on the instruction cache for non-users as the new code is out of the hot path,
and it comes with minimal impact for tc ingress users.
Using gcc version 4.8.4 on:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
[...]
L1d cache: 16K
L1i cache: 64K
L2 cache: 2048K
L3 cache: 8192K
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-13 16:19:38 +00:00
|
|
|
#include <linux/netfilter_ingress.h>
|
2016-06-08 12:39:08 +00:00
|
|
|
#include <linux/crash_dump.h>
|
2017-05-18 13:44:38 +00:00
|
|
|
#include <linux/sctp.h>
|
2017-07-21 10:49:31 +00:00
|
|
|
#include <net/udp_tunnel.h>
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2007-10-24 04:14:45 +00:00
|
|
|
#include "net-sysfs.h"
|
|
|
|
|
net: Add Generic Receive Offload infrastructure
This patch adds the top-level GRO (Generic Receive Offload) infrastructure.
This is pretty similar to LRO except that this is protocol-independent.
Instead of holding packets in an lro_mgr structure, they're now held in
napi_struct.
For drivers that intend to use this, they can set the NETIF_F_GRO bit and
call napi_gro_receive instead of netif_receive_skb or just call netif_rx.
The latter will call napi_receive_skb automatically. When napi_gro_receive
is used, the driver must either call napi_complete/napi_rx_complete, or
call napi_gro_flush in softirq context if the driver uses the primitives
__napi_complete/__napi_rx_complete.
Protocols will set the gro_receive and gro_complete function pointers in
order to participate in this scheme.
In addition to the packet, gro_receive will get a list of currently held
packets. Each packet in the list has a same_flow field which is non-zero
if it is a potential match for the new packet. For each packet that may
match, they also have a flush field which is non-zero if the held packet
must not be merged with the new packet.
Once gro_receive has determined that the new skb matches a held packet,
the held packet may be processed immediately if the new skb cannot be
merged with it. In this case gro_receive should return the pointer to
the existing skb in gro_list. Otherwise the new skb should be merged into
the existing packet and NULL should be returned, unless the new skb makes
it impossible for any further merges to be made (e.g., FIN packet) where
the merged skb should be returned.
Whenever the skb is merged into an existing entry, the gro_receive
function should set NAPI_GRO_CB(skb)->same_flow. Note that if an skb
merely matches an existing entry but can't be merged with it, then
this shouldn't be set.
If gro_receive finds it pointless to hold the new skb for future merging,
it should set NAPI_GRO_CB(skb)->flush.
Held packets will be flushed by napi_gro_flush which is called by
napi_complete and napi_rx_complete.
Currently held packets are stored in a singly liked list just like LRO.
The list is limited to a maximum of 8 entries. In future, this may be
expanded to use a hash table to allow more flows to be held for merging.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-16 07:38:52 +00:00
|
|
|
/* Instead of increasing this, you should create a hash table. */
|
|
|
|
#define MAX_GRO_SKBS 8
|
|
|
|
|
2009-01-05 00:13:40 +00:00
|
|
|
/* This should be increased if a protocol with a bigger head is added. */
|
|
|
|
#define GRO_MAX_HEAD (MAX_HEADER + 128)
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
static DEFINE_SPINLOCK(ptype_lock);
|
2012-11-15 08:49:10 +00:00
|
|
|
static DEFINE_SPINLOCK(offload_lock);
|
2013-02-18 19:20:33 +00:00
|
|
|
struct list_head ptype_base[PTYPE_HASH_SIZE] __read_mostly;
|
|
|
|
struct list_head ptype_all __read_mostly; /* Taps */
|
2012-11-15 08:49:10 +00:00
|
|
|
static struct list_head offload_base __read_mostly;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2014-01-10 22:17:24 +00:00
|
|
|
static int netif_rx_internal(struct sk_buff *skb);
|
2014-07-02 04:39:43 +00:00
|
|
|
static int call_netdevice_notifiers_info(unsigned long val,
|
|
|
|
struct net_device *dev,
|
|
|
|
struct netdev_notifier_info *info);
|
2017-05-19 15:52:37 +00:00
|
|
|
static struct napi_struct *napi_by_id(unsigned int napi_id);
|
2014-01-10 22:17:24 +00:00
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
/*
|
2007-05-03 22:13:45 +00:00
|
|
|
* The @dev_base_head list is protected by @dev_base_lock and the rtnl
|
2005-04-16 22:20:36 +00:00
|
|
|
* semaphore.
|
|
|
|
*
|
2009-11-04 13:43:23 +00:00
|
|
|
* Pure readers hold dev_base_lock for reading, or rcu_read_lock()
|
2005-04-16 22:20:36 +00:00
|
|
|
*
|
|
|
|
* Writers must hold the rtnl semaphore while they loop through the
|
2007-05-03 22:13:45 +00:00
|
|
|
* dev_base_head list, and hold dev_base_lock for writing when they do the
|
2005-04-16 22:20:36 +00:00
|
|
|
* actual updates. This allows pure readers to access the list even
|
|
|
|
* while a writer is preparing to update it.
|
|
|
|
*
|
|
|
|
* To put it another way, dev_base_lock is held for writing only to
|
|
|
|
* protect against pure readers; the rtnl semaphore provides the
|
|
|
|
* protection against other writers.
|
|
|
|
*
|
|
|
|
* See, for example usages, register_netdevice() and
|
|
|
|
* unregister_netdevice(), which must be called with the rtnl
|
|
|
|
* semaphore held.
|
|
|
|
*/
|
|
|
|
DEFINE_RWLOCK(dev_base_lock);
|
|
|
|
EXPORT_SYMBOL(dev_base_lock);
|
|
|
|
|
2017-10-02 21:50:05 +00:00
|
|
|
static DEFINE_MUTEX(ifalias_mutex);
|
|
|
|
|
2013-06-10 08:39:41 +00:00
|
|
|
/* protects napi_hash addition/deletion and napi_gen_id */
|
|
|
|
static DEFINE_SPINLOCK(napi_hash_lock);
|
|
|
|
|
2015-11-18 14:30:50 +00:00
|
|
|
static unsigned int napi_gen_id = NR_CPUS;
|
2015-11-18 14:31:01 +00:00
|
|
|
static DEFINE_READ_MOSTLY_HASHTABLE(napi_hash, 8);
|
2013-06-10 08:39:41 +00:00
|
|
|
|
2013-07-23 14:13:17 +00:00
|
|
|
static seqcount_t devnet_rename_seq;
|
2012-11-26 05:21:08 +00:00
|
|
|
|
2011-06-21 03:11:20 +00:00
|
|
|
static inline void dev_base_seq_inc(struct net *net)
|
|
|
|
{
|
2017-02-09 06:56:05 +00:00
|
|
|
while (++net->dev_base_seq == 0)
|
|
|
|
;
|
2011-06-21 03:11:20 +00:00
|
|
|
}
|
|
|
|
|
2007-09-17 18:56:21 +00:00
|
|
|
static inline struct hlist_head *dev_name_hash(struct net *net, const char *name)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2016-06-10 14:51:30 +00:00
|
|
|
unsigned int hash = full_name_hash(net, name, strnlen(name, IFNAMSIZ));
|
2012-04-15 05:58:06 +00:00
|
|
|
|
2009-11-10 07:20:34 +00:00
|
|
|
return &net->dev_name_head[hash_32(hash, NETDEV_HASHBITS)];
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
|
2007-09-17 18:56:21 +00:00
|
|
|
static inline struct hlist_head *dev_index_hash(struct net *net, int ifindex)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2009-10-24 13:13:17 +00:00
|
|
|
return &net->dev_index_head[ifindex & (NETDEV_HASHENTRIES - 1)];
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
|
2010-04-19 21:17:14 +00:00
|
|
|
static inline void rps_lock(struct softnet_data *sd)
|
2010-03-30 20:16:22 +00:00
|
|
|
{
|
|
|
|
#ifdef CONFIG_RPS
|
2010-04-19 21:17:14 +00:00
|
|
|
spin_lock(&sd->input_pkt_queue.lock);
|
2010-03-30 20:16:22 +00:00
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
2010-04-19 21:17:14 +00:00
|
|
|
static inline void rps_unlock(struct softnet_data *sd)
|
2010-03-30 20:16:22 +00:00
|
|
|
{
|
|
|
|
#ifdef CONFIG_RPS
|
2010-04-19 21:17:14 +00:00
|
|
|
spin_unlock(&sd->input_pkt_queue.lock);
|
2010-03-30 20:16:22 +00:00
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
2007-09-12 11:53:49 +00:00
|
|
|
/* Device list insertion */
|
2013-04-17 22:17:50 +00:00
|
|
|
static void list_netdevice(struct net_device *dev)
|
2007-09-12 11:53:49 +00:00
|
|
|
{
|
2008-03-25 12:47:49 +00:00
|
|
|
struct net *net = dev_net(dev);
|
2007-09-12 11:53:49 +00:00
|
|
|
|
|
|
|
ASSERT_RTNL();
|
|
|
|
|
|
|
|
write_lock_bh(&dev_base_lock);
|
2009-11-04 13:43:23 +00:00
|
|
|
list_add_tail_rcu(&dev->dev_list, &net->dev_base_head);
|
2009-10-30 07:11:27 +00:00
|
|
|
hlist_add_head_rcu(&dev->name_hlist, dev_name_hash(net, dev->name));
|
2009-10-19 19:18:49 +00:00
|
|
|
hlist_add_head_rcu(&dev->index_hlist,
|
|
|
|
dev_index_hash(net, dev->ifindex));
|
2007-09-12 11:53:49 +00:00
|
|
|
write_unlock_bh(&dev_base_lock);
|
2011-06-21 03:11:20 +00:00
|
|
|
|
|
|
|
dev_base_seq_inc(net);
|
2007-09-12 11:53:49 +00:00
|
|
|
}
|
|
|
|
|
2009-10-19 19:18:49 +00:00
|
|
|
/* Device list removal
|
|
|
|
* caller must respect a RCU grace period before freeing/reusing dev
|
|
|
|
*/
|
2007-09-12 11:53:49 +00:00
|
|
|
static void unlist_netdevice(struct net_device *dev)
|
|
|
|
{
|
|
|
|
ASSERT_RTNL();
|
|
|
|
|
|
|
|
/* Unlink dev from the device chain */
|
|
|
|
write_lock_bh(&dev_base_lock);
|
2009-11-04 13:43:23 +00:00
|
|
|
list_del_rcu(&dev->dev_list);
|
2009-10-30 07:11:27 +00:00
|
|
|
hlist_del_rcu(&dev->name_hlist);
|
2009-10-19 19:18:49 +00:00
|
|
|
hlist_del_rcu(&dev->index_hlist);
|
2007-09-12 11:53:49 +00:00
|
|
|
write_unlock_bh(&dev_base_lock);
|
2011-06-21 03:11:20 +00:00
|
|
|
|
|
|
|
dev_base_seq_inc(dev_net(dev));
|
2007-09-12 11:53:49 +00:00
|
|
|
}
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
/*
|
|
|
|
* Our notifier list
|
|
|
|
*/
|
|
|
|
|
2006-05-09 22:23:03 +00:00
|
|
|
static RAW_NOTIFIER_HEAD(netdev_chain);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Device drivers call our routines to queue packets here. We empty the
|
|
|
|
* queue in the local softnet handler.
|
|
|
|
*/
|
[NET]: Make NAPI polling independent of struct net_device objects.
Several devices have multiple independant RX queues per net
device, and some have a single interrupt doorbell for several
queues.
In either case, it's easier to support layouts like that if the
structure representing the poll is independant from the net
device itself.
The signature of the ->poll() call back goes from:
int foo_poll(struct net_device *dev, int *budget)
to
int foo_poll(struct napi_struct *napi, int budget)
The caller is returned the number of RX packets processed (or
the number of "NAPI credits" consumed if you want to get
abstract). The callee no longer messes around bumping
dev->quota, *budget, etc. because that is all handled in the
caller upon return.
The napi_struct is to be embedded in the device driver private data
structures.
Furthermore, it is the driver's responsibility to disable all NAPI
instances in it's ->stop() device close handler. Since the
napi_struct is privatized into the driver's private data structures,
only the driver knows how to get at all of the napi_struct instances
it may have per-device.
With lots of help and suggestions from Rusty Russell, Roland Dreier,
Michael Chan, Jeff Garzik, and Jamal Hadi Salim.
Bug fixes from Thomas Graf, Roland Dreier, Peter Zijlstra,
Joseph Fannin, Scott Wood, Hans J. Koch, and Michael Chan.
[ Ported to current tree and all drivers converted. Integrated
Stephen's follow-on kerneldoc additions, and restored poll_list
handling to the old style to fix mutual exclusion issues. -DaveM ]
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-03 23:41:36 +00:00
|
|
|
|
2010-04-17 04:17:02 +00:00
|
|
|
DEFINE_PER_CPU_ALIGNED(struct softnet_data, softnet_data);
|
2009-09-03 08:29:39 +00:00
|
|
|
EXPORT_PER_CPU_SYMBOL(softnet_data);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2008-07-22 21:16:42 +00:00
|
|
|
#ifdef CONFIG_LOCKDEP
|
2007-05-16 05:46:18 +00:00
|
|
|
/*
|
2008-07-09 06:13:53 +00:00
|
|
|
* register_netdevice() inits txq->_xmit_lock and sets lockdep class
|
2007-05-16 05:46:18 +00:00
|
|
|
* according to dev->type
|
|
|
|
*/
|
2017-02-09 06:56:05 +00:00
|
|
|
static const unsigned short netdev_lock_type[] = {
|
|
|
|
ARPHRD_NETROM, ARPHRD_ETHER, ARPHRD_EETHER, ARPHRD_AX25,
|
2007-05-16 05:46:18 +00:00
|
|
|
ARPHRD_PRONET, ARPHRD_CHAOS, ARPHRD_IEEE802, ARPHRD_ARCNET,
|
|
|
|
ARPHRD_APPLETLK, ARPHRD_DLCI, ARPHRD_ATM, ARPHRD_METRICOM,
|
|
|
|
ARPHRD_IEEE1394, ARPHRD_EUI64, ARPHRD_INFINIBAND, ARPHRD_SLIP,
|
|
|
|
ARPHRD_CSLIP, ARPHRD_SLIP6, ARPHRD_CSLIP6, ARPHRD_RSRVD,
|
|
|
|
ARPHRD_ADAPT, ARPHRD_ROSE, ARPHRD_X25, ARPHRD_HWX25,
|
|
|
|
ARPHRD_PPP, ARPHRD_CISCO, ARPHRD_LAPB, ARPHRD_DDCMP,
|
|
|
|
ARPHRD_RAWHDLC, ARPHRD_TUNNEL, ARPHRD_TUNNEL6, ARPHRD_FRAD,
|
|
|
|
ARPHRD_SKIP, ARPHRD_LOOPBACK, ARPHRD_LOCALTLK, ARPHRD_FDDI,
|
|
|
|
ARPHRD_BIF, ARPHRD_SIT, ARPHRD_IPDDP, ARPHRD_IPGRE,
|
|
|
|
ARPHRD_PIMREG, ARPHRD_HIPPI, ARPHRD_ASH, ARPHRD_ECONET,
|
|
|
|
ARPHRD_IRDA, ARPHRD_FCPP, ARPHRD_FCAL, ARPHRD_FCPL,
|
2012-05-10 21:14:35 +00:00
|
|
|
ARPHRD_FCFABRIC, ARPHRD_IEEE80211, ARPHRD_IEEE80211_PRISM,
|
|
|
|
ARPHRD_IEEE80211_RADIOTAP, ARPHRD_PHONET, ARPHRD_PHONET_PIPE,
|
|
|
|
ARPHRD_IEEE802154, ARPHRD_VOID, ARPHRD_NONE};
|
2007-05-16 05:46:18 +00:00
|
|
|
|
2017-02-09 06:56:05 +00:00
|
|
|
static const char *const netdev_lock_name[] = {
|
|
|
|
"_xmit_NETROM", "_xmit_ETHER", "_xmit_EETHER", "_xmit_AX25",
|
|
|
|
"_xmit_PRONET", "_xmit_CHAOS", "_xmit_IEEE802", "_xmit_ARCNET",
|
|
|
|
"_xmit_APPLETLK", "_xmit_DLCI", "_xmit_ATM", "_xmit_METRICOM",
|
|
|
|
"_xmit_IEEE1394", "_xmit_EUI64", "_xmit_INFINIBAND", "_xmit_SLIP",
|
|
|
|
"_xmit_CSLIP", "_xmit_SLIP6", "_xmit_CSLIP6", "_xmit_RSRVD",
|
|
|
|
"_xmit_ADAPT", "_xmit_ROSE", "_xmit_X25", "_xmit_HWX25",
|
|
|
|
"_xmit_PPP", "_xmit_CISCO", "_xmit_LAPB", "_xmit_DDCMP",
|
|
|
|
"_xmit_RAWHDLC", "_xmit_TUNNEL", "_xmit_TUNNEL6", "_xmit_FRAD",
|
|
|
|
"_xmit_SKIP", "_xmit_LOOPBACK", "_xmit_LOCALTLK", "_xmit_FDDI",
|
|
|
|
"_xmit_BIF", "_xmit_SIT", "_xmit_IPDDP", "_xmit_IPGRE",
|
|
|
|
"_xmit_PIMREG", "_xmit_HIPPI", "_xmit_ASH", "_xmit_ECONET",
|
|
|
|
"_xmit_IRDA", "_xmit_FCPP", "_xmit_FCAL", "_xmit_FCPL",
|
|
|
|
"_xmit_FCFABRIC", "_xmit_IEEE80211", "_xmit_IEEE80211_PRISM",
|
|
|
|
"_xmit_IEEE80211_RADIOTAP", "_xmit_PHONET", "_xmit_PHONET_PIPE",
|
|
|
|
"_xmit_IEEE802154", "_xmit_VOID", "_xmit_NONE"};
|
2007-05-16 05:46:18 +00:00
|
|
|
|
|
|
|
static struct lock_class_key netdev_xmit_lock_key[ARRAY_SIZE(netdev_lock_type)];
|
2008-07-22 21:16:42 +00:00
|
|
|
static struct lock_class_key netdev_addr_lock_key[ARRAY_SIZE(netdev_lock_type)];
|
2007-05-16 05:46:18 +00:00
|
|
|
|
|
|
|
static inline unsigned short netdev_lock_pos(unsigned short dev_type)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < ARRAY_SIZE(netdev_lock_type); i++)
|
|
|
|
if (netdev_lock_type[i] == dev_type)
|
|
|
|
return i;
|
|
|
|
/* the last key is used by default */
|
|
|
|
return ARRAY_SIZE(netdev_lock_type) - 1;
|
|
|
|
}
|
|
|
|
|
2008-07-22 21:16:42 +00:00
|
|
|
static inline void netdev_set_xmit_lockdep_class(spinlock_t *lock,
|
|
|
|
unsigned short dev_type)
|
2007-05-16 05:46:18 +00:00
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
|
|
|
i = netdev_lock_pos(dev_type);
|
|
|
|
lockdep_set_class_and_name(lock, &netdev_xmit_lock_key[i],
|
|
|
|
netdev_lock_name[i]);
|
|
|
|
}
|
2008-07-22 21:16:42 +00:00
|
|
|
|
|
|
|
static inline void netdev_set_addr_lockdep_class(struct net_device *dev)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
|
|
|
i = netdev_lock_pos(dev->type);
|
|
|
|
lockdep_set_class_and_name(&dev->addr_list_lock,
|
|
|
|
&netdev_addr_lock_key[i],
|
|
|
|
netdev_lock_name[i]);
|
|
|
|
}
|
2007-05-16 05:46:18 +00:00
|
|
|
#else
|
2008-07-22 21:16:42 +00:00
|
|
|
static inline void netdev_set_xmit_lockdep_class(spinlock_t *lock,
|
|
|
|
unsigned short dev_type)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
static inline void netdev_set_addr_lockdep_class(struct net_device *dev)
|
2007-05-16 05:46:18 +00:00
|
|
|
{
|
|
|
|
}
|
|
|
|
#endif
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
/*******************************************************************************
|
2017-02-09 06:56:06 +00:00
|
|
|
*
|
|
|
|
* Protocol management and registration routines
|
|
|
|
*
|
|
|
|
*******************************************************************************/
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Add a protocol ID to the list. Now that the input handler is
|
|
|
|
* smarter we can dispense with all the messy stuff that used to be
|
|
|
|
* here.
|
|
|
|
*
|
|
|
|
* BEWARE!!! Protocol handlers, mangling input packets,
|
|
|
|
* MUST BE last in hash buckets and checking protocol handlers
|
|
|
|
* MUST start from promiscuous ptype_all chain in net_bh.
|
|
|
|
* It is true now, do not change it.
|
|
|
|
* Explanation follows: if protocol handler, mangling packet, will
|
|
|
|
* be the first on list, it is not able to sense, that packet
|
|
|
|
* is cloned and should be copied-on-write, so that it will
|
|
|
|
* change it and subsequent readers will get broken packet.
|
|
|
|
* --ANK (980803)
|
|
|
|
*/
|
|
|
|
|
2010-09-02 03:53:46 +00:00
|
|
|
static inline struct list_head *ptype_head(const struct packet_type *pt)
|
|
|
|
{
|
|
|
|
if (pt->type == htons(ETH_P_ALL))
|
2015-01-27 19:35:48 +00:00
|
|
|
return pt->dev ? &pt->dev->ptype_all : &ptype_all;
|
2010-09-02 03:53:46 +00:00
|
|
|
else
|
2015-01-27 19:35:48 +00:00
|
|
|
return pt->dev ? &pt->dev->ptype_specific :
|
|
|
|
&ptype_base[ntohs(pt->type) & PTYPE_HASH_MASK];
|
2010-09-02 03:53:46 +00:00
|
|
|
}
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
/**
|
|
|
|
* dev_add_pack - add packet handler
|
|
|
|
* @pt: packet type declaration
|
|
|
|
*
|
|
|
|
* Add a protocol handler to the networking stack. The passed &packet_type
|
|
|
|
* is linked into kernel lists and may not be freed until it has been
|
|
|
|
* removed from the kernel lists.
|
|
|
|
*
|
2007-02-09 14:24:36 +00:00
|
|
|
* This call does not sleep therefore it can not
|
2005-04-16 22:20:36 +00:00
|
|
|
* guarantee all CPU's that are in middle of receiving packets
|
|
|
|
* will see the new packet type (until the next received packet).
|
|
|
|
*/
|
|
|
|
|
|
|
|
void dev_add_pack(struct packet_type *pt)
|
|
|
|
{
|
2010-09-02 03:53:46 +00:00
|
|
|
struct list_head *head = ptype_head(pt);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2010-09-02 03:53:46 +00:00
|
|
|
spin_lock(&ptype_lock);
|
|
|
|
list_add_rcu(&pt->list, head);
|
|
|
|
spin_unlock(&ptype_lock);
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
2009-09-03 08:29:39 +00:00
|
|
|
EXPORT_SYMBOL(dev_add_pack);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
/**
|
|
|
|
* __dev_remove_pack - remove packet handler
|
|
|
|
* @pt: packet type declaration
|
|
|
|
*
|
|
|
|
* Remove a protocol handler that was previously added to the kernel
|
|
|
|
* protocol handlers by dev_add_pack(). The passed &packet_type is removed
|
|
|
|
* from the kernel lists and can be freed or reused once this function
|
2007-02-09 14:24:36 +00:00
|
|
|
* returns.
|
2005-04-16 22:20:36 +00:00
|
|
|
*
|
|
|
|
* The packet type might still be in use by receivers
|
|
|
|
* and must not be freed until after all the CPU's have gone
|
|
|
|
* through a quiescent state.
|
|
|
|
*/
|
|
|
|
void __dev_remove_pack(struct packet_type *pt)
|
|
|
|
{
|
2010-09-02 03:53:46 +00:00
|
|
|
struct list_head *head = ptype_head(pt);
|
2005-04-16 22:20:36 +00:00
|
|
|
struct packet_type *pt1;
|
|
|
|
|
2010-09-02 03:53:46 +00:00
|
|
|
spin_lock(&ptype_lock);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
list_for_each_entry(pt1, head, list) {
|
|
|
|
if (pt == pt1) {
|
|
|
|
list_del_rcu(&pt->list);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2012-02-01 10:54:43 +00:00
|
|
|
pr_warn("dev_remove_pack: %p not found\n", pt);
|
2005-04-16 22:20:36 +00:00
|
|
|
out:
|
2010-09-02 03:53:46 +00:00
|
|
|
spin_unlock(&ptype_lock);
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
2009-09-03 08:29:39 +00:00
|
|
|
EXPORT_SYMBOL(__dev_remove_pack);
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
/**
|
|
|
|
* dev_remove_pack - remove packet handler
|
|
|
|
* @pt: packet type declaration
|
|
|
|
*
|
|
|
|
* Remove a protocol handler that was previously added to the kernel
|
|
|
|
* protocol handlers by dev_add_pack(). The passed &packet_type is removed
|
|
|
|
* from the kernel lists and can be freed or reused once this function
|
|
|
|
* returns.
|
|
|
|
*
|
|
|
|
* This call sleeps to guarantee that no CPU is looking at the packet
|
|
|
|
* type after return.
|
|
|
|
*/
|
|
|
|
void dev_remove_pack(struct packet_type *pt)
|
|
|
|
{
|
|
|
|
__dev_remove_pack(pt);
|
2007-02-09 14:24:36 +00:00
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
synchronize_net();
|
|
|
|
}
|
2009-09-03 08:29:39 +00:00
|
|
|
EXPORT_SYMBOL(dev_remove_pack);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2012-11-15 08:49:10 +00:00
|
|
|
|
|
|
|
/**
|
|
|
|
* dev_add_offload - register offload handlers
|
|
|
|
* @po: protocol offload declaration
|
|
|
|
*
|
|
|
|
* Add protocol offload handlers to the networking stack. The passed
|
|
|
|
* &proto_offload is linked into kernel lists and may not be freed until
|
|
|
|
* it has been removed from the kernel lists.
|
|
|
|
*
|
|
|
|
* This call does not sleep therefore it can not
|
|
|
|
* guarantee all CPU's that are in middle of receiving packets
|
|
|
|
* will see the new offload handlers (until the next received packet).
|
|
|
|
*/
|
|
|
|
void dev_add_offload(struct packet_offload *po)
|
|
|
|
{
|
2015-06-01 21:56:09 +00:00
|
|
|
struct packet_offload *elem;
|
2012-11-15 08:49:10 +00:00
|
|
|
|
|
|
|
spin_lock(&offload_lock);
|
2015-06-01 21:56:09 +00:00
|
|
|
list_for_each_entry(elem, &offload_base, list) {
|
|
|
|
if (po->priority < elem->priority)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
list_add_rcu(&po->list, elem->list.prev);
|
2012-11-15 08:49:10 +00:00
|
|
|
spin_unlock(&offload_lock);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(dev_add_offload);
|
|
|
|
|
|
|
|
/**
|
|
|
|
* __dev_remove_offload - remove offload handler
|
|
|
|
* @po: packet offload declaration
|
|
|
|
*
|
|
|
|
* Remove a protocol offload handler that was previously added to the
|
|
|
|
* kernel offload handlers by dev_add_offload(). The passed &offload_type
|
|
|
|
* is removed from the kernel lists and can be freed or reused once this
|
|
|
|
* function returns.
|
|
|
|
*
|
|
|
|
* The packet type might still be in use by receivers
|
|
|
|
* and must not be freed until after all the CPU's have gone
|
|
|
|
* through a quiescent state.
|
|
|
|
*/
|
2013-12-29 22:01:29 +00:00
|
|
|
static void __dev_remove_offload(struct packet_offload *po)
|
2012-11-15 08:49:10 +00:00
|
|
|
{
|
|
|
|
struct list_head *head = &offload_base;
|
|
|
|
struct packet_offload *po1;
|
|
|
|
|
2012-11-16 08:08:23 +00:00
|
|
|
spin_lock(&offload_lock);
|
2012-11-15 08:49:10 +00:00
|
|
|
|
|
|
|
list_for_each_entry(po1, head, list) {
|
|
|
|
if (po == po1) {
|
|
|
|
list_del_rcu(&po->list);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
pr_warn("dev_remove_offload: %p not found\n", po);
|
|
|
|
out:
|
2012-11-16 08:08:23 +00:00
|
|
|
spin_unlock(&offload_lock);
|
2012-11-15 08:49:10 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* dev_remove_offload - remove packet offload handler
|
|
|
|
* @po: packet offload declaration
|
|
|
|
*
|
|
|
|
* Remove a packet offload handler that was previously added to the kernel
|
|
|
|
* offload handlers by dev_add_offload(). The passed &offload_type is
|
|
|
|
* removed from the kernel lists and can be freed or reused once this
|
|
|
|
* function returns.
|
|
|
|
*
|
|
|
|
* This call sleeps to guarantee that no CPU is looking at the packet
|
|
|
|
* type after return.
|
|
|
|
*/
|
|
|
|
void dev_remove_offload(struct packet_offload *po)
|
|
|
|
{
|
|
|
|
__dev_remove_offload(po);
|
|
|
|
|
|
|
|
synchronize_net();
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(dev_remove_offload);
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
/******************************************************************************
|
2017-02-09 06:56:06 +00:00
|
|
|
*
|
|
|
|
* Device Boot-time Settings Routines
|
|
|
|
*
|
|
|
|
******************************************************************************/
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
/* Boot time configuration table */
|
|
|
|
static struct netdev_boot_setup dev_boot_setup[NETDEV_BOOT_SETUP_MAX];
|
|
|
|
|
|
|
|
/**
|
|
|
|
* netdev_boot_setup_add - add new setup entry
|
|
|
|
* @name: name of the device
|
|
|
|
* @map: configured settings for the device
|
|
|
|
*
|
|
|
|
* Adds new setup entry to the dev_boot_setup list. The function
|
|
|
|
* returns 0 on error and 1 on success. This is a generic routine to
|
|
|
|
* all netdevices.
|
|
|
|
*/
|
|
|
|
static int netdev_boot_setup_add(char *name, struct ifmap *map)
|
|
|
|
{
|
|
|
|
struct netdev_boot_setup *s;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
s = dev_boot_setup;
|
|
|
|
for (i = 0; i < NETDEV_BOOT_SETUP_MAX; i++) {
|
|
|
|
if (s[i].name[0] == '\0' || s[i].name[0] == ' ') {
|
|
|
|
memset(s[i].name, 0, sizeof(s[i].name));
|
2008-07-02 02:57:19 +00:00
|
|
|
strlcpy(s[i].name, name, IFNAMSIZ);
|
2005-04-16 22:20:36 +00:00
|
|
|
memcpy(&s[i].map, map, sizeof(s[i].map));
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return i >= NETDEV_BOOT_SETUP_MAX ? 0 : 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
2017-02-09 06:56:04 +00:00
|
|
|
* netdev_boot_setup_check - check boot time settings
|
|
|
|
* @dev: the netdevice
|
2005-04-16 22:20:36 +00:00
|
|
|
*
|
2017-02-09 06:56:04 +00:00
|
|
|
* Check boot time settings for the device.
|
|
|
|
* The found settings are set for the device to be used
|
|
|
|
* later in the device probing.
|
|
|
|
* Returns 0 if no settings found, 1 if they are.
|
2005-04-16 22:20:36 +00:00
|
|
|
*/
|
|
|
|
int netdev_boot_setup_check(struct net_device *dev)
|
|
|
|
{
|
|
|
|
struct netdev_boot_setup *s = dev_boot_setup;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < NETDEV_BOOT_SETUP_MAX; i++) {
|
|
|
|
if (s[i].name[0] != '\0' && s[i].name[0] != ' ' &&
|
2008-07-02 02:57:19 +00:00
|
|
|
!strcmp(dev->name, s[i].name)) {
|
2017-02-09 06:56:04 +00:00
|
|
|
dev->irq = s[i].map.irq;
|
|
|
|
dev->base_addr = s[i].map.base_addr;
|
|
|
|
dev->mem_start = s[i].map.mem_start;
|
|
|
|
dev->mem_end = s[i].map.mem_end;
|
2005-04-16 22:20:36 +00:00
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
2009-09-03 08:29:39 +00:00
|
|
|
EXPORT_SYMBOL(netdev_boot_setup_check);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
|
|
|
|
/**
|
2017-02-09 06:56:04 +00:00
|
|
|
* netdev_boot_base - get address from boot time settings
|
|
|
|
* @prefix: prefix for network device
|
|
|
|
* @unit: id for network device
|
|
|
|
*
|
|
|
|
* Check boot time settings for the base address of device.
|
|
|
|
* The found settings are set for the device to be used
|
|
|
|
* later in the device probing.
|
|
|
|
* Returns 0 if no settings found.
|
2005-04-16 22:20:36 +00:00
|
|
|
*/
|
|
|
|
unsigned long netdev_boot_base(const char *prefix, int unit)
|
|
|
|
{
|
|
|
|
const struct netdev_boot_setup *s = dev_boot_setup;
|
|
|
|
char name[IFNAMSIZ];
|
|
|
|
int i;
|
|
|
|
|
|
|
|
sprintf(name, "%s%d", prefix, unit);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If device already registered then return base of 1
|
|
|
|
* to indicate not to probe for this interface
|
|
|
|
*/
|
2007-09-17 18:56:21 +00:00
|
|
|
if (__dev_get_by_name(&init_net, name))
|
2005-04-16 22:20:36 +00:00
|
|
|
return 1;
|
|
|
|
|
|
|
|
for (i = 0; i < NETDEV_BOOT_SETUP_MAX; i++)
|
|
|
|
if (!strcmp(name, s[i].name))
|
|
|
|
return s[i].map.base_addr;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Saves at boot time configured settings for any netdevice.
|
|
|
|
*/
|
|
|
|
int __init netdev_boot_setup(char *str)
|
|
|
|
{
|
|
|
|
int ints[5];
|
|
|
|
struct ifmap map;
|
|
|
|
|
|
|
|
str = get_options(str, ARRAY_SIZE(ints), ints);
|
|
|
|
if (!str || !*str)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
/* Save settings */
|
|
|
|
memset(&map, 0, sizeof(map));
|
|
|
|
if (ints[0] > 0)
|
|
|
|
map.irq = ints[1];
|
|
|
|
if (ints[0] > 1)
|
|
|
|
map.base_addr = ints[2];
|
|
|
|
if (ints[0] > 2)
|
|
|
|
map.mem_start = ints[3];
|
|
|
|
if (ints[0] > 3)
|
|
|
|
map.mem_end = ints[4];
|
|
|
|
|
|
|
|
/* Add new entry to the list */
|
|
|
|
return netdev_boot_setup_add(str, &map);
|
|
|
|
}
|
|
|
|
|
|
|
|
__setup("netdev=", netdev_boot_setup);
|
|
|
|
|
|
|
|
/*******************************************************************************
|
2017-02-09 06:56:06 +00:00
|
|
|
*
|
|
|
|
* Device Interface Subroutines
|
|
|
|
*
|
|
|
|
*******************************************************************************/
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2015-04-02 15:07:00 +00:00
|
|
|
/**
|
|
|
|
* dev_get_iflink - get 'iflink' value of a interface
|
|
|
|
* @dev: targeted interface
|
|
|
|
*
|
|
|
|
* Indicates the ifindex the interface is linked to.
|
|
|
|
* Physical interfaces have the same 'ifindex' and 'iflink' values.
|
|
|
|
*/
|
|
|
|
|
|
|
|
int dev_get_iflink(const struct net_device *dev)
|
|
|
|
{
|
|
|
|
if (dev->netdev_ops && dev->netdev_ops->ndo_get_iflink)
|
|
|
|
return dev->netdev_ops->ndo_get_iflink(dev);
|
|
|
|
|
2015-04-02 15:07:09 +00:00
|
|
|
return dev->ifindex;
|
2015-04-02 15:07:00 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(dev_get_iflink);
|
|
|
|
|
2015-10-23 01:17:16 +00:00
|
|
|
/**
|
|
|
|
* dev_fill_metadata_dst - Retrieve tunnel egress information.
|
|
|
|
* @dev: targeted interface
|
|
|
|
* @skb: The packet.
|
|
|
|
*
|
|
|
|
* For better visibility of tunnel traffic OVS needs to retrieve
|
|
|
|
* egress tunnel information for a packet. Following API allows
|
|
|
|
* user to get this info.
|
|
|
|
*/
|
|
|
|
int dev_fill_metadata_dst(struct net_device *dev, struct sk_buff *skb)
|
|
|
|
{
|
|
|
|
struct ip_tunnel_info *info;
|
|
|
|
|
|
|
|
if (!dev->netdev_ops || !dev->netdev_ops->ndo_fill_metadata_dst)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
info = skb_tunnel_info_unclone(skb);
|
|
|
|
if (!info)
|
|
|
|
return -ENOMEM;
|
|
|
|
if (unlikely(!(info->mode & IP_TUNNEL_INFO_TX)))
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
return dev->netdev_ops->ndo_fill_metadata_dst(dev, skb);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(dev_fill_metadata_dst);
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
/**
|
|
|
|
* __dev_get_by_name - find a device by its name
|
2007-10-13 04:17:49 +00:00
|
|
|
* @net: the applicable net namespace
|
2005-04-16 22:20:36 +00:00
|
|
|
* @name: name to find
|
|
|
|
*
|
|
|
|
* Find an interface by name. Must be called under RTNL semaphore
|
|
|
|
* or @dev_base_lock. If the name is found a pointer to the device
|
|
|
|
* is returned. If the name is not found then %NULL is returned. The
|
|
|
|
* reference counters are not incremented so the caller must be
|
|
|
|
* careful with locks.
|
|
|
|
*/
|
|
|
|
|
2007-09-17 18:56:21 +00:00
|
|
|
struct net_device *__dev_get_by_name(struct net *net, const char *name)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2009-10-30 08:40:11 +00:00
|
|
|
struct net_device *dev;
|
|
|
|
struct hlist_head *head = dev_name_hash(net, name);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
hlist: drop the node parameter from iterators
I'm not sure why, but the hlist for each entry iterators were conceived
list_for_each_entry(pos, head, member)
The hlist ones were greedy and wanted an extra parameter:
hlist_for_each_entry(tpos, pos, head, member)
Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.
Besides the semantic patch, there was some manual work required:
- Fix up the actual hlist iterators in linux/list.h
- Fix up the declaration of other iterators based on the hlist ones.
- A very small amount of places were using the 'node' parameter, this
was modified to use 'obj->member' instead.
- Coccinelle didn't handle the hlist_for_each_entry_safe iterator
properly, so those had to be fixed up manually.
The semantic patch which is mostly the work of Peter Senna Tschudin is here:
@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
type T;
expression a,c,d,e;
identifier b;
statement S;
@@
-T b;
<+... when != b
(
hlist_for_each_entry(a,
- b,
c, d) S
|
hlist_for_each_entry_continue(a,
- b,
c) S
|
hlist_for_each_entry_from(a,
- b,
c) S
|
hlist_for_each_entry_rcu(a,
- b,
c, d) S
|
hlist_for_each_entry_rcu_bh(a,
- b,
c, d) S
|
hlist_for_each_entry_continue_rcu_bh(a,
- b,
c) S
|
for_each_busy_worker(a, c,
- b,
d) S
|
ax25_uid_for_each(a,
- b,
c) S
|
ax25_for_each(a,
- b,
c) S
|
inet_bind_bucket_for_each(a,
- b,
c) S
|
sctp_for_each_hentry(a,
- b,
c) S
|
sk_for_each(a,
- b,
c) S
|
sk_for_each_rcu(a,
- b,
c) S
|
sk_for_each_from
-(a, b)
+(a)
S
+ sk_for_each_from(a) S
|
sk_for_each_safe(a,
- b,
c, d) S
|
sk_for_each_bound(a,
- b,
c) S
|
hlist_for_each_entry_safe(a,
- b,
c, d, e) S
|
hlist_for_each_entry_continue_rcu(a,
- b,
c) S
|
nr_neigh_for_each(a,
- b,
c) S
|
nr_neigh_for_each_safe(a,
- b,
c, d) S
|
nr_node_for_each(a,
- b,
c) S
|
nr_node_for_each_safe(a,
- b,
c, d) S
|
- for_each_gfn_sp(a, c, d, b) S
+ for_each_gfn_sp(a, c, d) S
|
- for_each_gfn_indirect_valid_sp(a, c, d, b) S
+ for_each_gfn_indirect_valid_sp(a, c, d) S
|
for_each_host(a,
- b,
c) S
|
for_each_host_safe(a,
- b,
c, d) S
|
for_each_mesh_entry(a,
- b,
c, d) S
)
...+>
[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin <peter.senna@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-28 01:06:00 +00:00
|
|
|
hlist_for_each_entry(dev, head, name_hlist)
|
2005-04-16 22:20:36 +00:00
|
|
|
if (!strncmp(dev->name, name, IFNAMSIZ))
|
|
|
|
return dev;
|
2009-10-30 08:40:11 +00:00
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
return NULL;
|
|
|
|
}
|
2009-09-03 08:29:39 +00:00
|
|
|
EXPORT_SYMBOL(__dev_get_by_name);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2009-10-30 07:11:27 +00:00
|
|
|
/**
|
2017-02-09 06:56:04 +00:00
|
|
|
* dev_get_by_name_rcu - find a device by its name
|
|
|
|
* @net: the applicable net namespace
|
|
|
|
* @name: name to find
|
|
|
|
*
|
|
|
|
* Find an interface by name.
|
|
|
|
* If the name is found a pointer to the device is returned.
|
|
|
|
* If the name is not found then %NULL is returned.
|
|
|
|
* The reference counters are not incremented so the caller must be
|
|
|
|
* careful with locks. The caller must hold RCU lock.
|
2009-10-30 07:11:27 +00:00
|
|
|
*/
|
|
|
|
|
|
|
|
struct net_device *dev_get_by_name_rcu(struct net *net, const char *name)
|
|
|
|
{
|
|
|
|
struct net_device *dev;
|
|
|
|
struct hlist_head *head = dev_name_hash(net, name);
|
|
|
|
|
hlist: drop the node parameter from iterators
I'm not sure why, but the hlist for each entry iterators were conceived
list_for_each_entry(pos, head, member)
The hlist ones were greedy and wanted an extra parameter:
hlist_for_each_entry(tpos, pos, head, member)
Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.
Besides the semantic patch, there was some manual work required:
- Fix up the actual hlist iterators in linux/list.h
- Fix up the declaration of other iterators based on the hlist ones.
- A very small amount of places were using the 'node' parameter, this
was modified to use 'obj->member' instead.
- Coccinelle didn't handle the hlist_for_each_entry_safe iterator
properly, so those had to be fixed up manually.
The semantic patch which is mostly the work of Peter Senna Tschudin is here:
@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
type T;
expression a,c,d,e;
identifier b;
statement S;
@@
-T b;
<+... when != b
(
hlist_for_each_entry(a,
- b,
c, d) S
|
hlist_for_each_entry_continue(a,
- b,
c) S
|
hlist_for_each_entry_from(a,
- b,
c) S
|
hlist_for_each_entry_rcu(a,
- b,
c, d) S
|
hlist_for_each_entry_rcu_bh(a,
- b,
c, d) S
|
hlist_for_each_entry_continue_rcu_bh(a,
- b,
c) S
|
for_each_busy_worker(a, c,
- b,
d) S
|
ax25_uid_for_each(a,
- b,
c) S
|
ax25_for_each(a,
- b,
c) S
|
inet_bind_bucket_for_each(a,
- b,
c) S
|
sctp_for_each_hentry(a,
- b,
c) S
|
sk_for_each(a,
- b,
c) S
|
sk_for_each_rcu(a,
- b,
c) S
|
sk_for_each_from
-(a, b)
+(a)
S
+ sk_for_each_from(a) S
|
sk_for_each_safe(a,
- b,
c, d) S
|
sk_for_each_bound(a,
- b,
c) S
|
hlist_for_each_entry_safe(a,
- b,
c, d, e) S
|
hlist_for_each_entry_continue_rcu(a,
- b,
c) S
|
nr_neigh_for_each(a,
- b,
c) S
|
nr_neigh_for_each_safe(a,
- b,
c, d) S
|
nr_node_for_each(a,
- b,
c) S
|
nr_node_for_each_safe(a,
- b,
c, d) S
|
- for_each_gfn_sp(a, c, d, b) S
+ for_each_gfn_sp(a, c, d) S
|
- for_each_gfn_indirect_valid_sp(a, c, d, b) S
+ for_each_gfn_indirect_valid_sp(a, c, d) S
|
for_each_host(a,
- b,
c) S
|
for_each_host_safe(a,
- b,
c, d) S
|
for_each_mesh_entry(a,
- b,
c, d) S
)
...+>
[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin <peter.senna@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-28 01:06:00 +00:00
|
|
|
hlist_for_each_entry_rcu(dev, head, name_hlist)
|
2009-10-30 07:11:27 +00:00
|
|
|
if (!strncmp(dev->name, name, IFNAMSIZ))
|
|
|
|
return dev;
|
|
|
|
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(dev_get_by_name_rcu);
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
/**
|
|
|
|
* dev_get_by_name - find a device by its name
|
2007-10-13 04:17:49 +00:00
|
|
|
* @net: the applicable net namespace
|
2005-04-16 22:20:36 +00:00
|
|
|
* @name: name to find
|
|
|
|
*
|
|
|
|
* Find an interface by name. This can be called from any
|
|
|
|
* context and does its own locking. The returned handle has
|
|
|
|
* the usage count incremented and the caller must use dev_put() to
|
|
|
|
* release it when it is no longer needed. %NULL is returned if no
|
|
|
|
* matching device is found.
|
|
|
|
*/
|
|
|
|
|
2007-09-17 18:56:21 +00:00
|
|
|
struct net_device *dev_get_by_name(struct net *net, const char *name)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
|
|
|
struct net_device *dev;
|
|
|
|
|
2009-10-30 07:11:27 +00:00
|
|
|
rcu_read_lock();
|
|
|
|
dev = dev_get_by_name_rcu(net, name);
|
2005-04-16 22:20:36 +00:00
|
|
|
if (dev)
|
|
|
|
dev_hold(dev);
|
2009-10-30 07:11:27 +00:00
|
|
|
rcu_read_unlock();
|
2005-04-16 22:20:36 +00:00
|
|
|
return dev;
|
|
|
|
}
|
2009-09-03 08:29:39 +00:00
|
|
|
EXPORT_SYMBOL(dev_get_by_name);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
/**
|
|
|
|
* __dev_get_by_index - find a device by its ifindex
|
2007-10-13 04:17:49 +00:00
|
|
|
* @net: the applicable net namespace
|
2005-04-16 22:20:36 +00:00
|
|
|
* @ifindex: index of device
|
|
|
|
*
|
|
|
|
* Search for an interface by index. Returns %NULL if the device
|
|
|
|
* is not found or a pointer to the device. The device has not
|
|
|
|
* had its reference counter increased so the caller must be careful
|
|
|
|
* about locking. The caller must hold either the RTNL semaphore
|
|
|
|
* or @dev_base_lock.
|
|
|
|
*/
|
|
|
|
|
2007-09-17 18:56:21 +00:00
|
|
|
struct net_device *__dev_get_by_index(struct net *net, int ifindex)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2009-10-30 08:40:11 +00:00
|
|
|
struct net_device *dev;
|
|
|
|
struct hlist_head *head = dev_index_hash(net, ifindex);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
hlist: drop the node parameter from iterators
I'm not sure why, but the hlist for each entry iterators were conceived
list_for_each_entry(pos, head, member)
The hlist ones were greedy and wanted an extra parameter:
hlist_for_each_entry(tpos, pos, head, member)
Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.
Besides the semantic patch, there was some manual work required:
- Fix up the actual hlist iterators in linux/list.h
- Fix up the declaration of other iterators based on the hlist ones.
- A very small amount of places were using the 'node' parameter, this
was modified to use 'obj->member' instead.
- Coccinelle didn't handle the hlist_for_each_entry_safe iterator
properly, so those had to be fixed up manually.
The semantic patch which is mostly the work of Peter Senna Tschudin is here:
@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
type T;
expression a,c,d,e;
identifier b;
statement S;
@@
-T b;
<+... when != b
(
hlist_for_each_entry(a,
- b,
c, d) S
|
hlist_for_each_entry_continue(a,
- b,
c) S
|
hlist_for_each_entry_from(a,
- b,
c) S
|
hlist_for_each_entry_rcu(a,
- b,
c, d) S
|
hlist_for_each_entry_rcu_bh(a,
- b,
c, d) S
|
hlist_for_each_entry_continue_rcu_bh(a,
- b,
c) S
|
for_each_busy_worker(a, c,
- b,
d) S
|
ax25_uid_for_each(a,
- b,
c) S
|
ax25_for_each(a,
- b,
c) S
|
inet_bind_bucket_for_each(a,
- b,
c) S
|
sctp_for_each_hentry(a,
- b,
c) S
|
sk_for_each(a,
- b,
c) S
|
sk_for_each_rcu(a,
- b,
c) S
|
sk_for_each_from
-(a, b)
+(a)
S
+ sk_for_each_from(a) S
|
sk_for_each_safe(a,
- b,
c, d) S
|
sk_for_each_bound(a,
- b,
c) S
|
hlist_for_each_entry_safe(a,
- b,
c, d, e) S
|
hlist_for_each_entry_continue_rcu(a,
- b,
c) S
|
nr_neigh_for_each(a,
- b,
c) S
|
nr_neigh_for_each_safe(a,
- b,
c, d) S
|
nr_node_for_each(a,
- b,
c) S
|
nr_node_for_each_safe(a,
- b,
c, d) S
|
- for_each_gfn_sp(a, c, d, b) S
+ for_each_gfn_sp(a, c, d) S
|
- for_each_gfn_indirect_valid_sp(a, c, d, b) S
+ for_each_gfn_indirect_valid_sp(a, c, d) S
|
for_each_host(a,
- b,
c) S
|
for_each_host_safe(a,
- b,
c, d) S
|
for_each_mesh_entry(a,
- b,
c, d) S
)
...+>
[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin <peter.senna@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-28 01:06:00 +00:00
|
|
|
hlist_for_each_entry(dev, head, index_hlist)
|
2005-04-16 22:20:36 +00:00
|
|
|
if (dev->ifindex == ifindex)
|
|
|
|
return dev;
|
2009-10-30 08:40:11 +00:00
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
return NULL;
|
|
|
|
}
|
2009-09-03 08:29:39 +00:00
|
|
|
EXPORT_SYMBOL(__dev_get_by_index);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2009-10-19 19:18:49 +00:00
|
|
|
/**
|
|
|
|
* dev_get_by_index_rcu - find a device by its ifindex
|
|
|
|
* @net: the applicable net namespace
|
|
|
|
* @ifindex: index of device
|
|
|
|
*
|
|
|
|
* Search for an interface by index. Returns %NULL if the device
|
|
|
|
* is not found or a pointer to the device. The device has not
|
|
|
|
* had its reference counter increased so the caller must be careful
|
|
|
|
* about locking. The caller must hold RCU lock.
|
|
|
|
*/
|
|
|
|
|
|
|
|
struct net_device *dev_get_by_index_rcu(struct net *net, int ifindex)
|
|
|
|
{
|
|
|
|
struct net_device *dev;
|
|
|
|
struct hlist_head *head = dev_index_hash(net, ifindex);
|
|
|
|
|
hlist: drop the node parameter from iterators
I'm not sure why, but the hlist for each entry iterators were conceived
list_for_each_entry(pos, head, member)
The hlist ones were greedy and wanted an extra parameter:
hlist_for_each_entry(tpos, pos, head, member)
Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.
Besides the semantic patch, there was some manual work required:
- Fix up the actual hlist iterators in linux/list.h
- Fix up the declaration of other iterators based on the hlist ones.
- A very small amount of places were using the 'node' parameter, this
was modified to use 'obj->member' instead.
- Coccinelle didn't handle the hlist_for_each_entry_safe iterator
properly, so those had to be fixed up manually.
The semantic patch which is mostly the work of Peter Senna Tschudin is here:
@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
type T;
expression a,c,d,e;
identifier b;
statement S;
@@
-T b;
<+... when != b
(
hlist_for_each_entry(a,
- b,
c, d) S
|
hlist_for_each_entry_continue(a,
- b,
c) S
|
hlist_for_each_entry_from(a,
- b,
c) S
|
hlist_for_each_entry_rcu(a,
- b,
c, d) S
|
hlist_for_each_entry_rcu_bh(a,
- b,
c, d) S
|
hlist_for_each_entry_continue_rcu_bh(a,
- b,
c) S
|
for_each_busy_worker(a, c,
- b,
d) S
|
ax25_uid_for_each(a,
- b,
c) S
|
ax25_for_each(a,
- b,
c) S
|
inet_bind_bucket_for_each(a,
- b,
c) S
|
sctp_for_each_hentry(a,
- b,
c) S
|
sk_for_each(a,
- b,
c) S
|
sk_for_each_rcu(a,
- b,
c) S
|
sk_for_each_from
-(a, b)
+(a)
S
+ sk_for_each_from(a) S
|
sk_for_each_safe(a,
- b,
c, d) S
|
sk_for_each_bound(a,
- b,
c) S
|
hlist_for_each_entry_safe(a,
- b,
c, d, e) S
|
hlist_for_each_entry_continue_rcu(a,
- b,
c) S
|
nr_neigh_for_each(a,
- b,
c) S
|
nr_neigh_for_each_safe(a,
- b,
c, d) S
|
nr_node_for_each(a,
- b,
c) S
|
nr_node_for_each_safe(a,
- b,
c, d) S
|
- for_each_gfn_sp(a, c, d, b) S
+ for_each_gfn_sp(a, c, d) S
|
- for_each_gfn_indirect_valid_sp(a, c, d, b) S
+ for_each_gfn_indirect_valid_sp(a, c, d) S
|
for_each_host(a,
- b,
c) S
|
for_each_host_safe(a,
- b,
c, d) S
|
for_each_mesh_entry(a,
- b,
c, d) S
)
...+>
[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin <peter.senna@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-28 01:06:00 +00:00
|
|
|
hlist_for_each_entry_rcu(dev, head, index_hlist)
|
2009-10-19 19:18:49 +00:00
|
|
|
if (dev->ifindex == ifindex)
|
|
|
|
return dev;
|
|
|
|
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(dev_get_by_index_rcu);
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
/**
|
|
|
|
* dev_get_by_index - find a device by its ifindex
|
2007-10-13 04:17:49 +00:00
|
|
|
* @net: the applicable net namespace
|
2005-04-16 22:20:36 +00:00
|
|
|
* @ifindex: index of device
|
|
|
|
*
|
|
|
|
* Search for an interface by index. Returns NULL if the device
|
|
|
|
* is not found or a pointer to the device. The device returned has
|
|
|
|
* had a reference added and the pointer is safe until the user calls
|
|
|
|
* dev_put to indicate they have finished with it.
|
|
|
|
*/
|
|
|
|
|
2007-09-17 18:56:21 +00:00
|
|
|
struct net_device *dev_get_by_index(struct net *net, int ifindex)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
|
|
|
struct net_device *dev;
|
|
|
|
|
2009-10-19 19:18:49 +00:00
|
|
|
rcu_read_lock();
|
|
|
|
dev = dev_get_by_index_rcu(net, ifindex);
|
2005-04-16 22:20:36 +00:00
|
|
|
if (dev)
|
|
|
|
dev_hold(dev);
|
2009-10-19 19:18:49 +00:00
|
|
|
rcu_read_unlock();
|
2005-04-16 22:20:36 +00:00
|
|
|
return dev;
|
|
|
|
}
|
2009-09-03 08:29:39 +00:00
|
|
|
EXPORT_SYMBOL(dev_get_by_index);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2017-05-19 15:52:37 +00:00
|
|
|
/**
|
|
|
|
* dev_get_by_napi_id - find a device by napi_id
|
|
|
|
* @napi_id: ID of the NAPI struct
|
|
|
|
*
|
|
|
|
* Search for an interface by NAPI ID. Returns %NULL if the device
|
|
|
|
* is not found or a pointer to the device. The device has not had
|
|
|
|
* its reference counter increased so the caller must be careful
|
|
|
|
* about locking. The caller must hold RCU lock.
|
|
|
|
*/
|
|
|
|
|
|
|
|
struct net_device *dev_get_by_napi_id(unsigned int napi_id)
|
|
|
|
{
|
|
|
|
struct napi_struct *napi;
|
|
|
|
|
|
|
|
WARN_ON_ONCE(!rcu_read_lock_held());
|
|
|
|
|
|
|
|
if (napi_id < MIN_NAPI_ID)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
napi = napi_by_id(napi_id);
|
|
|
|
|
|
|
|
return napi ? napi->dev : NULL;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(dev_get_by_napi_id);
|
|
|
|
|
2013-06-26 15:23:42 +00:00
|
|
|
/**
|
|
|
|
* netdev_get_name - get a netdevice name, knowing its ifindex.
|
|
|
|
* @net: network namespace
|
|
|
|
* @name: a pointer to the buffer where the name will be stored.
|
|
|
|
* @ifindex: the ifindex of the interface to get the name from.
|
|
|
|
*
|
|
|
|
* The use of raw_seqcount_begin() and cond_resched() before
|
|
|
|
* retrying is required as we want to give the writers a chance
|
|
|
|
* to complete when CONFIG_PREEMPT is not set.
|
|
|
|
*/
|
|
|
|
int netdev_get_name(struct net *net, char *name, int ifindex)
|
|
|
|
{
|
|
|
|
struct net_device *dev;
|
|
|
|
unsigned int seq;
|
|
|
|
|
|
|
|
retry:
|
|
|
|
seq = raw_seqcount_begin(&devnet_rename_seq);
|
|
|
|
rcu_read_lock();
|
|
|
|
dev = dev_get_by_index_rcu(net, ifindex);
|
|
|
|
if (!dev) {
|
|
|
|
rcu_read_unlock();
|
|
|
|
return -ENODEV;
|
|
|
|
}
|
|
|
|
|
|
|
|
strcpy(name, dev->name);
|
|
|
|
rcu_read_unlock();
|
|
|
|
if (read_seqcount_retry(&devnet_rename_seq, seq)) {
|
|
|
|
cond_resched();
|
|
|
|
goto retry;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
/**
|
2010-12-05 01:23:53 +00:00
|
|
|
* dev_getbyhwaddr_rcu - find a device by its hardware address
|
2007-10-13 04:17:49 +00:00
|
|
|
* @net: the applicable net namespace
|
2005-04-16 22:20:36 +00:00
|
|
|
* @type: media type of device
|
|
|
|
* @ha: hardware address
|
|
|
|
*
|
|
|
|
* Search for an interface by MAC address. Returns NULL if the device
|
2011-01-24 21:16:16 +00:00
|
|
|
* is not found or a pointer to the device.
|
|
|
|
* The caller must hold RCU or RTNL.
|
2010-12-05 01:23:53 +00:00
|
|
|
* The returned device has not had its ref count increased
|
2005-04-16 22:20:36 +00:00
|
|
|
* and the caller must therefore be careful about locking
|
|
|
|
*
|
|
|
|
*/
|
|
|
|
|
2010-12-05 01:23:53 +00:00
|
|
|
struct net_device *dev_getbyhwaddr_rcu(struct net *net, unsigned short type,
|
|
|
|
const char *ha)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
|
|
|
struct net_device *dev;
|
|
|
|
|
2010-12-05 01:23:53 +00:00
|
|
|
for_each_netdev_rcu(net, dev)
|
2005-04-16 22:20:36 +00:00
|
|
|
if (dev->type == type &&
|
|
|
|
!memcmp(dev->dev_addr, ha, dev->addr_len))
|
2007-05-03 22:13:45 +00:00
|
|
|
return dev;
|
|
|
|
|
|
|
|
return NULL;
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
2010-12-05 01:23:53 +00:00
|
|
|
EXPORT_SYMBOL(dev_getbyhwaddr_rcu);
|
2005-09-22 07:44:55 +00:00
|
|
|
|
2007-09-17 18:56:21 +00:00
|
|
|
struct net_device *__dev_getfirstbyhwtype(struct net *net, unsigned short type)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
|
|
|
struct net_device *dev;
|
|
|
|
|
2007-05-03 10:28:13 +00:00
|
|
|
ASSERT_RTNL();
|
2007-09-17 18:56:21 +00:00
|
|
|
for_each_netdev(net, dev)
|
2007-05-03 10:28:13 +00:00
|
|
|
if (dev->type == type)
|
2007-05-03 22:13:45 +00:00
|
|
|
return dev;
|
|
|
|
|
|
|
|
return NULL;
|
2007-05-03 10:28:13 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(__dev_getfirstbyhwtype);
|
|
|
|
|
2007-09-17 18:56:21 +00:00
|
|
|
struct net_device *dev_getfirstbyhwtype(struct net *net, unsigned short type)
|
2007-05-03 10:28:13 +00:00
|
|
|
{
|
2010-03-18 11:27:25 +00:00
|
|
|
struct net_device *dev, *ret = NULL;
|
2007-05-03 10:28:13 +00:00
|
|
|
|
2010-03-18 11:27:25 +00:00
|
|
|
rcu_read_lock();
|
|
|
|
for_each_netdev_rcu(net, dev)
|
|
|
|
if (dev->type == type) {
|
|
|
|
dev_hold(dev);
|
|
|
|
ret = dev;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
rcu_read_unlock();
|
|
|
|
return ret;
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(dev_getfirstbyhwtype);
|
|
|
|
|
|
|
|
/**
|
2014-09-11 22:35:09 +00:00
|
|
|
* __dev_get_by_flags - find any device with given flags
|
2007-10-13 04:17:49 +00:00
|
|
|
* @net: the applicable net namespace
|
2005-04-16 22:20:36 +00:00
|
|
|
* @if_flags: IFF_* values
|
|
|
|
* @mask: bitmask of bits in if_flags to check
|
|
|
|
*
|
|
|
|
* Search for any interface with the given flags. Returns NULL if a device
|
2010-06-07 11:42:13 +00:00
|
|
|
* is not found or a pointer to the device. Must be called inside
|
2014-09-11 22:35:09 +00:00
|
|
|
* rtnl_lock(), and result refcount is unchanged.
|
2005-04-16 22:20:36 +00:00
|
|
|
*/
|
|
|
|
|
2014-09-11 22:35:09 +00:00
|
|
|
struct net_device *__dev_get_by_flags(struct net *net, unsigned short if_flags,
|
|
|
|
unsigned short mask)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2007-05-03 22:13:45 +00:00
|
|
|
struct net_device *dev, *ret;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2014-09-11 22:35:09 +00:00
|
|
|
ASSERT_RTNL();
|
|
|
|
|
2007-05-03 22:13:45 +00:00
|
|
|
ret = NULL;
|
2014-09-11 22:35:09 +00:00
|
|
|
for_each_netdev(net, dev) {
|
2005-04-16 22:20:36 +00:00
|
|
|
if (((dev->flags ^ if_flags) & mask) == 0) {
|
2007-05-03 22:13:45 +00:00
|
|
|
ret = dev;
|
2005-04-16 22:20:36 +00:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
2007-05-03 22:13:45 +00:00
|
|
|
return ret;
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
2014-09-11 22:35:09 +00:00
|
|
|
EXPORT_SYMBOL(__dev_get_by_flags);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
/**
|
|
|
|
* dev_valid_name - check if name is okay for network device
|
|
|
|
* @name: name string
|
|
|
|
*
|
|
|
|
* Network device names need to be valid file names to
|
2006-08-15 23:34:13 +00:00
|
|
|
* to allow sysfs to work. We also disallow any kind of
|
|
|
|
* whitespace.
|
2005-04-16 22:20:36 +00:00
|
|
|
*/
|
2012-03-06 21:12:15 +00:00
|
|
|
bool dev_valid_name(const char *name)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2006-08-15 23:34:13 +00:00
|
|
|
if (*name == '\0')
|
2012-03-06 21:12:15 +00:00
|
|
|
return false;
|
2006-08-30 00:06:13 +00:00
|
|
|
if (strlen(name) >= IFNAMSIZ)
|
2012-03-06 21:12:15 +00:00
|
|
|
return false;
|
2006-08-15 23:34:13 +00:00
|
|
|
if (!strcmp(name, ".") || !strcmp(name, ".."))
|
2012-03-06 21:12:15 +00:00
|
|
|
return false;
|
2006-08-15 23:34:13 +00:00
|
|
|
|
|
|
|
while (*name) {
|
2015-02-18 00:31:57 +00:00
|
|
|
if (*name == '/' || *name == ':' || isspace(*name))
|
2012-03-06 21:12:15 +00:00
|
|
|
return false;
|
2006-08-15 23:34:13 +00:00
|
|
|
name++;
|
|
|
|
}
|
2012-03-06 21:12:15 +00:00
|
|
|
return true;
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
2009-09-03 08:29:39 +00:00
|
|
|
EXPORT_SYMBOL(dev_valid_name);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
/**
|
2007-09-12 11:48:45 +00:00
|
|
|
* __dev_alloc_name - allocate a name for a device
|
|
|
|
* @net: network namespace to allocate the device name in
|
2005-04-16 22:20:36 +00:00
|
|
|
* @name: name format string
|
2007-09-12 11:48:45 +00:00
|
|
|
* @buf: scratch buffer and result name string
|
2005-04-16 22:20:36 +00:00
|
|
|
*
|
|
|
|
* Passed a format string - eg "lt%d" it will try and find a suitable
|
2006-05-26 20:25:24 +00:00
|
|
|
* id. It scans list of devices to build up a free map, then chooses
|
|
|
|
* the first empty slot. The caller must hold the dev_base or rtnl lock
|
|
|
|
* while allocating the name and adding the device in order to avoid
|
|
|
|
* duplicates.
|
|
|
|
* Limited to bits_per_byte * page size devices (ie 32K on most platforms).
|
|
|
|
* Returns the number of the unit assigned or a negative errno code.
|
2005-04-16 22:20:36 +00:00
|
|
|
*/
|
|
|
|
|
2007-09-12 11:48:45 +00:00
|
|
|
static int __dev_alloc_name(struct net *net, const char *name, char *buf)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
|
|
|
int i = 0;
|
|
|
|
const char *p;
|
|
|
|
const int max_netdevices = 8*PAGE_SIZE;
|
2007-10-09 08:59:42 +00:00
|
|
|
unsigned long *inuse;
|
2005-04-16 22:20:36 +00:00
|
|
|
struct net_device *d;
|
|
|
|
|
|
|
|
p = strnchr(name, IFNAMSIZ-1, '%');
|
|
|
|
if (p) {
|
|
|
|
/*
|
|
|
|
* Verify the string as this thing may have come from
|
|
|
|
* the user. There must be either one "%d" and no other "%"
|
|
|
|
* characters.
|
|
|
|
*/
|
|
|
|
if (p[1] != 'd' || strchr(p + 2, '%'))
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
/* Use one page as a bit array of possible slots */
|
2007-10-09 08:59:42 +00:00
|
|
|
inuse = (unsigned long *) get_zeroed_page(GFP_ATOMIC);
|
2005-04-16 22:20:36 +00:00
|
|
|
if (!inuse)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2007-09-17 18:56:21 +00:00
|
|
|
for_each_netdev(net, d) {
|
2005-04-16 22:20:36 +00:00
|
|
|
if (!sscanf(d->name, name, &i))
|
|
|
|
continue;
|
|
|
|
if (i < 0 || i >= max_netdevices)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
/* avoid cases where sscanf is not exact inverse of printf */
|
2007-09-12 11:48:45 +00:00
|
|
|
snprintf(buf, IFNAMSIZ, name, i);
|
2005-04-16 22:20:36 +00:00
|
|
|
if (!strncmp(buf, d->name, IFNAMSIZ))
|
|
|
|
set_bit(i, inuse);
|
|
|
|
}
|
|
|
|
|
|
|
|
i = find_first_zero_bit(inuse, max_netdevices);
|
|
|
|
free_page((unsigned long) inuse);
|
|
|
|
}
|
|
|
|
|
2009-11-18 02:36:59 +00:00
|
|
|
if (buf != name)
|
|
|
|
snprintf(buf, IFNAMSIZ, name, i);
|
2007-09-12 11:48:45 +00:00
|
|
|
if (!__dev_get_by_name(net, buf))
|
2005-04-16 22:20:36 +00:00
|
|
|
return i;
|
|
|
|
|
|
|
|
/* It is possible to run out of possible slots
|
|
|
|
* when the name is long and there isn't enough space left
|
|
|
|
* for the digits, or if all bits are used.
|
|
|
|
*/
|
|
|
|
return -ENFILE;
|
|
|
|
}
|
|
|
|
|
2007-09-12 11:48:45 +00:00
|
|
|
/**
|
|
|
|
* dev_alloc_name - allocate a name for a device
|
|
|
|
* @dev: device
|
|
|
|
* @name: name format string
|
|
|
|
*
|
|
|
|
* Passed a format string - eg "lt%d" it will try and find a suitable
|
|
|
|
* id. It scans list of devices to build up a free map, then chooses
|
|
|
|
* the first empty slot. The caller must hold the dev_base or rtnl lock
|
|
|
|
* while allocating the name and adding the device in order to avoid
|
|
|
|
* duplicates.
|
|
|
|
* Limited to bits_per_byte * page size devices (ie 32K on most platforms).
|
|
|
|
* Returns the number of the unit assigned or a negative errno code.
|
|
|
|
*/
|
|
|
|
|
|
|
|
int dev_alloc_name(struct net_device *dev, const char *name)
|
|
|
|
{
|
|
|
|
char buf[IFNAMSIZ];
|
|
|
|
struct net *net;
|
|
|
|
int ret;
|
|
|
|
|
2008-03-25 12:47:49 +00:00
|
|
|
BUG_ON(!dev_net(dev));
|
|
|
|
net = dev_net(dev);
|
2007-09-12 11:48:45 +00:00
|
|
|
ret = __dev_alloc_name(net, name, buf);
|
|
|
|
if (ret >= 0)
|
|
|
|
strlcpy(dev->name, buf, IFNAMSIZ);
|
|
|
|
return ret;
|
|
|
|
}
|
2009-09-03 08:29:39 +00:00
|
|
|
EXPORT_SYMBOL(dev_alloc_name);
|
2007-09-12 11:48:45 +00:00
|
|
|
|
2012-09-13 20:58:27 +00:00
|
|
|
static int dev_alloc_name_ns(struct net *net,
|
|
|
|
struct net_device *dev,
|
|
|
|
const char *name)
|
2009-11-18 02:36:59 +00:00
|
|
|
{
|
2012-09-13 20:58:27 +00:00
|
|
|
char buf[IFNAMSIZ];
|
|
|
|
int ret;
|
2010-05-19 10:12:19 +00:00
|
|
|
|
2012-09-13 20:58:27 +00:00
|
|
|
ret = __dev_alloc_name(net, name, buf);
|
|
|
|
if (ret >= 0)
|
|
|
|
strlcpy(dev->name, buf, IFNAMSIZ);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int dev_get_valid_name(struct net *net,
|
|
|
|
struct net_device *dev,
|
|
|
|
const char *name)
|
|
|
|
{
|
|
|
|
BUG_ON(!net);
|
2010-05-19 10:12:19 +00:00
|
|
|
|
2009-11-18 02:36:59 +00:00
|
|
|
if (!dev_valid_name(name))
|
|
|
|
return -EINVAL;
|
|
|
|
|
2011-04-30 01:21:32 +00:00
|
|
|
if (strchr(name, '%'))
|
2012-09-13 20:58:27 +00:00
|
|
|
return dev_alloc_name_ns(net, dev, name);
|
2009-11-18 02:36:59 +00:00
|
|
|
else if (__dev_get_by_name(net, name))
|
|
|
|
return -EEXIST;
|
2010-05-19 10:12:19 +00:00
|
|
|
else if (dev->name != name)
|
|
|
|
strlcpy(dev->name, name, IFNAMSIZ);
|
2009-11-18 02:36:59 +00:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
/**
|
|
|
|
* dev_change_name - change name of a device
|
|
|
|
* @dev: device
|
|
|
|
* @newname: name (or format string) must be at least IFNAMSIZ
|
|
|
|
*
|
|
|
|
* Change name of a device, can pass format strings "eth%d".
|
|
|
|
* for wildcarding.
|
|
|
|
*/
|
2008-09-30 09:22:14 +00:00
|
|
|
int dev_change_name(struct net_device *dev, const char *newname)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2014-07-14 14:37:23 +00:00
|
|
|
unsigned char old_assign_type;
|
2007-07-31 00:03:38 +00:00
|
|
|
char oldname[IFNAMSIZ];
|
2005-04-16 22:20:36 +00:00
|
|
|
int err = 0;
|
2007-07-31 00:03:38 +00:00
|
|
|
int ret;
|
2007-09-17 18:56:21 +00:00
|
|
|
struct net *net;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
ASSERT_RTNL();
|
2008-03-25 12:47:49 +00:00
|
|
|
BUG_ON(!dev_net(dev));
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2008-03-25 12:47:49 +00:00
|
|
|
net = dev_net(dev);
|
2005-04-16 22:20:36 +00:00
|
|
|
if (dev->flags & IFF_UP)
|
|
|
|
return -EBUSY;
|
|
|
|
|
2012-12-20 17:25:08 +00:00
|
|
|
write_seqcount_begin(&devnet_rename_seq);
|
2012-11-26 05:21:08 +00:00
|
|
|
|
|
|
|
if (strncmp(newname, dev->name, IFNAMSIZ) == 0) {
|
2012-12-20 17:25:08 +00:00
|
|
|
write_seqcount_end(&devnet_rename_seq);
|
2007-10-26 10:53:42 +00:00
|
|
|
return 0;
|
2012-11-26 05:21:08 +00:00
|
|
|
}
|
2007-10-26 10:53:42 +00:00
|
|
|
|
2007-07-31 00:03:38 +00:00
|
|
|
memcpy(oldname, dev->name, IFNAMSIZ);
|
|
|
|
|
2012-09-13 20:58:27 +00:00
|
|
|
err = dev_get_valid_name(net, dev, newname);
|
2012-11-26 05:21:08 +00:00
|
|
|
if (err < 0) {
|
2012-12-20 17:25:08 +00:00
|
|
|
write_seqcount_end(&devnet_rename_seq);
|
2009-11-18 02:36:59 +00:00
|
|
|
return err;
|
2012-11-26 05:21:08 +00:00
|
|
|
}
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2014-07-17 18:33:32 +00:00
|
|
|
if (oldname[0] && !strchr(oldname, '%'))
|
|
|
|
netdev_info(dev, "renamed from %s\n", oldname);
|
|
|
|
|
2014-07-14 14:37:23 +00:00
|
|
|
old_assign_type = dev->name_assign_type;
|
|
|
|
dev->name_assign_type = NET_NAME_RENAMED;
|
|
|
|
|
2007-07-31 00:03:38 +00:00
|
|
|
rollback:
|
2010-05-05 00:36:49 +00:00
|
|
|
ret = device_rename(&dev->dev, dev->name);
|
|
|
|
if (ret) {
|
|
|
|
memcpy(dev->name, oldname, IFNAMSIZ);
|
2014-07-14 14:37:23 +00:00
|
|
|
dev->name_assign_type = old_assign_type;
|
2012-12-20 17:25:08 +00:00
|
|
|
write_seqcount_end(&devnet_rename_seq);
|
2010-05-05 00:36:49 +00:00
|
|
|
return ret;
|
2008-05-15 05:33:38 +00:00
|
|
|
}
|
2007-07-30 23:35:46 +00:00
|
|
|
|
2012-12-20 17:25:08 +00:00
|
|
|
write_seqcount_end(&devnet_rename_seq);
|
2012-11-26 05:21:08 +00:00
|
|
|
|
2014-01-14 20:58:51 +00:00
|
|
|
netdev_adjacent_rename_links(dev, oldname);
|
|
|
|
|
2007-07-30 23:35:46 +00:00
|
|
|
write_lock_bh(&dev_base_lock);
|
2011-05-17 17:56:59 +00:00
|
|
|
hlist_del_rcu(&dev->name_hlist);
|
2009-10-30 07:11:27 +00:00
|
|
|
write_unlock_bh(&dev_base_lock);
|
|
|
|
|
|
|
|
synchronize_rcu();
|
|
|
|
|
|
|
|
write_lock_bh(&dev_base_lock);
|
|
|
|
hlist_add_head_rcu(&dev->name_hlist, dev_name_hash(net, dev->name));
|
2007-07-30 23:35:46 +00:00
|
|
|
write_unlock_bh(&dev_base_lock);
|
|
|
|
|
2007-09-16 22:42:43 +00:00
|
|
|
ret = call_netdevice_notifiers(NETDEV_CHANGENAME, dev);
|
2007-07-31 00:03:38 +00:00
|
|
|
ret = notifier_to_errno(ret);
|
|
|
|
|
|
|
|
if (ret) {
|
2009-11-15 23:30:24 +00:00
|
|
|
/* err >= 0 after dev_alloc_name() or stores the first errno */
|
|
|
|
if (err >= 0) {
|
2007-07-31 00:03:38 +00:00
|
|
|
err = ret;
|
2012-12-20 17:25:08 +00:00
|
|
|
write_seqcount_begin(&devnet_rename_seq);
|
2007-07-31 00:03:38 +00:00
|
|
|
memcpy(dev->name, oldname, IFNAMSIZ);
|
2014-01-14 20:58:51 +00:00
|
|
|
memcpy(oldname, newname, IFNAMSIZ);
|
2014-07-14 14:37:23 +00:00
|
|
|
dev->name_assign_type = old_assign_type;
|
|
|
|
old_assign_type = NET_NAME_RENAMED;
|
2007-07-31 00:03:38 +00:00
|
|
|
goto rollback;
|
2009-11-15 23:30:24 +00:00
|
|
|
} else {
|
2012-02-01 10:54:43 +00:00
|
|
|
pr_err("%s: name change rollback failed: %d\n",
|
2009-11-15 23:30:24 +00:00
|
|
|
dev->name, ret);
|
2007-07-31 00:03:38 +00:00
|
|
|
}
|
|
|
|
}
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2008-09-23 04:28:11 +00:00
|
|
|
/**
|
|
|
|
* dev_set_alias - change ifalias of a device
|
|
|
|
* @dev: device
|
|
|
|
* @alias: name up to IFALIASZ
|
2008-09-30 09:23:58 +00:00
|
|
|
* @len: limit of bytes to copy from info
|
2008-09-23 04:28:11 +00:00
|
|
|
*
|
|
|
|
* Set ifalias for a device,
|
|
|
|
*/
|
|
|
|
int dev_set_alias(struct net_device *dev, const char *alias, size_t len)
|
|
|
|
{
|
2017-10-02 21:50:05 +00:00
|
|
|
struct dev_ifalias *new_alias = NULL;
|
2008-09-23 04:28:11 +00:00
|
|
|
|
|
|
|
if (len >= IFALIASZ)
|
|
|
|
return -EINVAL;
|
|
|
|
|
2017-10-02 21:50:05 +00:00
|
|
|
if (len) {
|
|
|
|
new_alias = kmalloc(sizeof(*new_alias) + len + 1, GFP_KERNEL);
|
|
|
|
if (!new_alias)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
memcpy(new_alias->ifalias, alias, len);
|
|
|
|
new_alias->ifalias[len] = 0;
|
2008-09-24 04:23:19 +00:00
|
|
|
}
|
|
|
|
|
2017-10-02 21:50:05 +00:00
|
|
|
mutex_lock(&ifalias_mutex);
|
|
|
|
rcu_swap_protected(dev->ifalias, new_alias,
|
|
|
|
mutex_is_locked(&ifalias_mutex));
|
|
|
|
mutex_unlock(&ifalias_mutex);
|
|
|
|
|
|
|
|
if (new_alias)
|
|
|
|
kfree_rcu(new_alias, rcuhead);
|
2008-09-23 04:28:11 +00:00
|
|
|
|
|
|
|
return len;
|
|
|
|
}
|
|
|
|
|
2017-10-02 21:50:05 +00:00
|
|
|
/**
|
|
|
|
* dev_get_alias - get ifalias of a device
|
|
|
|
* @dev: device
|
|
|
|
* @alias: buffer to store name of ifalias
|
|
|
|
* @len: size of buffer
|
|
|
|
*
|
|
|
|
* get ifalias for a device. Caller must make sure dev cannot go
|
|
|
|
* away, e.g. rcu read lock or own a reference count to device.
|
|
|
|
*/
|
|
|
|
int dev_get_alias(const struct net_device *dev, char *name, size_t len)
|
|
|
|
{
|
|
|
|
const struct dev_ifalias *alias;
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
rcu_read_lock();
|
|
|
|
alias = rcu_dereference(dev->ifalias);
|
|
|
|
if (alias)
|
|
|
|
ret = snprintf(name, len, "%s", alias->ifalias);
|
|
|
|
rcu_read_unlock();
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
2008-09-23 04:28:11 +00:00
|
|
|
|
2005-05-29 21:13:47 +00:00
|
|
|
/**
|
2006-05-26 20:25:24 +00:00
|
|
|
* netdev_features_change - device changes features
|
2005-05-29 21:13:47 +00:00
|
|
|
* @dev: device to cause notification
|
|
|
|
*
|
|
|
|
* Called to indicate a device has changed features.
|
|
|
|
*/
|
|
|
|
void netdev_features_change(struct net_device *dev)
|
|
|
|
{
|
2007-09-16 22:42:43 +00:00
|
|
|
call_netdevice_notifiers(NETDEV_FEAT_CHANGE, dev);
|
2005-05-29 21:13:47 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(netdev_features_change);
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
/**
|
|
|
|
* netdev_state_change - device changes state
|
|
|
|
* @dev: device to cause notification
|
|
|
|
*
|
|
|
|
* Called to indicate a device has changed state. This function calls
|
|
|
|
* the notifier chains for netdev_chain and sends a NEWLINK message
|
|
|
|
* to the routing socket.
|
|
|
|
*/
|
|
|
|
void netdev_state_change(struct net_device *dev)
|
|
|
|
{
|
|
|
|
if (dev->flags & IFF_UP) {
|
2014-07-02 04:39:43 +00:00
|
|
|
struct netdev_notifier_change_info change_info;
|
|
|
|
|
|
|
|
change_info.flags_changed = 0;
|
|
|
|
call_netdevice_notifiers_info(NETDEV_CHANGE, dev,
|
|
|
|
&change_info.info);
|
2013-10-23 23:02:42 +00:00
|
|
|
rtmsg_ifinfo(RTM_NEWLINK, dev, 0, GFP_KERNEL);
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
}
|
2009-09-03 08:29:39 +00:00
|
|
|
EXPORT_SYMBOL(netdev_state_change);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2012-08-09 22:14:56 +00:00
|
|
|
/**
|
2017-02-09 06:56:04 +00:00
|
|
|
* netdev_notify_peers - notify network peers about existence of @dev
|
|
|
|
* @dev: network device
|
2012-08-09 22:14:56 +00:00
|
|
|
*
|
|
|
|
* Generate traffic such that interested network peers are aware of
|
|
|
|
* @dev, such as by generating a gratuitous ARP. This may be used when
|
|
|
|
* a device wants to inform the rest of the network about some sort of
|
|
|
|
* reconfiguration such as a failover event or virtual machine
|
|
|
|
* migration.
|
|
|
|
*/
|
|
|
|
void netdev_notify_peers(struct net_device *dev)
|
2008-06-14 01:12:00 +00:00
|
|
|
{
|
2012-08-09 22:14:56 +00:00
|
|
|
rtnl_lock();
|
|
|
|
call_netdevice_notifiers(NETDEV_NOTIFY_PEERS, dev);
|
2017-03-14 12:58:08 +00:00
|
|
|
call_netdevice_notifiers(NETDEV_RESEND_IGMP, dev);
|
2012-08-09 22:14:56 +00:00
|
|
|
rtnl_unlock();
|
2008-06-14 01:12:00 +00:00
|
|
|
}
|
2012-08-09 22:14:56 +00:00
|
|
|
EXPORT_SYMBOL(netdev_notify_peers);
|
2008-06-14 01:12:00 +00:00
|
|
|
|
2010-02-26 06:34:53 +00:00
|
|
|
static int __dev_open(struct net_device *dev)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2008-11-20 05:32:24 +00:00
|
|
|
const struct net_device_ops *ops = dev->netdev_ops;
|
2009-05-29 23:39:53 +00:00
|
|
|
int ret;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2008-05-08 09:53:17 +00:00
|
|
|
ASSERT_RTNL();
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
if (!netif_device_present(dev))
|
|
|
|
return -ENODEV;
|
|
|
|
|
netpoll: protect napi_poll and poll_controller during dev_[open|close]
Ivan Vercera was recently backporting commit
9c13cb8bb477a83b9a3c9e5a5478a4e21294a760 to a RHEL kernel, and I noticed that,
while this patch protects the tg3 driver from having its ndo_poll_controller
routine called during device initalization, it does nothing for the driver
during shutdown. I.e. it would be entirely possible to have the
ndo_poll_controller method (or subsequently the ndo_poll) routine called for a
driver in the netpoll path on CPU A while in parallel on CPU B, the ndo_close or
ndo_open routine could be called. Given that the two latter routines tend to
initizlize and free many data structures that the former two rely on, the result
can easily be data corruption or various other crashes. Furthermore, it seems
that this is potentially a problem with all net drivers that support netpoll,
and so this should ideally be fixed in a common path.
As Ben H Pointed out to me, we can't preform dev_open/dev_close in atomic
context, so I've come up with this solution. We can use a mutex to sleep in
open/close paths and just do a mutex_trylock in the napi poll path and abandon
the poll attempt if we're locked, as we'll just retry the poll on the next send
anyway.
I've tested this here by flooding netconsole with messages on a system whos nic
driver I modfied to periodically return NETDEV_TX_BUSY, so that the netpoll tx
workqueue would be forced to send frames and poll the device. While this was
going on I rapidly ifdown/up'ed the interface and watched for any problems.
I've not found any.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: Ivan Vecera <ivecera@redhat.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: Ben Hutchings <bhutchings@solarflare.com>
CC: Francois Romieu <romieu@fr.zoreil.com>
CC: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-05 08:05:43 +00:00
|
|
|
/* Block netpoll from trying to do any rx path servicing.
|
|
|
|
* If we don't do this there is a chance ndo_poll_controller
|
|
|
|
* or ndo_poll may be running while we open the device
|
|
|
|
*/
|
2014-03-27 22:39:03 +00:00
|
|
|
netpoll_poll_disable(dev);
|
netpoll: protect napi_poll and poll_controller during dev_[open|close]
Ivan Vercera was recently backporting commit
9c13cb8bb477a83b9a3c9e5a5478a4e21294a760 to a RHEL kernel, and I noticed that,
while this patch protects the tg3 driver from having its ndo_poll_controller
routine called during device initalization, it does nothing for the driver
during shutdown. I.e. it would be entirely possible to have the
ndo_poll_controller method (or subsequently the ndo_poll) routine called for a
driver in the netpoll path on CPU A while in parallel on CPU B, the ndo_close or
ndo_open routine could be called. Given that the two latter routines tend to
initizlize and free many data structures that the former two rely on, the result
can easily be data corruption or various other crashes. Furthermore, it seems
that this is potentially a problem with all net drivers that support netpoll,
and so this should ideally be fixed in a common path.
As Ben H Pointed out to me, we can't preform dev_open/dev_close in atomic
context, so I've come up with this solution. We can use a mutex to sleep in
open/close paths and just do a mutex_trylock in the napi poll path and abandon
the poll attempt if we're locked, as we'll just retry the poll on the next send
anyway.
I've tested this here by flooding netconsole with messages on a system whos nic
driver I modfied to periodically return NETDEV_TX_BUSY, so that the netpoll tx
workqueue would be forced to send frames and poll the device. While this was
going on I rapidly ifdown/up'ed the interface and watched for any problems.
I've not found any.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: Ivan Vecera <ivecera@redhat.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: Ben Hutchings <bhutchings@solarflare.com>
CC: Francois Romieu <romieu@fr.zoreil.com>
CC: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-05 08:05:43 +00:00
|
|
|
|
2009-05-29 23:39:53 +00:00
|
|
|
ret = call_netdevice_notifiers(NETDEV_PRE_UP, dev);
|
|
|
|
ret = notifier_to_errno(ret);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
set_bit(__LINK_STATE_START, &dev->state);
|
2007-10-24 03:19:37 +00:00
|
|
|
|
2008-11-20 05:32:24 +00:00
|
|
|
if (ops->ndo_validate_addr)
|
|
|
|
ret = ops->ndo_validate_addr(dev);
|
2007-10-24 03:19:37 +00:00
|
|
|
|
2008-11-20 05:32:24 +00:00
|
|
|
if (!ret && ops->ndo_open)
|
|
|
|
ret = ops->ndo_open(dev);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2014-03-27 22:39:03 +00:00
|
|
|
netpoll_poll_enable(dev);
|
netpoll: protect napi_poll and poll_controller during dev_[open|close]
Ivan Vercera was recently backporting commit
9c13cb8bb477a83b9a3c9e5a5478a4e21294a760 to a RHEL kernel, and I noticed that,
while this patch protects the tg3 driver from having its ndo_poll_controller
routine called during device initalization, it does nothing for the driver
during shutdown. I.e. it would be entirely possible to have the
ndo_poll_controller method (or subsequently the ndo_poll) routine called for a
driver in the netpoll path on CPU A while in parallel on CPU B, the ndo_close or
ndo_open routine could be called. Given that the two latter routines tend to
initizlize and free many data structures that the former two rely on, the result
can easily be data corruption or various other crashes. Furthermore, it seems
that this is potentially a problem with all net drivers that support netpoll,
and so this should ideally be fixed in a common path.
As Ben H Pointed out to me, we can't preform dev_open/dev_close in atomic
context, so I've come up with this solution. We can use a mutex to sleep in
open/close paths and just do a mutex_trylock in the napi poll path and abandon
the poll attempt if we're locked, as we'll just retry the poll on the next send
anyway.
I've tested this here by flooding netconsole with messages on a system whos nic
driver I modfied to periodically return NETDEV_TX_BUSY, so that the netpoll tx
workqueue would be forced to send frames and poll the device. While this was
going on I rapidly ifdown/up'ed the interface and watched for any problems.
I've not found any.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: Ivan Vecera <ivecera@redhat.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: Ben Hutchings <bhutchings@solarflare.com>
CC: Francois Romieu <romieu@fr.zoreil.com>
CC: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-05 08:05:43 +00:00
|
|
|
|
2007-10-24 03:19:37 +00:00
|
|
|
if (ret)
|
|
|
|
clear_bit(__LINK_STATE_START, &dev->state);
|
|
|
|
else {
|
2005-04-16 22:20:36 +00:00
|
|
|
dev->flags |= IFF_UP;
|
2007-06-27 08:28:10 +00:00
|
|
|
dev_set_rx_mode(dev);
|
2005-04-16 22:20:36 +00:00
|
|
|
dev_activate(dev);
|
2012-07-05 01:23:25 +00:00
|
|
|
add_device_randomness(dev->dev_addr, dev->addr_len);
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
2007-10-24 03:19:37 +00:00
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
2010-02-26 06:34:53 +00:00
|
|
|
* dev_open - prepare an interface for use.
|
|
|
|
* @dev: device to open
|
2005-04-16 22:20:36 +00:00
|
|
|
*
|
2010-02-26 06:34:53 +00:00
|
|
|
* Takes a device from down to up state. The device's private open
|
|
|
|
* function is invoked and then the multicast lists are loaded. Finally
|
|
|
|
* the device is moved into the up state and a %NETDEV_UP message is
|
|
|
|
* sent to the netdev notifier chain.
|
|
|
|
*
|
|
|
|
* Calling this function on an active interface is a nop. On a failure
|
|
|
|
* a negative errno code is returned.
|
2005-04-16 22:20:36 +00:00
|
|
|
*/
|
2010-02-26 06:34:53 +00:00
|
|
|
int dev_open(struct net_device *dev)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
if (dev->flags & IFF_UP)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
ret = __dev_open(dev);
|
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
|
2013-10-23 23:02:42 +00:00
|
|
|
rtmsg_ifinfo(RTM_NEWLINK, dev, IFF_UP|IFF_RUNNING, GFP_KERNEL);
|
2010-02-26 06:34:53 +00:00
|
|
|
call_netdevice_notifiers(NETDEV_UP, dev);
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(dev_open);
|
|
|
|
|
2017-07-18 22:59:27 +00:00
|
|
|
static void __dev_close_many(struct list_head *head)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2010-12-13 12:44:07 +00:00
|
|
|
struct net_device *dev;
|
2008-05-08 09:53:17 +00:00
|
|
|
|
2010-02-26 06:34:53 +00:00
|
|
|
ASSERT_RTNL();
|
2007-09-12 12:33:25 +00:00
|
|
|
might_sleep();
|
|
|
|
|
2013-10-06 02:26:05 +00:00
|
|
|
list_for_each_entry(dev, head, close_list) {
|
2014-03-27 22:38:17 +00:00
|
|
|
/* Temporarily disable netpoll until the interface is down */
|
2014-03-27 22:39:03 +00:00
|
|
|
netpoll_poll_disable(dev);
|
2014-03-27 22:38:17 +00:00
|
|
|
|
2010-12-13 12:44:07 +00:00
|
|
|
call_netdevice_notifiers(NETDEV_GOING_DOWN, dev);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2010-12-13 12:44:07 +00:00
|
|
|
clear_bit(__LINK_STATE_START, &dev->state);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2010-12-13 12:44:07 +00:00
|
|
|
/* Synchronize to scheduled poll. We cannot touch poll list, it
|
|
|
|
* can be even on different cpu. So just clear netif_running().
|
|
|
|
*
|
|
|
|
* dev->stop() will invoke napi_disable() on all of it's
|
|
|
|
* napi_struct instances on this device.
|
|
|
|
*/
|
2014-03-17 17:06:10 +00:00
|
|
|
smp_mb__after_atomic(); /* Commit netif_running(). */
|
2010-12-13 12:44:07 +00:00
|
|
|
}
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2010-12-13 12:44:07 +00:00
|
|
|
dev_deactivate_many(head);
|
2008-02-13 07:10:11 +00:00
|
|
|
|
2013-10-06 02:26:05 +00:00
|
|
|
list_for_each_entry(dev, head, close_list) {
|
2010-12-13 12:44:07 +00:00
|
|
|
const struct net_device_ops *ops = dev->netdev_ops;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2010-12-13 12:44:07 +00:00
|
|
|
/*
|
|
|
|
* Call the device specific close. This cannot fail.
|
|
|
|
* Only if device is UP
|
|
|
|
*
|
|
|
|
* We allow it to be called even after a DETACH hot-plug
|
|
|
|
* event.
|
|
|
|
*/
|
|
|
|
if (ops->ndo_stop)
|
|
|
|
ops->ndo_stop(dev);
|
|
|
|
|
|
|
|
dev->flags &= ~IFF_UP;
|
2014-03-27 22:39:03 +00:00
|
|
|
netpoll_poll_enable(dev);
|
2010-12-13 12:44:07 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2017-07-18 22:59:27 +00:00
|
|
|
static void __dev_close(struct net_device *dev)
|
2010-12-13 12:44:07 +00:00
|
|
|
{
|
|
|
|
LIST_HEAD(single);
|
|
|
|
|
2013-10-06 02:26:05 +00:00
|
|
|
list_add(&dev->close_list, &single);
|
2017-07-18 22:59:27 +00:00
|
|
|
__dev_close_many(&single);
|
2011-02-17 22:54:38 +00:00
|
|
|
list_del(&single);
|
2010-12-13 12:44:07 +00:00
|
|
|
}
|
|
|
|
|
2017-07-18 22:59:27 +00:00
|
|
|
void dev_close_many(struct list_head *head, bool unlink)
|
2010-12-13 12:44:07 +00:00
|
|
|
{
|
|
|
|
struct net_device *dev, *tmp;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2013-10-06 02:26:05 +00:00
|
|
|
/* Remove the devices that don't need to be closed */
|
|
|
|
list_for_each_entry_safe(dev, tmp, head, close_list)
|
2010-12-13 12:44:07 +00:00
|
|
|
if (!(dev->flags & IFF_UP))
|
2013-10-06 02:26:05 +00:00
|
|
|
list_del_init(&dev->close_list);
|
2010-12-13 12:44:07 +00:00
|
|
|
|
|
|
|
__dev_close_many(head);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2013-10-06 02:26:05 +00:00
|
|
|
list_for_each_entry_safe(dev, tmp, head, close_list) {
|
2013-10-23 23:02:42 +00:00
|
|
|
rtmsg_ifinfo(RTM_NEWLINK, dev, IFF_UP|IFF_RUNNING, GFP_KERNEL);
|
2010-12-13 12:44:07 +00:00
|
|
|
call_netdevice_notifiers(NETDEV_DOWN, dev);
|
2015-03-19 02:52:33 +00:00
|
|
|
if (unlink)
|
|
|
|
list_del_init(&dev->close_list);
|
2010-12-13 12:44:07 +00:00
|
|
|
}
|
2010-02-26 06:34:53 +00:00
|
|
|
}
|
2015-03-19 02:52:33 +00:00
|
|
|
EXPORT_SYMBOL(dev_close_many);
|
2010-02-26 06:34:53 +00:00
|
|
|
|
|
|
|
/**
|
|
|
|
* dev_close - shutdown an interface.
|
|
|
|
* @dev: device to shutdown
|
|
|
|
*
|
|
|
|
* This function moves an active device into down state. A
|
|
|
|
* %NETDEV_GOING_DOWN is sent to the netdev notifier chain. The device
|
|
|
|
* is then deactivated and finally a %NETDEV_DOWN is sent to the notifier
|
|
|
|
* chain.
|
|
|
|
*/
|
2017-07-18 22:59:27 +00:00
|
|
|
void dev_close(struct net_device *dev)
|
2010-02-26 06:34:53 +00:00
|
|
|
{
|
2011-05-10 19:26:06 +00:00
|
|
|
if (dev->flags & IFF_UP) {
|
|
|
|
LIST_HEAD(single);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2013-10-06 02:26:05 +00:00
|
|
|
list_add(&dev->close_list, &single);
|
2015-03-19 02:52:33 +00:00
|
|
|
dev_close_many(&single, true);
|
2011-05-10 19:26:06 +00:00
|
|
|
list_del(&single);
|
|
|
|
}
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
2009-09-03 08:29:39 +00:00
|
|
|
EXPORT_SYMBOL(dev_close);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
|
2008-06-19 23:15:47 +00:00
|
|
|
/**
|
|
|
|
* dev_disable_lro - disable Large Receive Offload on a device
|
|
|
|
* @dev: device
|
|
|
|
*
|
|
|
|
* Disable Large Receive Offload (LRO) on a net device. Must be
|
|
|
|
* called under RTNL. This is needed if received packets may be
|
|
|
|
* forwarded to another interface.
|
|
|
|
*/
|
|
|
|
void dev_disable_lro(struct net_device *dev)
|
|
|
|
{
|
2014-11-13 06:54:50 +00:00
|
|
|
struct net_device *lower_dev;
|
|
|
|
struct list_head *iter;
|
2013-11-15 05:18:50 +00:00
|
|
|
|
2011-11-15 15:29:55 +00:00
|
|
|
dev->wanted_features &= ~NETIF_F_LRO;
|
|
|
|
netdev_update_features(dev);
|
2011-03-18 16:56:34 +00:00
|
|
|
|
2011-04-21 12:42:15 +00:00
|
|
|
if (unlikely(dev->features & NETIF_F_LRO))
|
|
|
|
netdev_WARN(dev, "failed to disable LRO!\n");
|
2014-11-13 06:54:50 +00:00
|
|
|
|
|
|
|
netdev_for_each_lower_dev(dev, lower_dev, iter)
|
|
|
|
dev_disable_lro(lower_dev);
|
2008-06-19 23:15:47 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(dev_disable_lro);
|
|
|
|
|
2013-05-28 01:30:21 +00:00
|
|
|
static int call_netdevice_notifier(struct notifier_block *nb, unsigned long val,
|
|
|
|
struct net_device *dev)
|
|
|
|
{
|
|
|
|
struct netdev_notifier_info info;
|
|
|
|
|
|
|
|
netdev_notifier_info_init(&info, dev);
|
|
|
|
return nb->notifier_call(nb, val, &info);
|
|
|
|
}
|
2008-06-19 23:15:47 +00:00
|
|
|
|
2007-09-17 18:56:21 +00:00
|
|
|
static int dev_boot_phase = 1;
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
/**
|
2017-02-09 06:56:04 +00:00
|
|
|
* register_netdevice_notifier - register a network notifier block
|
|
|
|
* @nb: notifier
|
2005-04-16 22:20:36 +00:00
|
|
|
*
|
2017-02-09 06:56:04 +00:00
|
|
|
* Register a notifier to be called when network device events occur.
|
|
|
|
* The notifier passed is linked into the kernel structures and must
|
|
|
|
* not be reused until it has been unregistered. A negative errno code
|
|
|
|
* is returned on a failure.
|
2005-04-16 22:20:36 +00:00
|
|
|
*
|
2017-02-09 06:56:04 +00:00
|
|
|
* When registered all registration and up events are replayed
|
|
|
|
* to the new notifier to allow device to have a race free
|
|
|
|
* view of the network device list.
|
2005-04-16 22:20:36 +00:00
|
|
|
*/
|
|
|
|
|
|
|
|
int register_netdevice_notifier(struct notifier_block *nb)
|
|
|
|
{
|
|
|
|
struct net_device *dev;
|
2007-07-31 00:03:38 +00:00
|
|
|
struct net_device *last;
|
2007-09-17 18:56:21 +00:00
|
|
|
struct net *net;
|
2005-04-16 22:20:36 +00:00
|
|
|
int err;
|
|
|
|
|
|
|
|
rtnl_lock();
|
2006-05-09 22:23:03 +00:00
|
|
|
err = raw_notifier_chain_register(&netdev_chain, nb);
|
2007-07-31 00:03:38 +00:00
|
|
|
if (err)
|
|
|
|
goto unlock;
|
2007-09-17 18:56:21 +00:00
|
|
|
if (dev_boot_phase)
|
|
|
|
goto unlock;
|
|
|
|
for_each_net(net) {
|
|
|
|
for_each_netdev(net, dev) {
|
2013-05-28 01:30:21 +00:00
|
|
|
err = call_netdevice_notifier(nb, NETDEV_REGISTER, dev);
|
2007-09-17 18:56:21 +00:00
|
|
|
err = notifier_to_errno(err);
|
|
|
|
if (err)
|
|
|
|
goto rollback;
|
|
|
|
|
|
|
|
if (!(dev->flags & IFF_UP))
|
|
|
|
continue;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2013-05-28 01:30:21 +00:00
|
|
|
call_netdevice_notifier(nb, NETDEV_UP, dev);
|
2007-09-17 18:56:21 +00:00
|
|
|
}
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
2007-07-31 00:03:38 +00:00
|
|
|
|
|
|
|
unlock:
|
2005-04-16 22:20:36 +00:00
|
|
|
rtnl_unlock();
|
|
|
|
return err;
|
2007-07-31 00:03:38 +00:00
|
|
|
|
|
|
|
rollback:
|
|
|
|
last = dev;
|
2007-09-17 18:56:21 +00:00
|
|
|
for_each_net(net) {
|
|
|
|
for_each_netdev(net, dev) {
|
|
|
|
if (dev == last)
|
2011-12-01 04:43:07 +00:00
|
|
|
goto outroll;
|
2007-07-31 00:03:38 +00:00
|
|
|
|
2007-09-17 18:56:21 +00:00
|
|
|
if (dev->flags & IFF_UP) {
|
2013-05-28 01:30:21 +00:00
|
|
|
call_netdevice_notifier(nb, NETDEV_GOING_DOWN,
|
|
|
|
dev);
|
|
|
|
call_netdevice_notifier(nb, NETDEV_DOWN, dev);
|
2007-09-17 18:56:21 +00:00
|
|
|
}
|
2013-05-28 01:30:21 +00:00
|
|
|
call_netdevice_notifier(nb, NETDEV_UNREGISTER, dev);
|
2007-07-31 00:03:38 +00:00
|
|
|
}
|
|
|
|
}
|
2007-11-14 23:53:16 +00:00
|
|
|
|
2011-12-01 04:43:07 +00:00
|
|
|
outroll:
|
2007-11-14 23:53:16 +00:00
|
|
|
raw_notifier_chain_unregister(&netdev_chain, nb);
|
2007-07-31 00:03:38 +00:00
|
|
|
goto unlock;
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
2009-09-03 08:29:39 +00:00
|
|
|
EXPORT_SYMBOL(register_netdevice_notifier);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
/**
|
2017-02-09 06:56:04 +00:00
|
|
|
* unregister_netdevice_notifier - unregister a network notifier block
|
|
|
|
* @nb: notifier
|
2005-04-16 22:20:36 +00:00
|
|
|
*
|
2017-02-09 06:56:04 +00:00
|
|
|
* Unregister a notifier previously registered by
|
|
|
|
* register_netdevice_notifier(). The notifier is unlinked into the
|
|
|
|
* kernel structures and may then be reused. A negative errno code
|
|
|
|
* is returned on a failure.
|
2012-04-06 15:33:35 +00:00
|
|
|
*
|
2017-02-09 06:56:04 +00:00
|
|
|
* After unregistering unregister and down device events are synthesized
|
|
|
|
* for all devices on the device list to the removed notifier to remove
|
|
|
|
* the need for special case cleanup code.
|
2005-04-16 22:20:36 +00:00
|
|
|
*/
|
|
|
|
|
|
|
|
int unregister_netdevice_notifier(struct notifier_block *nb)
|
|
|
|
{
|
2012-04-06 15:33:35 +00:00
|
|
|
struct net_device *dev;
|
|
|
|
struct net *net;
|
2006-03-25 09:24:25 +00:00
|
|
|
int err;
|
|
|
|
|
|
|
|
rtnl_lock();
|
2006-05-09 22:23:03 +00:00
|
|
|
err = raw_notifier_chain_unregister(&netdev_chain, nb);
|
2012-04-06 15:33:35 +00:00
|
|
|
if (err)
|
|
|
|
goto unlock;
|
|
|
|
|
|
|
|
for_each_net(net) {
|
|
|
|
for_each_netdev(net, dev) {
|
|
|
|
if (dev->flags & IFF_UP) {
|
2013-05-28 01:30:21 +00:00
|
|
|
call_netdevice_notifier(nb, NETDEV_GOING_DOWN,
|
|
|
|
dev);
|
|
|
|
call_netdevice_notifier(nb, NETDEV_DOWN, dev);
|
2012-04-06 15:33:35 +00:00
|
|
|
}
|
2013-05-28 01:30:21 +00:00
|
|
|
call_netdevice_notifier(nb, NETDEV_UNREGISTER, dev);
|
2012-04-06 15:33:35 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
unlock:
|
2006-03-25 09:24:25 +00:00
|
|
|
rtnl_unlock();
|
|
|
|
return err;
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
2009-09-03 08:29:39 +00:00
|
|
|
EXPORT_SYMBOL(unregister_netdevice_notifier);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2013-05-28 01:30:21 +00:00
|
|
|
/**
|
|
|
|
* call_netdevice_notifiers_info - call all network notifier blocks
|
|
|
|
* @val: value passed unmodified to notifier function
|
|
|
|
* @dev: net_device pointer passed unmodified to notifier function
|
|
|
|
* @info: notifier information data
|
|
|
|
*
|
|
|
|
* Call all network notifier blocks. Parameters and return value
|
|
|
|
* are as for raw_notifier_call_chain().
|
|
|
|
*/
|
|
|
|
|
2013-12-29 22:01:29 +00:00
|
|
|
static int call_netdevice_notifiers_info(unsigned long val,
|
|
|
|
struct net_device *dev,
|
|
|
|
struct netdev_notifier_info *info)
|
2013-05-28 01:30:21 +00:00
|
|
|
{
|
|
|
|
ASSERT_RTNL();
|
|
|
|
netdev_notifier_info_init(info, dev);
|
|
|
|
return raw_notifier_call_chain(&netdev_chain, val, info);
|
|
|
|
}
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
/**
|
|
|
|
* call_netdevice_notifiers - call all network notifier blocks
|
|
|
|
* @val: value passed unmodified to notifier function
|
2007-10-13 04:17:49 +00:00
|
|
|
* @dev: net_device pointer passed unmodified to notifier function
|
2005-04-16 22:20:36 +00:00
|
|
|
*
|
|
|
|
* Call all network notifier blocks. Parameters and return value
|
2006-05-09 22:23:03 +00:00
|
|
|
* are as for raw_notifier_call_chain().
|
2005-04-16 22:20:36 +00:00
|
|
|
*/
|
|
|
|
|
2007-09-16 22:33:32 +00:00
|
|
|
int call_netdevice_notifiers(unsigned long val, struct net_device *dev)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2013-05-28 01:30:21 +00:00
|
|
|
struct netdev_notifier_info info;
|
|
|
|
|
|
|
|
return call_netdevice_notifiers_info(val, dev, &info);
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
2011-03-24 13:24:01 +00:00
|
|
|
EXPORT_SYMBOL(call_netdevice_notifiers);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2015-05-13 16:19:37 +00:00
|
|
|
#ifdef CONFIG_NET_INGRESS
|
net: use jump label patching for ingress qdisc in __netif_receive_skb_core
Even if we make use of classifier and actions from the egress
path, we're going into handle_ing() executing additional code
on a per-packet cost for ingress qdisc, just to realize that
nothing is attached on ingress.
Instead, this can just be blinded out as a no-op entirely with
the use of a static key. On input fast-path, we already make
use of static keys in various places, e.g. skb time stamping,
in RPS, etc. It makes sense to not waste time when we're assured
that no ingress qdisc is attached anywhere.
Enabling/disabling of that code path is being done via two
helpers, namely net_{inc,dec}_ingress_queue(), that are being
invoked under RTNL mutex when a ingress qdisc is being either
initialized or destructed.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-10 21:07:54 +00:00
|
|
|
static struct static_key ingress_needed __read_mostly;
|
|
|
|
|
|
|
|
void net_inc_ingress_queue(void)
|
|
|
|
{
|
|
|
|
static_key_slow_inc(&ingress_needed);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(net_inc_ingress_queue);
|
|
|
|
|
|
|
|
void net_dec_ingress_queue(void)
|
|
|
|
{
|
|
|
|
static_key_slow_dec(&ingress_needed);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(net_dec_ingress_queue);
|
|
|
|
#endif
|
|
|
|
|
net, sched: add clsact qdisc
This work adds a generalization of the ingress qdisc as a qdisc holding
only classifiers. The clsact qdisc works on ingress, but also on egress.
In both cases, it's execution happens without taking the qdisc lock, and
the main difference for the egress part compared to prior version of [1]
is that this can be applied with _any_ underlying real egress qdisc (also
classless ones).
Besides solving the use-case of [1], that is, allowing for more programmability
on assigning skb->priority for the mqprio case that is supported by most
popular 10G+ NICs, it also opens up a lot more flexibility for other tc
applications. The main work on classification can already be done at clsact
egress time if the use-case allows and state stored for later retrieval
f.e. again in skb->priority with major/minors (which is checked by most
classful qdiscs before consulting tc_classify()) and/or in other skb fields
like skb->tc_index for some light-weight post-processing to get to the
eventual classid in case of a classful qdisc. Another use case is that
the clsact egress part allows to have a central egress counterpart to
the ingress classifiers, so that classifiers can easily share state (e.g.
in cls_bpf via eBPF maps) for ingress and egress.
Currently, default setups like mq + pfifo_fast would require for this to
use, for example, prio qdisc instead (to get a tc_classify() run) and to
duplicate the egress classifier for each queue. With clsact, it allows
for leaving the setup as is, it can additionally assign skb->priority to
put the skb in one of pfifo_fast's bands and it can share state with maps.
Moreover, we can access the skb's dst entry (f.e. to retrieve tclassid)
w/o the need to perform a skb_dst_force() to hold on to it any longer. In
lwt case, we can also use this facility to setup dst metadata via cls_bpf
(bpf_skb_set_tunnel_key()) without needing a real egress qdisc just for
that (case of IFF_NO_QUEUE devices, for example).
The realization can be done without any changes to the scheduler core
framework. All it takes is that we have two a-priori defined minors/child
classes, where we can mux between ingress and egress classifier list
(dev->ingress_cl_list and dev->egress_cl_list, latter stored close to
dev->_tx to avoid extra cacheline miss for moderate loads). The egress
part is a bit similar modelled to handle_ing() and patched to a noop in
case the functionality is not used. Both handlers are now called
sch_handle_ingress() and sch_handle_egress(), code sharing among the two
doesn't seem practical as there are various minor differences in both
paths, so that making them conditional in a single handler would rather
slow things down.
Full compatibility to ingress qdisc is provided as well. Since both
piggyback on TC_H_CLSACT, only one of them (ingress/clsact) can exist
per netdevice, and thus ingress qdisc specific behaviour can be retained
for user space. This means, either a user does 'tc qdisc add dev foo ingress'
and configures ingress qdisc as usual, or the 'tc qdisc add dev foo clsact'
alternative, where both, ingress and egress classifier can be configured
as in the below example. ingress qdisc supports attaching classifier to any
minor number whereas clsact has two fixed minors for muxing between the
lists, therefore to not break user space setups, they are better done as
two separate qdiscs.
I decided to extend the sch_ingress module with clsact functionality so
that commonly used code can be reused, the module is being aliased with
sch_clsact so that it can be auto-loaded properly. Alternative would have been
to add a flag when initializing ingress to alter its behaviour plus aliasing
to a different name (as it's more than just ingress). However, the first would
end up, based on the flag, choosing the new/old behaviour by calling different
function implementations to handle each anyway, the latter would require to
register ingress qdisc once again under different alias. So, this really begs
to provide a minimal, cleaner approach to have Qdisc_ops and Qdisc_class_ops
by its own that share callbacks used by both.
Example, adding qdisc:
# tc qdisc add dev foo clsact
# tc qdisc show dev foo
qdisc mq 0: root
qdisc pfifo_fast 0: parent :1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :4 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc clsact ffff: parent ffff:fff1
Adding filters (deleting, etc works analogous by specifying ingress/egress):
# tc filter add dev foo ingress bpf da obj bar.o sec ingress
# tc filter add dev foo egress bpf da obj bar.o sec egress
# tc filter show dev foo ingress
filter protocol all pref 49152 bpf
filter protocol all pref 49152 bpf handle 0x1 bar.o:[ingress] direct-action
# tc filter show dev foo egress
filter protocol all pref 49152 bpf
filter protocol all pref 49152 bpf handle 0x1 bar.o:[egress] direct-action
A 'tc filter show dev foo' or 'tc filter show dev foo parent ffff:' will
show an empty list for clsact. Either using the parent names (ingress/egress)
or specifying the full major/minor will then show the related filter lists.
Prior work on a mqprio prequeue() facility [1] was done mainly by John Fastabend.
[1] http://patchwork.ozlabs.org/patch/512949/
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-07 21:29:47 +00:00
|
|
|
#ifdef CONFIG_NET_EGRESS
|
|
|
|
static struct static_key egress_needed __read_mostly;
|
|
|
|
|
|
|
|
void net_inc_egress_queue(void)
|
|
|
|
{
|
|
|
|
static_key_slow_inc(&egress_needed);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(net_inc_egress_queue);
|
|
|
|
|
|
|
|
void net_dec_egress_queue(void)
|
|
|
|
{
|
|
|
|
static_key_slow_dec(&egress_needed);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(net_dec_egress_queue);
|
|
|
|
#endif
|
|
|
|
|
2012-02-24 07:31:31 +00:00
|
|
|
static struct static_key netstamp_needed __read_mostly;
|
2011-11-28 11:16:50 +00:00
|
|
|
#ifdef HAVE_JUMP_LABEL
|
|
|
|
static atomic_t netstamp_needed_deferred;
|
2017-03-01 22:28:39 +00:00
|
|
|
static atomic_t netstamp_wanted;
|
net: use a work queue to defer net_disable_timestamp() work
Dmitry reported a warning [1] showing that we were calling
net_disable_timestamp() -> static_key_slow_dec() from a non
process context.
Grabbing a mutex while holding a spinlock or rcu_read_lock()
is not allowed.
As Cong suggested, we now use a work queue.
It is possible netstamp_clear() exits while netstamp_needed_deferred
is not zero, but it is probably not worth trying to do better than that.
netstamp_needed_deferred atomic tracks the exact number of deferred
decrements.
[1]
[ INFO: suspicious RCU usage. ]
4.10.0-rc5+ #192 Not tainted
-------------------------------
./include/linux/rcupdate.h:561 Illegal context switch in RCU read-side
critical section!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 0
2 locks held by syz-executor14/23111:
#0: (sk_lock-AF_INET6){+.+.+.}, at: [<ffffffff83a35c35>] lock_sock
include/net/sock.h:1454 [inline]
#0: (sk_lock-AF_INET6){+.+.+.}, at: [<ffffffff83a35c35>]
rawv6_sendmsg+0x1e65/0x3ec0 net/ipv6/raw.c:919
#1: (rcu_read_lock){......}, at: [<ffffffff83ae2678>] nf_hook
include/linux/netfilter.h:201 [inline]
#1: (rcu_read_lock){......}, at: [<ffffffff83ae2678>]
__ip6_local_out+0x258/0x840 net/ipv6/output_core.c:160
stack backtrace:
CPU: 2 PID: 23111 Comm: syz-executor14 Not tainted 4.10.0-rc5+ #192
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs
01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:15 [inline]
dump_stack+0x2ee/0x3ef lib/dump_stack.c:51
lockdep_rcu_suspicious+0x139/0x180 kernel/locking/lockdep.c:4452
rcu_preempt_sleep_check include/linux/rcupdate.h:560 [inline]
___might_sleep+0x560/0x650 kernel/sched/core.c:7748
__might_sleep+0x95/0x1a0 kernel/sched/core.c:7739
mutex_lock_nested+0x24f/0x1730 kernel/locking/mutex.c:752
atomic_dec_and_mutex_lock+0x119/0x160 kernel/locking/mutex.c:1060
__static_key_slow_dec+0x7a/0x1e0 kernel/jump_label.c:149
static_key_slow_dec+0x51/0x90 kernel/jump_label.c:174
net_disable_timestamp+0x3b/0x50 net/core/dev.c:1728
sock_disable_timestamp+0x98/0xc0 net/core/sock.c:403
__sk_destruct+0x27d/0x6b0 net/core/sock.c:1441
sk_destruct+0x47/0x80 net/core/sock.c:1460
__sk_free+0x57/0x230 net/core/sock.c:1468
sock_wfree+0xae/0x120 net/core/sock.c:1645
skb_release_head_state+0xfc/0x200 net/core/skbuff.c:655
skb_release_all+0x15/0x60 net/core/skbuff.c:668
__kfree_skb+0x15/0x20 net/core/skbuff.c:684
kfree_skb+0x16e/0x4c0 net/core/skbuff.c:705
inet_frag_destroy+0x121/0x290 net/ipv4/inet_fragment.c:304
inet_frag_put include/net/inet_frag.h:133 [inline]
nf_ct_frag6_gather+0x1106/0x3840
net/ipv6/netfilter/nf_conntrack_reasm.c:617
ipv6_defrag+0x1be/0x2b0 net/ipv6/netfilter/nf_defrag_ipv6_hooks.c:68
nf_hook_entry_hookfn include/linux/netfilter.h:102 [inline]
nf_hook_slow+0xc3/0x290 net/netfilter/core.c:310
nf_hook include/linux/netfilter.h:212 [inline]
__ip6_local_out+0x489/0x840 net/ipv6/output_core.c:160
ip6_local_out+0x2d/0x170 net/ipv6/output_core.c:170
ip6_send_skb+0xa1/0x340 net/ipv6/ip6_output.c:1722
ip6_push_pending_frames+0xb3/0xe0 net/ipv6/ip6_output.c:1742
rawv6_push_pending_frames net/ipv6/raw.c:613 [inline]
rawv6_sendmsg+0x2d1a/0x3ec0 net/ipv6/raw.c:927
inet_sendmsg+0x164/0x5b0 net/ipv4/af_inet.c:744
sock_sendmsg_nosec net/socket.c:635 [inline]
sock_sendmsg+0xca/0x110 net/socket.c:645
sock_write_iter+0x326/0x600 net/socket.c:848
do_iter_readv_writev+0x2e3/0x5b0 fs/read_write.c:695
do_readv_writev+0x42c/0x9b0 fs/read_write.c:872
vfs_writev+0x87/0xc0 fs/read_write.c:911
do_writev+0x110/0x2c0 fs/read_write.c:944
SYSC_writev fs/read_write.c:1017 [inline]
SyS_writev+0x27/0x30 fs/read_write.c:1014
entry_SYSCALL_64_fastpath+0x1f/0xc2
RIP: 0033:0x445559
RSP: 002b:00007f6f46fceb58 EFLAGS: 00000292 ORIG_RAX: 0000000000000014
RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 0000000000445559
RDX: 0000000000000001 RSI: 0000000020f1eff0 RDI: 0000000000000005
RBP: 00000000006e19c0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000292 R12: 0000000000700000
R13: 0000000020f59000 R14: 0000000000000015 R15: 0000000000020400
BUG: sleeping function called from invalid context at
kernel/locking/mutex.c:752
in_atomic(): 1, irqs_disabled(): 0, pid: 23111, name: syz-executor14
INFO: lockdep is turned off.
CPU: 2 PID: 23111 Comm: syz-executor14 Not tainted 4.10.0-rc5+ #192
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs
01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:15 [inline]
dump_stack+0x2ee/0x3ef lib/dump_stack.c:51
___might_sleep+0x47e/0x650 kernel/sched/core.c:7780
__might_sleep+0x95/0x1a0 kernel/sched/core.c:7739
mutex_lock_nested+0x24f/0x1730 kernel/locking/mutex.c:752
atomic_dec_and_mutex_lock+0x119/0x160 kernel/locking/mutex.c:1060
__static_key_slow_dec+0x7a/0x1e0 kernel/jump_label.c:149
static_key_slow_dec+0x51/0x90 kernel/jump_label.c:174
net_disable_timestamp+0x3b/0x50 net/core/dev.c:1728
sock_disable_timestamp+0x98/0xc0 net/core/sock.c:403
__sk_destruct+0x27d/0x6b0 net/core/sock.c:1441
sk_destruct+0x47/0x80 net/core/sock.c:1460
__sk_free+0x57/0x230 net/core/sock.c:1468
sock_wfree+0xae/0x120 net/core/sock.c:1645
skb_release_head_state+0xfc/0x200 net/core/skbuff.c:655
skb_release_all+0x15/0x60 net/core/skbuff.c:668
__kfree_skb+0x15/0x20 net/core/skbuff.c:684
kfree_skb+0x16e/0x4c0 net/core/skbuff.c:705
inet_frag_destroy+0x121/0x290 net/ipv4/inet_fragment.c:304
inet_frag_put include/net/inet_frag.h:133 [inline]
nf_ct_frag6_gather+0x1106/0x3840
net/ipv6/netfilter/nf_conntrack_reasm.c:617
ipv6_defrag+0x1be/0x2b0 net/ipv6/netfilter/nf_defrag_ipv6_hooks.c:68
nf_hook_entry_hookfn include/linux/netfilter.h:102 [inline]
nf_hook_slow+0xc3/0x290 net/netfilter/core.c:310
nf_hook include/linux/netfilter.h:212 [inline]
__ip6_local_out+0x489/0x840 net/ipv6/output_core.c:160
ip6_local_out+0x2d/0x170 net/ipv6/output_core.c:170
ip6_send_skb+0xa1/0x340 net/ipv6/ip6_output.c:1722
ip6_push_pending_frames+0xb3/0xe0 net/ipv6/ip6_output.c:1742
rawv6_push_pending_frames net/ipv6/raw.c:613 [inline]
rawv6_sendmsg+0x2d1a/0x3ec0 net/ipv6/raw.c:927
inet_sendmsg+0x164/0x5b0 net/ipv4/af_inet.c:744
sock_sendmsg_nosec net/socket.c:635 [inline]
sock_sendmsg+0xca/0x110 net/socket.c:645
sock_write_iter+0x326/0x600 net/socket.c:848
do_iter_readv_writev+0x2e3/0x5b0 fs/read_write.c:695
do_readv_writev+0x42c/0x9b0 fs/read_write.c:872
vfs_writev+0x87/0xc0 fs/read_write.c:911
do_writev+0x110/0x2c0 fs/read_write.c:944
SYSC_writev fs/read_write.c:1017 [inline]
SyS_writev+0x27/0x30 fs/read_write.c:1014
entry_SYSCALL_64_fastpath+0x1f/0xc2
RIP: 0033:0x445559
Fixes: b90e5794c5bd ("net: dont call jump_label_dec from irq context")
Suggested-by: Cong Wang <xiyou.wangcong@gmail.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-02 18:31:35 +00:00
|
|
|
static void netstamp_clear(struct work_struct *work)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2011-11-28 11:16:50 +00:00
|
|
|
int deferred = atomic_xchg(&netstamp_needed_deferred, 0);
|
2017-03-01 22:28:39 +00:00
|
|
|
int wanted;
|
2011-11-28 11:16:50 +00:00
|
|
|
|
2017-03-01 22:28:39 +00:00
|
|
|
wanted = atomic_add_return(deferred, &netstamp_wanted);
|
|
|
|
if (wanted > 0)
|
|
|
|
static_key_enable(&netstamp_needed);
|
|
|
|
else
|
|
|
|
static_key_disable(&netstamp_needed);
|
net: use a work queue to defer net_disable_timestamp() work
Dmitry reported a warning [1] showing that we were calling
net_disable_timestamp() -> static_key_slow_dec() from a non
process context.
Grabbing a mutex while holding a spinlock or rcu_read_lock()
is not allowed.
As Cong suggested, we now use a work queue.
It is possible netstamp_clear() exits while netstamp_needed_deferred
is not zero, but it is probably not worth trying to do better than that.
netstamp_needed_deferred atomic tracks the exact number of deferred
decrements.
[1]
[ INFO: suspicious RCU usage. ]
4.10.0-rc5+ #192 Not tainted
-------------------------------
./include/linux/rcupdate.h:561 Illegal context switch in RCU read-side
critical section!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 0
2 locks held by syz-executor14/23111:
#0: (sk_lock-AF_INET6){+.+.+.}, at: [<ffffffff83a35c35>] lock_sock
include/net/sock.h:1454 [inline]
#0: (sk_lock-AF_INET6){+.+.+.}, at: [<ffffffff83a35c35>]
rawv6_sendmsg+0x1e65/0x3ec0 net/ipv6/raw.c:919
#1: (rcu_read_lock){......}, at: [<ffffffff83ae2678>] nf_hook
include/linux/netfilter.h:201 [inline]
#1: (rcu_read_lock){......}, at: [<ffffffff83ae2678>]
__ip6_local_out+0x258/0x840 net/ipv6/output_core.c:160
stack backtrace:
CPU: 2 PID: 23111 Comm: syz-executor14 Not tainted 4.10.0-rc5+ #192
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs
01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:15 [inline]
dump_stack+0x2ee/0x3ef lib/dump_stack.c:51
lockdep_rcu_suspicious+0x139/0x180 kernel/locking/lockdep.c:4452
rcu_preempt_sleep_check include/linux/rcupdate.h:560 [inline]
___might_sleep+0x560/0x650 kernel/sched/core.c:7748
__might_sleep+0x95/0x1a0 kernel/sched/core.c:7739
mutex_lock_nested+0x24f/0x1730 kernel/locking/mutex.c:752
atomic_dec_and_mutex_lock+0x119/0x160 kernel/locking/mutex.c:1060
__static_key_slow_dec+0x7a/0x1e0 kernel/jump_label.c:149
static_key_slow_dec+0x51/0x90 kernel/jump_label.c:174
net_disable_timestamp+0x3b/0x50 net/core/dev.c:1728
sock_disable_timestamp+0x98/0xc0 net/core/sock.c:403
__sk_destruct+0x27d/0x6b0 net/core/sock.c:1441
sk_destruct+0x47/0x80 net/core/sock.c:1460
__sk_free+0x57/0x230 net/core/sock.c:1468
sock_wfree+0xae/0x120 net/core/sock.c:1645
skb_release_head_state+0xfc/0x200 net/core/skbuff.c:655
skb_release_all+0x15/0x60 net/core/skbuff.c:668
__kfree_skb+0x15/0x20 net/core/skbuff.c:684
kfree_skb+0x16e/0x4c0 net/core/skbuff.c:705
inet_frag_destroy+0x121/0x290 net/ipv4/inet_fragment.c:304
inet_frag_put include/net/inet_frag.h:133 [inline]
nf_ct_frag6_gather+0x1106/0x3840
net/ipv6/netfilter/nf_conntrack_reasm.c:617
ipv6_defrag+0x1be/0x2b0 net/ipv6/netfilter/nf_defrag_ipv6_hooks.c:68
nf_hook_entry_hookfn include/linux/netfilter.h:102 [inline]
nf_hook_slow+0xc3/0x290 net/netfilter/core.c:310
nf_hook include/linux/netfilter.h:212 [inline]
__ip6_local_out+0x489/0x840 net/ipv6/output_core.c:160
ip6_local_out+0x2d/0x170 net/ipv6/output_core.c:170
ip6_send_skb+0xa1/0x340 net/ipv6/ip6_output.c:1722
ip6_push_pending_frames+0xb3/0xe0 net/ipv6/ip6_output.c:1742
rawv6_push_pending_frames net/ipv6/raw.c:613 [inline]
rawv6_sendmsg+0x2d1a/0x3ec0 net/ipv6/raw.c:927
inet_sendmsg+0x164/0x5b0 net/ipv4/af_inet.c:744
sock_sendmsg_nosec net/socket.c:635 [inline]
sock_sendmsg+0xca/0x110 net/socket.c:645
sock_write_iter+0x326/0x600 net/socket.c:848
do_iter_readv_writev+0x2e3/0x5b0 fs/read_write.c:695
do_readv_writev+0x42c/0x9b0 fs/read_write.c:872
vfs_writev+0x87/0xc0 fs/read_write.c:911
do_writev+0x110/0x2c0 fs/read_write.c:944
SYSC_writev fs/read_write.c:1017 [inline]
SyS_writev+0x27/0x30 fs/read_write.c:1014
entry_SYSCALL_64_fastpath+0x1f/0xc2
RIP: 0033:0x445559
RSP: 002b:00007f6f46fceb58 EFLAGS: 00000292 ORIG_RAX: 0000000000000014
RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 0000000000445559
RDX: 0000000000000001 RSI: 0000000020f1eff0 RDI: 0000000000000005
RBP: 00000000006e19c0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000292 R12: 0000000000700000
R13: 0000000020f59000 R14: 0000000000000015 R15: 0000000000020400
BUG: sleeping function called from invalid context at
kernel/locking/mutex.c:752
in_atomic(): 1, irqs_disabled(): 0, pid: 23111, name: syz-executor14
INFO: lockdep is turned off.
CPU: 2 PID: 23111 Comm: syz-executor14 Not tainted 4.10.0-rc5+ #192
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs
01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:15 [inline]
dump_stack+0x2ee/0x3ef lib/dump_stack.c:51
___might_sleep+0x47e/0x650 kernel/sched/core.c:7780
__might_sleep+0x95/0x1a0 kernel/sched/core.c:7739
mutex_lock_nested+0x24f/0x1730 kernel/locking/mutex.c:752
atomic_dec_and_mutex_lock+0x119/0x160 kernel/locking/mutex.c:1060
__static_key_slow_dec+0x7a/0x1e0 kernel/jump_label.c:149
static_key_slow_dec+0x51/0x90 kernel/jump_label.c:174
net_disable_timestamp+0x3b/0x50 net/core/dev.c:1728
sock_disable_timestamp+0x98/0xc0 net/core/sock.c:403
__sk_destruct+0x27d/0x6b0 net/core/sock.c:1441
sk_destruct+0x47/0x80 net/core/sock.c:1460
__sk_free+0x57/0x230 net/core/sock.c:1468
sock_wfree+0xae/0x120 net/core/sock.c:1645
skb_release_head_state+0xfc/0x200 net/core/skbuff.c:655
skb_release_all+0x15/0x60 net/core/skbuff.c:668
__kfree_skb+0x15/0x20 net/core/skbuff.c:684
kfree_skb+0x16e/0x4c0 net/core/skbuff.c:705
inet_frag_destroy+0x121/0x290 net/ipv4/inet_fragment.c:304
inet_frag_put include/net/inet_frag.h:133 [inline]
nf_ct_frag6_gather+0x1106/0x3840
net/ipv6/netfilter/nf_conntrack_reasm.c:617
ipv6_defrag+0x1be/0x2b0 net/ipv6/netfilter/nf_defrag_ipv6_hooks.c:68
nf_hook_entry_hookfn include/linux/netfilter.h:102 [inline]
nf_hook_slow+0xc3/0x290 net/netfilter/core.c:310
nf_hook include/linux/netfilter.h:212 [inline]
__ip6_local_out+0x489/0x840 net/ipv6/output_core.c:160
ip6_local_out+0x2d/0x170 net/ipv6/output_core.c:170
ip6_send_skb+0xa1/0x340 net/ipv6/ip6_output.c:1722
ip6_push_pending_frames+0xb3/0xe0 net/ipv6/ip6_output.c:1742
rawv6_push_pending_frames net/ipv6/raw.c:613 [inline]
rawv6_sendmsg+0x2d1a/0x3ec0 net/ipv6/raw.c:927
inet_sendmsg+0x164/0x5b0 net/ipv4/af_inet.c:744
sock_sendmsg_nosec net/socket.c:635 [inline]
sock_sendmsg+0xca/0x110 net/socket.c:645
sock_write_iter+0x326/0x600 net/socket.c:848
do_iter_readv_writev+0x2e3/0x5b0 fs/read_write.c:695
do_readv_writev+0x42c/0x9b0 fs/read_write.c:872
vfs_writev+0x87/0xc0 fs/read_write.c:911
do_writev+0x110/0x2c0 fs/read_write.c:944
SYSC_writev fs/read_write.c:1017 [inline]
SyS_writev+0x27/0x30 fs/read_write.c:1014
entry_SYSCALL_64_fastpath+0x1f/0xc2
RIP: 0033:0x445559
Fixes: b90e5794c5bd ("net: dont call jump_label_dec from irq context")
Suggested-by: Cong Wang <xiyou.wangcong@gmail.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-02 18:31:35 +00:00
|
|
|
}
|
|
|
|
static DECLARE_WORK(netstamp_work, netstamp_clear);
|
2011-11-28 11:16:50 +00:00
|
|
|
#endif
|
net: use a work queue to defer net_disable_timestamp() work
Dmitry reported a warning [1] showing that we were calling
net_disable_timestamp() -> static_key_slow_dec() from a non
process context.
Grabbing a mutex while holding a spinlock or rcu_read_lock()
is not allowed.
As Cong suggested, we now use a work queue.
It is possible netstamp_clear() exits while netstamp_needed_deferred
is not zero, but it is probably not worth trying to do better than that.
netstamp_needed_deferred atomic tracks the exact number of deferred
decrements.
[1]
[ INFO: suspicious RCU usage. ]
4.10.0-rc5+ #192 Not tainted
-------------------------------
./include/linux/rcupdate.h:561 Illegal context switch in RCU read-side
critical section!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 0
2 locks held by syz-executor14/23111:
#0: (sk_lock-AF_INET6){+.+.+.}, at: [<ffffffff83a35c35>] lock_sock
include/net/sock.h:1454 [inline]
#0: (sk_lock-AF_INET6){+.+.+.}, at: [<ffffffff83a35c35>]
rawv6_sendmsg+0x1e65/0x3ec0 net/ipv6/raw.c:919
#1: (rcu_read_lock){......}, at: [<ffffffff83ae2678>] nf_hook
include/linux/netfilter.h:201 [inline]
#1: (rcu_read_lock){......}, at: [<ffffffff83ae2678>]
__ip6_local_out+0x258/0x840 net/ipv6/output_core.c:160
stack backtrace:
CPU: 2 PID: 23111 Comm: syz-executor14 Not tainted 4.10.0-rc5+ #192
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs
01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:15 [inline]
dump_stack+0x2ee/0x3ef lib/dump_stack.c:51
lockdep_rcu_suspicious+0x139/0x180 kernel/locking/lockdep.c:4452
rcu_preempt_sleep_check include/linux/rcupdate.h:560 [inline]
___might_sleep+0x560/0x650 kernel/sched/core.c:7748
__might_sleep+0x95/0x1a0 kernel/sched/core.c:7739
mutex_lock_nested+0x24f/0x1730 kernel/locking/mutex.c:752
atomic_dec_and_mutex_lock+0x119/0x160 kernel/locking/mutex.c:1060
__static_key_slow_dec+0x7a/0x1e0 kernel/jump_label.c:149
static_key_slow_dec+0x51/0x90 kernel/jump_label.c:174
net_disable_timestamp+0x3b/0x50 net/core/dev.c:1728
sock_disable_timestamp+0x98/0xc0 net/core/sock.c:403
__sk_destruct+0x27d/0x6b0 net/core/sock.c:1441
sk_destruct+0x47/0x80 net/core/sock.c:1460
__sk_free+0x57/0x230 net/core/sock.c:1468
sock_wfree+0xae/0x120 net/core/sock.c:1645
skb_release_head_state+0xfc/0x200 net/core/skbuff.c:655
skb_release_all+0x15/0x60 net/core/skbuff.c:668
__kfree_skb+0x15/0x20 net/core/skbuff.c:684
kfree_skb+0x16e/0x4c0 net/core/skbuff.c:705
inet_frag_destroy+0x121/0x290 net/ipv4/inet_fragment.c:304
inet_frag_put include/net/inet_frag.h:133 [inline]
nf_ct_frag6_gather+0x1106/0x3840
net/ipv6/netfilter/nf_conntrack_reasm.c:617
ipv6_defrag+0x1be/0x2b0 net/ipv6/netfilter/nf_defrag_ipv6_hooks.c:68
nf_hook_entry_hookfn include/linux/netfilter.h:102 [inline]
nf_hook_slow+0xc3/0x290 net/netfilter/core.c:310
nf_hook include/linux/netfilter.h:212 [inline]
__ip6_local_out+0x489/0x840 net/ipv6/output_core.c:160
ip6_local_out+0x2d/0x170 net/ipv6/output_core.c:170
ip6_send_skb+0xa1/0x340 net/ipv6/ip6_output.c:1722
ip6_push_pending_frames+0xb3/0xe0 net/ipv6/ip6_output.c:1742
rawv6_push_pending_frames net/ipv6/raw.c:613 [inline]
rawv6_sendmsg+0x2d1a/0x3ec0 net/ipv6/raw.c:927
inet_sendmsg+0x164/0x5b0 net/ipv4/af_inet.c:744
sock_sendmsg_nosec net/socket.c:635 [inline]
sock_sendmsg+0xca/0x110 net/socket.c:645
sock_write_iter+0x326/0x600 net/socket.c:848
do_iter_readv_writev+0x2e3/0x5b0 fs/read_write.c:695
do_readv_writev+0x42c/0x9b0 fs/read_write.c:872
vfs_writev+0x87/0xc0 fs/read_write.c:911
do_writev+0x110/0x2c0 fs/read_write.c:944
SYSC_writev fs/read_write.c:1017 [inline]
SyS_writev+0x27/0x30 fs/read_write.c:1014
entry_SYSCALL_64_fastpath+0x1f/0xc2
RIP: 0033:0x445559
RSP: 002b:00007f6f46fceb58 EFLAGS: 00000292 ORIG_RAX: 0000000000000014
RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 0000000000445559
RDX: 0000000000000001 RSI: 0000000020f1eff0 RDI: 0000000000000005
RBP: 00000000006e19c0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000292 R12: 0000000000700000
R13: 0000000020f59000 R14: 0000000000000015 R15: 0000000000020400
BUG: sleeping function called from invalid context at
kernel/locking/mutex.c:752
in_atomic(): 1, irqs_disabled(): 0, pid: 23111, name: syz-executor14
INFO: lockdep is turned off.
CPU: 2 PID: 23111 Comm: syz-executor14 Not tainted 4.10.0-rc5+ #192
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs
01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:15 [inline]
dump_stack+0x2ee/0x3ef lib/dump_stack.c:51
___might_sleep+0x47e/0x650 kernel/sched/core.c:7780
__might_sleep+0x95/0x1a0 kernel/sched/core.c:7739
mutex_lock_nested+0x24f/0x1730 kernel/locking/mutex.c:752
atomic_dec_and_mutex_lock+0x119/0x160 kernel/locking/mutex.c:1060
__static_key_slow_dec+0x7a/0x1e0 kernel/jump_label.c:149
static_key_slow_dec+0x51/0x90 kernel/jump_label.c:174
net_disable_timestamp+0x3b/0x50 net/core/dev.c:1728
sock_disable_timestamp+0x98/0xc0 net/core/sock.c:403
__sk_destruct+0x27d/0x6b0 net/core/sock.c:1441
sk_destruct+0x47/0x80 net/core/sock.c:1460
__sk_free+0x57/0x230 net/core/sock.c:1468
sock_wfree+0xae/0x120 net/core/sock.c:1645
skb_release_head_state+0xfc/0x200 net/core/skbuff.c:655
skb_release_all+0x15/0x60 net/core/skbuff.c:668
__kfree_skb+0x15/0x20 net/core/skbuff.c:684
kfree_skb+0x16e/0x4c0 net/core/skbuff.c:705
inet_frag_destroy+0x121/0x290 net/ipv4/inet_fragment.c:304
inet_frag_put include/net/inet_frag.h:133 [inline]
nf_ct_frag6_gather+0x1106/0x3840
net/ipv6/netfilter/nf_conntrack_reasm.c:617
ipv6_defrag+0x1be/0x2b0 net/ipv6/netfilter/nf_defrag_ipv6_hooks.c:68
nf_hook_entry_hookfn include/linux/netfilter.h:102 [inline]
nf_hook_slow+0xc3/0x290 net/netfilter/core.c:310
nf_hook include/linux/netfilter.h:212 [inline]
__ip6_local_out+0x489/0x840 net/ipv6/output_core.c:160
ip6_local_out+0x2d/0x170 net/ipv6/output_core.c:170
ip6_send_skb+0xa1/0x340 net/ipv6/ip6_output.c:1722
ip6_push_pending_frames+0xb3/0xe0 net/ipv6/ip6_output.c:1742
rawv6_push_pending_frames net/ipv6/raw.c:613 [inline]
rawv6_sendmsg+0x2d1a/0x3ec0 net/ipv6/raw.c:927
inet_sendmsg+0x164/0x5b0 net/ipv4/af_inet.c:744
sock_sendmsg_nosec net/socket.c:635 [inline]
sock_sendmsg+0xca/0x110 net/socket.c:645
sock_write_iter+0x326/0x600 net/socket.c:848
do_iter_readv_writev+0x2e3/0x5b0 fs/read_write.c:695
do_readv_writev+0x42c/0x9b0 fs/read_write.c:872
vfs_writev+0x87/0xc0 fs/read_write.c:911
do_writev+0x110/0x2c0 fs/read_write.c:944
SYSC_writev fs/read_write.c:1017 [inline]
SyS_writev+0x27/0x30 fs/read_write.c:1014
entry_SYSCALL_64_fastpath+0x1f/0xc2
RIP: 0033:0x445559
Fixes: b90e5794c5bd ("net: dont call jump_label_dec from irq context")
Suggested-by: Cong Wang <xiyou.wangcong@gmail.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-02 18:31:35 +00:00
|
|
|
|
|
|
|
void net_enable_timestamp(void)
|
|
|
|
{
|
2017-03-01 22:28:39 +00:00
|
|
|
#ifdef HAVE_JUMP_LABEL
|
|
|
|
int wanted;
|
|
|
|
|
|
|
|
while (1) {
|
|
|
|
wanted = atomic_read(&netstamp_wanted);
|
|
|
|
if (wanted <= 0)
|
|
|
|
break;
|
|
|
|
if (atomic_cmpxchg(&netstamp_wanted, wanted, wanted + 1) == wanted)
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
atomic_inc(&netstamp_needed_deferred);
|
|
|
|
schedule_work(&netstamp_work);
|
|
|
|
#else
|
2012-02-24 07:31:31 +00:00
|
|
|
static_key_slow_inc(&netstamp_needed);
|
2017-03-01 22:28:39 +00:00
|
|
|
#endif
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
2009-09-03 08:29:39 +00:00
|
|
|
EXPORT_SYMBOL(net_enable_timestamp);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
void net_disable_timestamp(void)
|
|
|
|
{
|
2011-11-28 11:16:50 +00:00
|
|
|
#ifdef HAVE_JUMP_LABEL
|
2017-03-01 22:28:39 +00:00
|
|
|
int wanted;
|
|
|
|
|
|
|
|
while (1) {
|
|
|
|
wanted = atomic_read(&netstamp_wanted);
|
|
|
|
if (wanted <= 1)
|
|
|
|
break;
|
|
|
|
if (atomic_cmpxchg(&netstamp_wanted, wanted, wanted - 1) == wanted)
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
atomic_dec(&netstamp_needed_deferred);
|
net: use a work queue to defer net_disable_timestamp() work
Dmitry reported a warning [1] showing that we were calling
net_disable_timestamp() -> static_key_slow_dec() from a non
process context.
Grabbing a mutex while holding a spinlock or rcu_read_lock()
is not allowed.
As Cong suggested, we now use a work queue.
It is possible netstamp_clear() exits while netstamp_needed_deferred
is not zero, but it is probably not worth trying to do better than that.
netstamp_needed_deferred atomic tracks the exact number of deferred
decrements.
[1]
[ INFO: suspicious RCU usage. ]
4.10.0-rc5+ #192 Not tainted
-------------------------------
./include/linux/rcupdate.h:561 Illegal context switch in RCU read-side
critical section!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 0
2 locks held by syz-executor14/23111:
#0: (sk_lock-AF_INET6){+.+.+.}, at: [<ffffffff83a35c35>] lock_sock
include/net/sock.h:1454 [inline]
#0: (sk_lock-AF_INET6){+.+.+.}, at: [<ffffffff83a35c35>]
rawv6_sendmsg+0x1e65/0x3ec0 net/ipv6/raw.c:919
#1: (rcu_read_lock){......}, at: [<ffffffff83ae2678>] nf_hook
include/linux/netfilter.h:201 [inline]
#1: (rcu_read_lock){......}, at: [<ffffffff83ae2678>]
__ip6_local_out+0x258/0x840 net/ipv6/output_core.c:160
stack backtrace:
CPU: 2 PID: 23111 Comm: syz-executor14 Not tainted 4.10.0-rc5+ #192
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs
01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:15 [inline]
dump_stack+0x2ee/0x3ef lib/dump_stack.c:51
lockdep_rcu_suspicious+0x139/0x180 kernel/locking/lockdep.c:4452
rcu_preempt_sleep_check include/linux/rcupdate.h:560 [inline]
___might_sleep+0x560/0x650 kernel/sched/core.c:7748
__might_sleep+0x95/0x1a0 kernel/sched/core.c:7739
mutex_lock_nested+0x24f/0x1730 kernel/locking/mutex.c:752
atomic_dec_and_mutex_lock+0x119/0x160 kernel/locking/mutex.c:1060
__static_key_slow_dec+0x7a/0x1e0 kernel/jump_label.c:149
static_key_slow_dec+0x51/0x90 kernel/jump_label.c:174
net_disable_timestamp+0x3b/0x50 net/core/dev.c:1728
sock_disable_timestamp+0x98/0xc0 net/core/sock.c:403
__sk_destruct+0x27d/0x6b0 net/core/sock.c:1441
sk_destruct+0x47/0x80 net/core/sock.c:1460
__sk_free+0x57/0x230 net/core/sock.c:1468
sock_wfree+0xae/0x120 net/core/sock.c:1645
skb_release_head_state+0xfc/0x200 net/core/skbuff.c:655
skb_release_all+0x15/0x60 net/core/skbuff.c:668
__kfree_skb+0x15/0x20 net/core/skbuff.c:684
kfree_skb+0x16e/0x4c0 net/core/skbuff.c:705
inet_frag_destroy+0x121/0x290 net/ipv4/inet_fragment.c:304
inet_frag_put include/net/inet_frag.h:133 [inline]
nf_ct_frag6_gather+0x1106/0x3840
net/ipv6/netfilter/nf_conntrack_reasm.c:617
ipv6_defrag+0x1be/0x2b0 net/ipv6/netfilter/nf_defrag_ipv6_hooks.c:68
nf_hook_entry_hookfn include/linux/netfilter.h:102 [inline]
nf_hook_slow+0xc3/0x290 net/netfilter/core.c:310
nf_hook include/linux/netfilter.h:212 [inline]
__ip6_local_out+0x489/0x840 net/ipv6/output_core.c:160
ip6_local_out+0x2d/0x170 net/ipv6/output_core.c:170
ip6_send_skb+0xa1/0x340 net/ipv6/ip6_output.c:1722
ip6_push_pending_frames+0xb3/0xe0 net/ipv6/ip6_output.c:1742
rawv6_push_pending_frames net/ipv6/raw.c:613 [inline]
rawv6_sendmsg+0x2d1a/0x3ec0 net/ipv6/raw.c:927
inet_sendmsg+0x164/0x5b0 net/ipv4/af_inet.c:744
sock_sendmsg_nosec net/socket.c:635 [inline]
sock_sendmsg+0xca/0x110 net/socket.c:645
sock_write_iter+0x326/0x600 net/socket.c:848
do_iter_readv_writev+0x2e3/0x5b0 fs/read_write.c:695
do_readv_writev+0x42c/0x9b0 fs/read_write.c:872
vfs_writev+0x87/0xc0 fs/read_write.c:911
do_writev+0x110/0x2c0 fs/read_write.c:944
SYSC_writev fs/read_write.c:1017 [inline]
SyS_writev+0x27/0x30 fs/read_write.c:1014
entry_SYSCALL_64_fastpath+0x1f/0xc2
RIP: 0033:0x445559
RSP: 002b:00007f6f46fceb58 EFLAGS: 00000292 ORIG_RAX: 0000000000000014
RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 0000000000445559
RDX: 0000000000000001 RSI: 0000000020f1eff0 RDI: 0000000000000005
RBP: 00000000006e19c0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000292 R12: 0000000000700000
R13: 0000000020f59000 R14: 0000000000000015 R15: 0000000000020400
BUG: sleeping function called from invalid context at
kernel/locking/mutex.c:752
in_atomic(): 1, irqs_disabled(): 0, pid: 23111, name: syz-executor14
INFO: lockdep is turned off.
CPU: 2 PID: 23111 Comm: syz-executor14 Not tainted 4.10.0-rc5+ #192
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs
01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:15 [inline]
dump_stack+0x2ee/0x3ef lib/dump_stack.c:51
___might_sleep+0x47e/0x650 kernel/sched/core.c:7780
__might_sleep+0x95/0x1a0 kernel/sched/core.c:7739
mutex_lock_nested+0x24f/0x1730 kernel/locking/mutex.c:752
atomic_dec_and_mutex_lock+0x119/0x160 kernel/locking/mutex.c:1060
__static_key_slow_dec+0x7a/0x1e0 kernel/jump_label.c:149
static_key_slow_dec+0x51/0x90 kernel/jump_label.c:174
net_disable_timestamp+0x3b/0x50 net/core/dev.c:1728
sock_disable_timestamp+0x98/0xc0 net/core/sock.c:403
__sk_destruct+0x27d/0x6b0 net/core/sock.c:1441
sk_destruct+0x47/0x80 net/core/sock.c:1460
__sk_free+0x57/0x230 net/core/sock.c:1468
sock_wfree+0xae/0x120 net/core/sock.c:1645
skb_release_head_state+0xfc/0x200 net/core/skbuff.c:655
skb_release_all+0x15/0x60 net/core/skbuff.c:668
__kfree_skb+0x15/0x20 net/core/skbuff.c:684
kfree_skb+0x16e/0x4c0 net/core/skbuff.c:705
inet_frag_destroy+0x121/0x290 net/ipv4/inet_fragment.c:304
inet_frag_put include/net/inet_frag.h:133 [inline]
nf_ct_frag6_gather+0x1106/0x3840
net/ipv6/netfilter/nf_conntrack_reasm.c:617
ipv6_defrag+0x1be/0x2b0 net/ipv6/netfilter/nf_defrag_ipv6_hooks.c:68
nf_hook_entry_hookfn include/linux/netfilter.h:102 [inline]
nf_hook_slow+0xc3/0x290 net/netfilter/core.c:310
nf_hook include/linux/netfilter.h:212 [inline]
__ip6_local_out+0x489/0x840 net/ipv6/output_core.c:160
ip6_local_out+0x2d/0x170 net/ipv6/output_core.c:170
ip6_send_skb+0xa1/0x340 net/ipv6/ip6_output.c:1722
ip6_push_pending_frames+0xb3/0xe0 net/ipv6/ip6_output.c:1742
rawv6_push_pending_frames net/ipv6/raw.c:613 [inline]
rawv6_sendmsg+0x2d1a/0x3ec0 net/ipv6/raw.c:927
inet_sendmsg+0x164/0x5b0 net/ipv4/af_inet.c:744
sock_sendmsg_nosec net/socket.c:635 [inline]
sock_sendmsg+0xca/0x110 net/socket.c:645
sock_write_iter+0x326/0x600 net/socket.c:848
do_iter_readv_writev+0x2e3/0x5b0 fs/read_write.c:695
do_readv_writev+0x42c/0x9b0 fs/read_write.c:872
vfs_writev+0x87/0xc0 fs/read_write.c:911
do_writev+0x110/0x2c0 fs/read_write.c:944
SYSC_writev fs/read_write.c:1017 [inline]
SyS_writev+0x27/0x30 fs/read_write.c:1014
entry_SYSCALL_64_fastpath+0x1f/0xc2
RIP: 0033:0x445559
Fixes: b90e5794c5bd ("net: dont call jump_label_dec from irq context")
Suggested-by: Cong Wang <xiyou.wangcong@gmail.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-02 18:31:35 +00:00
|
|
|
schedule_work(&netstamp_work);
|
|
|
|
#else
|
2012-02-24 07:31:31 +00:00
|
|
|
static_key_slow_dec(&netstamp_needed);
|
net: use a work queue to defer net_disable_timestamp() work
Dmitry reported a warning [1] showing that we were calling
net_disable_timestamp() -> static_key_slow_dec() from a non
process context.
Grabbing a mutex while holding a spinlock or rcu_read_lock()
is not allowed.
As Cong suggested, we now use a work queue.
It is possible netstamp_clear() exits while netstamp_needed_deferred
is not zero, but it is probably not worth trying to do better than that.
netstamp_needed_deferred atomic tracks the exact number of deferred
decrements.
[1]
[ INFO: suspicious RCU usage. ]
4.10.0-rc5+ #192 Not tainted
-------------------------------
./include/linux/rcupdate.h:561 Illegal context switch in RCU read-side
critical section!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 0
2 locks held by syz-executor14/23111:
#0: (sk_lock-AF_INET6){+.+.+.}, at: [<ffffffff83a35c35>] lock_sock
include/net/sock.h:1454 [inline]
#0: (sk_lock-AF_INET6){+.+.+.}, at: [<ffffffff83a35c35>]
rawv6_sendmsg+0x1e65/0x3ec0 net/ipv6/raw.c:919
#1: (rcu_read_lock){......}, at: [<ffffffff83ae2678>] nf_hook
include/linux/netfilter.h:201 [inline]
#1: (rcu_read_lock){......}, at: [<ffffffff83ae2678>]
__ip6_local_out+0x258/0x840 net/ipv6/output_core.c:160
stack backtrace:
CPU: 2 PID: 23111 Comm: syz-executor14 Not tainted 4.10.0-rc5+ #192
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs
01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:15 [inline]
dump_stack+0x2ee/0x3ef lib/dump_stack.c:51
lockdep_rcu_suspicious+0x139/0x180 kernel/locking/lockdep.c:4452
rcu_preempt_sleep_check include/linux/rcupdate.h:560 [inline]
___might_sleep+0x560/0x650 kernel/sched/core.c:7748
__might_sleep+0x95/0x1a0 kernel/sched/core.c:7739
mutex_lock_nested+0x24f/0x1730 kernel/locking/mutex.c:752
atomic_dec_and_mutex_lock+0x119/0x160 kernel/locking/mutex.c:1060
__static_key_slow_dec+0x7a/0x1e0 kernel/jump_label.c:149
static_key_slow_dec+0x51/0x90 kernel/jump_label.c:174
net_disable_timestamp+0x3b/0x50 net/core/dev.c:1728
sock_disable_timestamp+0x98/0xc0 net/core/sock.c:403
__sk_destruct+0x27d/0x6b0 net/core/sock.c:1441
sk_destruct+0x47/0x80 net/core/sock.c:1460
__sk_free+0x57/0x230 net/core/sock.c:1468
sock_wfree+0xae/0x120 net/core/sock.c:1645
skb_release_head_state+0xfc/0x200 net/core/skbuff.c:655
skb_release_all+0x15/0x60 net/core/skbuff.c:668
__kfree_skb+0x15/0x20 net/core/skbuff.c:684
kfree_skb+0x16e/0x4c0 net/core/skbuff.c:705
inet_frag_destroy+0x121/0x290 net/ipv4/inet_fragment.c:304
inet_frag_put include/net/inet_frag.h:133 [inline]
nf_ct_frag6_gather+0x1106/0x3840
net/ipv6/netfilter/nf_conntrack_reasm.c:617
ipv6_defrag+0x1be/0x2b0 net/ipv6/netfilter/nf_defrag_ipv6_hooks.c:68
nf_hook_entry_hookfn include/linux/netfilter.h:102 [inline]
nf_hook_slow+0xc3/0x290 net/netfilter/core.c:310
nf_hook include/linux/netfilter.h:212 [inline]
__ip6_local_out+0x489/0x840 net/ipv6/output_core.c:160
ip6_local_out+0x2d/0x170 net/ipv6/output_core.c:170
ip6_send_skb+0xa1/0x340 net/ipv6/ip6_output.c:1722
ip6_push_pending_frames+0xb3/0xe0 net/ipv6/ip6_output.c:1742
rawv6_push_pending_frames net/ipv6/raw.c:613 [inline]
rawv6_sendmsg+0x2d1a/0x3ec0 net/ipv6/raw.c:927
inet_sendmsg+0x164/0x5b0 net/ipv4/af_inet.c:744
sock_sendmsg_nosec net/socket.c:635 [inline]
sock_sendmsg+0xca/0x110 net/socket.c:645
sock_write_iter+0x326/0x600 net/socket.c:848
do_iter_readv_writev+0x2e3/0x5b0 fs/read_write.c:695
do_readv_writev+0x42c/0x9b0 fs/read_write.c:872
vfs_writev+0x87/0xc0 fs/read_write.c:911
do_writev+0x110/0x2c0 fs/read_write.c:944
SYSC_writev fs/read_write.c:1017 [inline]
SyS_writev+0x27/0x30 fs/read_write.c:1014
entry_SYSCALL_64_fastpath+0x1f/0xc2
RIP: 0033:0x445559
RSP: 002b:00007f6f46fceb58 EFLAGS: 00000292 ORIG_RAX: 0000000000000014
RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 0000000000445559
RDX: 0000000000000001 RSI: 0000000020f1eff0 RDI: 0000000000000005
RBP: 00000000006e19c0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000292 R12: 0000000000700000
R13: 0000000020f59000 R14: 0000000000000015 R15: 0000000000020400
BUG: sleeping function called from invalid context at
kernel/locking/mutex.c:752
in_atomic(): 1, irqs_disabled(): 0, pid: 23111, name: syz-executor14
INFO: lockdep is turned off.
CPU: 2 PID: 23111 Comm: syz-executor14 Not tainted 4.10.0-rc5+ #192
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs
01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:15 [inline]
dump_stack+0x2ee/0x3ef lib/dump_stack.c:51
___might_sleep+0x47e/0x650 kernel/sched/core.c:7780
__might_sleep+0x95/0x1a0 kernel/sched/core.c:7739
mutex_lock_nested+0x24f/0x1730 kernel/locking/mutex.c:752
atomic_dec_and_mutex_lock+0x119/0x160 kernel/locking/mutex.c:1060
__static_key_slow_dec+0x7a/0x1e0 kernel/jump_label.c:149
static_key_slow_dec+0x51/0x90 kernel/jump_label.c:174
net_disable_timestamp+0x3b/0x50 net/core/dev.c:1728
sock_disable_timestamp+0x98/0xc0 net/core/sock.c:403
__sk_destruct+0x27d/0x6b0 net/core/sock.c:1441
sk_destruct+0x47/0x80 net/core/sock.c:1460
__sk_free+0x57/0x230 net/core/sock.c:1468
sock_wfree+0xae/0x120 net/core/sock.c:1645
skb_release_head_state+0xfc/0x200 net/core/skbuff.c:655
skb_release_all+0x15/0x60 net/core/skbuff.c:668
__kfree_skb+0x15/0x20 net/core/skbuff.c:684
kfree_skb+0x16e/0x4c0 net/core/skbuff.c:705
inet_frag_destroy+0x121/0x290 net/ipv4/inet_fragment.c:304
inet_frag_put include/net/inet_frag.h:133 [inline]
nf_ct_frag6_gather+0x1106/0x3840
net/ipv6/netfilter/nf_conntrack_reasm.c:617
ipv6_defrag+0x1be/0x2b0 net/ipv6/netfilter/nf_defrag_ipv6_hooks.c:68
nf_hook_entry_hookfn include/linux/netfilter.h:102 [inline]
nf_hook_slow+0xc3/0x290 net/netfilter/core.c:310
nf_hook include/linux/netfilter.h:212 [inline]
__ip6_local_out+0x489/0x840 net/ipv6/output_core.c:160
ip6_local_out+0x2d/0x170 net/ipv6/output_core.c:170
ip6_send_skb+0xa1/0x340 net/ipv6/ip6_output.c:1722
ip6_push_pending_frames+0xb3/0xe0 net/ipv6/ip6_output.c:1742
rawv6_push_pending_frames net/ipv6/raw.c:613 [inline]
rawv6_sendmsg+0x2d1a/0x3ec0 net/ipv6/raw.c:927
inet_sendmsg+0x164/0x5b0 net/ipv4/af_inet.c:744
sock_sendmsg_nosec net/socket.c:635 [inline]
sock_sendmsg+0xca/0x110 net/socket.c:645
sock_write_iter+0x326/0x600 net/socket.c:848
do_iter_readv_writev+0x2e3/0x5b0 fs/read_write.c:695
do_readv_writev+0x42c/0x9b0 fs/read_write.c:872
vfs_writev+0x87/0xc0 fs/read_write.c:911
do_writev+0x110/0x2c0 fs/read_write.c:944
SYSC_writev fs/read_write.c:1017 [inline]
SyS_writev+0x27/0x30 fs/read_write.c:1014
entry_SYSCALL_64_fastpath+0x1f/0xc2
RIP: 0033:0x445559
Fixes: b90e5794c5bd ("net: dont call jump_label_dec from irq context")
Suggested-by: Cong Wang <xiyou.wangcong@gmail.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-02 18:31:35 +00:00
|
|
|
#endif
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
2009-09-03 08:29:39 +00:00
|
|
|
EXPORT_SYMBOL(net_disable_timestamp);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
net: Consistent skb timestamping
With RPS inclusion, skb timestamping is not consistent in RX path.
If netif_receive_skb() is used, its deferred after RPS dispatch.
If netif_rx() is used, its done before RPS dispatch.
This can give strange tcpdump timestamps results.
I think timestamping should be done as soon as possible in the receive
path, to get meaningful values (ie timestamps taken at the time packet
was delivered by NIC driver to our stack), even if NAPI already can
defer timestamping a bit (RPS can help to reduce the gap)
Tom Herbert prefer to sample timestamps after RPS dispatch. In case
sampling is expensive (HPET/acpi_pm on x86), this makes sense.
Let admins switch from one mode to another, using a new
sysctl, /proc/sys/net/core/netdev_tstamp_prequeue
Its default value (1), means timestamps are taken as soon as possible,
before backlog queueing, giving accurate timestamps.
Setting a 0 value permits to sample timestamps when processing backlog,
after RPS dispatch, to lower the load of the pre-RPS cpu.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-16 06:57:10 +00:00
|
|
|
static inline void net_timestamp_set(struct sk_buff *skb)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2016-12-25 10:38:40 +00:00
|
|
|
skb->tstamp = 0;
|
2012-02-24 07:31:31 +00:00
|
|
|
if (static_key_false(&netstamp_needed))
|
2005-08-15 00:24:31 +00:00
|
|
|
__net_timestamp(skb);
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
|
2011-11-15 04:12:55 +00:00
|
|
|
#define net_timestamp_check(COND, SKB) \
|
2012-02-24 07:31:31 +00:00
|
|
|
if (static_key_false(&netstamp_needed)) { \
|
2016-12-25 10:38:40 +00:00
|
|
|
if ((COND) && !(SKB)->tstamp) \
|
2011-11-15 04:12:55 +00:00
|
|
|
__net_timestamp(SKB); \
|
|
|
|
} \
|
net: Consistent skb timestamping
With RPS inclusion, skb timestamping is not consistent in RX path.
If netif_receive_skb() is used, its deferred after RPS dispatch.
If netif_rx() is used, its done before RPS dispatch.
This can give strange tcpdump timestamps results.
I think timestamping should be done as soon as possible in the receive
path, to get meaningful values (ie timestamps taken at the time packet
was delivered by NIC driver to our stack), even if NAPI already can
defer timestamping a bit (RPS can help to reduce the gap)
Tom Herbert prefer to sample timestamps after RPS dispatch. In case
sampling is expensive (HPET/acpi_pm on x86), this makes sense.
Let admins switch from one mode to another, using a new
sysctl, /proc/sys/net/core/netdev_tstamp_prequeue
Its default value (1), means timestamps are taken as soon as possible,
before backlog queueing, giving accurate timestamps.
Setting a 0 value permits to sample timestamps when processing backlog,
after RPS dispatch, to lower the load of the pre-RPS cpu.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-16 06:57:10 +00:00
|
|
|
|
2016-04-28 15:59:28 +00:00
|
|
|
bool is_skb_forwardable(const struct net_device *dev, const struct sk_buff *skb)
|
2011-03-30 09:42:17 +00:00
|
|
|
{
|
|
|
|
unsigned int len;
|
|
|
|
|
|
|
|
if (!(dev->flags & IFF_UP))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
len = dev->mtu + dev->hard_header_len + VLAN_HLEN;
|
|
|
|
if (skb->len <= len)
|
|
|
|
return true;
|
|
|
|
|
|
|
|
/* if TSO is enabled, we don't care about the length as the packet
|
|
|
|
* could be forwarded without being segmented before
|
|
|
|
*/
|
|
|
|
if (skb_is_gso(skb))
|
|
|
|
return true;
|
|
|
|
|
|
|
|
return false;
|
|
|
|
}
|
2014-03-27 21:32:29 +00:00
|
|
|
EXPORT_SYMBOL_GPL(is_skb_forwardable);
|
2011-03-30 09:42:17 +00:00
|
|
|
|
2014-04-17 05:45:03 +00:00
|
|
|
int __dev_forward_skb(struct net_device *dev, struct sk_buff *skb)
|
|
|
|
{
|
2016-11-09 23:36:33 +00:00
|
|
|
int ret = ____dev_forward_skb(dev, skb);
|
2014-04-17 05:45:03 +00:00
|
|
|
|
2016-11-09 23:36:33 +00:00
|
|
|
if (likely(!ret)) {
|
|
|
|
skb->protocol = eth_type_trans(skb, dev);
|
|
|
|
skb_postpull_rcsum(skb, eth_hdr(skb), ETH_HLEN);
|
|
|
|
}
|
2014-04-17 05:45:03 +00:00
|
|
|
|
2016-11-09 23:36:33 +00:00
|
|
|
return ret;
|
2014-04-17 05:45:03 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(__dev_forward_skb);
|
|
|
|
|
2009-11-26 06:07:08 +00:00
|
|
|
/**
|
|
|
|
* dev_forward_skb - loopback an skb to another netif
|
|
|
|
*
|
|
|
|
* @dev: destination network device
|
|
|
|
* @skb: buffer to forward
|
|
|
|
*
|
|
|
|
* return values:
|
|
|
|
* NET_RX_SUCCESS (no congestion)
|
2010-05-06 07:53:53 +00:00
|
|
|
* NET_RX_DROP (packet was dropped, but freed)
|
2009-11-26 06:07:08 +00:00
|
|
|
*
|
|
|
|
* dev_forward_skb can be used for injecting an skb from the
|
|
|
|
* start_xmit function of one device into the receive queue
|
|
|
|
* of another device.
|
|
|
|
*
|
|
|
|
* The receiving device may be in another namespace, so
|
|
|
|
* we have to clear all information in the skb that could
|
|
|
|
* impact namespace isolation.
|
|
|
|
*/
|
|
|
|
int dev_forward_skb(struct net_device *dev, struct sk_buff *skb)
|
|
|
|
{
|
2014-04-17 05:45:03 +00:00
|
|
|
return __dev_forward_skb(dev, skb) ?: netif_rx_internal(skb);
|
2009-11-26 06:07:08 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(dev_forward_skb);
|
|
|
|
|
2010-12-15 19:57:25 +00:00
|
|
|
static inline int deliver_skb(struct sk_buff *skb,
|
|
|
|
struct packet_type *pt_prev,
|
|
|
|
struct net_device *orig_dev)
|
|
|
|
{
|
sock: enable MSG_ZEROCOPY
Prepare the datapath for refcounted ubuf_info. Clone ubuf_info with
skb_zerocopy_clone() wherever needed due to skb split, merge, resize
or clone.
Split skb_orphan_frags into two variants. The split, merge, .. paths
support reference counted zerocopy buffers, so do not do a deep copy.
Add skb_orphan_frags_rx for paths that may loop packets to receive
sockets. That is not allowed, as it may cause unbounded latency.
Deep copy all zerocopy copy buffers, ref-counted or not, in this path.
The exact locations to modify were chosen by exhaustively searching
through all code that might modify skb_frag references and/or the
the SKBTX_DEV_ZEROCOPY tx_flags bit.
The changes err on the safe side, in two ways.
(1) legacy ubuf_info paths virtio and tap are not modified. They keep
a 1:1 ubuf_info to sk_buff relationship. Calls to skb_orphan_frags
still call skb_copy_ubufs and thus copy frags in this case.
(2) not all copies deep in the stack are addressed yet. skb_shift,
skb_split and skb_try_coalesce can be refined to avoid copying.
These are not in the hot path and this patch is hairy enough as
is, so that is left for future refinement.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03 20:29:41 +00:00
|
|
|
if (unlikely(skb_orphan_frags_rx(skb, GFP_ATOMIC)))
|
2012-07-20 09:23:17 +00:00
|
|
|
return -ENOMEM;
|
2017-06-30 10:07:58 +00:00
|
|
|
refcount_inc(&skb->users);
|
2010-12-15 19:57:25 +00:00
|
|
|
return pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
|
|
|
|
}
|
|
|
|
|
2015-01-27 19:35:48 +00:00
|
|
|
static inline void deliver_ptype_list_skb(struct sk_buff *skb,
|
|
|
|
struct packet_type **pt,
|
2015-03-30 14:56:01 +00:00
|
|
|
struct net_device *orig_dev,
|
|
|
|
__be16 type,
|
2015-01-27 19:35:48 +00:00
|
|
|
struct list_head *ptype_list)
|
|
|
|
{
|
|
|
|
struct packet_type *ptype, *pt_prev = *pt;
|
|
|
|
|
|
|
|
list_for_each_entry_rcu(ptype, ptype_list, list) {
|
|
|
|
if (ptype->type != type)
|
|
|
|
continue;
|
|
|
|
if (pt_prev)
|
2015-03-30 14:56:01 +00:00
|
|
|
deliver_skb(skb, pt_prev, orig_dev);
|
2015-01-27 19:35:48 +00:00
|
|
|
pt_prev = ptype;
|
|
|
|
}
|
|
|
|
*pt = pt_prev;
|
|
|
|
}
|
|
|
|
|
2012-08-16 22:02:58 +00:00
|
|
|
static inline bool skb_loop_sk(struct packet_type *ptype, struct sk_buff *skb)
|
|
|
|
{
|
2012-11-06 02:10:10 +00:00
|
|
|
if (!ptype->af_packet_priv || !skb->sk)
|
2012-08-16 22:02:58 +00:00
|
|
|
return false;
|
|
|
|
|
|
|
|
if (ptype->id_match)
|
|
|
|
return ptype->id_match(ptype, skb->sk);
|
|
|
|
else if ((struct sock *)ptype->af_packet_priv == skb->sk)
|
|
|
|
return true;
|
|
|
|
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
/*
|
|
|
|
* Support routine. Sends outgoing frames to any network
|
|
|
|
* taps currently in use.
|
|
|
|
*/
|
|
|
|
|
2016-05-10 18:19:50 +00:00
|
|
|
void dev_queue_xmit_nit(struct sk_buff *skb, struct net_device *dev)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
|
|
|
struct packet_type *ptype;
|
2010-12-15 19:57:25 +00:00
|
|
|
struct sk_buff *skb2 = NULL;
|
|
|
|
struct packet_type *pt_prev = NULL;
|
2015-01-27 19:35:48 +00:00
|
|
|
struct list_head *ptype_list = &ptype_all;
|
2005-08-15 00:24:31 +00:00
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
rcu_read_lock();
|
2015-01-27 19:35:48 +00:00
|
|
|
again:
|
|
|
|
list_for_each_entry_rcu(ptype, ptype_list, list) {
|
2005-04-16 22:20:36 +00:00
|
|
|
/* Never send packets back to the socket
|
|
|
|
* they originated from - MvS (miquels@drinkel.ow.org)
|
|
|
|
*/
|
2015-01-27 19:35:48 +00:00
|
|
|
if (skb_loop_sk(ptype, skb))
|
|
|
|
continue;
|
2010-12-15 19:57:25 +00:00
|
|
|
|
2015-01-27 19:35:48 +00:00
|
|
|
if (pt_prev) {
|
|
|
|
deliver_skb(skb2, pt_prev, skb->dev);
|
|
|
|
pt_prev = ptype;
|
|
|
|
continue;
|
|
|
|
}
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2015-01-27 19:35:48 +00:00
|
|
|
/* need to clone skb, done only once */
|
|
|
|
skb2 = skb_clone(skb, GFP_ATOMIC);
|
|
|
|
if (!skb2)
|
|
|
|
goto out_unlock;
|
2010-12-20 21:22:51 +00:00
|
|
|
|
2015-01-27 19:35:48 +00:00
|
|
|
net_timestamp_set(skb2);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2015-01-27 19:35:48 +00:00
|
|
|
/* skb->nh should be correctly
|
|
|
|
* set by sender, so that the second statement is
|
|
|
|
* just protection against buggy protocols.
|
|
|
|
*/
|
|
|
|
skb_reset_mac_header(skb2);
|
|
|
|
|
|
|
|
if (skb_network_header(skb2) < skb2->data ||
|
|
|
|
skb_network_header(skb2) > skb_tail_pointer(skb2)) {
|
|
|
|
net_crit_ratelimited("protocol %04x is buggy, dev %s\n",
|
|
|
|
ntohs(skb2->protocol),
|
|
|
|
dev->name);
|
|
|
|
skb_reset_network_header(skb2);
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
2015-01-27 19:35:48 +00:00
|
|
|
|
|
|
|
skb2->transport_header = skb2->network_header;
|
|
|
|
skb2->pkt_type = PACKET_OUTGOING;
|
|
|
|
pt_prev = ptype;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (ptype_list == &ptype_all) {
|
|
|
|
ptype_list = &dev->ptype_all;
|
|
|
|
goto again;
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
2015-01-27 19:35:48 +00:00
|
|
|
out_unlock:
|
2017-09-22 23:42:37 +00:00
|
|
|
if (pt_prev) {
|
|
|
|
if (!skb_orphan_frags_rx(skb2, GFP_ATOMIC))
|
|
|
|
pt_prev->func(skb2, skb->dev, pt_prev, skb->dev);
|
|
|
|
else
|
|
|
|
kfree_skb(skb2);
|
|
|
|
}
|
2005-04-16 22:20:36 +00:00
|
|
|
rcu_read_unlock();
|
|
|
|
}
|
2016-05-10 18:19:50 +00:00
|
|
|
EXPORT_SYMBOL_GPL(dev_queue_xmit_nit);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2012-07-10 10:55:09 +00:00
|
|
|
/**
|
|
|
|
* netif_setup_tc - Handle tc mappings on real_num_tx_queues change
|
2011-01-17 08:06:04 +00:00
|
|
|
* @dev: Network device
|
|
|
|
* @txq: number of queues available
|
|
|
|
*
|
|
|
|
* If real_num_tx_queues is changed the tc mappings may no longer be
|
|
|
|
* valid. To resolve this verify the tc mapping remains valid and if
|
|
|
|
* not NULL the mapping. With no priorities mapping to this
|
|
|
|
* offset/count pair it will no longer be used. In the worst case TC0
|
|
|
|
* is invalid nothing can be done so disable priority mappings. If is
|
|
|
|
* expected that drivers will fix this mapping if they can before
|
|
|
|
* calling netif_set_real_num_tx_queues.
|
|
|
|
*/
|
2011-01-20 19:18:08 +00:00
|
|
|
static void netif_setup_tc(struct net_device *dev, unsigned int txq)
|
2011-01-17 08:06:04 +00:00
|
|
|
{
|
|
|
|
int i;
|
|
|
|
struct netdev_tc_txq *tc = &dev->tc_to_txq[0];
|
|
|
|
|
|
|
|
/* If TC0 is invalidated disable TC mapping */
|
|
|
|
if (tc->offset + tc->count > txq) {
|
2012-02-01 10:54:43 +00:00
|
|
|
pr_warn("Number of in use tx queues changed invalidating tc mappings. Priority traffic classification disabled!\n");
|
2011-01-17 08:06:04 +00:00
|
|
|
dev->num_tc = 0;
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Invalidated prio to tc mappings set to TC0 */
|
|
|
|
for (i = 1; i < TC_BITMASK + 1; i++) {
|
|
|
|
int q = netdev_get_prio_tc_map(dev, i);
|
|
|
|
|
|
|
|
tc = &dev->tc_to_txq[q];
|
|
|
|
if (tc->offset + tc->count > txq) {
|
2012-02-01 10:54:43 +00:00
|
|
|
pr_warn("Number of in use tx queues changed. Priority %i to tc mapping %i is no longer valid. Setting map to 0\n",
|
|
|
|
i, q);
|
2011-01-17 08:06:04 +00:00
|
|
|
netdev_set_prio_tc_map(dev, i, 0);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2016-10-28 15:43:49 +00:00
|
|
|
int netdev_txq_to_tc(struct net_device *dev, unsigned int txq)
|
|
|
|
{
|
|
|
|
if (dev->num_tc) {
|
|
|
|
struct netdev_tc_txq *tc = &dev->tc_to_txq[0];
|
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < TC_MAX_QUEUE; i++, tc++) {
|
|
|
|
if ((txq - tc->offset) < tc->count)
|
|
|
|
return i;
|
|
|
|
}
|
|
|
|
|
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2013-01-10 08:57:02 +00:00
|
|
|
#ifdef CONFIG_XPS
|
|
|
|
static DEFINE_MUTEX(xps_map_mutex);
|
|
|
|
#define xmap_dereference(P) \
|
|
|
|
rcu_dereference_protected((P), lockdep_is_held(&xps_map_mutex))
|
|
|
|
|
2016-10-28 15:46:49 +00:00
|
|
|
static bool remove_xps_queue(struct xps_dev_maps *dev_maps,
|
|
|
|
int tci, u16 index)
|
2013-01-10 08:57:02 +00:00
|
|
|
{
|
2013-01-10 08:57:17 +00:00
|
|
|
struct xps_map *map = NULL;
|
|
|
|
int pos;
|
2013-01-10 08:57:02 +00:00
|
|
|
|
2013-01-10 08:57:17 +00:00
|
|
|
if (dev_maps)
|
2016-10-28 15:46:49 +00:00
|
|
|
map = xmap_dereference(dev_maps->cpu_map[tci]);
|
|
|
|
if (!map)
|
|
|
|
return false;
|
2013-01-10 08:57:02 +00:00
|
|
|
|
2016-10-28 15:46:49 +00:00
|
|
|
for (pos = map->len; pos--;) {
|
|
|
|
if (map->queues[pos] != index)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (map->len > 1) {
|
|
|
|
map->queues[pos] = map->queues[--map->len];
|
2013-01-10 08:57:17 +00:00
|
|
|
break;
|
2013-01-10 08:57:02 +00:00
|
|
|
}
|
2016-10-28 15:46:49 +00:00
|
|
|
|
|
|
|
RCU_INIT_POINTER(dev_maps->cpu_map[tci], NULL);
|
|
|
|
kfree_rcu(map, rcu);
|
|
|
|
return false;
|
2013-01-10 08:57:02 +00:00
|
|
|
}
|
|
|
|
|
2016-10-28 15:46:49 +00:00
|
|
|
return true;
|
2013-01-10 08:57:17 +00:00
|
|
|
}
|
|
|
|
|
2016-10-28 15:46:49 +00:00
|
|
|
static bool remove_xps_queue_cpu(struct net_device *dev,
|
|
|
|
struct xps_dev_maps *dev_maps,
|
|
|
|
int cpu, u16 offset, u16 count)
|
|
|
|
{
|
2016-10-28 15:50:13 +00:00
|
|
|
int num_tc = dev->num_tc ? : 1;
|
|
|
|
bool active = false;
|
|
|
|
int tci;
|
2016-10-28 15:46:49 +00:00
|
|
|
|
2016-10-28 15:50:13 +00:00
|
|
|
for (tci = cpu * num_tc; num_tc--; tci++) {
|
|
|
|
int i, j;
|
|
|
|
|
|
|
|
for (i = count, j = offset; i--; j++) {
|
|
|
|
if (!remove_xps_queue(dev_maps, cpu, j))
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
active |= i < 0;
|
2016-10-28 15:46:49 +00:00
|
|
|
}
|
|
|
|
|
2016-10-28 15:50:13 +00:00
|
|
|
return active;
|
2016-10-28 15:46:49 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static void netif_reset_xps_queues(struct net_device *dev, u16 offset,
|
|
|
|
u16 count)
|
2013-01-10 08:57:17 +00:00
|
|
|
{
|
|
|
|
struct xps_dev_maps *dev_maps;
|
2013-01-10 08:57:46 +00:00
|
|
|
int cpu, i;
|
2013-01-10 08:57:17 +00:00
|
|
|
bool active = false;
|
|
|
|
|
|
|
|
mutex_lock(&xps_map_mutex);
|
|
|
|
dev_maps = xmap_dereference(dev->xps_maps);
|
|
|
|
|
|
|
|
if (!dev_maps)
|
|
|
|
goto out_no_maps;
|
|
|
|
|
2016-10-28 15:46:49 +00:00
|
|
|
for_each_possible_cpu(cpu)
|
|
|
|
active |= remove_xps_queue_cpu(dev, dev_maps, cpu,
|
|
|
|
offset, count);
|
2013-01-10 08:57:17 +00:00
|
|
|
|
|
|
|
if (!active) {
|
2013-01-10 08:57:02 +00:00
|
|
|
RCU_INIT_POINTER(dev->xps_maps, NULL);
|
|
|
|
kfree_rcu(dev_maps, rcu);
|
|
|
|
}
|
|
|
|
|
2016-10-28 15:46:49 +00:00
|
|
|
for (i = offset + (count - 1); count--; i--)
|
2013-01-10 08:57:46 +00:00
|
|
|
netdev_queue_numa_node_write(netdev_get_tx_queue(dev, i),
|
|
|
|
NUMA_NO_NODE);
|
|
|
|
|
2013-01-10 08:57:02 +00:00
|
|
|
out_no_maps:
|
|
|
|
mutex_unlock(&xps_map_mutex);
|
|
|
|
}
|
|
|
|
|
2016-10-28 15:46:49 +00:00
|
|
|
static void netif_reset_xps_queues_gt(struct net_device *dev, u16 index)
|
|
|
|
{
|
|
|
|
netif_reset_xps_queues(dev, index, dev->num_tx_queues - index);
|
|
|
|
}
|
|
|
|
|
2013-01-10 08:57:35 +00:00
|
|
|
static struct xps_map *expand_xps_map(struct xps_map *map,
|
|
|
|
int cpu, u16 index)
|
|
|
|
{
|
|
|
|
struct xps_map *new_map;
|
|
|
|
int alloc_len = XPS_MIN_MAP_ALLOC;
|
|
|
|
int i, pos;
|
|
|
|
|
|
|
|
for (pos = 0; map && pos < map->len; pos++) {
|
|
|
|
if (map->queues[pos] != index)
|
|
|
|
continue;
|
|
|
|
return map;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Need to add queue to this CPU's existing map */
|
|
|
|
if (map) {
|
|
|
|
if (pos < map->alloc_len)
|
|
|
|
return map;
|
|
|
|
|
|
|
|
alloc_len = map->alloc_len * 2;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Need to allocate new map to store queue on this CPU's map */
|
|
|
|
new_map = kzalloc_node(XPS_MAP_SIZE(alloc_len), GFP_KERNEL,
|
|
|
|
cpu_to_node(cpu));
|
|
|
|
if (!new_map)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
for (i = 0; i < pos; i++)
|
|
|
|
new_map->queues[i] = map->queues[i];
|
|
|
|
new_map->alloc_len = alloc_len;
|
|
|
|
new_map->len = pos;
|
|
|
|
|
|
|
|
return new_map;
|
|
|
|
}
|
|
|
|
|
2013-10-02 06:14:06 +00:00
|
|
|
int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
|
|
|
|
u16 index)
|
2013-01-10 08:57:02 +00:00
|
|
|
{
|
2013-01-10 08:57:35 +00:00
|
|
|
struct xps_dev_maps *dev_maps, *new_dev_maps = NULL;
|
2016-10-28 15:50:13 +00:00
|
|
|
int i, cpu, tci, numa_node_id = -2;
|
|
|
|
int maps_sz, num_tc = 1, tc = 0;
|
2013-01-10 08:57:02 +00:00
|
|
|
struct xps_map *map, *new_map;
|
2013-01-10 08:57:35 +00:00
|
|
|
bool active = false;
|
2013-01-10 08:57:02 +00:00
|
|
|
|
2016-10-28 15:50:13 +00:00
|
|
|
if (dev->num_tc) {
|
|
|
|
num_tc = dev->num_tc;
|
|
|
|
tc = netdev_txq_to_tc(dev, index);
|
|
|
|
if (tc < 0)
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
|
|
|
maps_sz = XPS_DEV_MAPS_SIZE(num_tc);
|
|
|
|
if (maps_sz < L1_CACHE_BYTES)
|
|
|
|
maps_sz = L1_CACHE_BYTES;
|
|
|
|
|
2013-01-10 08:57:02 +00:00
|
|
|
mutex_lock(&xps_map_mutex);
|
|
|
|
|
|
|
|
dev_maps = xmap_dereference(dev->xps_maps);
|
|
|
|
|
2013-01-10 08:57:35 +00:00
|
|
|
/* allocate memory for queue storage */
|
2016-10-28 15:50:13 +00:00
|
|
|
for_each_cpu_and(cpu, cpu_online_mask, mask) {
|
2013-01-10 08:57:35 +00:00
|
|
|
if (!new_dev_maps)
|
|
|
|
new_dev_maps = kzalloc(maps_sz, GFP_KERNEL);
|
2013-02-22 06:38:44 +00:00
|
|
|
if (!new_dev_maps) {
|
|
|
|
mutex_unlock(&xps_map_mutex);
|
2013-01-10 08:57:35 +00:00
|
|
|
return -ENOMEM;
|
2013-02-22 06:38:44 +00:00
|
|
|
}
|
2013-01-10 08:57:35 +00:00
|
|
|
|
2016-10-28 15:50:13 +00:00
|
|
|
tci = cpu * num_tc + tc;
|
|
|
|
map = dev_maps ? xmap_dereference(dev_maps->cpu_map[tci]) :
|
2013-01-10 08:57:35 +00:00
|
|
|
NULL;
|
|
|
|
|
|
|
|
map = expand_xps_map(map, cpu, index);
|
|
|
|
if (!map)
|
|
|
|
goto error;
|
|
|
|
|
2016-10-28 15:50:13 +00:00
|
|
|
RCU_INIT_POINTER(new_dev_maps->cpu_map[tci], map);
|
2013-01-10 08:57:35 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
if (!new_dev_maps)
|
|
|
|
goto out_no_new_maps;
|
|
|
|
|
2013-01-10 08:57:02 +00:00
|
|
|
for_each_possible_cpu(cpu) {
|
2016-10-28 15:50:13 +00:00
|
|
|
/* copy maps belonging to foreign traffic classes */
|
|
|
|
for (i = tc, tci = cpu * num_tc; dev_maps && i--; tci++) {
|
|
|
|
/* fill in the new device map from the old device map */
|
|
|
|
map = xmap_dereference(dev_maps->cpu_map[tci]);
|
|
|
|
RCU_INIT_POINTER(new_dev_maps->cpu_map[tci], map);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* We need to explicitly update tci as prevous loop
|
|
|
|
* could break out early if dev_maps is NULL.
|
|
|
|
*/
|
|
|
|
tci = cpu * num_tc + tc;
|
|
|
|
|
2013-01-10 08:57:35 +00:00
|
|
|
if (cpumask_test_cpu(cpu, mask) && cpu_online(cpu)) {
|
|
|
|
/* add queue to CPU maps */
|
|
|
|
int pos = 0;
|
|
|
|
|
2016-10-28 15:50:13 +00:00
|
|
|
map = xmap_dereference(new_dev_maps->cpu_map[tci]);
|
2013-01-10 08:57:35 +00:00
|
|
|
while ((pos < map->len) && (map->queues[pos] != index))
|
|
|
|
pos++;
|
|
|
|
|
|
|
|
if (pos == map->len)
|
|
|
|
map->queues[map->len++] = index;
|
2013-01-10 08:57:02 +00:00
|
|
|
#ifdef CONFIG_NUMA
|
|
|
|
if (numa_node_id == -2)
|
|
|
|
numa_node_id = cpu_to_node(cpu);
|
|
|
|
else if (numa_node_id != cpu_to_node(cpu))
|
|
|
|
numa_node_id = -1;
|
|
|
|
#endif
|
2013-01-10 08:57:35 +00:00
|
|
|
} else if (dev_maps) {
|
|
|
|
/* fill in the new device map from the old device map */
|
2016-10-28 15:50:13 +00:00
|
|
|
map = xmap_dereference(dev_maps->cpu_map[tci]);
|
|
|
|
RCU_INIT_POINTER(new_dev_maps->cpu_map[tci], map);
|
2013-01-10 08:57:02 +00:00
|
|
|
}
|
2013-01-10 08:57:35 +00:00
|
|
|
|
2016-10-28 15:50:13 +00:00
|
|
|
/* copy maps belonging to foreign traffic classes */
|
|
|
|
for (i = num_tc - tc, tci++; dev_maps && --i; tci++) {
|
|
|
|
/* fill in the new device map from the old device map */
|
|
|
|
map = xmap_dereference(dev_maps->cpu_map[tci]);
|
|
|
|
RCU_INIT_POINTER(new_dev_maps->cpu_map[tci], map);
|
|
|
|
}
|
2013-01-10 08:57:02 +00:00
|
|
|
}
|
|
|
|
|
2013-01-10 08:57:35 +00:00
|
|
|
rcu_assign_pointer(dev->xps_maps, new_dev_maps);
|
|
|
|
|
2013-01-10 08:57:02 +00:00
|
|
|
/* Cleanup old maps */
|
2016-10-28 15:50:13 +00:00
|
|
|
if (!dev_maps)
|
|
|
|
goto out_no_old_maps;
|
|
|
|
|
|
|
|
for_each_possible_cpu(cpu) {
|
|
|
|
for (i = num_tc, tci = cpu * num_tc; i--; tci++) {
|
|
|
|
new_map = xmap_dereference(new_dev_maps->cpu_map[tci]);
|
|
|
|
map = xmap_dereference(dev_maps->cpu_map[tci]);
|
2013-01-10 08:57:35 +00:00
|
|
|
if (map && map != new_map)
|
|
|
|
kfree_rcu(map, rcu);
|
|
|
|
}
|
2013-01-10 08:57:02 +00:00
|
|
|
}
|
|
|
|
|
2016-10-28 15:50:13 +00:00
|
|
|
kfree_rcu(dev_maps, rcu);
|
|
|
|
|
|
|
|
out_no_old_maps:
|
2013-01-10 08:57:35 +00:00
|
|
|
dev_maps = new_dev_maps;
|
|
|
|
active = true;
|
2013-01-10 08:57:02 +00:00
|
|
|
|
2013-01-10 08:57:35 +00:00
|
|
|
out_no_new_maps:
|
|
|
|
/* update Tx queue numa node */
|
2013-01-10 08:57:02 +00:00
|
|
|
netdev_queue_numa_node_write(netdev_get_tx_queue(dev, index),
|
|
|
|
(numa_node_id >= 0) ? numa_node_id :
|
|
|
|
NUMA_NO_NODE);
|
|
|
|
|
2013-01-10 08:57:35 +00:00
|
|
|
if (!dev_maps)
|
|
|
|
goto out_no_maps;
|
|
|
|
|
|
|
|
/* removes queue from unused CPUs */
|
|
|
|
for_each_possible_cpu(cpu) {
|
2016-10-28 15:50:13 +00:00
|
|
|
for (i = tc, tci = cpu * num_tc; i--; tci++)
|
|
|
|
active |= remove_xps_queue(dev_maps, tci, index);
|
|
|
|
if (!cpumask_test_cpu(cpu, mask) || !cpu_online(cpu))
|
|
|
|
active |= remove_xps_queue(dev_maps, tci, index);
|
|
|
|
for (i = num_tc - tc, tci++; --i; tci++)
|
|
|
|
active |= remove_xps_queue(dev_maps, tci, index);
|
2013-01-10 08:57:35 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/* free map if not active */
|
|
|
|
if (!active) {
|
|
|
|
RCU_INIT_POINTER(dev->xps_maps, NULL);
|
|
|
|
kfree_rcu(dev_maps, rcu);
|
|
|
|
}
|
|
|
|
|
|
|
|
out_no_maps:
|
2013-01-10 08:57:02 +00:00
|
|
|
mutex_unlock(&xps_map_mutex);
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
error:
|
2013-01-10 08:57:35 +00:00
|
|
|
/* remove any maps that we added */
|
|
|
|
for_each_possible_cpu(cpu) {
|
2016-10-28 15:50:13 +00:00
|
|
|
for (i = num_tc, tci = cpu * num_tc; i--; tci++) {
|
|
|
|
new_map = xmap_dereference(new_dev_maps->cpu_map[tci]);
|
|
|
|
map = dev_maps ?
|
|
|
|
xmap_dereference(dev_maps->cpu_map[tci]) :
|
|
|
|
NULL;
|
|
|
|
if (new_map && new_map != map)
|
|
|
|
kfree(new_map);
|
|
|
|
}
|
2013-01-10 08:57:35 +00:00
|
|
|
}
|
|
|
|
|
2013-01-10 08:57:02 +00:00
|
|
|
mutex_unlock(&xps_map_mutex);
|
|
|
|
|
|
|
|
kfree(new_dev_maps);
|
|
|
|
return -ENOMEM;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(netif_set_xps_queue);
|
|
|
|
|
|
|
|
#endif
|
2016-10-28 15:43:20 +00:00
|
|
|
void netdev_reset_tc(struct net_device *dev)
|
|
|
|
{
|
2016-10-28 15:46:49 +00:00
|
|
|
#ifdef CONFIG_XPS
|
|
|
|
netif_reset_xps_queues_gt(dev, 0);
|
|
|
|
#endif
|
2016-10-28 15:43:20 +00:00
|
|
|
dev->num_tc = 0;
|
|
|
|
memset(dev->tc_to_txq, 0, sizeof(dev->tc_to_txq));
|
|
|
|
memset(dev->prio_tc_map, 0, sizeof(dev->prio_tc_map));
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(netdev_reset_tc);
|
|
|
|
|
|
|
|
int netdev_set_tc_queue(struct net_device *dev, u8 tc, u16 count, u16 offset)
|
|
|
|
{
|
|
|
|
if (tc >= dev->num_tc)
|
|
|
|
return -EINVAL;
|
|
|
|
|
2016-10-28 15:46:49 +00:00
|
|
|
#ifdef CONFIG_XPS
|
|
|
|
netif_reset_xps_queues(dev, offset, count);
|
|
|
|
#endif
|
2016-10-28 15:43:20 +00:00
|
|
|
dev->tc_to_txq[tc].count = count;
|
|
|
|
dev->tc_to_txq[tc].offset = offset;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(netdev_set_tc_queue);
|
|
|
|
|
|
|
|
int netdev_set_num_tc(struct net_device *dev, u8 num_tc)
|
|
|
|
{
|
|
|
|
if (num_tc > TC_MAX_QUEUE)
|
|
|
|
return -EINVAL;
|
|
|
|
|
2016-10-28 15:46:49 +00:00
|
|
|
#ifdef CONFIG_XPS
|
|
|
|
netif_reset_xps_queues_gt(dev, 0);
|
|
|
|
#endif
|
2016-10-28 15:43:20 +00:00
|
|
|
dev->num_tc = num_tc;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(netdev_set_num_tc);
|
|
|
|
|
2010-07-01 13:21:57 +00:00
|
|
|
/*
|
|
|
|
* Routine to help set real_num_tx_queues. To avoid skbs mapped to queues
|
|
|
|
* greater then real_num_tx_queues stale skbs on the qdisc must be flushed.
|
|
|
|
*/
|
2010-10-18 18:04:39 +00:00
|
|
|
int netif_set_real_num_tx_queues(struct net_device *dev, unsigned int txq)
|
2010-07-01 13:21:57 +00:00
|
|
|
{
|
2010-11-21 13:17:27 +00:00
|
|
|
int rc;
|
|
|
|
|
2010-10-18 18:04:39 +00:00
|
|
|
if (txq < 1 || txq > dev->num_tx_queues)
|
|
|
|
return -EINVAL;
|
2010-07-01 13:21:57 +00:00
|
|
|
|
2011-02-15 19:39:21 +00:00
|
|
|
if (dev->reg_state == NETREG_REGISTERED ||
|
|
|
|
dev->reg_state == NETREG_UNREGISTERING) {
|
2010-10-18 18:04:39 +00:00
|
|
|
ASSERT_RTNL();
|
|
|
|
|
2010-11-21 13:17:27 +00:00
|
|
|
rc = netdev_queue_update_kobjects(dev, dev->real_num_tx_queues,
|
|
|
|
txq);
|
2010-11-26 08:36:09 +00:00
|
|
|
if (rc)
|
|
|
|
return rc;
|
|
|
|
|
2011-01-17 08:06:04 +00:00
|
|
|
if (dev->num_tc)
|
|
|
|
netif_setup_tc(dev, txq);
|
|
|
|
|
2013-01-10 08:57:46 +00:00
|
|
|
if (txq < dev->real_num_tx_queues) {
|
2010-10-18 18:04:39 +00:00
|
|
|
qdisc_reset_all_tx_gt(dev, txq);
|
2013-01-10 08:57:46 +00:00
|
|
|
#ifdef CONFIG_XPS
|
|
|
|
netif_reset_xps_queues_gt(dev, txq);
|
|
|
|
#endif
|
|
|
|
}
|
2010-07-01 13:21:57 +00:00
|
|
|
}
|
2010-10-18 18:04:39 +00:00
|
|
|
|
|
|
|
dev->real_num_tx_queues = txq;
|
|
|
|
return 0;
|
2010-07-01 13:21:57 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(netif_set_real_num_tx_queues);
|
2006-03-29 23:57:29 +00:00
|
|
|
|
2014-01-17 06:23:28 +00:00
|
|
|
#ifdef CONFIG_SYSFS
|
2010-09-27 08:24:33 +00:00
|
|
|
/**
|
|
|
|
* netif_set_real_num_rx_queues - set actual number of RX queues used
|
|
|
|
* @dev: Network device
|
|
|
|
* @rxq: Actual number of RX queues
|
|
|
|
*
|
|
|
|
* This must be called either with the rtnl_lock held or before
|
|
|
|
* registration of the net device. Returns 0 on success, or a
|
2010-10-08 17:33:39 +00:00
|
|
|
* negative error code. If called before registration, it always
|
|
|
|
* succeeds.
|
2010-09-27 08:24:33 +00:00
|
|
|
*/
|
|
|
|
int netif_set_real_num_rx_queues(struct net_device *dev, unsigned int rxq)
|
|
|
|
{
|
|
|
|
int rc;
|
|
|
|
|
2010-10-18 18:00:16 +00:00
|
|
|
if (rxq < 1 || rxq > dev->num_rx_queues)
|
|
|
|
return -EINVAL;
|
|
|
|
|
2010-09-27 08:24:33 +00:00
|
|
|
if (dev->reg_state == NETREG_REGISTERED) {
|
|
|
|
ASSERT_RTNL();
|
|
|
|
|
|
|
|
rc = net_rx_queue_update_kobjects(dev, dev->real_num_rx_queues,
|
|
|
|
rxq);
|
|
|
|
if (rc)
|
|
|
|
return rc;
|
|
|
|
}
|
|
|
|
|
|
|
|
dev->real_num_rx_queues = rxq;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(netif_set_real_num_rx_queues);
|
|
|
|
#endif
|
|
|
|
|
2012-07-10 10:55:09 +00:00
|
|
|
/**
|
|
|
|
* netif_get_num_default_rss_queues - default number of RSS queues
|
2012-07-01 03:18:50 +00:00
|
|
|
*
|
|
|
|
* This routine should set an upper limit on the number of RSS queues
|
|
|
|
* used by default by multiqueue devices.
|
|
|
|
*/
|
2012-07-10 10:54:38 +00:00
|
|
|
int netif_get_num_default_rss_queues(void)
|
2012-07-01 03:18:50 +00:00
|
|
|
{
|
2016-06-08 12:39:08 +00:00
|
|
|
return is_kdump_kernel() ?
|
|
|
|
1 : min_t(int, DEFAULT_MAX_NUM_RSS_QUEUES, num_online_cpus());
|
2012-07-01 03:18:50 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(netif_get_num_default_rss_queues);
|
|
|
|
|
net: get rid of spin_trylock() in net_tx_action()
Note: Tom Herbert posted almost same patch 3 months back, but for
different reasons.
The reasons we want to get rid of this spin_trylock() are :
1) Under high qdisc pressure, the spin_trylock() has almost no
chance to succeed.
2) We loop multiple times in softirq handler, eventually reaching
the max retry count (10), and we schedule ksoftirqd.
Since we want to adhere more strictly to ksoftirqd being waked up in
the future (https://lwn.net/Articles/687617/), better avoid spurious
wakeups.
3) calls to __netif_reschedule() dirty the cache line containing
q->next_sched, slowing down the owner of qdisc.
4) RT kernels can not use the spin_trylock() here.
With help of busylock, we get the qdisc spinlock fast enough, and
the trylock trick brings only performance penalty.
Depending on qdisc setup, I observed a gain of up to 19 % in qdisc
performance (1016600 pps instead of 853400 pps, using prio+tbf+fq_codel)
("mpstat -I SCPU 1" is much happier now)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <tom@herbertland.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-05 03:02:28 +00:00
|
|
|
static void __netif_reschedule(struct Qdisc *q)
|
2006-03-29 23:57:29 +00:00
|
|
|
{
|
2008-08-18 04:54:43 +00:00
|
|
|
struct softnet_data *sd;
|
|
|
|
unsigned long flags;
|
2006-03-29 23:57:29 +00:00
|
|
|
|
2008-08-18 04:54:43 +00:00
|
|
|
local_irq_save(flags);
|
2014-08-17 17:30:35 +00:00
|
|
|
sd = this_cpu_ptr(&softnet_data);
|
2010-04-26 23:06:24 +00:00
|
|
|
q->next_sched = NULL;
|
|
|
|
*sd->output_queue_tailp = q;
|
|
|
|
sd->output_queue_tailp = &q->next_sched;
|
2008-08-18 04:54:43 +00:00
|
|
|
raise_softirq_irqoff(NET_TX_SOFTIRQ);
|
|
|
|
local_irq_restore(flags);
|
|
|
|
}
|
|
|
|
|
|
|
|
void __netif_schedule(struct Qdisc *q)
|
|
|
|
{
|
|
|
|
if (!test_and_set_bit(__QDISC_STATE_SCHED, &q->state))
|
|
|
|
__netif_reschedule(q);
|
2006-03-29 23:57:29 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(__netif_schedule);
|
|
|
|
|
2013-12-05 12:45:08 +00:00
|
|
|
struct dev_kfree_skb_cb {
|
|
|
|
enum skb_free_reason reason;
|
|
|
|
};
|
|
|
|
|
|
|
|
static struct dev_kfree_skb_cb *get_kfree_skb_cb(const struct sk_buff *skb)
|
2006-03-29 23:57:29 +00:00
|
|
|
{
|
2013-12-05 12:45:08 +00:00
|
|
|
return (struct dev_kfree_skb_cb *)skb->cb;
|
|
|
|
}
|
|
|
|
|
2014-09-13 03:04:52 +00:00
|
|
|
void netif_schedule_queue(struct netdev_queue *txq)
|
|
|
|
{
|
|
|
|
rcu_read_lock();
|
|
|
|
if (!(txq->state & QUEUE_STATE_ANY_XOFF)) {
|
|
|
|
struct Qdisc *q = rcu_dereference(txq->qdisc);
|
|
|
|
|
|
|
|
__netif_schedule(q);
|
|
|
|
}
|
|
|
|
rcu_read_unlock();
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(netif_schedule_queue);
|
|
|
|
|
|
|
|
void netif_tx_wake_queue(struct netdev_queue *dev_queue)
|
|
|
|
{
|
|
|
|
if (test_and_clear_bit(__QUEUE_STATE_DRV_XOFF, &dev_queue->state)) {
|
|
|
|
struct Qdisc *q;
|
|
|
|
|
|
|
|
rcu_read_lock();
|
|
|
|
q = rcu_dereference(dev_queue->qdisc);
|
|
|
|
__netif_schedule(q);
|
|
|
|
rcu_read_unlock();
|
|
|
|
}
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(netif_tx_wake_queue);
|
|
|
|
|
2013-12-05 12:45:08 +00:00
|
|
|
void __dev_kfree_skb_irq(struct sk_buff *skb, enum skb_free_reason reason)
|
2006-03-29 23:57:29 +00:00
|
|
|
{
|
2013-12-05 12:45:08 +00:00
|
|
|
unsigned long flags;
|
2006-03-29 23:57:29 +00:00
|
|
|
|
2017-04-25 18:58:15 +00:00
|
|
|
if (unlikely(!skb))
|
|
|
|
return;
|
|
|
|
|
2017-06-30 10:07:58 +00:00
|
|
|
if (likely(refcount_read(&skb->users) == 1)) {
|
2013-12-05 12:45:08 +00:00
|
|
|
smp_rmb();
|
2017-06-30 10:07:58 +00:00
|
|
|
refcount_set(&skb->users, 0);
|
|
|
|
} else if (likely(!refcount_dec_and_test(&skb->users))) {
|
2013-12-05 12:45:08 +00:00
|
|
|
return;
|
[NET]: Make NAPI polling independent of struct net_device objects.
Several devices have multiple independant RX queues per net
device, and some have a single interrupt doorbell for several
queues.
In either case, it's easier to support layouts like that if the
structure representing the poll is independant from the net
device itself.
The signature of the ->poll() call back goes from:
int foo_poll(struct net_device *dev, int *budget)
to
int foo_poll(struct napi_struct *napi, int budget)
The caller is returned the number of RX packets processed (or
the number of "NAPI credits" consumed if you want to get
abstract). The callee no longer messes around bumping
dev->quota, *budget, etc. because that is all handled in the
caller upon return.
The napi_struct is to be embedded in the device driver private data
structures.
Furthermore, it is the driver's responsibility to disable all NAPI
instances in it's ->stop() device close handler. Since the
napi_struct is privatized into the driver's private data structures,
only the driver knows how to get at all of the napi_struct instances
it may have per-device.
With lots of help and suggestions from Rusty Russell, Roland Dreier,
Michael Chan, Jeff Garzik, and Jamal Hadi Salim.
Bug fixes from Thomas Graf, Roland Dreier, Peter Zijlstra,
Joseph Fannin, Scott Wood, Hans J. Koch, and Michael Chan.
[ Ported to current tree and all drivers converted. Integrated
Stephen's follow-on kerneldoc additions, and restored poll_list
handling to the old style to fix mutual exclusion issues. -DaveM ]
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-03 23:41:36 +00:00
|
|
|
}
|
2013-12-05 12:45:08 +00:00
|
|
|
get_kfree_skb_cb(skb)->reason = reason;
|
|
|
|
local_irq_save(flags);
|
|
|
|
skb->next = __this_cpu_read(softnet_data.completion_queue);
|
|
|
|
__this_cpu_write(softnet_data.completion_queue, skb);
|
|
|
|
raise_softirq_irqoff(NET_TX_SOFTIRQ);
|
|
|
|
local_irq_restore(flags);
|
2006-03-29 23:57:29 +00:00
|
|
|
}
|
2013-12-05 12:45:08 +00:00
|
|
|
EXPORT_SYMBOL(__dev_kfree_skb_irq);
|
2006-03-29 23:57:29 +00:00
|
|
|
|
2013-12-05 12:45:08 +00:00
|
|
|
void __dev_kfree_skb_any(struct sk_buff *skb, enum skb_free_reason reason)
|
2006-03-29 23:57:29 +00:00
|
|
|
{
|
|
|
|
if (in_irq() || irqs_disabled())
|
2013-12-05 12:45:08 +00:00
|
|
|
__dev_kfree_skb_irq(skb, reason);
|
2006-03-29 23:57:29 +00:00
|
|
|
else
|
|
|
|
dev_kfree_skb(skb);
|
|
|
|
}
|
2013-12-05 12:45:08 +00:00
|
|
|
EXPORT_SYMBOL(__dev_kfree_skb_any);
|
2006-03-29 23:57:29 +00:00
|
|
|
|
|
|
|
|
[NET]: Make NAPI polling independent of struct net_device objects.
Several devices have multiple independant RX queues per net
device, and some have a single interrupt doorbell for several
queues.
In either case, it's easier to support layouts like that if the
structure representing the poll is independant from the net
device itself.
The signature of the ->poll() call back goes from:
int foo_poll(struct net_device *dev, int *budget)
to
int foo_poll(struct napi_struct *napi, int budget)
The caller is returned the number of RX packets processed (or
the number of "NAPI credits" consumed if you want to get
abstract). The callee no longer messes around bumping
dev->quota, *budget, etc. because that is all handled in the
caller upon return.
The napi_struct is to be embedded in the device driver private data
structures.
Furthermore, it is the driver's responsibility to disable all NAPI
instances in it's ->stop() device close handler. Since the
napi_struct is privatized into the driver's private data structures,
only the driver knows how to get at all of the napi_struct instances
it may have per-device.
With lots of help and suggestions from Rusty Russell, Roland Dreier,
Michael Chan, Jeff Garzik, and Jamal Hadi Salim.
Bug fixes from Thomas Graf, Roland Dreier, Peter Zijlstra,
Joseph Fannin, Scott Wood, Hans J. Koch, and Michael Chan.
[ Ported to current tree and all drivers converted. Integrated
Stephen's follow-on kerneldoc additions, and restored poll_list
handling to the old style to fix mutual exclusion issues. -DaveM ]
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-03 23:41:36 +00:00
|
|
|
/**
|
|
|
|
* netif_device_detach - mark device as removed
|
|
|
|
* @dev: network device
|
|
|
|
*
|
|
|
|
* Mark device as removed from system and therefore no longer available.
|
|
|
|
*/
|
2006-03-29 23:57:29 +00:00
|
|
|
void netif_device_detach(struct net_device *dev)
|
|
|
|
{
|
|
|
|
if (test_and_clear_bit(__LINK_STATE_PRESENT, &dev->state) &&
|
|
|
|
netif_running(dev)) {
|
2009-04-08 13:15:22 +00:00
|
|
|
netif_tx_stop_all_queues(dev);
|
2006-03-29 23:57:29 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(netif_device_detach);
|
|
|
|
|
[NET]: Make NAPI polling independent of struct net_device objects.
Several devices have multiple independant RX queues per net
device, and some have a single interrupt doorbell for several
queues.
In either case, it's easier to support layouts like that if the
structure representing the poll is independant from the net
device itself.
The signature of the ->poll() call back goes from:
int foo_poll(struct net_device *dev, int *budget)
to
int foo_poll(struct napi_struct *napi, int budget)
The caller is returned the number of RX packets processed (or
the number of "NAPI credits" consumed if you want to get
abstract). The callee no longer messes around bumping
dev->quota, *budget, etc. because that is all handled in the
caller upon return.
The napi_struct is to be embedded in the device driver private data
structures.
Furthermore, it is the driver's responsibility to disable all NAPI
instances in it's ->stop() device close handler. Since the
napi_struct is privatized into the driver's private data structures,
only the driver knows how to get at all of the napi_struct instances
it may have per-device.
With lots of help and suggestions from Rusty Russell, Roland Dreier,
Michael Chan, Jeff Garzik, and Jamal Hadi Salim.
Bug fixes from Thomas Graf, Roland Dreier, Peter Zijlstra,
Joseph Fannin, Scott Wood, Hans J. Koch, and Michael Chan.
[ Ported to current tree and all drivers converted. Integrated
Stephen's follow-on kerneldoc additions, and restored poll_list
handling to the old style to fix mutual exclusion issues. -DaveM ]
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-03 23:41:36 +00:00
|
|
|
/**
|
|
|
|
* netif_device_attach - mark device as attached
|
|
|
|
* @dev: network device
|
|
|
|
*
|
|
|
|
* Mark device as attached from system and restart if needed.
|
|
|
|
*/
|
2006-03-29 23:57:29 +00:00
|
|
|
void netif_device_attach(struct net_device *dev)
|
|
|
|
{
|
|
|
|
if (!test_and_set_bit(__LINK_STATE_PRESENT, &dev->state) &&
|
|
|
|
netif_running(dev)) {
|
2009-04-08 13:15:22 +00:00
|
|
|
netif_tx_wake_all_queues(dev);
|
2007-02-09 14:24:36 +00:00
|
|
|
__netdev_watchdog_up(dev);
|
2006-03-29 23:57:29 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(netif_device_attach);
|
|
|
|
|
2015-05-12 12:56:12 +00:00
|
|
|
/*
|
|
|
|
* Returns a Tx hash based on the given packet descriptor a Tx queues' number
|
|
|
|
* to be used as a distribution range.
|
|
|
|
*/
|
|
|
|
u16 __skb_tx_hash(const struct net_device *dev, struct sk_buff *skb,
|
|
|
|
unsigned int num_tx_queues)
|
|
|
|
{
|
|
|
|
u32 hash;
|
|
|
|
u16 qoffset = 0;
|
|
|
|
u16 qcount = num_tx_queues;
|
|
|
|
|
|
|
|
if (skb_rx_queue_recorded(skb)) {
|
|
|
|
hash = skb_get_rx_queue(skb);
|
|
|
|
while (unlikely(hash >= num_tx_queues))
|
|
|
|
hash -= num_tx_queues;
|
|
|
|
return hash;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (dev->num_tc) {
|
|
|
|
u8 tc = netdev_get_prio_tc_map(dev, skb->priority);
|
2017-02-09 06:56:07 +00:00
|
|
|
|
2015-05-12 12:56:12 +00:00
|
|
|
qoffset = dev->tc_to_txq[tc].offset;
|
|
|
|
qcount = dev->tc_to_txq[tc].count;
|
|
|
|
}
|
|
|
|
|
|
|
|
return (u16) reciprocal_scale(skb_get_hash(skb), qcount) + qoffset;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(__skb_tx_hash);
|
|
|
|
|
2012-01-17 07:57:56 +00:00
|
|
|
static void skb_warn_bad_offload(const struct sk_buff *skb)
|
|
|
|
{
|
2016-06-16 13:17:49 +00:00
|
|
|
static const netdev_features_t null_features;
|
2012-01-17 07:57:56 +00:00
|
|
|
struct net_device *dev = skb->dev;
|
2015-11-16 18:16:40 +00:00
|
|
|
const char *name = "";
|
2012-01-17 07:57:56 +00:00
|
|
|
|
2013-04-19 10:45:52 +00:00
|
|
|
if (!net_ratelimit())
|
|
|
|
return;
|
|
|
|
|
2015-11-16 18:16:40 +00:00
|
|
|
if (dev) {
|
|
|
|
if (dev->dev.parent)
|
|
|
|
name = dev_driver_string(dev->dev.parent);
|
|
|
|
else
|
|
|
|
name = netdev_name(dev);
|
|
|
|
}
|
2012-01-17 07:57:56 +00:00
|
|
|
WARN(1, "%s: caps=(%pNF, %pNF) len=%d data_len=%d gso_size=%d "
|
|
|
|
"gso_type=%d ip_summed=%d\n",
|
2015-11-16 18:16:40 +00:00
|
|
|
name, dev ? &dev->features : &null_features,
|
2012-01-17 10:00:40 +00:00
|
|
|
skb->sk ? &skb->sk->sk_route_caps : &null_features,
|
2012-01-17 07:57:56 +00:00
|
|
|
skb->len, skb->data_len, skb_shinfo(skb)->gso_size,
|
|
|
|
skb_shinfo(skb)->gso_type, skb->ip_summed);
|
|
|
|
}
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
/*
|
|
|
|
* Invalidate hardware checksum when packet is to be mangled, and
|
|
|
|
* complete checksum manually on outgoing path.
|
|
|
|
*/
|
2006-08-29 23:44:56 +00:00
|
|
|
int skb_checksum_help(struct sk_buff *skb)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2006-11-15 05:24:49 +00:00
|
|
|
__wsum csum;
|
2007-04-09 18:59:07 +00:00
|
|
|
int ret = 0, offset;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2006-08-29 23:44:56 +00:00
|
|
|
if (skb->ip_summed == CHECKSUM_COMPLETE)
|
2006-07-08 20:34:56 +00:00
|
|
|
goto out_set_summed;
|
|
|
|
|
|
|
|
if (unlikely(skb_shinfo(skb)->gso_size)) {
|
2012-01-17 07:57:56 +00:00
|
|
|
skb_warn_bad_offload(skb);
|
|
|
|
return -EINVAL;
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
|
2013-01-25 20:34:37 +00:00
|
|
|
/* Before computing a checksum, we should make sure no frag could
|
|
|
|
* be modified by an external entity : checksum could be wrong.
|
|
|
|
*/
|
|
|
|
if (skb_has_shared_frag(skb)) {
|
|
|
|
ret = __skb_linearize(skb);
|
|
|
|
if (ret)
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2010-12-14 15:24:08 +00:00
|
|
|
offset = skb_checksum_start_offset(skb);
|
2007-10-15 08:47:15 +00:00
|
|
|
BUG_ON(offset >= skb_headlen(skb));
|
|
|
|
csum = skb_checksum(skb, offset, skb->len - offset, 0);
|
|
|
|
|
|
|
|
offset += skb->csum_offset;
|
|
|
|
BUG_ON(offset + sizeof(__sum16) > skb_headlen(skb));
|
|
|
|
|
|
|
|
if (skb_cloned(skb) &&
|
|
|
|
!skb_clone_writable(skb, offset + sizeof(__sum16))) {
|
2005-04-16 22:20:36 +00:00
|
|
|
ret = pskb_expand_head(skb, 0, 0, GFP_ATOMIC);
|
|
|
|
if (ret)
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2016-10-29 18:02:36 +00:00
|
|
|
*(__sum16 *)(skb->data + offset) = csum_fold(csum) ?: CSUM_MANGLED_0;
|
2006-07-08 20:34:56 +00:00
|
|
|
out_set_summed:
|
2005-04-16 22:20:36 +00:00
|
|
|
skb->ip_summed = CHECKSUM_NONE;
|
2007-02-09 14:24:36 +00:00
|
|
|
out:
|
2005-04-16 22:20:36 +00:00
|
|
|
return ret;
|
|
|
|
}
|
2009-09-03 08:29:39 +00:00
|
|
|
EXPORT_SYMBOL(skb_checksum_help);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2017-05-18 13:44:38 +00:00
|
|
|
int skb_crc32c_csum_help(struct sk_buff *skb)
|
|
|
|
{
|
|
|
|
__le32 crc32c_csum;
|
|
|
|
int ret = 0, offset, start;
|
|
|
|
|
|
|
|
if (skb->ip_summed != CHECKSUM_PARTIAL)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
if (unlikely(skb_is_gso(skb)))
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
/* Before computing a checksum, we should make sure no frag could
|
|
|
|
* be modified by an external entity : checksum could be wrong.
|
|
|
|
*/
|
|
|
|
if (unlikely(skb_has_shared_frag(skb))) {
|
|
|
|
ret = __skb_linearize(skb);
|
|
|
|
if (ret)
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
start = skb_checksum_start_offset(skb);
|
|
|
|
offset = start + offsetof(struct sctphdr, checksum);
|
|
|
|
if (WARN_ON_ONCE(offset >= skb_headlen(skb))) {
|
|
|
|
ret = -EINVAL;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
if (skb_cloned(skb) &&
|
|
|
|
!skb_clone_writable(skb, offset + sizeof(__le32))) {
|
|
|
|
ret = pskb_expand_head(skb, 0, 0, GFP_ATOMIC);
|
|
|
|
if (ret)
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
crc32c_csum = cpu_to_le32(~__skb_checksum(skb, start,
|
|
|
|
skb->len - start, ~(__u32)0,
|
|
|
|
crc32c_csum_stub));
|
|
|
|
*(__le32 *)(skb->data + offset) = crc32c_csum;
|
|
|
|
skb->ip_summed = CHECKSUM_NONE;
|
2017-05-18 13:44:40 +00:00
|
|
|
skb->csum_not_inet = 0;
|
2017-05-18 13:44:38 +00:00
|
|
|
out:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2014-03-27 21:26:18 +00:00
|
|
|
__be16 skb_network_protocol(struct sk_buff *skb, int *depth)
|
2006-06-22 09:57:17 +00:00
|
|
|
{
|
2006-11-15 04:48:11 +00:00
|
|
|
__be16 type = skb->protocol;
|
2006-06-22 09:57:17 +00:00
|
|
|
|
2013-05-07 20:41:07 +00:00
|
|
|
/* Tunnel gso handlers can set protocol to ethernet. */
|
|
|
|
if (type == htons(ETH_P_TEB)) {
|
|
|
|
struct ethhdr *eth;
|
|
|
|
|
|
|
|
if (unlikely(!pskb_may_pull(skb, sizeof(struct ethhdr))))
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
eth = (struct ethhdr *)skb_mac_header(skb);
|
|
|
|
type = eth->h_proto;
|
|
|
|
}
|
|
|
|
|
2015-01-29 11:37:07 +00:00
|
|
|
return __vlan_get_protocol(skb, type, depth);
|
2013-03-07 09:28:01 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* skb_mac_gso_segment - mac layer segmentation handler.
|
|
|
|
* @skb: buffer to segment
|
|
|
|
* @features: features for the output path (see dev->features)
|
|
|
|
*/
|
|
|
|
struct sk_buff *skb_mac_gso_segment(struct sk_buff *skb,
|
|
|
|
netdev_features_t features)
|
|
|
|
{
|
|
|
|
struct sk_buff *segs = ERR_PTR(-EPROTONOSUPPORT);
|
|
|
|
struct packet_offload *ptype;
|
2014-03-27 21:26:18 +00:00
|
|
|
int vlan_depth = skb->mac_len;
|
|
|
|
__be16 type = skb_network_protocol(skb, &vlan_depth);
|
2013-03-07 09:28:01 +00:00
|
|
|
|
|
|
|
if (unlikely(!type))
|
|
|
|
return ERR_PTR(-EINVAL);
|
|
|
|
|
2014-03-27 21:26:18 +00:00
|
|
|
__skb_pull(skb, vlan_depth);
|
2006-06-22 09:57:17 +00:00
|
|
|
|
|
|
|
rcu_read_lock();
|
2012-11-15 08:49:11 +00:00
|
|
|
list_for_each_entry_rcu(ptype, &offload_base, list) {
|
2012-11-15 08:49:23 +00:00
|
|
|
if (ptype->type == type && ptype->callbacks.gso_segment) {
|
|
|
|
segs = ptype->callbacks.gso_segment(skb, features);
|
2006-06-22 09:57:17 +00:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
rcu_read_unlock();
|
|
|
|
|
2007-03-19 22:33:04 +00:00
|
|
|
__skb_push(skb, skb->data - skb_mac_header(skb));
|
2006-06-27 20:22:38 +00:00
|
|
|
|
2006-06-22 09:57:17 +00:00
|
|
|
return segs;
|
|
|
|
}
|
2013-02-14 09:44:55 +00:00
|
|
|
EXPORT_SYMBOL(skb_mac_gso_segment);
|
|
|
|
|
|
|
|
|
|
|
|
/* openvswitch calls this on rx path, so we need a different check.
|
|
|
|
*/
|
|
|
|
static inline bool skb_needs_check(struct sk_buff *skb, bool tx_path)
|
|
|
|
{
|
|
|
|
if (tx_path)
|
2017-08-11 03:16:29 +00:00
|
|
|
return skb->ip_summed != CHECKSUM_PARTIAL;
|
2017-02-03 22:29:42 +00:00
|
|
|
|
|
|
|
return skb->ip_summed == CHECKSUM_NONE;
|
2013-02-14 09:44:55 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* __skb_gso_segment - Perform segmentation on skb.
|
|
|
|
* @skb: buffer to segment
|
|
|
|
* @features: features for the output path (see dev->features)
|
|
|
|
* @tx_path: whether it is called in TX path
|
|
|
|
*
|
|
|
|
* This function segments the given skb and returns a list of segments.
|
|
|
|
*
|
|
|
|
* It may return NULL if the skb requires no segmentation. This is
|
|
|
|
* only possible when GSO is used for verifying header integrity.
|
2016-01-08 12:21:46 +00:00
|
|
|
*
|
|
|
|
* Segmentation preserves SKB_SGO_CB_OFFSET bytes of previous skb cb.
|
2013-02-14 09:44:55 +00:00
|
|
|
*/
|
|
|
|
struct sk_buff *__skb_gso_segment(struct sk_buff *skb,
|
|
|
|
netdev_features_t features, bool tx_path)
|
|
|
|
{
|
2017-01-31 18:20:32 +00:00
|
|
|
struct sk_buff *segs;
|
|
|
|
|
2013-02-14 09:44:55 +00:00
|
|
|
if (unlikely(skb_needs_check(skb, tx_path))) {
|
|
|
|
int err;
|
|
|
|
|
2017-01-31 18:20:32 +00:00
|
|
|
/* We're going to init ->check field in TCP or UDP header */
|
2014-07-15 21:55:35 +00:00
|
|
|
err = skb_cow_head(skb, 0);
|
|
|
|
if (err < 0)
|
2013-02-14 09:44:55 +00:00
|
|
|
return ERR_PTR(err);
|
|
|
|
}
|
|
|
|
|
2016-04-11 01:45:03 +00:00
|
|
|
/* Only report GSO partial support if it will enable us to
|
|
|
|
* support segmentation on this frame without needing additional
|
|
|
|
* work.
|
|
|
|
*/
|
|
|
|
if (features & NETIF_F_GSO_PARTIAL) {
|
|
|
|
netdev_features_t partial_features = NETIF_F_GSO_ROBUST;
|
|
|
|
struct net_device *dev = skb->dev;
|
|
|
|
|
|
|
|
partial_features |= dev->features & dev->gso_partial_features;
|
|
|
|
if (!skb_gso_ok(skb, features | partial_features))
|
|
|
|
features &= ~NETIF_F_GSO_PARTIAL;
|
|
|
|
}
|
|
|
|
|
2016-01-08 12:21:46 +00:00
|
|
|
BUILD_BUG_ON(SKB_SGO_CB_OFFSET +
|
|
|
|
sizeof(*SKB_GSO_CB(skb)) > sizeof(skb->cb));
|
|
|
|
|
2013-02-14 14:02:41 +00:00
|
|
|
SKB_GSO_CB(skb)->mac_offset = skb_headroom(skb);
|
2013-10-19 18:42:56 +00:00
|
|
|
SKB_GSO_CB(skb)->encap_level = 0;
|
|
|
|
|
2013-02-14 09:44:55 +00:00
|
|
|
skb_reset_mac_header(skb);
|
|
|
|
skb_reset_mac_len(skb);
|
|
|
|
|
2017-01-31 18:20:32 +00:00
|
|
|
segs = skb_mac_gso_segment(skb, features);
|
|
|
|
|
|
|
|
if (unlikely(skb_needs_check(skb, tx_path)))
|
|
|
|
skb_warn_bad_offload(skb);
|
|
|
|
|
|
|
|
return segs;
|
2013-02-14 09:44:55 +00:00
|
|
|
}
|
2013-02-05 16:36:38 +00:00
|
|
|
EXPORT_SYMBOL(__skb_gso_segment);
|
2006-06-22 09:57:17 +00:00
|
|
|
|
2005-11-10 21:01:24 +00:00
|
|
|
/* Take action when hardware reception checksum errors are detected. */
|
|
|
|
#ifdef CONFIG_BUG
|
|
|
|
void netdev_rx_csum_fault(struct net_device *dev)
|
|
|
|
{
|
|
|
|
if (net_ratelimit()) {
|
2012-02-01 10:54:43 +00:00
|
|
|
pr_err("%s: hw csum failure\n", dev ? dev->name : "<unknown>");
|
2005-11-10 21:01:24 +00:00
|
|
|
dump_stack();
|
|
|
|
}
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(netdev_rx_csum_fault);
|
|
|
|
#endif
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
/* Actually, we should eliminate this check as soon as we know, that:
|
|
|
|
* 1. IOMMU is present and allows to map all the memory.
|
|
|
|
* 2. No high memory really exists on this machine.
|
|
|
|
*/
|
|
|
|
|
2014-05-05 13:00:44 +00:00
|
|
|
static int illegal_highdma(struct net_device *dev, struct sk_buff *skb)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2006-06-27 20:33:10 +00:00
|
|
|
#ifdef CONFIG_HIGHMEM
|
2005-04-16 22:20:36 +00:00
|
|
|
int i;
|
2017-02-09 06:56:07 +00:00
|
|
|
|
2010-03-30 22:35:50 +00:00
|
|
|
if (!(dev->features & NETIF_F_HIGHDMA)) {
|
2011-08-22 23:44:58 +00:00
|
|
|
for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
|
|
|
|
skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
|
2017-02-09 06:56:07 +00:00
|
|
|
|
2011-08-22 23:44:58 +00:00
|
|
|
if (PageHighMem(skb_frag_page(frag)))
|
2010-03-30 22:35:50 +00:00
|
|
|
return 1;
|
2011-08-22 23:44:58 +00:00
|
|
|
}
|
2010-03-30 22:35:50 +00:00
|
|
|
}
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2010-03-30 22:35:50 +00:00
|
|
|
if (PCI_DMA_BUS_IS_PHYS) {
|
|
|
|
struct device *pdev = dev->dev.parent;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2010-04-02 20:34:49 +00:00
|
|
|
if (!pdev)
|
|
|
|
return 0;
|
2010-03-30 22:35:50 +00:00
|
|
|
for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
|
2011-08-22 23:44:58 +00:00
|
|
|
skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
|
|
|
|
dma_addr_t addr = page_to_phys(skb_frag_page(frag));
|
2017-02-09 06:56:07 +00:00
|
|
|
|
2010-03-30 22:35:50 +00:00
|
|
|
if (!pdev->dma_mask || addr + PAGE_SIZE - 1 > *pdev->dma_mask)
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
}
|
2006-06-27 20:33:10 +00:00
|
|
|
#endif
|
2005-04-16 22:20:36 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2014-06-03 23:53:17 +00:00
|
|
|
/* If MPLS offload request, verify we are testing hardware MPLS features
|
|
|
|
* instead of standard features for the netdev.
|
|
|
|
*/
|
2014-12-24 00:20:11 +00:00
|
|
|
#if IS_ENABLED(CONFIG_NET_MPLS_GSO)
|
2014-06-03 23:53:17 +00:00
|
|
|
static netdev_features_t net_mpls_features(struct sk_buff *skb,
|
|
|
|
netdev_features_t features,
|
|
|
|
__be16 type)
|
|
|
|
{
|
2014-10-06 12:05:13 +00:00
|
|
|
if (eth_p_mpls(type))
|
2014-06-03 23:53:17 +00:00
|
|
|
features &= skb->dev->mpls_features;
|
|
|
|
|
|
|
|
return features;
|
|
|
|
}
|
|
|
|
#else
|
|
|
|
static netdev_features_t net_mpls_features(struct sk_buff *skb,
|
|
|
|
netdev_features_t features,
|
|
|
|
__be16 type)
|
|
|
|
{
|
|
|
|
return features;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2011-11-15 15:29:55 +00:00
|
|
|
static netdev_features_t harmonize_features(struct sk_buff *skb,
|
2014-05-05 13:00:44 +00:00
|
|
|
netdev_features_t features)
|
2011-01-09 06:23:31 +00:00
|
|
|
{
|
2014-03-27 21:26:18 +00:00
|
|
|
int tmp;
|
2014-06-03 23:53:17 +00:00
|
|
|
__be16 type;
|
|
|
|
|
|
|
|
type = skb_network_protocol(skb, &tmp);
|
|
|
|
features = net_mpls_features(skb, features, type);
|
2014-03-27 21:26:18 +00:00
|
|
|
|
2012-09-19 15:49:00 +00:00
|
|
|
if (skb->ip_summed != CHECKSUM_NONE &&
|
2014-06-03 23:53:17 +00:00
|
|
|
!can_checksum_protocol(features, type)) {
|
2016-05-02 16:25:10 +00:00
|
|
|
features &= ~(NETIF_F_CSUM_MASK | NETIF_F_GSO_MASK);
|
2011-01-09 06:23:31 +00:00
|
|
|
}
|
2017-01-18 20:12:17 +00:00
|
|
|
if (illegal_highdma(skb->dev, skb))
|
|
|
|
features &= ~NETIF_F_SG;
|
2011-01-09 06:23:31 +00:00
|
|
|
|
|
|
|
return features;
|
|
|
|
}
|
|
|
|
|
2015-03-27 05:31:13 +00:00
|
|
|
netdev_features_t passthru_features_check(struct sk_buff *skb,
|
|
|
|
struct net_device *dev,
|
|
|
|
netdev_features_t features)
|
|
|
|
{
|
|
|
|
return features;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(passthru_features_check);
|
|
|
|
|
2015-03-27 05:31:12 +00:00
|
|
|
static netdev_features_t dflt_features_check(const struct sk_buff *skb,
|
|
|
|
struct net_device *dev,
|
|
|
|
netdev_features_t features)
|
|
|
|
{
|
|
|
|
return vlan_features_check(skb, features);
|
|
|
|
}
|
|
|
|
|
2016-04-11 01:44:51 +00:00
|
|
|
static netdev_features_t gso_features_check(const struct sk_buff *skb,
|
|
|
|
struct net_device *dev,
|
|
|
|
netdev_features_t features)
|
|
|
|
{
|
|
|
|
u16 gso_segs = skb_shinfo(skb)->gso_segs;
|
|
|
|
|
|
|
|
if (gso_segs > dev->gso_max_segs)
|
|
|
|
return features & ~NETIF_F_GSO_MASK;
|
|
|
|
|
2016-04-11 01:45:03 +00:00
|
|
|
/* Support for GSO partial features requires software
|
|
|
|
* intervention before we can actually process the packets
|
|
|
|
* so we need to strip support for any partial features now
|
|
|
|
* and we can pull them back in after we have partially
|
|
|
|
* segmented the frame.
|
|
|
|
*/
|
|
|
|
if (!(skb_shinfo(skb)->gso_type & SKB_GSO_PARTIAL))
|
|
|
|
features &= ~dev->gso_partial_features;
|
|
|
|
|
|
|
|
/* Make sure to clear the IPv4 ID mangling feature if the
|
|
|
|
* IPv4 header has the potential to be fragmented.
|
2016-04-11 01:44:51 +00:00
|
|
|
*/
|
|
|
|
if (skb_shinfo(skb)->gso_type & SKB_GSO_TCPV4) {
|
|
|
|
struct iphdr *iph = skb->encapsulation ?
|
|
|
|
inner_ip_hdr(skb) : ip_hdr(skb);
|
|
|
|
|
|
|
|
if (!(iph->frag_off & htons(IP_DF)))
|
|
|
|
features &= ~NETIF_F_TSO_MANGLEID;
|
|
|
|
}
|
|
|
|
|
|
|
|
return features;
|
|
|
|
}
|
|
|
|
|
2014-05-05 13:00:44 +00:00
|
|
|
netdev_features_t netif_skb_features(struct sk_buff *skb)
|
2010-10-29 12:14:55 +00:00
|
|
|
{
|
2014-12-24 06:37:26 +00:00
|
|
|
struct net_device *dev = skb->dev;
|
2014-10-05 17:11:27 +00:00
|
|
|
netdev_features_t features = dev->features;
|
2010-10-29 12:14:55 +00:00
|
|
|
|
2016-04-11 01:44:51 +00:00
|
|
|
if (skb_is_gso(skb))
|
|
|
|
features = gso_features_check(skb, dev, features);
|
2012-07-30 15:57:00 +00:00
|
|
|
|
2014-12-24 06:37:26 +00:00
|
|
|
/* If encapsulation offload request, verify we are testing
|
|
|
|
* hardware encapsulation features instead of standard
|
|
|
|
* features for the netdev
|
|
|
|
*/
|
|
|
|
if (skb->encapsulation)
|
|
|
|
features &= dev->hw_enc_features;
|
|
|
|
|
2015-03-27 05:31:11 +00:00
|
|
|
if (skb_vlan_tagged(skb))
|
|
|
|
features = netdev_intersect_features(features,
|
|
|
|
dev->vlan_features |
|
|
|
|
NETIF_F_HW_VLAN_CTAG_TX |
|
|
|
|
NETIF_F_HW_VLAN_STAG_TX);
|
2011-01-09 06:23:31 +00:00
|
|
|
|
2014-12-24 06:37:26 +00:00
|
|
|
if (dev->netdev_ops->ndo_features_check)
|
|
|
|
features &= dev->netdev_ops->ndo_features_check(skb, dev,
|
|
|
|
features);
|
2015-03-27 05:31:12 +00:00
|
|
|
else
|
|
|
|
features &= dflt_features_check(skb, dev, features);
|
2014-12-24 06:37:26 +00:00
|
|
|
|
2014-05-05 13:00:44 +00:00
|
|
|
return harmonize_features(skb, features);
|
2010-10-29 12:14:55 +00:00
|
|
|
}
|
2014-05-05 13:00:44 +00:00
|
|
|
EXPORT_SYMBOL(netif_skb_features);
|
2010-10-29 12:14:55 +00:00
|
|
|
|
2014-08-30 04:10:01 +00:00
|
|
|
static int xmit_one(struct sk_buff *skb, struct net_device *dev,
|
2014-08-30 04:57:30 +00:00
|
|
|
struct netdev_queue *txq, bool more)
|
2006-06-22 09:57:17 +00:00
|
|
|
{
|
2014-08-30 04:10:01 +00:00
|
|
|
unsigned int len;
|
|
|
|
int rc;
|
2008-11-21 04:14:53 +00:00
|
|
|
|
2015-01-27 19:35:48 +00:00
|
|
|
if (!list_empty(&ptype_all) || !list_empty(&dev->ptype_all))
|
2014-08-30 04:10:01 +00:00
|
|
|
dev_queue_xmit_nit(skb, dev);
|
2011-01-09 06:23:32 +00:00
|
|
|
|
2014-08-30 04:10:01 +00:00
|
|
|
len = skb->len;
|
|
|
|
trace_net_dev_start_xmit(skb, dev);
|
2014-08-30 04:57:30 +00:00
|
|
|
rc = netdev_start_xmit(skb, dev, txq, more);
|
2014-08-30 04:10:01 +00:00
|
|
|
trace_net_dev_xmit(skb, rc, dev, len);
|
2009-06-02 05:19:30 +00:00
|
|
|
|
2014-08-30 04:10:01 +00:00
|
|
|
return rc;
|
|
|
|
}
|
2010-10-20 13:56:04 +00:00
|
|
|
|
2014-09-01 22:06:40 +00:00
|
|
|
struct sk_buff *dev_hard_start_xmit(struct sk_buff *first, struct net_device *dev,
|
|
|
|
struct netdev_queue *txq, int *ret)
|
2014-08-30 04:19:14 +00:00
|
|
|
{
|
|
|
|
struct sk_buff *skb = first;
|
|
|
|
int rc = NETDEV_TX_OK;
|
2010-10-20 13:56:04 +00:00
|
|
|
|
2014-08-30 04:19:14 +00:00
|
|
|
while (skb) {
|
|
|
|
struct sk_buff *next = skb->next;
|
2012-12-07 14:14:15 +00:00
|
|
|
|
2014-08-30 04:19:14 +00:00
|
|
|
skb->next = NULL;
|
2014-08-30 04:57:30 +00:00
|
|
|
rc = xmit_one(skb, dev, txq, next != NULL);
|
2014-08-30 04:19:14 +00:00
|
|
|
if (unlikely(!dev_xmit_complete(rc))) {
|
|
|
|
skb->next = next;
|
|
|
|
goto out;
|
|
|
|
}
|
2010-06-16 14:18:12 +00:00
|
|
|
|
2014-08-30 04:19:14 +00:00
|
|
|
skb = next;
|
|
|
|
if (netif_xmit_stopped(txq) && skb) {
|
|
|
|
rc = NETDEV_TX_BUSY;
|
|
|
|
break;
|
2010-04-22 08:02:07 +00:00
|
|
|
}
|
2014-08-30 04:19:14 +00:00
|
|
|
}
|
2010-04-22 08:02:07 +00:00
|
|
|
|
2014-08-30 04:19:14 +00:00
|
|
|
out:
|
|
|
|
*ret = rc;
|
|
|
|
return skb;
|
|
|
|
}
|
2012-09-18 20:44:49 +00:00
|
|
|
|
2014-10-06 18:26:27 +00:00
|
|
|
static struct sk_buff *validate_xmit_vlan(struct sk_buff *skb,
|
|
|
|
netdev_features_t features)
|
2006-06-22 09:57:17 +00:00
|
|
|
{
|
2015-01-13 16:13:44 +00:00
|
|
|
if (skb_vlan_tag_present(skb) &&
|
2014-11-19 13:04:59 +00:00
|
|
|
!vlan_hw_offload_capable(features, skb->vlan_proto))
|
|
|
|
skb = __vlan_hwaccel_push_inside(skb);
|
2014-08-30 22:17:13 +00:00
|
|
|
return skb;
|
|
|
|
}
|
2006-06-22 09:57:17 +00:00
|
|
|
|
2017-05-18 13:44:41 +00:00
|
|
|
int skb_csum_hwoffload_help(struct sk_buff *skb,
|
|
|
|
const netdev_features_t features)
|
|
|
|
{
|
|
|
|
if (unlikely(skb->csum_not_inet))
|
|
|
|
return !!(features & NETIF_F_SCTP_CRC) ? 0 :
|
|
|
|
skb_crc32c_csum_help(skb);
|
|
|
|
|
|
|
|
return !!(features & NETIF_F_CSUM_MASK) ? 0 : skb_checksum_help(skb);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(skb_csum_hwoffload_help);
|
|
|
|
|
2014-10-03 22:31:07 +00:00
|
|
|
static struct sk_buff *validate_xmit_skb(struct sk_buff *skb, struct net_device *dev)
|
2014-08-30 22:17:13 +00:00
|
|
|
{
|
|
|
|
netdev_features_t features;
|
2006-06-22 09:57:17 +00:00
|
|
|
|
2014-08-30 22:17:13 +00:00
|
|
|
features = netif_skb_features(skb);
|
|
|
|
skb = validate_xmit_vlan(skb, features);
|
|
|
|
if (unlikely(!skb))
|
|
|
|
goto out_null;
|
2010-10-20 13:56:04 +00:00
|
|
|
|
2015-04-17 13:45:04 +00:00
|
|
|
if (netif_needs_gso(skb, features)) {
|
2014-08-31 02:22:20 +00:00
|
|
|
struct sk_buff *segs;
|
|
|
|
|
|
|
|
segs = skb_gso_segment(skb, features);
|
2014-09-19 08:04:38 +00:00
|
|
|
if (IS_ERR(segs)) {
|
2014-12-19 03:09:13 +00:00
|
|
|
goto out_kfree_skb;
|
2014-09-19 08:04:38 +00:00
|
|
|
} else if (segs) {
|
|
|
|
consume_skb(skb);
|
|
|
|
skb = segs;
|
2006-06-22 09:57:17 +00:00
|
|
|
}
|
2014-08-30 22:17:13 +00:00
|
|
|
} else {
|
|
|
|
if (skb_needs_linearize(skb, features) &&
|
|
|
|
__skb_linearize(skb))
|
|
|
|
goto out_kfree_skb;
|
2007-02-09 14:24:36 +00:00
|
|
|
|
2017-04-14 08:07:28 +00:00
|
|
|
if (validate_xmit_xfrm(skb, features))
|
|
|
|
goto out_kfree_skb;
|
|
|
|
|
2014-08-30 22:17:13 +00:00
|
|
|
/* If packet is not checksummed and device does not
|
|
|
|
* support checksumming for this protocol, complete
|
|
|
|
* checksumming here.
|
|
|
|
*/
|
|
|
|
if (skb->ip_summed == CHECKSUM_PARTIAL) {
|
|
|
|
if (skb->encapsulation)
|
|
|
|
skb_set_inner_transport_header(skb,
|
|
|
|
skb_checksum_start_offset(skb));
|
|
|
|
else
|
|
|
|
skb_set_transport_header(skb,
|
|
|
|
skb_checksum_start_offset(skb));
|
2017-05-18 13:44:41 +00:00
|
|
|
if (skb_csum_hwoffload_help(skb, features))
|
2014-08-30 22:17:13 +00:00
|
|
|
goto out_kfree_skb;
|
2010-10-20 13:56:04 +00:00
|
|
|
}
|
2013-04-29 13:02:42 +00:00
|
|
|
}
|
2010-10-20 13:56:04 +00:00
|
|
|
|
2014-08-30 22:17:13 +00:00
|
|
|
return skb;
|
2012-12-07 14:14:15 +00:00
|
|
|
|
2006-06-22 09:57:17 +00:00
|
|
|
out_kfree_skb:
|
|
|
|
kfree_skb(skb);
|
2014-08-30 22:17:13 +00:00
|
|
|
out_null:
|
2016-04-13 04:50:07 +00:00
|
|
|
atomic_long_inc(&dev->tx_dropped);
|
2014-08-30 22:17:13 +00:00
|
|
|
return NULL;
|
|
|
|
}
|
2010-06-16 14:18:12 +00:00
|
|
|
|
2014-10-03 22:31:07 +00:00
|
|
|
struct sk_buff *validate_xmit_skb_list(struct sk_buff *skb, struct net_device *dev)
|
|
|
|
{
|
|
|
|
struct sk_buff *next, *head = NULL, *tail;
|
|
|
|
|
2014-10-04 03:59:19 +00:00
|
|
|
for (; skb != NULL; skb = next) {
|
2014-10-03 22:31:07 +00:00
|
|
|
next = skb->next;
|
|
|
|
skb->next = NULL;
|
2014-10-04 03:59:19 +00:00
|
|
|
|
|
|
|
/* in case skb wont be segmented, point to itself */
|
|
|
|
skb->prev = skb;
|
|
|
|
|
2014-10-03 22:31:07 +00:00
|
|
|
skb = validate_xmit_skb(skb, dev);
|
2014-10-04 03:59:19 +00:00
|
|
|
if (!skb)
|
|
|
|
continue;
|
2014-10-03 22:31:07 +00:00
|
|
|
|
2014-10-04 03:59:19 +00:00
|
|
|
if (!head)
|
|
|
|
head = skb;
|
|
|
|
else
|
|
|
|
tail->next = skb;
|
|
|
|
/* If skb was segmented, skb->prev points to
|
|
|
|
* the last segment. If not, it still contains skb.
|
|
|
|
*/
|
|
|
|
tail = skb->prev;
|
2014-10-03 22:31:07 +00:00
|
|
|
}
|
|
|
|
return head;
|
2006-06-22 09:57:17 +00:00
|
|
|
}
|
2016-10-26 15:23:07 +00:00
|
|
|
EXPORT_SYMBOL_GPL(validate_xmit_skb_list);
|
2006-06-22 09:57:17 +00:00
|
|
|
|
2013-01-10 12:36:42 +00:00
|
|
|
static void qdisc_pkt_len_init(struct sk_buff *skb)
|
|
|
|
{
|
|
|
|
const struct skb_shared_info *shinfo = skb_shinfo(skb);
|
|
|
|
|
|
|
|
qdisc_skb_cb(skb)->pkt_len = skb->len;
|
|
|
|
|
|
|
|
/* To get more precise estimation of bytes sent on wire,
|
|
|
|
* we add to pkt_len the headers size of all segments
|
|
|
|
*/
|
|
|
|
if (shinfo->gso_size) {
|
2013-01-16 05:14:21 +00:00
|
|
|
unsigned int hdr_len;
|
2013-03-25 20:19:59 +00:00
|
|
|
u16 gso_segs = shinfo->gso_segs;
|
2013-01-10 12:36:42 +00:00
|
|
|
|
2013-01-16 05:14:21 +00:00
|
|
|
/* mac layer + network layer */
|
|
|
|
hdr_len = skb_transport_header(skb) - skb_mac_header(skb);
|
|
|
|
|
|
|
|
/* + transport layer */
|
2013-01-10 12:36:42 +00:00
|
|
|
if (likely(shinfo->gso_type & (SKB_GSO_TCPV4 | SKB_GSO_TCPV6)))
|
|
|
|
hdr_len += tcp_hdrlen(skb);
|
|
|
|
else
|
|
|
|
hdr_len += sizeof(struct udphdr);
|
2013-03-25 20:19:59 +00:00
|
|
|
|
|
|
|
if (shinfo->gso_type & SKB_GSO_DODGY)
|
|
|
|
gso_segs = DIV_ROUND_UP(skb->len - hdr_len,
|
|
|
|
shinfo->gso_size);
|
|
|
|
|
|
|
|
qdisc_skb_cb(skb)->pkt_len += (gso_segs - 1) * hdr_len;
|
2013-01-10 12:36:42 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2009-08-06 01:44:21 +00:00
|
|
|
static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
|
|
|
|
struct net_device *dev,
|
|
|
|
struct netdev_queue *txq)
|
|
|
|
{
|
|
|
|
spinlock_t *root_lock = qdisc_lock(q);
|
2016-06-22 06:16:49 +00:00
|
|
|
struct sk_buff *to_free = NULL;
|
2011-01-20 03:48:19 +00:00
|
|
|
bool contended;
|
2009-08-06 01:44:21 +00:00
|
|
|
int rc;
|
|
|
|
|
2011-01-20 03:48:19 +00:00
|
|
|
qdisc_calculate_pkt_len(skb, q);
|
2010-06-02 12:09:29 +00:00
|
|
|
/*
|
|
|
|
* Heuristic to force contended enqueues to serialize on a
|
|
|
|
* separate lock before trying to get qdisc main lock.
|
2016-06-06 16:37:15 +00:00
|
|
|
* This permits qdisc->running owner to get the lock more
|
2014-06-26 07:56:31 +00:00
|
|
|
* often and dequeue packets faster.
|
2010-06-02 12:09:29 +00:00
|
|
|
*/
|
2011-01-20 03:48:19 +00:00
|
|
|
contended = qdisc_is_running(q);
|
2010-06-02 12:09:29 +00:00
|
|
|
if (unlikely(contended))
|
|
|
|
spin_lock(&q->busylock);
|
|
|
|
|
2009-08-06 01:44:21 +00:00
|
|
|
spin_lock(root_lock);
|
|
|
|
if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
|
2016-06-22 06:16:49 +00:00
|
|
|
__qdisc_drop(skb, &to_free);
|
2009-08-06 01:44:21 +00:00
|
|
|
rc = NET_XMIT_DROP;
|
|
|
|
} else if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) &&
|
2010-06-02 10:23:51 +00:00
|
|
|
qdisc_run_begin(q)) {
|
2009-08-06 01:44:21 +00:00
|
|
|
/*
|
|
|
|
* This is a work-conserving queue; there are no old skbs
|
|
|
|
* waiting to be sent out; and the qdisc is not running -
|
|
|
|
* xmit the skb directly.
|
|
|
|
*/
|
2011-01-09 08:30:54 +00:00
|
|
|
|
|
|
|
qdisc_bstats_update(q, skb);
|
|
|
|
|
2014-10-03 22:31:07 +00:00
|
|
|
if (sch_direct_xmit(skb, q, dev, txq, root_lock, true)) {
|
2010-06-02 12:09:29 +00:00
|
|
|
if (unlikely(contended)) {
|
|
|
|
spin_unlock(&q->busylock);
|
|
|
|
contended = false;
|
|
|
|
}
|
2009-08-06 01:44:21 +00:00
|
|
|
__qdisc_run(q);
|
2010-06-02 12:09:29 +00:00
|
|
|
} else
|
2010-06-02 10:23:51 +00:00
|
|
|
qdisc_run_end(q);
|
2009-08-06 01:44:21 +00:00
|
|
|
|
|
|
|
rc = NET_XMIT_SUCCESS;
|
|
|
|
} else {
|
2016-06-22 06:16:49 +00:00
|
|
|
rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
|
2010-06-02 12:09:29 +00:00
|
|
|
if (qdisc_run_begin(q)) {
|
|
|
|
if (unlikely(contended)) {
|
|
|
|
spin_unlock(&q->busylock);
|
|
|
|
contended = false;
|
|
|
|
}
|
|
|
|
__qdisc_run(q);
|
|
|
|
}
|
2009-08-06 01:44:21 +00:00
|
|
|
}
|
|
|
|
spin_unlock(root_lock);
|
2016-06-22 06:16:49 +00:00
|
|
|
if (unlikely(to_free))
|
|
|
|
kfree_skb_list(to_free);
|
2010-06-02 12:09:29 +00:00
|
|
|
if (unlikely(contended))
|
|
|
|
spin_unlock(&q->busylock);
|
2009-08-06 01:44:21 +00:00
|
|
|
return rc;
|
|
|
|
}
|
|
|
|
|
2013-12-29 16:27:11 +00:00
|
|
|
#if IS_ENABLED(CONFIG_CGROUP_NET_PRIO)
|
2011-11-22 05:10:51 +00:00
|
|
|
static void skb_update_prio(struct sk_buff *skb)
|
|
|
|
{
|
2011-11-25 07:44:54 +00:00
|
|
|
struct netprio_map *map = rcu_dereference_bh(skb->dev->priomap);
|
2011-11-22 05:10:51 +00:00
|
|
|
|
2012-07-08 21:45:10 +00:00
|
|
|
if (!skb->priority && skb->sk && map) {
|
2015-12-07 22:38:52 +00:00
|
|
|
unsigned int prioidx =
|
|
|
|
sock_cgroup_prioidx(&skb->sk->sk_cgrp_data);
|
2012-07-08 21:45:10 +00:00
|
|
|
|
|
|
|
if (prioidx < map->priomap_len)
|
|
|
|
skb->priority = map->priomap[prioidx];
|
|
|
|
}
|
2011-11-22 05:10:51 +00:00
|
|
|
}
|
|
|
|
#else
|
|
|
|
#define skb_update_prio(skb)
|
|
|
|
#endif
|
|
|
|
|
2015-04-01 15:07:44 +00:00
|
|
|
DEFINE_PER_CPU(int, xmit_recursion);
|
|
|
|
EXPORT_SYMBOL(xmit_recursion);
|
|
|
|
|
2012-06-12 10:16:35 +00:00
|
|
|
/**
|
|
|
|
* dev_loopback_xmit - loop back @skb
|
2015-09-16 01:04:18 +00:00
|
|
|
* @net: network namespace this loopback is happening in
|
|
|
|
* @sk: sk needed to be a netfilter okfn
|
2012-06-12 10:16:35 +00:00
|
|
|
* @skb: buffer to transmit
|
|
|
|
*/
|
2015-09-16 01:04:18 +00:00
|
|
|
int dev_loopback_xmit(struct net *net, struct sock *sk, struct sk_buff *skb)
|
2012-06-12 10:16:35 +00:00
|
|
|
{
|
|
|
|
skb_reset_mac_header(skb);
|
|
|
|
__skb_pull(skb, skb_network_offset(skb));
|
|
|
|
skb->pkt_type = PACKET_LOOPBACK;
|
|
|
|
skb->ip_summed = CHECKSUM_UNNECESSARY;
|
|
|
|
WARN_ON(!skb_dst(skb));
|
|
|
|
skb_dst_force(skb);
|
|
|
|
netif_rx_ni(skb);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(dev_loopback_xmit);
|
|
|
|
|
net, sched: add clsact qdisc
This work adds a generalization of the ingress qdisc as a qdisc holding
only classifiers. The clsact qdisc works on ingress, but also on egress.
In both cases, it's execution happens without taking the qdisc lock, and
the main difference for the egress part compared to prior version of [1]
is that this can be applied with _any_ underlying real egress qdisc (also
classless ones).
Besides solving the use-case of [1], that is, allowing for more programmability
on assigning skb->priority for the mqprio case that is supported by most
popular 10G+ NICs, it also opens up a lot more flexibility for other tc
applications. The main work on classification can already be done at clsact
egress time if the use-case allows and state stored for later retrieval
f.e. again in skb->priority with major/minors (which is checked by most
classful qdiscs before consulting tc_classify()) and/or in other skb fields
like skb->tc_index for some light-weight post-processing to get to the
eventual classid in case of a classful qdisc. Another use case is that
the clsact egress part allows to have a central egress counterpart to
the ingress classifiers, so that classifiers can easily share state (e.g.
in cls_bpf via eBPF maps) for ingress and egress.
Currently, default setups like mq + pfifo_fast would require for this to
use, for example, prio qdisc instead (to get a tc_classify() run) and to
duplicate the egress classifier for each queue. With clsact, it allows
for leaving the setup as is, it can additionally assign skb->priority to
put the skb in one of pfifo_fast's bands and it can share state with maps.
Moreover, we can access the skb's dst entry (f.e. to retrieve tclassid)
w/o the need to perform a skb_dst_force() to hold on to it any longer. In
lwt case, we can also use this facility to setup dst metadata via cls_bpf
(bpf_skb_set_tunnel_key()) without needing a real egress qdisc just for
that (case of IFF_NO_QUEUE devices, for example).
The realization can be done without any changes to the scheduler core
framework. All it takes is that we have two a-priori defined minors/child
classes, where we can mux between ingress and egress classifier list
(dev->ingress_cl_list and dev->egress_cl_list, latter stored close to
dev->_tx to avoid extra cacheline miss for moderate loads). The egress
part is a bit similar modelled to handle_ing() and patched to a noop in
case the functionality is not used. Both handlers are now called
sch_handle_ingress() and sch_handle_egress(), code sharing among the two
doesn't seem practical as there are various minor differences in both
paths, so that making them conditional in a single handler would rather
slow things down.
Full compatibility to ingress qdisc is provided as well. Since both
piggyback on TC_H_CLSACT, only one of them (ingress/clsact) can exist
per netdevice, and thus ingress qdisc specific behaviour can be retained
for user space. This means, either a user does 'tc qdisc add dev foo ingress'
and configures ingress qdisc as usual, or the 'tc qdisc add dev foo clsact'
alternative, where both, ingress and egress classifier can be configured
as in the below example. ingress qdisc supports attaching classifier to any
minor number whereas clsact has two fixed minors for muxing between the
lists, therefore to not break user space setups, they are better done as
two separate qdiscs.
I decided to extend the sch_ingress module with clsact functionality so
that commonly used code can be reused, the module is being aliased with
sch_clsact so that it can be auto-loaded properly. Alternative would have been
to add a flag when initializing ingress to alter its behaviour plus aliasing
to a different name (as it's more than just ingress). However, the first would
end up, based on the flag, choosing the new/old behaviour by calling different
function implementations to handle each anyway, the latter would require to
register ingress qdisc once again under different alias. So, this really begs
to provide a minimal, cleaner approach to have Qdisc_ops and Qdisc_class_ops
by its own that share callbacks used by both.
Example, adding qdisc:
# tc qdisc add dev foo clsact
# tc qdisc show dev foo
qdisc mq 0: root
qdisc pfifo_fast 0: parent :1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :4 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc clsact ffff: parent ffff:fff1
Adding filters (deleting, etc works analogous by specifying ingress/egress):
# tc filter add dev foo ingress bpf da obj bar.o sec ingress
# tc filter add dev foo egress bpf da obj bar.o sec egress
# tc filter show dev foo ingress
filter protocol all pref 49152 bpf
filter protocol all pref 49152 bpf handle 0x1 bar.o:[ingress] direct-action
# tc filter show dev foo egress
filter protocol all pref 49152 bpf
filter protocol all pref 49152 bpf handle 0x1 bar.o:[egress] direct-action
A 'tc filter show dev foo' or 'tc filter show dev foo parent ffff:' will
show an empty list for clsact. Either using the parent names (ingress/egress)
or specifying the full major/minor will then show the related filter lists.
Prior work on a mqprio prequeue() facility [1] was done mainly by John Fastabend.
[1] http://patchwork.ozlabs.org/patch/512949/
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-07 21:29:47 +00:00
|
|
|
#ifdef CONFIG_NET_EGRESS
|
|
|
|
static struct sk_buff *
|
|
|
|
sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
|
|
|
|
{
|
|
|
|
struct tcf_proto *cl = rcu_dereference_bh(dev->egress_cl_list);
|
|
|
|
struct tcf_result cl_res;
|
|
|
|
|
|
|
|
if (!cl)
|
|
|
|
return skb;
|
|
|
|
|
2017-01-07 22:06:37 +00:00
|
|
|
/* qdisc_skb_cb(skb)->pkt_len was already set by the caller. */
|
net, sched: add clsact qdisc
This work adds a generalization of the ingress qdisc as a qdisc holding
only classifiers. The clsact qdisc works on ingress, but also on egress.
In both cases, it's execution happens without taking the qdisc lock, and
the main difference for the egress part compared to prior version of [1]
is that this can be applied with _any_ underlying real egress qdisc (also
classless ones).
Besides solving the use-case of [1], that is, allowing for more programmability
on assigning skb->priority for the mqprio case that is supported by most
popular 10G+ NICs, it also opens up a lot more flexibility for other tc
applications. The main work on classification can already be done at clsact
egress time if the use-case allows and state stored for later retrieval
f.e. again in skb->priority with major/minors (which is checked by most
classful qdiscs before consulting tc_classify()) and/or in other skb fields
like skb->tc_index for some light-weight post-processing to get to the
eventual classid in case of a classful qdisc. Another use case is that
the clsact egress part allows to have a central egress counterpart to
the ingress classifiers, so that classifiers can easily share state (e.g.
in cls_bpf via eBPF maps) for ingress and egress.
Currently, default setups like mq + pfifo_fast would require for this to
use, for example, prio qdisc instead (to get a tc_classify() run) and to
duplicate the egress classifier for each queue. With clsact, it allows
for leaving the setup as is, it can additionally assign skb->priority to
put the skb in one of pfifo_fast's bands and it can share state with maps.
Moreover, we can access the skb's dst entry (f.e. to retrieve tclassid)
w/o the need to perform a skb_dst_force() to hold on to it any longer. In
lwt case, we can also use this facility to setup dst metadata via cls_bpf
(bpf_skb_set_tunnel_key()) without needing a real egress qdisc just for
that (case of IFF_NO_QUEUE devices, for example).
The realization can be done without any changes to the scheduler core
framework. All it takes is that we have two a-priori defined minors/child
classes, where we can mux between ingress and egress classifier list
(dev->ingress_cl_list and dev->egress_cl_list, latter stored close to
dev->_tx to avoid extra cacheline miss for moderate loads). The egress
part is a bit similar modelled to handle_ing() and patched to a noop in
case the functionality is not used. Both handlers are now called
sch_handle_ingress() and sch_handle_egress(), code sharing among the two
doesn't seem practical as there are various minor differences in both
paths, so that making them conditional in a single handler would rather
slow things down.
Full compatibility to ingress qdisc is provided as well. Since both
piggyback on TC_H_CLSACT, only one of them (ingress/clsact) can exist
per netdevice, and thus ingress qdisc specific behaviour can be retained
for user space. This means, either a user does 'tc qdisc add dev foo ingress'
and configures ingress qdisc as usual, or the 'tc qdisc add dev foo clsact'
alternative, where both, ingress and egress classifier can be configured
as in the below example. ingress qdisc supports attaching classifier to any
minor number whereas clsact has two fixed minors for muxing between the
lists, therefore to not break user space setups, they are better done as
two separate qdiscs.
I decided to extend the sch_ingress module with clsact functionality so
that commonly used code can be reused, the module is being aliased with
sch_clsact so that it can be auto-loaded properly. Alternative would have been
to add a flag when initializing ingress to alter its behaviour plus aliasing
to a different name (as it's more than just ingress). However, the first would
end up, based on the flag, choosing the new/old behaviour by calling different
function implementations to handle each anyway, the latter would require to
register ingress qdisc once again under different alias. So, this really begs
to provide a minimal, cleaner approach to have Qdisc_ops and Qdisc_class_ops
by its own that share callbacks used by both.
Example, adding qdisc:
# tc qdisc add dev foo clsact
# tc qdisc show dev foo
qdisc mq 0: root
qdisc pfifo_fast 0: parent :1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :4 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc clsact ffff: parent ffff:fff1
Adding filters (deleting, etc works analogous by specifying ingress/egress):
# tc filter add dev foo ingress bpf da obj bar.o sec ingress
# tc filter add dev foo egress bpf da obj bar.o sec egress
# tc filter show dev foo ingress
filter protocol all pref 49152 bpf
filter protocol all pref 49152 bpf handle 0x1 bar.o:[ingress] direct-action
# tc filter show dev foo egress
filter protocol all pref 49152 bpf
filter protocol all pref 49152 bpf handle 0x1 bar.o:[egress] direct-action
A 'tc filter show dev foo' or 'tc filter show dev foo parent ffff:' will
show an empty list for clsact. Either using the parent names (ingress/egress)
or specifying the full major/minor will then show the related filter lists.
Prior work on a mqprio prequeue() facility [1] was done mainly by John Fastabend.
[1] http://patchwork.ozlabs.org/patch/512949/
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-07 21:29:47 +00:00
|
|
|
qdisc_bstats_cpu_update(cl->q, skb);
|
|
|
|
|
2017-05-17 09:07:54 +00:00
|
|
|
switch (tcf_classify(skb, cl, &cl_res, false)) {
|
net, sched: add clsact qdisc
This work adds a generalization of the ingress qdisc as a qdisc holding
only classifiers. The clsact qdisc works on ingress, but also on egress.
In both cases, it's execution happens without taking the qdisc lock, and
the main difference for the egress part compared to prior version of [1]
is that this can be applied with _any_ underlying real egress qdisc (also
classless ones).
Besides solving the use-case of [1], that is, allowing for more programmability
on assigning skb->priority for the mqprio case that is supported by most
popular 10G+ NICs, it also opens up a lot more flexibility for other tc
applications. The main work on classification can already be done at clsact
egress time if the use-case allows and state stored for later retrieval
f.e. again in skb->priority with major/minors (which is checked by most
classful qdiscs before consulting tc_classify()) and/or in other skb fields
like skb->tc_index for some light-weight post-processing to get to the
eventual classid in case of a classful qdisc. Another use case is that
the clsact egress part allows to have a central egress counterpart to
the ingress classifiers, so that classifiers can easily share state (e.g.
in cls_bpf via eBPF maps) for ingress and egress.
Currently, default setups like mq + pfifo_fast would require for this to
use, for example, prio qdisc instead (to get a tc_classify() run) and to
duplicate the egress classifier for each queue. With clsact, it allows
for leaving the setup as is, it can additionally assign skb->priority to
put the skb in one of pfifo_fast's bands and it can share state with maps.
Moreover, we can access the skb's dst entry (f.e. to retrieve tclassid)
w/o the need to perform a skb_dst_force() to hold on to it any longer. In
lwt case, we can also use this facility to setup dst metadata via cls_bpf
(bpf_skb_set_tunnel_key()) without needing a real egress qdisc just for
that (case of IFF_NO_QUEUE devices, for example).
The realization can be done without any changes to the scheduler core
framework. All it takes is that we have two a-priori defined minors/child
classes, where we can mux between ingress and egress classifier list
(dev->ingress_cl_list and dev->egress_cl_list, latter stored close to
dev->_tx to avoid extra cacheline miss for moderate loads). The egress
part is a bit similar modelled to handle_ing() and patched to a noop in
case the functionality is not used. Both handlers are now called
sch_handle_ingress() and sch_handle_egress(), code sharing among the two
doesn't seem practical as there are various minor differences in both
paths, so that making them conditional in a single handler would rather
slow things down.
Full compatibility to ingress qdisc is provided as well. Since both
piggyback on TC_H_CLSACT, only one of them (ingress/clsact) can exist
per netdevice, and thus ingress qdisc specific behaviour can be retained
for user space. This means, either a user does 'tc qdisc add dev foo ingress'
and configures ingress qdisc as usual, or the 'tc qdisc add dev foo clsact'
alternative, where both, ingress and egress classifier can be configured
as in the below example. ingress qdisc supports attaching classifier to any
minor number whereas clsact has two fixed minors for muxing between the
lists, therefore to not break user space setups, they are better done as
two separate qdiscs.
I decided to extend the sch_ingress module with clsact functionality so
that commonly used code can be reused, the module is being aliased with
sch_clsact so that it can be auto-loaded properly. Alternative would have been
to add a flag when initializing ingress to alter its behaviour plus aliasing
to a different name (as it's more than just ingress). However, the first would
end up, based on the flag, choosing the new/old behaviour by calling different
function implementations to handle each anyway, the latter would require to
register ingress qdisc once again under different alias. So, this really begs
to provide a minimal, cleaner approach to have Qdisc_ops and Qdisc_class_ops
by its own that share callbacks used by both.
Example, adding qdisc:
# tc qdisc add dev foo clsact
# tc qdisc show dev foo
qdisc mq 0: root
qdisc pfifo_fast 0: parent :1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :4 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc clsact ffff: parent ffff:fff1
Adding filters (deleting, etc works analogous by specifying ingress/egress):
# tc filter add dev foo ingress bpf da obj bar.o sec ingress
# tc filter add dev foo egress bpf da obj bar.o sec egress
# tc filter show dev foo ingress
filter protocol all pref 49152 bpf
filter protocol all pref 49152 bpf handle 0x1 bar.o:[ingress] direct-action
# tc filter show dev foo egress
filter protocol all pref 49152 bpf
filter protocol all pref 49152 bpf handle 0x1 bar.o:[egress] direct-action
A 'tc filter show dev foo' or 'tc filter show dev foo parent ffff:' will
show an empty list for clsact. Either using the parent names (ingress/egress)
or specifying the full major/minor will then show the related filter lists.
Prior work on a mqprio prequeue() facility [1] was done mainly by John Fastabend.
[1] http://patchwork.ozlabs.org/patch/512949/
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-07 21:29:47 +00:00
|
|
|
case TC_ACT_OK:
|
|
|
|
case TC_ACT_RECLASSIFY:
|
|
|
|
skb->tc_index = TC_H_MIN(cl_res.classid);
|
|
|
|
break;
|
|
|
|
case TC_ACT_SHOT:
|
|
|
|
qdisc_qstats_cpu_drop(cl->q);
|
|
|
|
*ret = NET_XMIT_DROP;
|
2016-05-15 21:28:29 +00:00
|
|
|
kfree_skb(skb);
|
|
|
|
return NULL;
|
net, sched: add clsact qdisc
This work adds a generalization of the ingress qdisc as a qdisc holding
only classifiers. The clsact qdisc works on ingress, but also on egress.
In both cases, it's execution happens without taking the qdisc lock, and
the main difference for the egress part compared to prior version of [1]
is that this can be applied with _any_ underlying real egress qdisc (also
classless ones).
Besides solving the use-case of [1], that is, allowing for more programmability
on assigning skb->priority for the mqprio case that is supported by most
popular 10G+ NICs, it also opens up a lot more flexibility for other tc
applications. The main work on classification can already be done at clsact
egress time if the use-case allows and state stored for later retrieval
f.e. again in skb->priority with major/minors (which is checked by most
classful qdiscs before consulting tc_classify()) and/or in other skb fields
like skb->tc_index for some light-weight post-processing to get to the
eventual classid in case of a classful qdisc. Another use case is that
the clsact egress part allows to have a central egress counterpart to
the ingress classifiers, so that classifiers can easily share state (e.g.
in cls_bpf via eBPF maps) for ingress and egress.
Currently, default setups like mq + pfifo_fast would require for this to
use, for example, prio qdisc instead (to get a tc_classify() run) and to
duplicate the egress classifier for each queue. With clsact, it allows
for leaving the setup as is, it can additionally assign skb->priority to
put the skb in one of pfifo_fast's bands and it can share state with maps.
Moreover, we can access the skb's dst entry (f.e. to retrieve tclassid)
w/o the need to perform a skb_dst_force() to hold on to it any longer. In
lwt case, we can also use this facility to setup dst metadata via cls_bpf
(bpf_skb_set_tunnel_key()) without needing a real egress qdisc just for
that (case of IFF_NO_QUEUE devices, for example).
The realization can be done without any changes to the scheduler core
framework. All it takes is that we have two a-priori defined minors/child
classes, where we can mux between ingress and egress classifier list
(dev->ingress_cl_list and dev->egress_cl_list, latter stored close to
dev->_tx to avoid extra cacheline miss for moderate loads). The egress
part is a bit similar modelled to handle_ing() and patched to a noop in
case the functionality is not used. Both handlers are now called
sch_handle_ingress() and sch_handle_egress(), code sharing among the two
doesn't seem practical as there are various minor differences in both
paths, so that making them conditional in a single handler would rather
slow things down.
Full compatibility to ingress qdisc is provided as well. Since both
piggyback on TC_H_CLSACT, only one of them (ingress/clsact) can exist
per netdevice, and thus ingress qdisc specific behaviour can be retained
for user space. This means, either a user does 'tc qdisc add dev foo ingress'
and configures ingress qdisc as usual, or the 'tc qdisc add dev foo clsact'
alternative, where both, ingress and egress classifier can be configured
as in the below example. ingress qdisc supports attaching classifier to any
minor number whereas clsact has two fixed minors for muxing between the
lists, therefore to not break user space setups, they are better done as
two separate qdiscs.
I decided to extend the sch_ingress module with clsact functionality so
that commonly used code can be reused, the module is being aliased with
sch_clsact so that it can be auto-loaded properly. Alternative would have been
to add a flag when initializing ingress to alter its behaviour plus aliasing
to a different name (as it's more than just ingress). However, the first would
end up, based on the flag, choosing the new/old behaviour by calling different
function implementations to handle each anyway, the latter would require to
register ingress qdisc once again under different alias. So, this really begs
to provide a minimal, cleaner approach to have Qdisc_ops and Qdisc_class_ops
by its own that share callbacks used by both.
Example, adding qdisc:
# tc qdisc add dev foo clsact
# tc qdisc show dev foo
qdisc mq 0: root
qdisc pfifo_fast 0: parent :1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :4 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc clsact ffff: parent ffff:fff1
Adding filters (deleting, etc works analogous by specifying ingress/egress):
# tc filter add dev foo ingress bpf da obj bar.o sec ingress
# tc filter add dev foo egress bpf da obj bar.o sec egress
# tc filter show dev foo ingress
filter protocol all pref 49152 bpf
filter protocol all pref 49152 bpf handle 0x1 bar.o:[ingress] direct-action
# tc filter show dev foo egress
filter protocol all pref 49152 bpf
filter protocol all pref 49152 bpf handle 0x1 bar.o:[egress] direct-action
A 'tc filter show dev foo' or 'tc filter show dev foo parent ffff:' will
show an empty list for clsact. Either using the parent names (ingress/egress)
or specifying the full major/minor will then show the related filter lists.
Prior work on a mqprio prequeue() facility [1] was done mainly by John Fastabend.
[1] http://patchwork.ozlabs.org/patch/512949/
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-07 21:29:47 +00:00
|
|
|
case TC_ACT_STOLEN:
|
|
|
|
case TC_ACT_QUEUED:
|
2017-06-06 12:12:02 +00:00
|
|
|
case TC_ACT_TRAP:
|
net, sched: add clsact qdisc
This work adds a generalization of the ingress qdisc as a qdisc holding
only classifiers. The clsact qdisc works on ingress, but also on egress.
In both cases, it's execution happens without taking the qdisc lock, and
the main difference for the egress part compared to prior version of [1]
is that this can be applied with _any_ underlying real egress qdisc (also
classless ones).
Besides solving the use-case of [1], that is, allowing for more programmability
on assigning skb->priority for the mqprio case that is supported by most
popular 10G+ NICs, it also opens up a lot more flexibility for other tc
applications. The main work on classification can already be done at clsact
egress time if the use-case allows and state stored for later retrieval
f.e. again in skb->priority with major/minors (which is checked by most
classful qdiscs before consulting tc_classify()) and/or in other skb fields
like skb->tc_index for some light-weight post-processing to get to the
eventual classid in case of a classful qdisc. Another use case is that
the clsact egress part allows to have a central egress counterpart to
the ingress classifiers, so that classifiers can easily share state (e.g.
in cls_bpf via eBPF maps) for ingress and egress.
Currently, default setups like mq + pfifo_fast would require for this to
use, for example, prio qdisc instead (to get a tc_classify() run) and to
duplicate the egress classifier for each queue. With clsact, it allows
for leaving the setup as is, it can additionally assign skb->priority to
put the skb in one of pfifo_fast's bands and it can share state with maps.
Moreover, we can access the skb's dst entry (f.e. to retrieve tclassid)
w/o the need to perform a skb_dst_force() to hold on to it any longer. In
lwt case, we can also use this facility to setup dst metadata via cls_bpf
(bpf_skb_set_tunnel_key()) without needing a real egress qdisc just for
that (case of IFF_NO_QUEUE devices, for example).
The realization can be done without any changes to the scheduler core
framework. All it takes is that we have two a-priori defined minors/child
classes, where we can mux between ingress and egress classifier list
(dev->ingress_cl_list and dev->egress_cl_list, latter stored close to
dev->_tx to avoid extra cacheline miss for moderate loads). The egress
part is a bit similar modelled to handle_ing() and patched to a noop in
case the functionality is not used. Both handlers are now called
sch_handle_ingress() and sch_handle_egress(), code sharing among the two
doesn't seem practical as there are various minor differences in both
paths, so that making them conditional in a single handler would rather
slow things down.
Full compatibility to ingress qdisc is provided as well. Since both
piggyback on TC_H_CLSACT, only one of them (ingress/clsact) can exist
per netdevice, and thus ingress qdisc specific behaviour can be retained
for user space. This means, either a user does 'tc qdisc add dev foo ingress'
and configures ingress qdisc as usual, or the 'tc qdisc add dev foo clsact'
alternative, where both, ingress and egress classifier can be configured
as in the below example. ingress qdisc supports attaching classifier to any
minor number whereas clsact has two fixed minors for muxing between the
lists, therefore to not break user space setups, they are better done as
two separate qdiscs.
I decided to extend the sch_ingress module with clsact functionality so
that commonly used code can be reused, the module is being aliased with
sch_clsact so that it can be auto-loaded properly. Alternative would have been
to add a flag when initializing ingress to alter its behaviour plus aliasing
to a different name (as it's more than just ingress). However, the first would
end up, based on the flag, choosing the new/old behaviour by calling different
function implementations to handle each anyway, the latter would require to
register ingress qdisc once again under different alias. So, this really begs
to provide a minimal, cleaner approach to have Qdisc_ops and Qdisc_class_ops
by its own that share callbacks used by both.
Example, adding qdisc:
# tc qdisc add dev foo clsact
# tc qdisc show dev foo
qdisc mq 0: root
qdisc pfifo_fast 0: parent :1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :4 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc clsact ffff: parent ffff:fff1
Adding filters (deleting, etc works analogous by specifying ingress/egress):
# tc filter add dev foo ingress bpf da obj bar.o sec ingress
# tc filter add dev foo egress bpf da obj bar.o sec egress
# tc filter show dev foo ingress
filter protocol all pref 49152 bpf
filter protocol all pref 49152 bpf handle 0x1 bar.o:[ingress] direct-action
# tc filter show dev foo egress
filter protocol all pref 49152 bpf
filter protocol all pref 49152 bpf handle 0x1 bar.o:[egress] direct-action
A 'tc filter show dev foo' or 'tc filter show dev foo parent ffff:' will
show an empty list for clsact. Either using the parent names (ingress/egress)
or specifying the full major/minor will then show the related filter lists.
Prior work on a mqprio prequeue() facility [1] was done mainly by John Fastabend.
[1] http://patchwork.ozlabs.org/patch/512949/
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-07 21:29:47 +00:00
|
|
|
*ret = NET_XMIT_SUCCESS;
|
2016-05-15 21:28:29 +00:00
|
|
|
consume_skb(skb);
|
net, sched: add clsact qdisc
This work adds a generalization of the ingress qdisc as a qdisc holding
only classifiers. The clsact qdisc works on ingress, but also on egress.
In both cases, it's execution happens without taking the qdisc lock, and
the main difference for the egress part compared to prior version of [1]
is that this can be applied with _any_ underlying real egress qdisc (also
classless ones).
Besides solving the use-case of [1], that is, allowing for more programmability
on assigning skb->priority for the mqprio case that is supported by most
popular 10G+ NICs, it also opens up a lot more flexibility for other tc
applications. The main work on classification can already be done at clsact
egress time if the use-case allows and state stored for later retrieval
f.e. again in skb->priority with major/minors (which is checked by most
classful qdiscs before consulting tc_classify()) and/or in other skb fields
like skb->tc_index for some light-weight post-processing to get to the
eventual classid in case of a classful qdisc. Another use case is that
the clsact egress part allows to have a central egress counterpart to
the ingress classifiers, so that classifiers can easily share state (e.g.
in cls_bpf via eBPF maps) for ingress and egress.
Currently, default setups like mq + pfifo_fast would require for this to
use, for example, prio qdisc instead (to get a tc_classify() run) and to
duplicate the egress classifier for each queue. With clsact, it allows
for leaving the setup as is, it can additionally assign skb->priority to
put the skb in one of pfifo_fast's bands and it can share state with maps.
Moreover, we can access the skb's dst entry (f.e. to retrieve tclassid)
w/o the need to perform a skb_dst_force() to hold on to it any longer. In
lwt case, we can also use this facility to setup dst metadata via cls_bpf
(bpf_skb_set_tunnel_key()) without needing a real egress qdisc just for
that (case of IFF_NO_QUEUE devices, for example).
The realization can be done without any changes to the scheduler core
framework. All it takes is that we have two a-priori defined minors/child
classes, where we can mux between ingress and egress classifier list
(dev->ingress_cl_list and dev->egress_cl_list, latter stored close to
dev->_tx to avoid extra cacheline miss for moderate loads). The egress
part is a bit similar modelled to handle_ing() and patched to a noop in
case the functionality is not used. Both handlers are now called
sch_handle_ingress() and sch_handle_egress(), code sharing among the two
doesn't seem practical as there are various minor differences in both
paths, so that making them conditional in a single handler would rather
slow things down.
Full compatibility to ingress qdisc is provided as well. Since both
piggyback on TC_H_CLSACT, only one of them (ingress/clsact) can exist
per netdevice, and thus ingress qdisc specific behaviour can be retained
for user space. This means, either a user does 'tc qdisc add dev foo ingress'
and configures ingress qdisc as usual, or the 'tc qdisc add dev foo clsact'
alternative, where both, ingress and egress classifier can be configured
as in the below example. ingress qdisc supports attaching classifier to any
minor number whereas clsact has two fixed minors for muxing between the
lists, therefore to not break user space setups, they are better done as
two separate qdiscs.
I decided to extend the sch_ingress module with clsact functionality so
that commonly used code can be reused, the module is being aliased with
sch_clsact so that it can be auto-loaded properly. Alternative would have been
to add a flag when initializing ingress to alter its behaviour plus aliasing
to a different name (as it's more than just ingress). However, the first would
end up, based on the flag, choosing the new/old behaviour by calling different
function implementations to handle each anyway, the latter would require to
register ingress qdisc once again under different alias. So, this really begs
to provide a minimal, cleaner approach to have Qdisc_ops and Qdisc_class_ops
by its own that share callbacks used by both.
Example, adding qdisc:
# tc qdisc add dev foo clsact
# tc qdisc show dev foo
qdisc mq 0: root
qdisc pfifo_fast 0: parent :1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :4 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc clsact ffff: parent ffff:fff1
Adding filters (deleting, etc works analogous by specifying ingress/egress):
# tc filter add dev foo ingress bpf da obj bar.o sec ingress
# tc filter add dev foo egress bpf da obj bar.o sec egress
# tc filter show dev foo ingress
filter protocol all pref 49152 bpf
filter protocol all pref 49152 bpf handle 0x1 bar.o:[ingress] direct-action
# tc filter show dev foo egress
filter protocol all pref 49152 bpf
filter protocol all pref 49152 bpf handle 0x1 bar.o:[egress] direct-action
A 'tc filter show dev foo' or 'tc filter show dev foo parent ffff:' will
show an empty list for clsact. Either using the parent names (ingress/egress)
or specifying the full major/minor will then show the related filter lists.
Prior work on a mqprio prequeue() facility [1] was done mainly by John Fastabend.
[1] http://patchwork.ozlabs.org/patch/512949/
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-07 21:29:47 +00:00
|
|
|
return NULL;
|
|
|
|
case TC_ACT_REDIRECT:
|
|
|
|
/* No need to push/pop skb's mac_header here on egress! */
|
|
|
|
skb_do_redirect(skb);
|
|
|
|
*ret = NET_XMIT_SUCCESS;
|
|
|
|
return NULL;
|
|
|
|
default:
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
return skb;
|
|
|
|
}
|
|
|
|
#endif /* CONFIG_NET_EGRESS */
|
|
|
|
|
2015-05-12 12:56:13 +00:00
|
|
|
static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
|
|
|
|
{
|
|
|
|
#ifdef CONFIG_XPS
|
|
|
|
struct xps_dev_maps *dev_maps;
|
|
|
|
struct xps_map *map;
|
|
|
|
int queue_index = -1;
|
|
|
|
|
|
|
|
rcu_read_lock();
|
|
|
|
dev_maps = rcu_dereference(dev->xps_maps);
|
|
|
|
if (dev_maps) {
|
2016-10-28 15:50:13 +00:00
|
|
|
unsigned int tci = skb->sender_cpu - 1;
|
|
|
|
|
|
|
|
if (dev->num_tc) {
|
|
|
|
tci *= dev->num_tc;
|
|
|
|
tci += netdev_get_prio_tc_map(dev, skb->priority);
|
|
|
|
}
|
|
|
|
|
|
|
|
map = rcu_dereference(dev_maps->cpu_map[tci]);
|
2015-05-12 12:56:13 +00:00
|
|
|
if (map) {
|
|
|
|
if (map->len == 1)
|
|
|
|
queue_index = map->queues[0];
|
|
|
|
else
|
|
|
|
queue_index = map->queues[reciprocal_scale(skb_get_hash(skb),
|
|
|
|
map->len)];
|
|
|
|
if (unlikely(queue_index >= dev->real_num_tx_queues))
|
|
|
|
queue_index = -1;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
rcu_read_unlock();
|
|
|
|
|
|
|
|
return queue_index;
|
|
|
|
#else
|
|
|
|
return -1;
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
|
|
|
static u16 __netdev_pick_tx(struct net_device *dev, struct sk_buff *skb)
|
|
|
|
{
|
|
|
|
struct sock *sk = skb->sk;
|
|
|
|
int queue_index = sk_tx_queue_get(sk);
|
|
|
|
|
|
|
|
if (queue_index < 0 || skb->ooo_okay ||
|
|
|
|
queue_index >= dev->real_num_tx_queues) {
|
|
|
|
int new_index = get_xps_queue(dev, skb);
|
2017-02-09 06:56:07 +00:00
|
|
|
|
2015-05-12 12:56:13 +00:00
|
|
|
if (new_index < 0)
|
|
|
|
new_index = skb_tx_hash(dev, skb);
|
|
|
|
|
|
|
|
if (queue_index != new_index && sk &&
|
2015-10-05 04:08:10 +00:00
|
|
|
sk_fullsock(sk) &&
|
2015-05-12 12:56:13 +00:00
|
|
|
rcu_access_pointer(sk->sk_dst_cache))
|
|
|
|
sk_tx_queue_set(sk, new_index);
|
|
|
|
|
|
|
|
queue_index = new_index;
|
|
|
|
}
|
|
|
|
|
|
|
|
return queue_index;
|
|
|
|
}
|
|
|
|
|
|
|
|
struct netdev_queue *netdev_pick_tx(struct net_device *dev,
|
|
|
|
struct sk_buff *skb,
|
|
|
|
void *accel_priv)
|
|
|
|
{
|
|
|
|
int queue_index = 0;
|
|
|
|
|
|
|
|
#ifdef CONFIG_XPS
|
2015-11-18 14:30:50 +00:00
|
|
|
u32 sender_cpu = skb->sender_cpu - 1;
|
|
|
|
|
|
|
|
if (sender_cpu >= (u32)NR_CPUS)
|
2015-05-12 12:56:13 +00:00
|
|
|
skb->sender_cpu = raw_smp_processor_id() + 1;
|
|
|
|
#endif
|
|
|
|
|
|
|
|
if (dev->real_num_tx_queues != 1) {
|
|
|
|
const struct net_device_ops *ops = dev->netdev_ops;
|
2017-02-09 06:56:07 +00:00
|
|
|
|
2015-05-12 12:56:13 +00:00
|
|
|
if (ops->ndo_select_queue)
|
|
|
|
queue_index = ops->ndo_select_queue(dev, skb, accel_priv,
|
|
|
|
__netdev_pick_tx);
|
|
|
|
else
|
|
|
|
queue_index = __netdev_pick_tx(dev, skb);
|
|
|
|
|
|
|
|
if (!accel_priv)
|
|
|
|
queue_index = netdev_cap_txqueue(dev, queue_index);
|
|
|
|
}
|
|
|
|
|
|
|
|
skb_set_queue_mapping(skb, queue_index);
|
|
|
|
return netdev_get_tx_queue(dev, queue_index);
|
|
|
|
}
|
|
|
|
|
2008-07-22 21:09:06 +00:00
|
|
|
/**
|
2014-01-20 03:25:13 +00:00
|
|
|
* __dev_queue_xmit - transmit a buffer
|
2008-07-22 21:09:06 +00:00
|
|
|
* @skb: buffer to transmit
|
2014-01-20 03:25:13 +00:00
|
|
|
* @accel_priv: private data used for L2 forwarding offload
|
2008-07-22 21:09:06 +00:00
|
|
|
*
|
|
|
|
* Queue a buffer for transmission to a network device. The caller must
|
|
|
|
* have set the device and priority and built the buffer before calling
|
|
|
|
* this function. The function can be called from an interrupt.
|
|
|
|
*
|
|
|
|
* A negative errno code is returned on a failure. A success does not
|
|
|
|
* guarantee the frame will be transmitted as it may be dropped due
|
|
|
|
* to congestion or traffic shaping.
|
|
|
|
*
|
|
|
|
* -----------------------------------------------------------------------------------
|
|
|
|
* I notice this method can also return errors from the queue disciplines,
|
|
|
|
* including NET_XMIT_DROP, which is a positive value. So, errors can also
|
|
|
|
* be positive.
|
|
|
|
*
|
|
|
|
* Regardless of the return value, the skb is consumed, so it is currently
|
|
|
|
* difficult to retry a send to this method. (You can bump the ref count
|
|
|
|
* before sending to hold a reference for retry if you are careful.)
|
|
|
|
*
|
|
|
|
* When calling this method, interrupts MUST be enabled. This is because
|
|
|
|
* the BH enable code must have IRQs enabled so that it will not deadlock.
|
|
|
|
* --BLG
|
|
|
|
*/
|
2014-02-09 14:56:25 +00:00
|
|
|
static int __dev_queue_xmit(struct sk_buff *skb, void *accel_priv)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
|
|
|
struct net_device *dev = skb->dev;
|
2008-07-09 00:18:23 +00:00
|
|
|
struct netdev_queue *txq;
|
2005-04-16 22:20:36 +00:00
|
|
|
struct Qdisc *q;
|
|
|
|
int rc = -ENOMEM;
|
|
|
|
|
2013-02-05 20:22:20 +00:00
|
|
|
skb_reset_mac_header(skb);
|
|
|
|
|
2014-08-05 02:11:48 +00:00
|
|
|
if (unlikely(skb_shinfo(skb)->tx_flags & SKBTX_SCHED_TSTAMP))
|
|
|
|
__skb_tstamp_tx(skb, NULL, skb->sk, SCM_TSTAMP_SCHED);
|
|
|
|
|
2007-02-09 14:24:36 +00:00
|
|
|
/* Disable soft irqs for various locks below. Also
|
|
|
|
* stops preemption for RCU.
|
2005-04-16 22:20:36 +00:00
|
|
|
*/
|
2007-02-09 14:24:36 +00:00
|
|
|
rcu_read_lock_bh();
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2011-11-22 05:10:51 +00:00
|
|
|
skb_update_prio(skb);
|
|
|
|
|
net, sched: add clsact qdisc
This work adds a generalization of the ingress qdisc as a qdisc holding
only classifiers. The clsact qdisc works on ingress, but also on egress.
In both cases, it's execution happens without taking the qdisc lock, and
the main difference for the egress part compared to prior version of [1]
is that this can be applied with _any_ underlying real egress qdisc (also
classless ones).
Besides solving the use-case of [1], that is, allowing for more programmability
on assigning skb->priority for the mqprio case that is supported by most
popular 10G+ NICs, it also opens up a lot more flexibility for other tc
applications. The main work on classification can already be done at clsact
egress time if the use-case allows and state stored for later retrieval
f.e. again in skb->priority with major/minors (which is checked by most
classful qdiscs before consulting tc_classify()) and/or in other skb fields
like skb->tc_index for some light-weight post-processing to get to the
eventual classid in case of a classful qdisc. Another use case is that
the clsact egress part allows to have a central egress counterpart to
the ingress classifiers, so that classifiers can easily share state (e.g.
in cls_bpf via eBPF maps) for ingress and egress.
Currently, default setups like mq + pfifo_fast would require for this to
use, for example, prio qdisc instead (to get a tc_classify() run) and to
duplicate the egress classifier for each queue. With clsact, it allows
for leaving the setup as is, it can additionally assign skb->priority to
put the skb in one of pfifo_fast's bands and it can share state with maps.
Moreover, we can access the skb's dst entry (f.e. to retrieve tclassid)
w/o the need to perform a skb_dst_force() to hold on to it any longer. In
lwt case, we can also use this facility to setup dst metadata via cls_bpf
(bpf_skb_set_tunnel_key()) without needing a real egress qdisc just for
that (case of IFF_NO_QUEUE devices, for example).
The realization can be done without any changes to the scheduler core
framework. All it takes is that we have two a-priori defined minors/child
classes, where we can mux between ingress and egress classifier list
(dev->ingress_cl_list and dev->egress_cl_list, latter stored close to
dev->_tx to avoid extra cacheline miss for moderate loads). The egress
part is a bit similar modelled to handle_ing() and patched to a noop in
case the functionality is not used. Both handlers are now called
sch_handle_ingress() and sch_handle_egress(), code sharing among the two
doesn't seem practical as there are various minor differences in both
paths, so that making them conditional in a single handler would rather
slow things down.
Full compatibility to ingress qdisc is provided as well. Since both
piggyback on TC_H_CLSACT, only one of them (ingress/clsact) can exist
per netdevice, and thus ingress qdisc specific behaviour can be retained
for user space. This means, either a user does 'tc qdisc add dev foo ingress'
and configures ingress qdisc as usual, or the 'tc qdisc add dev foo clsact'
alternative, where both, ingress and egress classifier can be configured
as in the below example. ingress qdisc supports attaching classifier to any
minor number whereas clsact has two fixed minors for muxing between the
lists, therefore to not break user space setups, they are better done as
two separate qdiscs.
I decided to extend the sch_ingress module with clsact functionality so
that commonly used code can be reused, the module is being aliased with
sch_clsact so that it can be auto-loaded properly. Alternative would have been
to add a flag when initializing ingress to alter its behaviour plus aliasing
to a different name (as it's more than just ingress). However, the first would
end up, based on the flag, choosing the new/old behaviour by calling different
function implementations to handle each anyway, the latter would require to
register ingress qdisc once again under different alias. So, this really begs
to provide a minimal, cleaner approach to have Qdisc_ops and Qdisc_class_ops
by its own that share callbacks used by both.
Example, adding qdisc:
# tc qdisc add dev foo clsact
# tc qdisc show dev foo
qdisc mq 0: root
qdisc pfifo_fast 0: parent :1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :4 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc clsact ffff: parent ffff:fff1
Adding filters (deleting, etc works analogous by specifying ingress/egress):
# tc filter add dev foo ingress bpf da obj bar.o sec ingress
# tc filter add dev foo egress bpf da obj bar.o sec egress
# tc filter show dev foo ingress
filter protocol all pref 49152 bpf
filter protocol all pref 49152 bpf handle 0x1 bar.o:[ingress] direct-action
# tc filter show dev foo egress
filter protocol all pref 49152 bpf
filter protocol all pref 49152 bpf handle 0x1 bar.o:[egress] direct-action
A 'tc filter show dev foo' or 'tc filter show dev foo parent ffff:' will
show an empty list for clsact. Either using the parent names (ingress/egress)
or specifying the full major/minor will then show the related filter lists.
Prior work on a mqprio prequeue() facility [1] was done mainly by John Fastabend.
[1] http://patchwork.ozlabs.org/patch/512949/
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-07 21:29:47 +00:00
|
|
|
qdisc_pkt_len_init(skb);
|
|
|
|
#ifdef CONFIG_NET_CLS_ACT
|
2017-01-07 22:06:37 +00:00
|
|
|
skb->tc_at_ingress = 0;
|
net, sched: add clsact qdisc
This work adds a generalization of the ingress qdisc as a qdisc holding
only classifiers. The clsact qdisc works on ingress, but also on egress.
In both cases, it's execution happens without taking the qdisc lock, and
the main difference for the egress part compared to prior version of [1]
is that this can be applied with _any_ underlying real egress qdisc (also
classless ones).
Besides solving the use-case of [1], that is, allowing for more programmability
on assigning skb->priority for the mqprio case that is supported by most
popular 10G+ NICs, it also opens up a lot more flexibility for other tc
applications. The main work on classification can already be done at clsact
egress time if the use-case allows and state stored for later retrieval
f.e. again in skb->priority with major/minors (which is checked by most
classful qdiscs before consulting tc_classify()) and/or in other skb fields
like skb->tc_index for some light-weight post-processing to get to the
eventual classid in case of a classful qdisc. Another use case is that
the clsact egress part allows to have a central egress counterpart to
the ingress classifiers, so that classifiers can easily share state (e.g.
in cls_bpf via eBPF maps) for ingress and egress.
Currently, default setups like mq + pfifo_fast would require for this to
use, for example, prio qdisc instead (to get a tc_classify() run) and to
duplicate the egress classifier for each queue. With clsact, it allows
for leaving the setup as is, it can additionally assign skb->priority to
put the skb in one of pfifo_fast's bands and it can share state with maps.
Moreover, we can access the skb's dst entry (f.e. to retrieve tclassid)
w/o the need to perform a skb_dst_force() to hold on to it any longer. In
lwt case, we can also use this facility to setup dst metadata via cls_bpf
(bpf_skb_set_tunnel_key()) without needing a real egress qdisc just for
that (case of IFF_NO_QUEUE devices, for example).
The realization can be done without any changes to the scheduler core
framework. All it takes is that we have two a-priori defined minors/child
classes, where we can mux between ingress and egress classifier list
(dev->ingress_cl_list and dev->egress_cl_list, latter stored close to
dev->_tx to avoid extra cacheline miss for moderate loads). The egress
part is a bit similar modelled to handle_ing() and patched to a noop in
case the functionality is not used. Both handlers are now called
sch_handle_ingress() and sch_handle_egress(), code sharing among the two
doesn't seem practical as there are various minor differences in both
paths, so that making them conditional in a single handler would rather
slow things down.
Full compatibility to ingress qdisc is provided as well. Since both
piggyback on TC_H_CLSACT, only one of them (ingress/clsact) can exist
per netdevice, and thus ingress qdisc specific behaviour can be retained
for user space. This means, either a user does 'tc qdisc add dev foo ingress'
and configures ingress qdisc as usual, or the 'tc qdisc add dev foo clsact'
alternative, where both, ingress and egress classifier can be configured
as in the below example. ingress qdisc supports attaching classifier to any
minor number whereas clsact has two fixed minors for muxing between the
lists, therefore to not break user space setups, they are better done as
two separate qdiscs.
I decided to extend the sch_ingress module with clsact functionality so
that commonly used code can be reused, the module is being aliased with
sch_clsact so that it can be auto-loaded properly. Alternative would have been
to add a flag when initializing ingress to alter its behaviour plus aliasing
to a different name (as it's more than just ingress). However, the first would
end up, based on the flag, choosing the new/old behaviour by calling different
function implementations to handle each anyway, the latter would require to
register ingress qdisc once again under different alias. So, this really begs
to provide a minimal, cleaner approach to have Qdisc_ops and Qdisc_class_ops
by its own that share callbacks used by both.
Example, adding qdisc:
# tc qdisc add dev foo clsact
# tc qdisc show dev foo
qdisc mq 0: root
qdisc pfifo_fast 0: parent :1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :4 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc clsact ffff: parent ffff:fff1
Adding filters (deleting, etc works analogous by specifying ingress/egress):
# tc filter add dev foo ingress bpf da obj bar.o sec ingress
# tc filter add dev foo egress bpf da obj bar.o sec egress
# tc filter show dev foo ingress
filter protocol all pref 49152 bpf
filter protocol all pref 49152 bpf handle 0x1 bar.o:[ingress] direct-action
# tc filter show dev foo egress
filter protocol all pref 49152 bpf
filter protocol all pref 49152 bpf handle 0x1 bar.o:[egress] direct-action
A 'tc filter show dev foo' or 'tc filter show dev foo parent ffff:' will
show an empty list for clsact. Either using the parent names (ingress/egress)
or specifying the full major/minor will then show the related filter lists.
Prior work on a mqprio prequeue() facility [1] was done mainly by John Fastabend.
[1] http://patchwork.ozlabs.org/patch/512949/
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-07 21:29:47 +00:00
|
|
|
# ifdef CONFIG_NET_EGRESS
|
|
|
|
if (static_key_false(&egress_needed)) {
|
|
|
|
skb = sch_handle_egress(skb, &rc, dev);
|
|
|
|
if (!skb)
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
# endif
|
|
|
|
#endif
|
2014-10-06 01:38:35 +00:00
|
|
|
/* If device/qdisc don't need skb->dst, release it right now while
|
|
|
|
* its hot in this cpu cache.
|
|
|
|
*/
|
|
|
|
if (dev->priv_flags & IFF_XMIT_DST_RELEASE)
|
|
|
|
skb_dst_drop(skb);
|
|
|
|
else
|
|
|
|
skb_dst_force(skb);
|
|
|
|
|
2014-01-10 08:18:26 +00:00
|
|
|
txq = netdev_pick_tx(dev, skb, accel_priv);
|
2010-02-23 01:04:49 +00:00
|
|
|
q = rcu_dereference_bh(txq->qdisc);
|
2008-07-16 09:15:04 +00:00
|
|
|
|
2010-08-23 09:45:02 +00:00
|
|
|
trace_net_dev_queue(skb);
|
2005-04-16 22:20:36 +00:00
|
|
|
if (q->enqueue) {
|
2009-08-06 01:44:21 +00:00
|
|
|
rc = __dev_xmit_skb(skb, q, dev, txq);
|
2008-07-16 09:15:04 +00:00
|
|
|
goto out;
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/* The device has no queue. Common case for software devices:
|
2017-02-09 06:56:06 +00:00
|
|
|
* loopback, all the sorts of tunnels...
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2017-02-09 06:56:06 +00:00
|
|
|
* Really, it is unlikely that netif_tx_lock protection is necessary
|
|
|
|
* here. (f.e. loopback and IP tunnels are clean ignoring statistics
|
|
|
|
* counters.)
|
|
|
|
* However, it is possible, that they rely on protection
|
|
|
|
* made by us here.
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2017-02-09 06:56:06 +00:00
|
|
|
* Check this and shot the lock. It is not prone from deadlocks.
|
|
|
|
*Either shot noqueue qdisc, it is even simpler 8)
|
2005-04-16 22:20:36 +00:00
|
|
|
*/
|
|
|
|
if (dev->flags & IFF_UP) {
|
|
|
|
int cpu = smp_processor_id(); /* ok because BHs are off */
|
|
|
|
|
2008-07-09 06:13:53 +00:00
|
|
|
if (txq->xmit_lock_owner != cpu) {
|
2016-06-10 19:19:06 +00:00
|
|
|
if (unlikely(__this_cpu_read(xmit_recursion) >
|
|
|
|
XMIT_RECURSION_LIMIT))
|
2010-09-29 20:23:09 +00:00
|
|
|
goto recursion_alert;
|
|
|
|
|
2014-09-03 15:56:09 +00:00
|
|
|
skb = validate_xmit_skb(skb, dev);
|
|
|
|
if (!skb)
|
2016-04-13 04:50:07 +00:00
|
|
|
goto out;
|
2014-09-03 15:56:09 +00:00
|
|
|
|
2008-07-09 06:13:53 +00:00
|
|
|
HARD_TX_LOCK(dev, txq, cpu);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2011-11-28 16:32:44 +00:00
|
|
|
if (!netif_xmit_stopped(txq)) {
|
2010-09-29 20:23:09 +00:00
|
|
|
__this_cpu_inc(xmit_recursion);
|
2014-08-31 02:22:20 +00:00
|
|
|
skb = dev_hard_start_xmit(skb, dev, txq, &rc);
|
2010-09-29 20:23:09 +00:00
|
|
|
__this_cpu_dec(xmit_recursion);
|
2009-11-10 06:14:14 +00:00
|
|
|
if (dev_xmit_complete(rc)) {
|
2008-07-09 06:13:53 +00:00
|
|
|
HARD_TX_UNLOCK(dev, txq);
|
2005-04-16 22:20:36 +00:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
}
|
2008-07-09 06:13:53 +00:00
|
|
|
HARD_TX_UNLOCK(dev, txq);
|
2012-05-13 21:56:26 +00:00
|
|
|
net_crit_ratelimited("Virtual device %s asks to queue packet!\n",
|
|
|
|
dev->name);
|
2005-04-16 22:20:36 +00:00
|
|
|
} else {
|
|
|
|
/* Recursion is detected! It is possible,
|
2010-09-29 20:23:09 +00:00
|
|
|
* unfortunately
|
|
|
|
*/
|
|
|
|
recursion_alert:
|
2012-05-13 21:56:26 +00:00
|
|
|
net_crit_ratelimited("Dead loop on virtual device %s, fix it urgently!\n",
|
|
|
|
dev->name);
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
rc = -ENETDOWN;
|
2006-06-22 09:28:18 +00:00
|
|
|
rcu_read_unlock_bh();
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2014-03-27 15:45:56 +00:00
|
|
|
atomic_long_inc(&dev->tx_dropped);
|
2014-09-03 15:56:09 +00:00
|
|
|
kfree_skb_list(skb);
|
2005-04-16 22:20:36 +00:00
|
|
|
return rc;
|
|
|
|
out:
|
2006-06-22 09:28:18 +00:00
|
|
|
rcu_read_unlock_bh();
|
2005-04-16 22:20:36 +00:00
|
|
|
return rc;
|
|
|
|
}
|
2014-01-10 08:18:26 +00:00
|
|
|
|
2015-09-16 01:04:07 +00:00
|
|
|
int dev_queue_xmit(struct sk_buff *skb)
|
2014-01-10 08:18:26 +00:00
|
|
|
{
|
|
|
|
return __dev_queue_xmit(skb, NULL);
|
|
|
|
}
|
2015-09-16 01:04:07 +00:00
|
|
|
EXPORT_SYMBOL(dev_queue_xmit);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2014-01-10 08:18:26 +00:00
|
|
|
int dev_queue_xmit_accel(struct sk_buff *skb, void *accel_priv)
|
|
|
|
{
|
|
|
|
return __dev_queue_xmit(skb, accel_priv);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(dev_queue_xmit_accel);
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2017-02-09 06:56:06 +00:00
|
|
|
/*************************************************************************
|
|
|
|
* Receiver routines
|
|
|
|
*************************************************************************/
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2007-03-12 21:33:50 +00:00
|
|
|
int netdev_max_backlog __read_mostly = 1000;
|
2012-09-27 19:29:05 +00:00
|
|
|
EXPORT_SYMBOL(netdev_max_backlog);
|
|
|
|
|
net: Consistent skb timestamping
With RPS inclusion, skb timestamping is not consistent in RX path.
If netif_receive_skb() is used, its deferred after RPS dispatch.
If netif_rx() is used, its done before RPS dispatch.
This can give strange tcpdump timestamps results.
I think timestamping should be done as soon as possible in the receive
path, to get meaningful values (ie timestamps taken at the time packet
was delivered by NIC driver to our stack), even if NAPI already can
defer timestamping a bit (RPS can help to reduce the gap)
Tom Herbert prefer to sample timestamps after RPS dispatch. In case
sampling is expensive (HPET/acpi_pm on x86), this makes sense.
Let admins switch from one mode to another, using a new
sysctl, /proc/sys/net/core/netdev_tstamp_prequeue
Its default value (1), means timestamps are taken as soon as possible,
before backlog queueing, giving accurate timestamps.
Setting a 0 value permits to sample timestamps when processing backlog,
after RPS dispatch, to lower the load of the pre-RPS cpu.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-16 06:57:10 +00:00
|
|
|
int netdev_tstamp_prequeue __read_mostly = 1;
|
2007-03-12 21:33:50 +00:00
|
|
|
int netdev_budget __read_mostly = 300;
|
2017-04-19 16:37:10 +00:00
|
|
|
unsigned int __read_mostly netdev_budget_usecs = 2000;
|
2016-12-29 20:37:21 +00:00
|
|
|
int weight_p __read_mostly = 64; /* old backlog weight */
|
|
|
|
int dev_weight_rx_bias __read_mostly = 1; /* bias for backlog weight */
|
|
|
|
int dev_weight_tx_bias __read_mostly = 1; /* bias for output_queue quota */
|
|
|
|
int dev_rx_weight __read_mostly = 64;
|
|
|
|
int dev_tx_weight __read_mostly = 64;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2010-05-07 05:07:48 +00:00
|
|
|
/* Called with irq disabled */
|
|
|
|
static inline void ____napi_schedule(struct softnet_data *sd,
|
|
|
|
struct napi_struct *napi)
|
|
|
|
{
|
|
|
|
list_add_tail(&napi->poll_list, &sd->poll_list);
|
|
|
|
__raise_softirq_irqoff(NET_RX_SOFTIRQ);
|
|
|
|
}
|
|
|
|
|
2010-08-04 06:15:52 +00:00
|
|
|
#ifdef CONFIG_RPS
|
|
|
|
|
|
|
|
/* One global table that all flow-based protocols share. */
|
2010-10-25 03:02:02 +00:00
|
|
|
struct rps_sock_flow_table __rcu *rps_sock_flow_table __read_mostly;
|
2010-08-04 06:15:52 +00:00
|
|
|
EXPORT_SYMBOL(rps_sock_flow_table);
|
net: rfs: add hash collision detection
Receive Flow Steering is a nice solution but suffers from
hash collisions when a mix of connected and unconnected traffic
is received on the host, when flow hash table is populated.
Also, clearing flow in inet_release() makes RFS not very good
for short lived flows, as many packets can follow close().
(FIN , ACK packets, ...)
This patch extends the information stored into global hash table
to not only include cpu number, but upper part of the hash value.
I use a 32bit value, and dynamically split it in two parts.
For host with less than 64 possible cpus, this gives 6 bits for the
cpu number, and 26 (32-6) bits for the upper part of the hash.
Since hash bucket selection use low order bits of the hash, we have
a full hash match, if /proc/sys/net/core/rps_sock_flow_entries is big
enough.
If the hash found in flow table does not match, we fallback to RPS (if
it is enabled for the rxqueue).
This means that a packet for an non connected flow can avoid the
IPI through a unrelated/victim CPU.
This also means we no longer have to clear the table at socket
close time, and this helps short lived flows performance.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-06 20:59:01 +00:00
|
|
|
u32 rps_cpu_mask __read_mostly;
|
|
|
|
EXPORT_SYMBOL(rps_cpu_mask);
|
2010-08-04 06:15:52 +00:00
|
|
|
|
2012-02-24 07:31:31 +00:00
|
|
|
struct static_key rps_needed __read_mostly;
|
2016-04-26 03:13:42 +00:00
|
|
|
EXPORT_SYMBOL(rps_needed);
|
2016-12-07 16:29:10 +00:00
|
|
|
struct static_key rfs_needed __read_mostly;
|
|
|
|
EXPORT_SYMBOL(rfs_needed);
|
2011-11-17 03:13:26 +00:00
|
|
|
|
2011-01-19 11:03:53 +00:00
|
|
|
static struct rps_dev_flow *
|
|
|
|
set_rps_cpu(struct net_device *dev, struct sk_buff *skb,
|
|
|
|
struct rps_dev_flow *rflow, u16 next_cpu)
|
|
|
|
{
|
2015-04-25 16:35:24 +00:00
|
|
|
if (next_cpu < nr_cpu_ids) {
|
2011-01-19 11:03:53 +00:00
|
|
|
#ifdef CONFIG_RFS_ACCEL
|
|
|
|
struct netdev_rx_queue *rxqueue;
|
|
|
|
struct rps_dev_flow_table *flow_table;
|
|
|
|
struct rps_dev_flow *old_rflow;
|
|
|
|
u32 flow_id;
|
|
|
|
u16 rxq_index;
|
|
|
|
int rc;
|
|
|
|
|
|
|
|
/* Should we steer this flow to a different hardware queue? */
|
2011-02-15 20:32:04 +00:00
|
|
|
if (!skb_rx_queue_recorded(skb) || !dev->rx_cpu_rmap ||
|
|
|
|
!(dev->features & NETIF_F_NTUPLE))
|
2011-01-19 11:03:53 +00:00
|
|
|
goto out;
|
|
|
|
rxq_index = cpu_rmap_lookup_index(dev->rx_cpu_rmap, next_cpu);
|
|
|
|
if (rxq_index == skb_get_rx_queue(skb))
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
rxqueue = dev->_rx + rxq_index;
|
|
|
|
flow_table = rcu_dereference(rxqueue->rps_flow_table);
|
|
|
|
if (!flow_table)
|
|
|
|
goto out;
|
2014-03-24 22:34:47 +00:00
|
|
|
flow_id = skb_get_hash(skb) & flow_table->mask;
|
2011-01-19 11:03:53 +00:00
|
|
|
rc = dev->netdev_ops->ndo_rx_flow_steer(dev, skb,
|
|
|
|
rxq_index, flow_id);
|
|
|
|
if (rc < 0)
|
|
|
|
goto out;
|
|
|
|
old_rflow = rflow;
|
|
|
|
rflow = &flow_table->flows[flow_id];
|
|
|
|
rflow->filter = rc;
|
|
|
|
if (old_rflow->filter == rflow->filter)
|
|
|
|
old_rflow->filter = RPS_NO_FILTER;
|
|
|
|
out:
|
|
|
|
#endif
|
|
|
|
rflow->last_qtail =
|
2011-10-03 04:42:46 +00:00
|
|
|
per_cpu(softnet_data, next_cpu).input_queue_head;
|
2011-01-19 11:03:53 +00:00
|
|
|
}
|
|
|
|
|
2011-10-03 04:42:46 +00:00
|
|
|
rflow->cpu = next_cpu;
|
2011-01-19 11:03:53 +00:00
|
|
|
return rflow;
|
|
|
|
}
|
|
|
|
|
2010-08-04 06:15:52 +00:00
|
|
|
/*
|
|
|
|
* get_rps_cpu is called from netif_receive_skb and returns the target
|
|
|
|
* CPU from the RPS map of the receiving queue for a given skb.
|
|
|
|
* rcu_read_lock must be held on entry.
|
|
|
|
*/
|
|
|
|
static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb,
|
|
|
|
struct rps_dev_flow **rflowp)
|
|
|
|
{
|
net: rfs: add hash collision detection
Receive Flow Steering is a nice solution but suffers from
hash collisions when a mix of connected and unconnected traffic
is received on the host, when flow hash table is populated.
Also, clearing flow in inet_release() makes RFS not very good
for short lived flows, as many packets can follow close().
(FIN , ACK packets, ...)
This patch extends the information stored into global hash table
to not only include cpu number, but upper part of the hash value.
I use a 32bit value, and dynamically split it in two parts.
For host with less than 64 possible cpus, this gives 6 bits for the
cpu number, and 26 (32-6) bits for the upper part of the hash.
Since hash bucket selection use low order bits of the hash, we have
a full hash match, if /proc/sys/net/core/rps_sock_flow_entries is big
enough.
If the hash found in flow table does not match, we fallback to RPS (if
it is enabled for the rxqueue).
This means that a packet for an non connected flow can avoid the
IPI through a unrelated/victim CPU.
This also means we no longer have to clear the table at socket
close time, and this helps short lived flows performance.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-06 20:59:01 +00:00
|
|
|
const struct rps_sock_flow_table *sock_flow_table;
|
|
|
|
struct netdev_rx_queue *rxqueue = dev->_rx;
|
2010-08-04 06:15:52 +00:00
|
|
|
struct rps_dev_flow_table *flow_table;
|
net: rfs: add hash collision detection
Receive Flow Steering is a nice solution but suffers from
hash collisions when a mix of connected and unconnected traffic
is received on the host, when flow hash table is populated.
Also, clearing flow in inet_release() makes RFS not very good
for short lived flows, as many packets can follow close().
(FIN , ACK packets, ...)
This patch extends the information stored into global hash table
to not only include cpu number, but upper part of the hash value.
I use a 32bit value, and dynamically split it in two parts.
For host with less than 64 possible cpus, this gives 6 bits for the
cpu number, and 26 (32-6) bits for the upper part of the hash.
Since hash bucket selection use low order bits of the hash, we have
a full hash match, if /proc/sys/net/core/rps_sock_flow_entries is big
enough.
If the hash found in flow table does not match, we fallback to RPS (if
it is enabled for the rxqueue).
This means that a packet for an non connected flow can avoid the
IPI through a unrelated/victim CPU.
This also means we no longer have to clear the table at socket
close time, and this helps short lived flows performance.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-06 20:59:01 +00:00
|
|
|
struct rps_map *map;
|
2010-08-04 06:15:52 +00:00
|
|
|
int cpu = -1;
|
net: rfs: add hash collision detection
Receive Flow Steering is a nice solution but suffers from
hash collisions when a mix of connected and unconnected traffic
is received on the host, when flow hash table is populated.
Also, clearing flow in inet_release() makes RFS not very good
for short lived flows, as many packets can follow close().
(FIN , ACK packets, ...)
This patch extends the information stored into global hash table
to not only include cpu number, but upper part of the hash value.
I use a 32bit value, and dynamically split it in two parts.
For host with less than 64 possible cpus, this gives 6 bits for the
cpu number, and 26 (32-6) bits for the upper part of the hash.
Since hash bucket selection use low order bits of the hash, we have
a full hash match, if /proc/sys/net/core/rps_sock_flow_entries is big
enough.
If the hash found in flow table does not match, we fallback to RPS (if
it is enabled for the rxqueue).
This means that a packet for an non connected flow can avoid the
IPI through a unrelated/victim CPU.
This also means we no longer have to clear the table at socket
close time, and this helps short lived flows performance.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-06 20:59:01 +00:00
|
|
|
u32 tcpu;
|
2014-03-24 22:34:47 +00:00
|
|
|
u32 hash;
|
2010-08-04 06:15:52 +00:00
|
|
|
|
|
|
|
if (skb_rx_queue_recorded(skb)) {
|
|
|
|
u16 index = skb_get_rx_queue(skb);
|
net: rfs: add hash collision detection
Receive Flow Steering is a nice solution but suffers from
hash collisions when a mix of connected and unconnected traffic
is received on the host, when flow hash table is populated.
Also, clearing flow in inet_release() makes RFS not very good
for short lived flows, as many packets can follow close().
(FIN , ACK packets, ...)
This patch extends the information stored into global hash table
to not only include cpu number, but upper part of the hash value.
I use a 32bit value, and dynamically split it in two parts.
For host with less than 64 possible cpus, this gives 6 bits for the
cpu number, and 26 (32-6) bits for the upper part of the hash.
Since hash bucket selection use low order bits of the hash, we have
a full hash match, if /proc/sys/net/core/rps_sock_flow_entries is big
enough.
If the hash found in flow table does not match, we fallback to RPS (if
it is enabled for the rxqueue).
This means that a packet for an non connected flow can avoid the
IPI through a unrelated/victim CPU.
This also means we no longer have to clear the table at socket
close time, and this helps short lived flows performance.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-06 20:59:01 +00:00
|
|
|
|
2010-09-27 08:24:33 +00:00
|
|
|
if (unlikely(index >= dev->real_num_rx_queues)) {
|
|
|
|
WARN_ONCE(dev->real_num_rx_queues > 1,
|
|
|
|
"%s received packet on queue %u, but number "
|
|
|
|
"of RX queues is %u\n",
|
|
|
|
dev->name, index, dev->real_num_rx_queues);
|
2010-08-04 06:15:52 +00:00
|
|
|
goto done;
|
|
|
|
}
|
net: rfs: add hash collision detection
Receive Flow Steering is a nice solution but suffers from
hash collisions when a mix of connected and unconnected traffic
is received on the host, when flow hash table is populated.
Also, clearing flow in inet_release() makes RFS not very good
for short lived flows, as many packets can follow close().
(FIN , ACK packets, ...)
This patch extends the information stored into global hash table
to not only include cpu number, but upper part of the hash value.
I use a 32bit value, and dynamically split it in two parts.
For host with less than 64 possible cpus, this gives 6 bits for the
cpu number, and 26 (32-6) bits for the upper part of the hash.
Since hash bucket selection use low order bits of the hash, we have
a full hash match, if /proc/sys/net/core/rps_sock_flow_entries is big
enough.
If the hash found in flow table does not match, we fallback to RPS (if
it is enabled for the rxqueue).
This means that a packet for an non connected flow can avoid the
IPI through a unrelated/victim CPU.
This also means we no longer have to clear the table at socket
close time, and this helps short lived flows performance.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-06 20:59:01 +00:00
|
|
|
rxqueue += index;
|
|
|
|
}
|
2010-08-04 06:15:52 +00:00
|
|
|
|
net: rfs: add hash collision detection
Receive Flow Steering is a nice solution but suffers from
hash collisions when a mix of connected and unconnected traffic
is received on the host, when flow hash table is populated.
Also, clearing flow in inet_release() makes RFS not very good
for short lived flows, as many packets can follow close().
(FIN , ACK packets, ...)
This patch extends the information stored into global hash table
to not only include cpu number, but upper part of the hash value.
I use a 32bit value, and dynamically split it in two parts.
For host with less than 64 possible cpus, this gives 6 bits for the
cpu number, and 26 (32-6) bits for the upper part of the hash.
Since hash bucket selection use low order bits of the hash, we have
a full hash match, if /proc/sys/net/core/rps_sock_flow_entries is big
enough.
If the hash found in flow table does not match, we fallback to RPS (if
it is enabled for the rxqueue).
This means that a packet for an non connected flow can avoid the
IPI through a unrelated/victim CPU.
This also means we no longer have to clear the table at socket
close time, and this helps short lived flows performance.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-06 20:59:01 +00:00
|
|
|
/* Avoid computing hash if RFS/RPS is not active for this rxqueue */
|
|
|
|
|
|
|
|
flow_table = rcu_dereference(rxqueue->rps_flow_table);
|
2010-10-25 03:02:02 +00:00
|
|
|
map = rcu_dereference(rxqueue->rps_map);
|
net: rfs: add hash collision detection
Receive Flow Steering is a nice solution but suffers from
hash collisions when a mix of connected and unconnected traffic
is received on the host, when flow hash table is populated.
Also, clearing flow in inet_release() makes RFS not very good
for short lived flows, as many packets can follow close().
(FIN , ACK packets, ...)
This patch extends the information stored into global hash table
to not only include cpu number, but upper part of the hash value.
I use a 32bit value, and dynamically split it in two parts.
For host with less than 64 possible cpus, this gives 6 bits for the
cpu number, and 26 (32-6) bits for the upper part of the hash.
Since hash bucket selection use low order bits of the hash, we have
a full hash match, if /proc/sys/net/core/rps_sock_flow_entries is big
enough.
If the hash found in flow table does not match, we fallback to RPS (if
it is enabled for the rxqueue).
This means that a packet for an non connected flow can avoid the
IPI through a unrelated/victim CPU.
This also means we no longer have to clear the table at socket
close time, and this helps short lived flows performance.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-06 20:59:01 +00:00
|
|
|
if (!flow_table && !map)
|
2010-08-04 06:15:52 +00:00
|
|
|
goto done;
|
|
|
|
|
2010-08-17 19:00:56 +00:00
|
|
|
skb_reset_network_header(skb);
|
2014-03-24 22:34:47 +00:00
|
|
|
hash = skb_get_hash(skb);
|
|
|
|
if (!hash)
|
2010-08-04 06:15:52 +00:00
|
|
|
goto done;
|
|
|
|
|
rfs: Receive Flow Steering
This patch implements receive flow steering (RFS). RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running. RFS is an
extension of Receive Packet Steering (RPS).
The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure. The rxhash is passed in skb's received on
the connection from netif_receive_skb. For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.
The convolution of the simple approach is that it would potentially
allow OOO packets. If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.
To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.
rps_sock_table is a global hash table. Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.
rps_dev_flow_table is specific to each device queue. Each entry
contains a CPU and a tail queue counter. The CPU is the "current"
CPU for a matching flow. The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.
Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length. When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.
And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted. When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:
- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table. This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.
Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality. 2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.
This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.
There are two configuration parameters for RFS. The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue. Both are rounded to power of two.
The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).
The benefits of RFS are dependent on cache hierarchy, application
load, and other factors. On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation. However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.
Below are some benchmark results which show the potential benfit of
this patch. The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp. The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.
e1000e on 8 core Intel
No RFS or RPS 104K tps at 30% CPU
No RFS (best RPS config): 290K tps at 63% CPU
RFS 303K tps at 61% CPU
RPC test tps CPU% 50/90/99% usec latency Latency StdDev
No RFS/RPS 103K 48% 757/900/3185 4472.35
RPS only: 174K 73% 415/993/2468 491.66
RFS 223K 73% 379/651/1382 315.61
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-16 23:01:27 +00:00
|
|
|
sock_flow_table = rcu_dereference(rps_sock_flow_table);
|
|
|
|
if (flow_table && sock_flow_table) {
|
|
|
|
struct rps_dev_flow *rflow;
|
net: rfs: add hash collision detection
Receive Flow Steering is a nice solution but suffers from
hash collisions when a mix of connected and unconnected traffic
is received on the host, when flow hash table is populated.
Also, clearing flow in inet_release() makes RFS not very good
for short lived flows, as many packets can follow close().
(FIN , ACK packets, ...)
This patch extends the information stored into global hash table
to not only include cpu number, but upper part of the hash value.
I use a 32bit value, and dynamically split it in two parts.
For host with less than 64 possible cpus, this gives 6 bits for the
cpu number, and 26 (32-6) bits for the upper part of the hash.
Since hash bucket selection use low order bits of the hash, we have
a full hash match, if /proc/sys/net/core/rps_sock_flow_entries is big
enough.
If the hash found in flow table does not match, we fallback to RPS (if
it is enabled for the rxqueue).
This means that a packet for an non connected flow can avoid the
IPI through a unrelated/victim CPU.
This also means we no longer have to clear the table at socket
close time, and this helps short lived flows performance.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-06 20:59:01 +00:00
|
|
|
u32 next_cpu;
|
|
|
|
u32 ident;
|
|
|
|
|
|
|
|
/* First check into global flow table if there is a match */
|
|
|
|
ident = sock_flow_table->ents[hash & sock_flow_table->mask];
|
|
|
|
if ((ident ^ hash) & ~rps_cpu_mask)
|
|
|
|
goto try_rps;
|
rfs: Receive Flow Steering
This patch implements receive flow steering (RFS). RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running. RFS is an
extension of Receive Packet Steering (RPS).
The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure. The rxhash is passed in skb's received on
the connection from netif_receive_skb. For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.
The convolution of the simple approach is that it would potentially
allow OOO packets. If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.
To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.
rps_sock_table is a global hash table. Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.
rps_dev_flow_table is specific to each device queue. Each entry
contains a CPU and a tail queue counter. The CPU is the "current"
CPU for a matching flow. The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.
Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length. When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.
And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted. When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:
- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table. This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.
Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality. 2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.
This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.
There are two configuration parameters for RFS. The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue. Both are rounded to power of two.
The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).
The benefits of RFS are dependent on cache hierarchy, application
load, and other factors. On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation. However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.
Below are some benchmark results which show the potential benfit of
this patch. The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp. The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.
e1000e on 8 core Intel
No RFS or RPS 104K tps at 30% CPU
No RFS (best RPS config): 290K tps at 63% CPU
RFS 303K tps at 61% CPU
RPC test tps CPU% 50/90/99% usec latency Latency StdDev
No RFS/RPS 103K 48% 757/900/3185 4472.35
RPS only: 174K 73% 415/993/2468 491.66
RFS 223K 73% 379/651/1382 315.61
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-16 23:01:27 +00:00
|
|
|
|
net: rfs: add hash collision detection
Receive Flow Steering is a nice solution but suffers from
hash collisions when a mix of connected and unconnected traffic
is received on the host, when flow hash table is populated.
Also, clearing flow in inet_release() makes RFS not very good
for short lived flows, as many packets can follow close().
(FIN , ACK packets, ...)
This patch extends the information stored into global hash table
to not only include cpu number, but upper part of the hash value.
I use a 32bit value, and dynamically split it in two parts.
For host with less than 64 possible cpus, this gives 6 bits for the
cpu number, and 26 (32-6) bits for the upper part of the hash.
Since hash bucket selection use low order bits of the hash, we have
a full hash match, if /proc/sys/net/core/rps_sock_flow_entries is big
enough.
If the hash found in flow table does not match, we fallback to RPS (if
it is enabled for the rxqueue).
This means that a packet for an non connected flow can avoid the
IPI through a unrelated/victim CPU.
This also means we no longer have to clear the table at socket
close time, and this helps short lived flows performance.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-06 20:59:01 +00:00
|
|
|
next_cpu = ident & rps_cpu_mask;
|
|
|
|
|
|
|
|
/* OK, now we know there is a match,
|
|
|
|
* we can look at the local (per receive queue) flow table
|
|
|
|
*/
|
2014-03-24 22:34:47 +00:00
|
|
|
rflow = &flow_table->flows[hash & flow_table->mask];
|
rfs: Receive Flow Steering
This patch implements receive flow steering (RFS). RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running. RFS is an
extension of Receive Packet Steering (RPS).
The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure. The rxhash is passed in skb's received on
the connection from netif_receive_skb. For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.
The convolution of the simple approach is that it would potentially
allow OOO packets. If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.
To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.
rps_sock_table is a global hash table. Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.
rps_dev_flow_table is specific to each device queue. Each entry
contains a CPU and a tail queue counter. The CPU is the "current"
CPU for a matching flow. The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.
Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length. When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.
And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted. When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:
- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table. This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.
Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality. 2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.
This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.
There are two configuration parameters for RFS. The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue. Both are rounded to power of two.
The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).
The benefits of RFS are dependent on cache hierarchy, application
load, and other factors. On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation. However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.
Below are some benchmark results which show the potential benfit of
this patch. The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp. The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.
e1000e on 8 core Intel
No RFS or RPS 104K tps at 30% CPU
No RFS (best RPS config): 290K tps at 63% CPU
RFS 303K tps at 61% CPU
RPC test tps CPU% 50/90/99% usec latency Latency StdDev
No RFS/RPS 103K 48% 757/900/3185 4472.35
RPS only: 174K 73% 415/993/2468 491.66
RFS 223K 73% 379/651/1382 315.61
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-16 23:01:27 +00:00
|
|
|
tcpu = rflow->cpu;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If the desired CPU (where last recvmsg was done) is
|
|
|
|
* different from current CPU (one in the rx-queue flow
|
|
|
|
* table entry), switch if one of the following holds:
|
2015-04-25 16:35:24 +00:00
|
|
|
* - Current CPU is unset (>= nr_cpu_ids).
|
rfs: Receive Flow Steering
This patch implements receive flow steering (RFS). RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running. RFS is an
extension of Receive Packet Steering (RPS).
The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure. The rxhash is passed in skb's received on
the connection from netif_receive_skb. For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.
The convolution of the simple approach is that it would potentially
allow OOO packets. If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.
To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.
rps_sock_table is a global hash table. Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.
rps_dev_flow_table is specific to each device queue. Each entry
contains a CPU and a tail queue counter. The CPU is the "current"
CPU for a matching flow. The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.
Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length. When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.
And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted. When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:
- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table. This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.
Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality. 2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.
This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.
There are two configuration parameters for RFS. The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue. Both are rounded to power of two.
The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).
The benefits of RFS are dependent on cache hierarchy, application
load, and other factors. On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation. However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.
Below are some benchmark results which show the potential benfit of
this patch. The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp. The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.
e1000e on 8 core Intel
No RFS or RPS 104K tps at 30% CPU
No RFS (best RPS config): 290K tps at 63% CPU
RFS 303K tps at 61% CPU
RPC test tps CPU% 50/90/99% usec latency Latency StdDev
No RFS/RPS 103K 48% 757/900/3185 4472.35
RPS only: 174K 73% 415/993/2468 491.66
RFS 223K 73% 379/651/1382 315.61
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-16 23:01:27 +00:00
|
|
|
* - Current CPU is offline.
|
|
|
|
* - The current CPU's queue tail has advanced beyond the
|
|
|
|
* last packet that was enqueued using this table entry.
|
|
|
|
* This guarantees that all previous packets for the flow
|
|
|
|
* have been dequeued, thus preserving in order delivery.
|
|
|
|
*/
|
|
|
|
if (unlikely(tcpu != next_cpu) &&
|
2015-04-25 16:35:24 +00:00
|
|
|
(tcpu >= nr_cpu_ids || !cpu_online(tcpu) ||
|
rfs: Receive Flow Steering
This patch implements receive flow steering (RFS). RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running. RFS is an
extension of Receive Packet Steering (RPS).
The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure. The rxhash is passed in skb's received on
the connection from netif_receive_skb. For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.
The convolution of the simple approach is that it would potentially
allow OOO packets. If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.
To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.
rps_sock_table is a global hash table. Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.
rps_dev_flow_table is specific to each device queue. Each entry
contains a CPU and a tail queue counter. The CPU is the "current"
CPU for a matching flow. The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.
Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length. When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.
And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted. When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:
- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table. This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.
Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality. 2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.
This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.
There are two configuration parameters for RFS. The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue. Both are rounded to power of two.
The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).
The benefits of RFS are dependent on cache hierarchy, application
load, and other factors. On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation. However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.
Below are some benchmark results which show the potential benfit of
this patch. The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp. The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.
e1000e on 8 core Intel
No RFS or RPS 104K tps at 30% CPU
No RFS (best RPS config): 290K tps at 63% CPU
RFS 303K tps at 61% CPU
RPC test tps CPU% 50/90/99% usec latency Latency StdDev
No RFS/RPS 103K 48% 757/900/3185 4472.35
RPS only: 174K 73% 415/993/2468 491.66
RFS 223K 73% 379/651/1382 315.61
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-16 23:01:27 +00:00
|
|
|
((int)(per_cpu(softnet_data, tcpu).input_queue_head -
|
2012-11-16 09:04:15 +00:00
|
|
|
rflow->last_qtail)) >= 0)) {
|
|
|
|
tcpu = next_cpu;
|
2011-01-19 11:03:53 +00:00
|
|
|
rflow = set_rps_cpu(dev, skb, rflow, next_cpu);
|
2012-11-16 09:04:15 +00:00
|
|
|
}
|
2011-01-19 11:03:53 +00:00
|
|
|
|
2015-04-25 16:35:24 +00:00
|
|
|
if (tcpu < nr_cpu_ids && cpu_online(tcpu)) {
|
rfs: Receive Flow Steering
This patch implements receive flow steering (RFS). RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running. RFS is an
extension of Receive Packet Steering (RPS).
The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure. The rxhash is passed in skb's received on
the connection from netif_receive_skb. For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.
The convolution of the simple approach is that it would potentially
allow OOO packets. If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.
To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.
rps_sock_table is a global hash table. Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.
rps_dev_flow_table is specific to each device queue. Each entry
contains a CPU and a tail queue counter. The CPU is the "current"
CPU for a matching flow. The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.
Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length. When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.
And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted. When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:
- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table. This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.
Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality. 2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.
This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.
There are two configuration parameters for RFS. The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue. Both are rounded to power of two.
The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).
The benefits of RFS are dependent on cache hierarchy, application
load, and other factors. On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation. However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.
Below are some benchmark results which show the potential benfit of
this patch. The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp. The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.
e1000e on 8 core Intel
No RFS or RPS 104K tps at 30% CPU
No RFS (best RPS config): 290K tps at 63% CPU
RFS 303K tps at 61% CPU
RPC test tps CPU% 50/90/99% usec latency Latency StdDev
No RFS/RPS 103K 48% 757/900/3185 4472.35
RPS only: 174K 73% 415/993/2468 491.66
RFS 223K 73% 379/651/1382 315.61
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-16 23:01:27 +00:00
|
|
|
*rflowp = rflow;
|
|
|
|
cpu = tcpu;
|
|
|
|
goto done;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
net: rfs: add hash collision detection
Receive Flow Steering is a nice solution but suffers from
hash collisions when a mix of connected and unconnected traffic
is received on the host, when flow hash table is populated.
Also, clearing flow in inet_release() makes RFS not very good
for short lived flows, as many packets can follow close().
(FIN , ACK packets, ...)
This patch extends the information stored into global hash table
to not only include cpu number, but upper part of the hash value.
I use a 32bit value, and dynamically split it in two parts.
For host with less than 64 possible cpus, this gives 6 bits for the
cpu number, and 26 (32-6) bits for the upper part of the hash.
Since hash bucket selection use low order bits of the hash, we have
a full hash match, if /proc/sys/net/core/rps_sock_flow_entries is big
enough.
If the hash found in flow table does not match, we fallback to RPS (if
it is enabled for the rxqueue).
This means that a packet for an non connected flow can avoid the
IPI through a unrelated/victim CPU.
This also means we no longer have to clear the table at socket
close time, and this helps short lived flows performance.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-06 20:59:01 +00:00
|
|
|
try_rps:
|
|
|
|
|
2010-03-16 08:03:29 +00:00
|
|
|
if (map) {
|
2014-08-23 18:58:54 +00:00
|
|
|
tcpu = map->cpus[reciprocal_scale(hash, map->len)];
|
2010-03-16 08:03:29 +00:00
|
|
|
if (cpu_online(tcpu)) {
|
|
|
|
cpu = tcpu;
|
|
|
|
goto done;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
done:
|
|
|
|
return cpu;
|
|
|
|
}
|
|
|
|
|
2011-01-19 11:03:53 +00:00
|
|
|
#ifdef CONFIG_RFS_ACCEL
|
|
|
|
|
|
|
|
/**
|
|
|
|
* rps_may_expire_flow - check whether an RFS hardware filter may be removed
|
|
|
|
* @dev: Device on which the filter was set
|
|
|
|
* @rxq_index: RX queue index
|
|
|
|
* @flow_id: Flow ID passed to ndo_rx_flow_steer()
|
|
|
|
* @filter_id: Filter ID returned by ndo_rx_flow_steer()
|
|
|
|
*
|
|
|
|
* Drivers that implement ndo_rx_flow_steer() should periodically call
|
|
|
|
* this function for each installed filter and remove the filters for
|
|
|
|
* which it returns %true.
|
|
|
|
*/
|
|
|
|
bool rps_may_expire_flow(struct net_device *dev, u16 rxq_index,
|
|
|
|
u32 flow_id, u16 filter_id)
|
|
|
|
{
|
|
|
|
struct netdev_rx_queue *rxqueue = dev->_rx + rxq_index;
|
|
|
|
struct rps_dev_flow_table *flow_table;
|
|
|
|
struct rps_dev_flow *rflow;
|
|
|
|
bool expire = true;
|
2015-04-25 16:35:24 +00:00
|
|
|
unsigned int cpu;
|
2011-01-19 11:03:53 +00:00
|
|
|
|
|
|
|
rcu_read_lock();
|
|
|
|
flow_table = rcu_dereference(rxqueue->rps_flow_table);
|
|
|
|
if (flow_table && flow_id <= flow_table->mask) {
|
|
|
|
rflow = &flow_table->flows[flow_id];
|
|
|
|
cpu = ACCESS_ONCE(rflow->cpu);
|
2015-04-25 16:35:24 +00:00
|
|
|
if (rflow->filter == filter_id && cpu < nr_cpu_ids &&
|
2011-01-19 11:03:53 +00:00
|
|
|
((int)(per_cpu(softnet_data, cpu).input_queue_head -
|
|
|
|
rflow->last_qtail) <
|
|
|
|
(int)(10 * flow_table->mask)))
|
|
|
|
expire = false;
|
|
|
|
}
|
|
|
|
rcu_read_unlock();
|
|
|
|
return expire;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(rps_may_expire_flow);
|
|
|
|
|
|
|
|
#endif /* CONFIG_RFS_ACCEL */
|
|
|
|
|
2010-03-16 08:03:29 +00:00
|
|
|
/* Called from hardirq (IPI) context */
|
2010-04-19 21:17:14 +00:00
|
|
|
static void rps_trigger_softirq(void *data)
|
2010-03-16 08:03:29 +00:00
|
|
|
{
|
2010-04-19 21:17:14 +00:00
|
|
|
struct softnet_data *sd = data;
|
|
|
|
|
2010-05-07 05:07:48 +00:00
|
|
|
____napi_schedule(sd, &sd->backlog);
|
2010-05-02 05:42:16 +00:00
|
|
|
sd->received_rps++;
|
2010-03-16 08:03:29 +00:00
|
|
|
}
|
2010-04-19 21:17:14 +00:00
|
|
|
|
rfs: Receive Flow Steering
This patch implements receive flow steering (RFS). RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running. RFS is an
extension of Receive Packet Steering (RPS).
The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure. The rxhash is passed in skb's received on
the connection from netif_receive_skb. For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.
The convolution of the simple approach is that it would potentially
allow OOO packets. If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.
To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.
rps_sock_table is a global hash table. Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.
rps_dev_flow_table is specific to each device queue. Each entry
contains a CPU and a tail queue counter. The CPU is the "current"
CPU for a matching flow. The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.
Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length. When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.
And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted. When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:
- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table. This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.
Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality. 2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.
This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.
There are two configuration parameters for RFS. The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue. Both are rounded to power of two.
The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).
The benefits of RFS are dependent on cache hierarchy, application
load, and other factors. On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation. However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.
Below are some benchmark results which show the potential benfit of
this patch. The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp. The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.
e1000e on 8 core Intel
No RFS or RPS 104K tps at 30% CPU
No RFS (best RPS config): 290K tps at 63% CPU
RFS 303K tps at 61% CPU
RPC test tps CPU% 50/90/99% usec latency Latency StdDev
No RFS/RPS 103K 48% 757/900/3185 4472.35
RPS only: 174K 73% 415/993/2468 491.66
RFS 223K 73% 379/651/1382 315.61
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-16 23:01:27 +00:00
|
|
|
#endif /* CONFIG_RPS */
|
2010-03-16 08:03:29 +00:00
|
|
|
|
2010-04-19 21:17:14 +00:00
|
|
|
/*
|
|
|
|
* Check if this softnet_data structure is another cpu one
|
|
|
|
* If yes, queue it to our IPI list and return 1
|
|
|
|
* If no, return 0
|
|
|
|
*/
|
|
|
|
static int rps_ipi_queued(struct softnet_data *sd)
|
|
|
|
{
|
|
|
|
#ifdef CONFIG_RPS
|
2014-08-17 17:30:35 +00:00
|
|
|
struct softnet_data *mysd = this_cpu_ptr(&softnet_data);
|
2010-04-19 21:17:14 +00:00
|
|
|
|
|
|
|
if (sd != mysd) {
|
|
|
|
sd->rps_ipi_next = mysd->rps_ipi_list;
|
|
|
|
mysd->rps_ipi_list = sd;
|
|
|
|
|
|
|
|
__raise_softirq_irqoff(NET_RX_SOFTIRQ);
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
#endif /* CONFIG_RPS */
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2013-05-20 04:02:32 +00:00
|
|
|
#ifdef CONFIG_NET_FLOW_LIMIT
|
|
|
|
int netdev_flow_limit_table_len __read_mostly = (1 << 12);
|
|
|
|
#endif
|
|
|
|
|
|
|
|
static bool skb_flow_limit(struct sk_buff *skb, unsigned int qlen)
|
|
|
|
{
|
|
|
|
#ifdef CONFIG_NET_FLOW_LIMIT
|
|
|
|
struct sd_flow_limit *fl;
|
|
|
|
struct softnet_data *sd;
|
|
|
|
unsigned int old_flow, new_flow;
|
|
|
|
|
|
|
|
if (qlen < (netdev_max_backlog >> 1))
|
|
|
|
return false;
|
|
|
|
|
2014-08-17 17:30:35 +00:00
|
|
|
sd = this_cpu_ptr(&softnet_data);
|
2013-05-20 04:02:32 +00:00
|
|
|
|
|
|
|
rcu_read_lock();
|
|
|
|
fl = rcu_dereference(sd->flow_limit);
|
|
|
|
if (fl) {
|
2013-12-16 06:12:06 +00:00
|
|
|
new_flow = skb_get_hash(skb) & (fl->num_buckets - 1);
|
2013-05-20 04:02:32 +00:00
|
|
|
old_flow = fl->history[fl->history_head];
|
|
|
|
fl->history[fl->history_head] = new_flow;
|
|
|
|
|
|
|
|
fl->history_head++;
|
|
|
|
fl->history_head &= FLOW_LIMIT_HISTORY - 1;
|
|
|
|
|
|
|
|
if (likely(fl->buckets[old_flow]))
|
|
|
|
fl->buckets[old_flow]--;
|
|
|
|
|
|
|
|
if (++fl->buckets[new_flow] > (FLOW_LIMIT_HISTORY >> 1)) {
|
|
|
|
fl->count++;
|
|
|
|
rcu_read_unlock();
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
rcu_read_unlock();
|
|
|
|
#endif
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2010-03-16 08:03:29 +00:00
|
|
|
/*
|
|
|
|
* enqueue_to_backlog is called to queue an skb to a per CPU backlog
|
|
|
|
* queue (may be a remote CPU queue).
|
|
|
|
*/
|
rfs: Receive Flow Steering
This patch implements receive flow steering (RFS). RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running. RFS is an
extension of Receive Packet Steering (RPS).
The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure. The rxhash is passed in skb's received on
the connection from netif_receive_skb. For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.
The convolution of the simple approach is that it would potentially
allow OOO packets. If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.
To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.
rps_sock_table is a global hash table. Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.
rps_dev_flow_table is specific to each device queue. Each entry
contains a CPU and a tail queue counter. The CPU is the "current"
CPU for a matching flow. The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.
Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length. When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.
And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted. When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:
- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table. This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.
Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality. 2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.
This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.
There are two configuration parameters for RFS. The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue. Both are rounded to power of two.
The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).
The benefits of RFS are dependent on cache hierarchy, application
load, and other factors. On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation. However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.
Below are some benchmark results which show the potential benfit of
this patch. The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp. The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.
e1000e on 8 core Intel
No RFS or RPS 104K tps at 30% CPU
No RFS (best RPS config): 290K tps at 63% CPU
RFS 303K tps at 61% CPU
RPC test tps CPU% 50/90/99% usec latency Latency StdDev
No RFS/RPS 103K 48% 757/900/3185 4472.35
RPS only: 174K 73% 415/993/2468 491.66
RFS 223K 73% 379/651/1382 315.61
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-16 23:01:27 +00:00
|
|
|
static int enqueue_to_backlog(struct sk_buff *skb, int cpu,
|
|
|
|
unsigned int *qtail)
|
2010-03-16 08:03:29 +00:00
|
|
|
{
|
2010-04-19 21:17:14 +00:00
|
|
|
struct softnet_data *sd;
|
2010-03-16 08:03:29 +00:00
|
|
|
unsigned long flags;
|
2013-05-20 04:02:32 +00:00
|
|
|
unsigned int qlen;
|
2010-03-16 08:03:29 +00:00
|
|
|
|
2010-04-19 21:17:14 +00:00
|
|
|
sd = &per_cpu(softnet_data, cpu);
|
2010-03-16 08:03:29 +00:00
|
|
|
|
|
|
|
local_irq_save(flags);
|
|
|
|
|
2010-04-19 21:17:14 +00:00
|
|
|
rps_lock(sd);
|
2015-07-09 06:59:09 +00:00
|
|
|
if (!netif_running(skb->dev))
|
|
|
|
goto drop;
|
2013-05-20 04:02:32 +00:00
|
|
|
qlen = skb_queue_len(&sd->input_pkt_queue);
|
|
|
|
if (qlen <= netdev_max_backlog && !skb_flow_limit(skb, qlen)) {
|
2014-12-08 01:42:55 +00:00
|
|
|
if (qlen) {
|
2010-03-16 08:03:29 +00:00
|
|
|
enqueue:
|
2010-04-19 21:17:14 +00:00
|
|
|
__skb_queue_tail(&sd->input_pkt_queue, skb);
|
2010-05-20 18:37:59 +00:00
|
|
|
input_queue_tail_incr_save(sd, qtail);
|
2010-04-19 21:17:14 +00:00
|
|
|
rps_unlock(sd);
|
2010-03-30 20:16:22 +00:00
|
|
|
local_irq_restore(flags);
|
2010-03-16 08:03:29 +00:00
|
|
|
return NET_RX_SUCCESS;
|
|
|
|
}
|
|
|
|
|
2010-05-06 23:51:21 +00:00
|
|
|
/* Schedule NAPI for backlog device
|
|
|
|
* We can use non atomic operation since we own the queue lock
|
|
|
|
*/
|
|
|
|
if (!__test_and_set_bit(NAPI_STATE_SCHED, &sd->backlog.state)) {
|
2010-04-19 21:17:14 +00:00
|
|
|
if (!rps_ipi_queued(sd))
|
2010-05-07 05:07:48 +00:00
|
|
|
____napi_schedule(sd, &sd->backlog);
|
2010-03-16 08:03:29 +00:00
|
|
|
}
|
|
|
|
goto enqueue;
|
|
|
|
}
|
|
|
|
|
2015-07-09 06:59:09 +00:00
|
|
|
drop:
|
2010-05-02 05:42:16 +00:00
|
|
|
sd->dropped++;
|
2010-04-19 21:17:14 +00:00
|
|
|
rps_unlock(sd);
|
2010-03-16 08:03:29 +00:00
|
|
|
|
|
|
|
local_irq_restore(flags);
|
|
|
|
|
2010-09-30 21:06:55 +00:00
|
|
|
atomic_long_inc(&skb->dev->rx_dropped);
|
2010-03-16 08:03:29 +00:00
|
|
|
kfree_skb(skb);
|
|
|
|
return NET_RX_DROP;
|
|
|
|
}
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2017-07-17 16:26:45 +00:00
|
|
|
static u32 netif_receive_generic_xdp(struct sk_buff *skb,
|
|
|
|
struct bpf_prog *xdp_prog)
|
|
|
|
{
|
bpf: add meta pointer for direct access
This work enables generic transfer of metadata from XDP into skb. The
basic idea is that we can make use of the fact that the resulting skb
must be linear and already comes with a larger headroom for supporting
bpf_xdp_adjust_head(), which mangles xdp->data. Here, we base our work
on a similar principle and introduce a small helper bpf_xdp_adjust_meta()
for adjusting a new pointer called xdp->data_meta. Thus, the packet has
a flexible and programmable room for meta data, followed by the actual
packet data. struct xdp_buff is therefore laid out that we first point
to data_hard_start, then data_meta directly prepended to data followed
by data_end marking the end of packet. bpf_xdp_adjust_head() takes into
account whether we have meta data already prepended and if so, memmove()s
this along with the given offset provided there's enough room.
xdp->data_meta is optional and programs are not required to use it. The
rationale is that when we process the packet in XDP (e.g. as DoS filter),
we can push further meta data along with it for the XDP_PASS case, and
give the guarantee that a clsact ingress BPF program on the same device
can pick this up for further post-processing. Since we work with skb
there, we can also set skb->mark, skb->priority or other skb meta data
out of BPF, thus having this scratch space generic and programmable
allows for more flexibility than defining a direct 1:1 transfer of
potentially new XDP members into skb (it's also more efficient as we
don't need to initialize/handle each of such new members). The facility
also works together with GRO aggregation. The scratch space at the head
of the packet can be multiple of 4 byte up to 32 byte large. Drivers not
yet supporting xdp->data_meta can simply be set up with xdp->data_meta
as xdp->data + 1 as bpf_xdp_adjust_meta() will detect this and bail out,
such that the subsequent match against xdp->data for later access is
guaranteed to fail.
The verifier treats xdp->data_meta/xdp->data the same way as we treat
xdp->data/xdp->data_end pointer comparisons. The requirement for doing
the compare against xdp->data is that it hasn't been modified from it's
original address we got from ctx access. It may have a range marking
already from prior successful xdp->data/xdp->data_end pointer comparisons
though.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-25 00:25:51 +00:00
|
|
|
u32 metalen, act = XDP_DROP;
|
2017-07-17 16:26:45 +00:00
|
|
|
struct xdp_buff xdp;
|
|
|
|
void *orig_data;
|
|
|
|
int hlen, off;
|
|
|
|
u32 mac_len;
|
|
|
|
|
|
|
|
/* Reinjected packets coming from act_mirred or similar should
|
|
|
|
* not get XDP generic processing.
|
|
|
|
*/
|
|
|
|
if (skb_cloned(skb))
|
|
|
|
return XDP_PASS;
|
|
|
|
|
bpf: add meta pointer for direct access
This work enables generic transfer of metadata from XDP into skb. The
basic idea is that we can make use of the fact that the resulting skb
must be linear and already comes with a larger headroom for supporting
bpf_xdp_adjust_head(), which mangles xdp->data. Here, we base our work
on a similar principle and introduce a small helper bpf_xdp_adjust_meta()
for adjusting a new pointer called xdp->data_meta. Thus, the packet has
a flexible and programmable room for meta data, followed by the actual
packet data. struct xdp_buff is therefore laid out that we first point
to data_hard_start, then data_meta directly prepended to data followed
by data_end marking the end of packet. bpf_xdp_adjust_head() takes into
account whether we have meta data already prepended and if so, memmove()s
this along with the given offset provided there's enough room.
xdp->data_meta is optional and programs are not required to use it. The
rationale is that when we process the packet in XDP (e.g. as DoS filter),
we can push further meta data along with it for the XDP_PASS case, and
give the guarantee that a clsact ingress BPF program on the same device
can pick this up for further post-processing. Since we work with skb
there, we can also set skb->mark, skb->priority or other skb meta data
out of BPF, thus having this scratch space generic and programmable
allows for more flexibility than defining a direct 1:1 transfer of
potentially new XDP members into skb (it's also more efficient as we
don't need to initialize/handle each of such new members). The facility
also works together with GRO aggregation. The scratch space at the head
of the packet can be multiple of 4 byte up to 32 byte large. Drivers not
yet supporting xdp->data_meta can simply be set up with xdp->data_meta
as xdp->data + 1 as bpf_xdp_adjust_meta() will detect this and bail out,
such that the subsequent match against xdp->data for later access is
guaranteed to fail.
The verifier treats xdp->data_meta/xdp->data the same way as we treat
xdp->data/xdp->data_end pointer comparisons. The requirement for doing
the compare against xdp->data is that it hasn't been modified from it's
original address we got from ctx access. It may have a range marking
already from prior successful xdp->data/xdp->data_end pointer comparisons
though.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-25 00:25:51 +00:00
|
|
|
/* XDP packets must be linear and must have sufficient headroom
|
|
|
|
* of XDP_PACKET_HEADROOM bytes. This is the guarantee that also
|
|
|
|
* native XDP provides, thus we need to do it here as well.
|
|
|
|
*/
|
|
|
|
if (skb_is_nonlinear(skb) ||
|
|
|
|
skb_headroom(skb) < XDP_PACKET_HEADROOM) {
|
|
|
|
int hroom = XDP_PACKET_HEADROOM - skb_headroom(skb);
|
|
|
|
int troom = skb->tail + skb->data_len - skb->end;
|
|
|
|
|
|
|
|
/* In case we have to go down the path and also linearize,
|
|
|
|
* then lets do the pskb_expand_head() work just once here.
|
|
|
|
*/
|
|
|
|
if (pskb_expand_head(skb,
|
|
|
|
hroom > 0 ? ALIGN(hroom, NET_SKB_PAD) : 0,
|
|
|
|
troom > 0 ? troom + 128 : 0, GFP_ATOMIC))
|
|
|
|
goto do_drop;
|
|
|
|
if (troom > 0 && __skb_linearize(skb))
|
|
|
|
goto do_drop;
|
|
|
|
}
|
2017-07-17 16:26:45 +00:00
|
|
|
|
|
|
|
/* The XDP program wants to see the packet starting at the MAC
|
|
|
|
* header.
|
|
|
|
*/
|
|
|
|
mac_len = skb->data - skb_mac_header(skb);
|
|
|
|
hlen = skb_headlen(skb) + mac_len;
|
|
|
|
xdp.data = skb->data - mac_len;
|
bpf: add meta pointer for direct access
This work enables generic transfer of metadata from XDP into skb. The
basic idea is that we can make use of the fact that the resulting skb
must be linear and already comes with a larger headroom for supporting
bpf_xdp_adjust_head(), which mangles xdp->data. Here, we base our work
on a similar principle and introduce a small helper bpf_xdp_adjust_meta()
for adjusting a new pointer called xdp->data_meta. Thus, the packet has
a flexible and programmable room for meta data, followed by the actual
packet data. struct xdp_buff is therefore laid out that we first point
to data_hard_start, then data_meta directly prepended to data followed
by data_end marking the end of packet. bpf_xdp_adjust_head() takes into
account whether we have meta data already prepended and if so, memmove()s
this along with the given offset provided there's enough room.
xdp->data_meta is optional and programs are not required to use it. The
rationale is that when we process the packet in XDP (e.g. as DoS filter),
we can push further meta data along with it for the XDP_PASS case, and
give the guarantee that a clsact ingress BPF program on the same device
can pick this up for further post-processing. Since we work with skb
there, we can also set skb->mark, skb->priority or other skb meta data
out of BPF, thus having this scratch space generic and programmable
allows for more flexibility than defining a direct 1:1 transfer of
potentially new XDP members into skb (it's also more efficient as we
don't need to initialize/handle each of such new members). The facility
also works together with GRO aggregation. The scratch space at the head
of the packet can be multiple of 4 byte up to 32 byte large. Drivers not
yet supporting xdp->data_meta can simply be set up with xdp->data_meta
as xdp->data + 1 as bpf_xdp_adjust_meta() will detect this and bail out,
such that the subsequent match against xdp->data for later access is
guaranteed to fail.
The verifier treats xdp->data_meta/xdp->data the same way as we treat
xdp->data/xdp->data_end pointer comparisons. The requirement for doing
the compare against xdp->data is that it hasn't been modified from it's
original address we got from ctx access. It may have a range marking
already from prior successful xdp->data/xdp->data_end pointer comparisons
though.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-25 00:25:51 +00:00
|
|
|
xdp.data_meta = xdp.data;
|
2017-07-17 16:26:45 +00:00
|
|
|
xdp.data_end = xdp.data + hlen;
|
|
|
|
xdp.data_hard_start = skb->data - skb_headroom(skb);
|
|
|
|
orig_data = xdp.data;
|
|
|
|
|
|
|
|
act = bpf_prog_run_xdp(xdp_prog, &xdp);
|
|
|
|
|
|
|
|
off = xdp.data - orig_data;
|
|
|
|
if (off > 0)
|
|
|
|
__skb_pull(skb, off);
|
|
|
|
else if (off < 0)
|
|
|
|
__skb_push(skb, -off);
|
2017-09-19 17:45:56 +00:00
|
|
|
skb->mac_header += off;
|
2017-07-17 16:26:45 +00:00
|
|
|
|
|
|
|
switch (act) {
|
2017-07-17 16:27:50 +00:00
|
|
|
case XDP_REDIRECT:
|
2017-07-17 16:26:45 +00:00
|
|
|
case XDP_TX:
|
|
|
|
__skb_push(skb, mac_len);
|
bpf: add meta pointer for direct access
This work enables generic transfer of metadata from XDP into skb. The
basic idea is that we can make use of the fact that the resulting skb
must be linear and already comes with a larger headroom for supporting
bpf_xdp_adjust_head(), which mangles xdp->data. Here, we base our work
on a similar principle and introduce a small helper bpf_xdp_adjust_meta()
for adjusting a new pointer called xdp->data_meta. Thus, the packet has
a flexible and programmable room for meta data, followed by the actual
packet data. struct xdp_buff is therefore laid out that we first point
to data_hard_start, then data_meta directly prepended to data followed
by data_end marking the end of packet. bpf_xdp_adjust_head() takes into
account whether we have meta data already prepended and if so, memmove()s
this along with the given offset provided there's enough room.
xdp->data_meta is optional and programs are not required to use it. The
rationale is that when we process the packet in XDP (e.g. as DoS filter),
we can push further meta data along with it for the XDP_PASS case, and
give the guarantee that a clsact ingress BPF program on the same device
can pick this up for further post-processing. Since we work with skb
there, we can also set skb->mark, skb->priority or other skb meta data
out of BPF, thus having this scratch space generic and programmable
allows for more flexibility than defining a direct 1:1 transfer of
potentially new XDP members into skb (it's also more efficient as we
don't need to initialize/handle each of such new members). The facility
also works together with GRO aggregation. The scratch space at the head
of the packet can be multiple of 4 byte up to 32 byte large. Drivers not
yet supporting xdp->data_meta can simply be set up with xdp->data_meta
as xdp->data + 1 as bpf_xdp_adjust_meta() will detect this and bail out,
such that the subsequent match against xdp->data for later access is
guaranteed to fail.
The verifier treats xdp->data_meta/xdp->data the same way as we treat
xdp->data/xdp->data_end pointer comparisons. The requirement for doing
the compare against xdp->data is that it hasn't been modified from it's
original address we got from ctx access. It may have a range marking
already from prior successful xdp->data/xdp->data_end pointer comparisons
though.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-25 00:25:51 +00:00
|
|
|
break;
|
2017-07-17 16:26:45 +00:00
|
|
|
case XDP_PASS:
|
bpf: add meta pointer for direct access
This work enables generic transfer of metadata from XDP into skb. The
basic idea is that we can make use of the fact that the resulting skb
must be linear and already comes with a larger headroom for supporting
bpf_xdp_adjust_head(), which mangles xdp->data. Here, we base our work
on a similar principle and introduce a small helper bpf_xdp_adjust_meta()
for adjusting a new pointer called xdp->data_meta. Thus, the packet has
a flexible and programmable room for meta data, followed by the actual
packet data. struct xdp_buff is therefore laid out that we first point
to data_hard_start, then data_meta directly prepended to data followed
by data_end marking the end of packet. bpf_xdp_adjust_head() takes into
account whether we have meta data already prepended and if so, memmove()s
this along with the given offset provided there's enough room.
xdp->data_meta is optional and programs are not required to use it. The
rationale is that when we process the packet in XDP (e.g. as DoS filter),
we can push further meta data along with it for the XDP_PASS case, and
give the guarantee that a clsact ingress BPF program on the same device
can pick this up for further post-processing. Since we work with skb
there, we can also set skb->mark, skb->priority or other skb meta data
out of BPF, thus having this scratch space generic and programmable
allows for more flexibility than defining a direct 1:1 transfer of
potentially new XDP members into skb (it's also more efficient as we
don't need to initialize/handle each of such new members). The facility
also works together with GRO aggregation. The scratch space at the head
of the packet can be multiple of 4 byte up to 32 byte large. Drivers not
yet supporting xdp->data_meta can simply be set up with xdp->data_meta
as xdp->data + 1 as bpf_xdp_adjust_meta() will detect this and bail out,
such that the subsequent match against xdp->data for later access is
guaranteed to fail.
The verifier treats xdp->data_meta/xdp->data the same way as we treat
xdp->data/xdp->data_end pointer comparisons. The requirement for doing
the compare against xdp->data is that it hasn't been modified from it's
original address we got from ctx access. It may have a range marking
already from prior successful xdp->data/xdp->data_end pointer comparisons
though.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-25 00:25:51 +00:00
|
|
|
metalen = xdp.data - xdp.data_meta;
|
|
|
|
if (metalen)
|
|
|
|
skb_metadata_set(skb, metalen);
|
2017-07-17 16:26:45 +00:00
|
|
|
break;
|
|
|
|
default:
|
|
|
|
bpf_warn_invalid_xdp_action(act);
|
|
|
|
/* fall through */
|
|
|
|
case XDP_ABORTED:
|
|
|
|
trace_xdp_exception(skb->dev, xdp_prog, act);
|
|
|
|
/* fall through */
|
|
|
|
case XDP_DROP:
|
|
|
|
do_drop:
|
|
|
|
kfree_skb(skb);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
return act;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* When doing generic XDP we have to bypass the qdisc layer and the
|
|
|
|
* network taps in order to match in-driver-XDP behavior.
|
|
|
|
*/
|
2017-08-11 11:41:17 +00:00
|
|
|
void generic_xdp_tx(struct sk_buff *skb, struct bpf_prog *xdp_prog)
|
2017-07-17 16:26:45 +00:00
|
|
|
{
|
|
|
|
struct net_device *dev = skb->dev;
|
|
|
|
struct netdev_queue *txq;
|
|
|
|
bool free_skb = true;
|
|
|
|
int cpu, rc;
|
|
|
|
|
|
|
|
txq = netdev_pick_tx(dev, skb, NULL);
|
|
|
|
cpu = smp_processor_id();
|
|
|
|
HARD_TX_LOCK(dev, txq, cpu);
|
|
|
|
if (!netif_xmit_stopped(txq)) {
|
|
|
|
rc = netdev_start_xmit(skb, dev, txq, 0);
|
|
|
|
if (dev_xmit_complete(rc))
|
|
|
|
free_skb = false;
|
|
|
|
}
|
|
|
|
HARD_TX_UNLOCK(dev, txq);
|
|
|
|
if (free_skb) {
|
|
|
|
trace_xdp_exception(dev, xdp_prog, XDP_TX);
|
|
|
|
kfree_skb(skb);
|
|
|
|
}
|
|
|
|
}
|
2017-08-11 11:41:17 +00:00
|
|
|
EXPORT_SYMBOL_GPL(generic_xdp_tx);
|
2017-07-17 16:26:45 +00:00
|
|
|
|
|
|
|
static struct static_key generic_xdp_needed __read_mostly;
|
|
|
|
|
2017-08-11 11:41:17 +00:00
|
|
|
int do_xdp_generic(struct bpf_prog *xdp_prog, struct sk_buff *skb)
|
2017-07-17 16:26:45 +00:00
|
|
|
{
|
|
|
|
if (xdp_prog) {
|
|
|
|
u32 act = netif_receive_generic_xdp(skb, xdp_prog);
|
2017-07-17 16:27:50 +00:00
|
|
|
int err;
|
2017-07-17 16:26:45 +00:00
|
|
|
|
|
|
|
if (act != XDP_PASS) {
|
2017-07-17 16:27:50 +00:00
|
|
|
switch (act) {
|
|
|
|
case XDP_REDIRECT:
|
2017-08-24 10:33:08 +00:00
|
|
|
err = xdp_do_generic_redirect(skb->dev, skb,
|
|
|
|
xdp_prog);
|
2017-07-17 16:27:50 +00:00
|
|
|
if (err)
|
|
|
|
goto out_redir;
|
|
|
|
/* fallthru to submit skb */
|
|
|
|
case XDP_TX:
|
2017-07-17 16:26:45 +00:00
|
|
|
generic_xdp_tx(skb, xdp_prog);
|
2017-07-17 16:27:50 +00:00
|
|
|
break;
|
|
|
|
}
|
2017-07-17 16:26:45 +00:00
|
|
|
return XDP_DROP;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return XDP_PASS;
|
2017-07-17 16:27:50 +00:00
|
|
|
out_redir:
|
|
|
|
kfree_skb(skb);
|
|
|
|
return XDP_DROP;
|
2017-07-17 16:26:45 +00:00
|
|
|
}
|
2017-08-11 11:41:17 +00:00
|
|
|
EXPORT_SYMBOL_GPL(do_xdp_generic);
|
2017-07-17 16:26:45 +00:00
|
|
|
|
2014-01-10 22:17:24 +00:00
|
|
|
static int netif_rx_internal(struct sk_buff *skb)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2010-04-15 07:14:07 +00:00
|
|
|
int ret;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2011-11-15 04:12:55 +00:00
|
|
|
net_timestamp_check(netdev_tstamp_prequeue, skb);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2010-08-23 09:45:02 +00:00
|
|
|
trace_netif_rx(skb);
|
2017-07-17 16:26:45 +00:00
|
|
|
|
|
|
|
if (static_key_false(&generic_xdp_needed)) {
|
2017-09-08 21:00:30 +00:00
|
|
|
int ret;
|
|
|
|
|
|
|
|
preempt_disable();
|
|
|
|
rcu_read_lock();
|
|
|
|
ret = do_xdp_generic(rcu_dereference(skb->dev->xdp_prog), skb);
|
|
|
|
rcu_read_unlock();
|
|
|
|
preempt_enable();
|
2017-07-17 16:26:45 +00:00
|
|
|
|
2017-07-17 16:27:50 +00:00
|
|
|
/* Consider XDP consuming the packet a success from
|
|
|
|
* the netdev point of view we do not want to count
|
|
|
|
* this as an error.
|
|
|
|
*/
|
2017-07-17 16:26:45 +00:00
|
|
|
if (ret != XDP_PASS)
|
2017-07-17 16:27:50 +00:00
|
|
|
return NET_RX_SUCCESS;
|
2017-07-17 16:26:45 +00:00
|
|
|
}
|
|
|
|
|
2010-03-24 19:13:54 +00:00
|
|
|
#ifdef CONFIG_RPS
|
2012-02-24 07:31:31 +00:00
|
|
|
if (static_key_false(&rps_needed)) {
|
rfs: Receive Flow Steering
This patch implements receive flow steering (RFS). RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running. RFS is an
extension of Receive Packet Steering (RPS).
The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure. The rxhash is passed in skb's received on
the connection from netif_receive_skb. For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.
The convolution of the simple approach is that it would potentially
allow OOO packets. If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.
To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.
rps_sock_table is a global hash table. Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.
rps_dev_flow_table is specific to each device queue. Each entry
contains a CPU and a tail queue counter. The CPU is the "current"
CPU for a matching flow. The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.
Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length. When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.
And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted. When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:
- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table. This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.
Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality. 2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.
This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.
There are two configuration parameters for RFS. The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue. Both are rounded to power of two.
The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).
The benefits of RFS are dependent on cache hierarchy, application
load, and other factors. On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation. However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.
Below are some benchmark results which show the potential benfit of
this patch. The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp. The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.
e1000e on 8 core Intel
No RFS or RPS 104K tps at 30% CPU
No RFS (best RPS config): 290K tps at 63% CPU
RFS 303K tps at 61% CPU
RPC test tps CPU% 50/90/99% usec latency Latency StdDev
No RFS/RPS 103K 48% 757/900/3185 4472.35
RPS only: 174K 73% 415/993/2468 491.66
RFS 223K 73% 379/651/1382 315.61
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-16 23:01:27 +00:00
|
|
|
struct rps_dev_flow voidflow, *rflow = &voidflow;
|
2010-04-15 07:14:07 +00:00
|
|
|
int cpu;
|
|
|
|
|
2010-08-08 03:35:43 +00:00
|
|
|
preempt_disable();
|
2010-04-15 07:14:07 +00:00
|
|
|
rcu_read_lock();
|
rfs: Receive Flow Steering
This patch implements receive flow steering (RFS). RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running. RFS is an
extension of Receive Packet Steering (RPS).
The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure. The rxhash is passed in skb's received on
the connection from netif_receive_skb. For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.
The convolution of the simple approach is that it would potentially
allow OOO packets. If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.
To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.
rps_sock_table is a global hash table. Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.
rps_dev_flow_table is specific to each device queue. Each entry
contains a CPU and a tail queue counter. The CPU is the "current"
CPU for a matching flow. The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.
Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length. When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.
And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted. When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:
- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table. This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.
Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality. 2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.
This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.
There are two configuration parameters for RFS. The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue. Both are rounded to power of two.
The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).
The benefits of RFS are dependent on cache hierarchy, application
load, and other factors. On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation. However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.
Below are some benchmark results which show the potential benfit of
this patch. The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp. The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.
e1000e on 8 core Intel
No RFS or RPS 104K tps at 30% CPU
No RFS (best RPS config): 290K tps at 63% CPU
RFS 303K tps at 61% CPU
RPC test tps CPU% 50/90/99% usec latency Latency StdDev
No RFS/RPS 103K 48% 757/900/3185 4472.35
RPS only: 174K 73% 415/993/2468 491.66
RFS 223K 73% 379/651/1382 315.61
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-16 23:01:27 +00:00
|
|
|
|
|
|
|
cpu = get_rps_cpu(skb->dev, skb, &rflow);
|
2010-04-15 07:14:07 +00:00
|
|
|
if (cpu < 0)
|
|
|
|
cpu = smp_processor_id();
|
rfs: Receive Flow Steering
This patch implements receive flow steering (RFS). RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running. RFS is an
extension of Receive Packet Steering (RPS).
The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure. The rxhash is passed in skb's received on
the connection from netif_receive_skb. For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.
The convolution of the simple approach is that it would potentially
allow OOO packets. If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.
To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.
rps_sock_table is a global hash table. Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.
rps_dev_flow_table is specific to each device queue. Each entry
contains a CPU and a tail queue counter. The CPU is the "current"
CPU for a matching flow. The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.
Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length. When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.
And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted. When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:
- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table. This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.
Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality. 2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.
This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.
There are two configuration parameters for RFS. The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue. Both are rounded to power of two.
The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).
The benefits of RFS are dependent on cache hierarchy, application
load, and other factors. On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation. However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.
Below are some benchmark results which show the potential benfit of
this patch. The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp. The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.
e1000e on 8 core Intel
No RFS or RPS 104K tps at 30% CPU
No RFS (best RPS config): 290K tps at 63% CPU
RFS 303K tps at 61% CPU
RPC test tps CPU% 50/90/99% usec latency Latency StdDev
No RFS/RPS 103K 48% 757/900/3185 4472.35
RPS only: 174K 73% 415/993/2468 491.66
RFS 223K 73% 379/651/1382 315.61
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-16 23:01:27 +00:00
|
|
|
|
|
|
|
ret = enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
|
|
|
|
|
2010-04-15 07:14:07 +00:00
|
|
|
rcu_read_unlock();
|
2010-08-08 03:35:43 +00:00
|
|
|
preempt_enable();
|
2011-11-17 03:13:26 +00:00
|
|
|
} else
|
|
|
|
#endif
|
rfs: Receive Flow Steering
This patch implements receive flow steering (RFS). RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running. RFS is an
extension of Receive Packet Steering (RPS).
The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure. The rxhash is passed in skb's received on
the connection from netif_receive_skb. For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.
The convolution of the simple approach is that it would potentially
allow OOO packets. If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.
To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.
rps_sock_table is a global hash table. Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.
rps_dev_flow_table is specific to each device queue. Each entry
contains a CPU and a tail queue counter. The CPU is the "current"
CPU for a matching flow. The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.
Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length. When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.
And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted. When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:
- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table. This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.
Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality. 2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.
This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.
There are two configuration parameters for RFS. The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue. Both are rounded to power of two.
The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).
The benefits of RFS are dependent on cache hierarchy, application
load, and other factors. On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation. However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.
Below are some benchmark results which show the potential benfit of
this patch. The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp. The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.
e1000e on 8 core Intel
No RFS or RPS 104K tps at 30% CPU
No RFS (best RPS config): 290K tps at 63% CPU
RFS 303K tps at 61% CPU
RPC test tps CPU% 50/90/99% usec latency Latency StdDev
No RFS/RPS 103K 48% 757/900/3185 4472.35
RPS only: 174K 73% 415/993/2468 491.66
RFS 223K 73% 379/651/1382 315.61
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-16 23:01:27 +00:00
|
|
|
{
|
|
|
|
unsigned int qtail;
|
2017-02-09 06:56:07 +00:00
|
|
|
|
rfs: Receive Flow Steering
This patch implements receive flow steering (RFS). RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running. RFS is an
extension of Receive Packet Steering (RPS).
The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure. The rxhash is passed in skb's received on
the connection from netif_receive_skb. For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.
The convolution of the simple approach is that it would potentially
allow OOO packets. If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.
To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.
rps_sock_table is a global hash table. Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.
rps_dev_flow_table is specific to each device queue. Each entry
contains a CPU and a tail queue counter. The CPU is the "current"
CPU for a matching flow. The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.
Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length. When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.
And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted. When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:
- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table. This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.
Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality. 2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.
This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.
There are two configuration parameters for RFS. The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue. Both are rounded to power of two.
The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).
The benefits of RFS are dependent on cache hierarchy, application
load, and other factors. On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation. However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.
Below are some benchmark results which show the potential benfit of
this patch. The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp. The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.
e1000e on 8 core Intel
No RFS or RPS 104K tps at 30% CPU
No RFS (best RPS config): 290K tps at 63% CPU
RFS 303K tps at 61% CPU
RPC test tps CPU% 50/90/99% usec latency Latency StdDev
No RFS/RPS 103K 48% 757/900/3185 4472.35
RPS only: 174K 73% 415/993/2468 491.66
RFS 223K 73% 379/651/1382 315.61
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-16 23:01:27 +00:00
|
|
|
ret = enqueue_to_backlog(skb, get_cpu(), &qtail);
|
|
|
|
put_cpu();
|
|
|
|
}
|
2010-04-15 07:14:07 +00:00
|
|
|
return ret;
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
2014-01-10 22:17:24 +00:00
|
|
|
|
|
|
|
/**
|
|
|
|
* netif_rx - post buffer to the network code
|
|
|
|
* @skb: buffer to post
|
|
|
|
*
|
|
|
|
* This function receives a packet from a device driver and queues it for
|
|
|
|
* the upper (protocol) levels to process. It always succeeds. The buffer
|
|
|
|
* may be dropped during processing for congestion control or by the
|
|
|
|
* protocol layers.
|
|
|
|
*
|
|
|
|
* return values:
|
|
|
|
* NET_RX_SUCCESS (no congestion)
|
|
|
|
* NET_RX_DROP (packet was dropped)
|
|
|
|
*
|
|
|
|
*/
|
|
|
|
|
|
|
|
int netif_rx(struct sk_buff *skb)
|
|
|
|
{
|
|
|
|
trace_netif_rx_entry(skb);
|
|
|
|
|
|
|
|
return netif_rx_internal(skb);
|
|
|
|
}
|
2009-09-03 08:29:39 +00:00
|
|
|
EXPORT_SYMBOL(netif_rx);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
int netif_rx_ni(struct sk_buff *skb)
|
|
|
|
{
|
|
|
|
int err;
|
|
|
|
|
2014-01-10 22:17:24 +00:00
|
|
|
trace_netif_rx_ni_entry(skb);
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
preempt_disable();
|
2014-01-10 22:17:24 +00:00
|
|
|
err = netif_rx_internal(skb);
|
2005-04-16 22:20:36 +00:00
|
|
|
if (local_softirq_pending())
|
|
|
|
do_softirq();
|
|
|
|
preempt_enable();
|
|
|
|
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(netif_rx_ni);
|
|
|
|
|
2016-06-20 18:42:34 +00:00
|
|
|
static __latent_entropy void net_tx_action(struct softirq_action *h)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2014-08-17 17:30:35 +00:00
|
|
|
struct softnet_data *sd = this_cpu_ptr(&softnet_data);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
if (sd->completion_queue) {
|
|
|
|
struct sk_buff *clist;
|
|
|
|
|
|
|
|
local_irq_disable();
|
|
|
|
clist = sd->completion_queue;
|
|
|
|
sd->completion_queue = NULL;
|
|
|
|
local_irq_enable();
|
|
|
|
|
|
|
|
while (clist) {
|
|
|
|
struct sk_buff *skb = clist;
|
2017-02-09 06:56:07 +00:00
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
clist = clist->next;
|
|
|
|
|
2017-06-30 10:07:58 +00:00
|
|
|
WARN_ON(refcount_read(&skb->users));
|
2013-12-05 12:45:08 +00:00
|
|
|
if (likely(get_kfree_skb_cb(skb)->reason == SKB_REASON_CONSUMED))
|
|
|
|
trace_consume_skb(skb);
|
|
|
|
else
|
|
|
|
trace_kfree_skb(skb, net_tx_action);
|
2016-02-08 12:15:04 +00:00
|
|
|
|
|
|
|
if (skb->fclone != SKB_FCLONE_UNAVAILABLE)
|
|
|
|
__kfree_skb(skb);
|
|
|
|
else
|
|
|
|
__kfree_skb_defer(skb);
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
2016-02-08 12:15:04 +00:00
|
|
|
|
|
|
|
__kfree_skb_flush();
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
if (sd->output_queue) {
|
2008-07-16 09:15:04 +00:00
|
|
|
struct Qdisc *head;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
local_irq_disable();
|
|
|
|
head = sd->output_queue;
|
|
|
|
sd->output_queue = NULL;
|
2010-04-26 23:06:24 +00:00
|
|
|
sd->output_queue_tailp = &sd->output_queue;
|
2005-04-16 22:20:36 +00:00
|
|
|
local_irq_enable();
|
|
|
|
|
|
|
|
while (head) {
|
2008-07-16 09:15:04 +00:00
|
|
|
struct Qdisc *q = head;
|
|
|
|
spinlock_t *root_lock;
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
head = head->next_sched;
|
|
|
|
|
2008-08-03 03:02:43 +00:00
|
|
|
root_lock = qdisc_lock(q);
|
net: get rid of spin_trylock() in net_tx_action()
Note: Tom Herbert posted almost same patch 3 months back, but for
different reasons.
The reasons we want to get rid of this spin_trylock() are :
1) Under high qdisc pressure, the spin_trylock() has almost no
chance to succeed.
2) We loop multiple times in softirq handler, eventually reaching
the max retry count (10), and we schedule ksoftirqd.
Since we want to adhere more strictly to ksoftirqd being waked up in
the future (https://lwn.net/Articles/687617/), better avoid spurious
wakeups.
3) calls to __netif_reschedule() dirty the cache line containing
q->next_sched, slowing down the owner of qdisc.
4) RT kernels can not use the spin_trylock() here.
With help of busylock, we get the qdisc spinlock fast enough, and
the trylock trick brings only performance penalty.
Depending on qdisc setup, I observed a gain of up to 19 % in qdisc
performance (1016600 pps instead of 853400 pps, using prio+tbf+fq_codel)
("mpstat -I SCPU 1" is much happier now)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <tom@herbertland.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-05 03:02:28 +00:00
|
|
|
spin_lock(root_lock);
|
|
|
|
/* We need to make sure head->next_sched is read
|
|
|
|
* before clearing __QDISC_STATE_SCHED
|
|
|
|
*/
|
|
|
|
smp_mb__before_atomic();
|
|
|
|
clear_bit(__QDISC_STATE_SCHED, &q->state);
|
|
|
|
qdisc_run(q);
|
|
|
|
spin_unlock(root_lock);
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2016-09-09 12:43:15 +00:00
|
|
|
#if IS_ENABLED(CONFIG_BRIDGE) && IS_ENABLED(CONFIG_ATM_LANE)
|
2009-06-05 05:35:28 +00:00
|
|
|
/* This hook is defined here for ATM LANE */
|
|
|
|
int (*br_fdb_test_addr_hook)(struct net_device *dev,
|
|
|
|
unsigned char *addr) __read_mostly;
|
2009-09-11 18:50:08 +00:00
|
|
|
EXPORT_SYMBOL_GPL(br_fdb_test_addr_hook);
|
2009-06-05 05:35:28 +00:00
|
|
|
#endif
|
2005-04-16 22:20:36 +00:00
|
|
|
|
net, sched: add clsact qdisc
This work adds a generalization of the ingress qdisc as a qdisc holding
only classifiers. The clsact qdisc works on ingress, but also on egress.
In both cases, it's execution happens without taking the qdisc lock, and
the main difference for the egress part compared to prior version of [1]
is that this can be applied with _any_ underlying real egress qdisc (also
classless ones).
Besides solving the use-case of [1], that is, allowing for more programmability
on assigning skb->priority for the mqprio case that is supported by most
popular 10G+ NICs, it also opens up a lot more flexibility for other tc
applications. The main work on classification can already be done at clsact
egress time if the use-case allows and state stored for later retrieval
f.e. again in skb->priority with major/minors (which is checked by most
classful qdiscs before consulting tc_classify()) and/or in other skb fields
like skb->tc_index for some light-weight post-processing to get to the
eventual classid in case of a classful qdisc. Another use case is that
the clsact egress part allows to have a central egress counterpart to
the ingress classifiers, so that classifiers can easily share state (e.g.
in cls_bpf via eBPF maps) for ingress and egress.
Currently, default setups like mq + pfifo_fast would require for this to
use, for example, prio qdisc instead (to get a tc_classify() run) and to
duplicate the egress classifier for each queue. With clsact, it allows
for leaving the setup as is, it can additionally assign skb->priority to
put the skb in one of pfifo_fast's bands and it can share state with maps.
Moreover, we can access the skb's dst entry (f.e. to retrieve tclassid)
w/o the need to perform a skb_dst_force() to hold on to it any longer. In
lwt case, we can also use this facility to setup dst metadata via cls_bpf
(bpf_skb_set_tunnel_key()) without needing a real egress qdisc just for
that (case of IFF_NO_QUEUE devices, for example).
The realization can be done without any changes to the scheduler core
framework. All it takes is that we have two a-priori defined minors/child
classes, where we can mux between ingress and egress classifier list
(dev->ingress_cl_list and dev->egress_cl_list, latter stored close to
dev->_tx to avoid extra cacheline miss for moderate loads). The egress
part is a bit similar modelled to handle_ing() and patched to a noop in
case the functionality is not used. Both handlers are now called
sch_handle_ingress() and sch_handle_egress(), code sharing among the two
doesn't seem practical as there are various minor differences in both
paths, so that making them conditional in a single handler would rather
slow things down.
Full compatibility to ingress qdisc is provided as well. Since both
piggyback on TC_H_CLSACT, only one of them (ingress/clsact) can exist
per netdevice, and thus ingress qdisc specific behaviour can be retained
for user space. This means, either a user does 'tc qdisc add dev foo ingress'
and configures ingress qdisc as usual, or the 'tc qdisc add dev foo clsact'
alternative, where both, ingress and egress classifier can be configured
as in the below example. ingress qdisc supports attaching classifier to any
minor number whereas clsact has two fixed minors for muxing between the
lists, therefore to not break user space setups, they are better done as
two separate qdiscs.
I decided to extend the sch_ingress module with clsact functionality so
that commonly used code can be reused, the module is being aliased with
sch_clsact so that it can be auto-loaded properly. Alternative would have been
to add a flag when initializing ingress to alter its behaviour plus aliasing
to a different name (as it's more than just ingress). However, the first would
end up, based on the flag, choosing the new/old behaviour by calling different
function implementations to handle each anyway, the latter would require to
register ingress qdisc once again under different alias. So, this really begs
to provide a minimal, cleaner approach to have Qdisc_ops and Qdisc_class_ops
by its own that share callbacks used by both.
Example, adding qdisc:
# tc qdisc add dev foo clsact
# tc qdisc show dev foo
qdisc mq 0: root
qdisc pfifo_fast 0: parent :1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :4 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc clsact ffff: parent ffff:fff1
Adding filters (deleting, etc works analogous by specifying ingress/egress):
# tc filter add dev foo ingress bpf da obj bar.o sec ingress
# tc filter add dev foo egress bpf da obj bar.o sec egress
# tc filter show dev foo ingress
filter protocol all pref 49152 bpf
filter protocol all pref 49152 bpf handle 0x1 bar.o:[ingress] direct-action
# tc filter show dev foo egress
filter protocol all pref 49152 bpf
filter protocol all pref 49152 bpf handle 0x1 bar.o:[egress] direct-action
A 'tc filter show dev foo' or 'tc filter show dev foo parent ffff:' will
show an empty list for clsact. Either using the parent names (ingress/egress)
or specifying the full major/minor will then show the related filter lists.
Prior work on a mqprio prequeue() facility [1] was done mainly by John Fastabend.
[1] http://patchwork.ozlabs.org/patch/512949/
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-07 21:29:47 +00:00
|
|
|
static inline struct sk_buff *
|
|
|
|
sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
|
|
|
|
struct net_device *orig_dev)
|
2007-10-14 07:38:47 +00:00
|
|
|
{
|
2015-05-19 20:33:25 +00:00
|
|
|
#ifdef CONFIG_NET_CLS_ACT
|
net: sched: further simplify handle_ing
Ingress qdisc has no other purpose than calling into tc_classify()
that executes attached classifier(s) and action(s).
It has a 1:1 relationship to dev->ingress_queue. After having commit
087c1a601ad7 ("net: sched: run ingress qdisc without locks") removed
the central ingress lock, one major contention point is gone.
The extra indirection layers however, are not necessary for calling
into ingress qdisc. pktgen calling locally into netif_receive_skb()
with a dummy u32, single CPU result on a Supermicro X10SLM-F, Xeon
E3-1240: before ~21,1 Mpps, after patch ~22,9 Mpps.
We can redirect the private classifier list to the netdev directly,
without changing any classifier API bits (!) and execute on that from
handle_ing() side. The __QDISC_STATE_DEACTIVATE test can be removed,
ingress qdisc doesn't have a queue and thus dev_deactivate_queue()
is also not applicable, ingress_cl_list provides similar behaviour.
In other words, ingress qdisc acts like TCQ_F_BUILTIN qdisc.
One next possible step is the removal of the dev's ingress (dummy)
netdev_queue, and to only have the list member in the netdevice
itself.
Note, the filter chain is RCU protected and individual filter elements
are being kfree'd by sched subsystem after RCU grace period. RCU read
lock is being held by __netif_receive_skb_core().
Joint work with Alexei Starovoitov.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09 20:51:32 +00:00
|
|
|
struct tcf_proto *cl = rcu_dereference_bh(skb->dev->ingress_cl_list);
|
|
|
|
struct tcf_result cl_res;
|
2010-10-02 06:11:55 +00:00
|
|
|
|
2015-05-09 20:51:31 +00:00
|
|
|
/* If there's at least one ingress present somewhere (so
|
|
|
|
* we get here via enabled static key), remaining devices
|
|
|
|
* that are not configured with an ingress qdisc will bail
|
net: sched: further simplify handle_ing
Ingress qdisc has no other purpose than calling into tc_classify()
that executes attached classifier(s) and action(s).
It has a 1:1 relationship to dev->ingress_queue. After having commit
087c1a601ad7 ("net: sched: run ingress qdisc without locks") removed
the central ingress lock, one major contention point is gone.
The extra indirection layers however, are not necessary for calling
into ingress qdisc. pktgen calling locally into netif_receive_skb()
with a dummy u32, single CPU result on a Supermicro X10SLM-F, Xeon
E3-1240: before ~21,1 Mpps, after patch ~22,9 Mpps.
We can redirect the private classifier list to the netdev directly,
without changing any classifier API bits (!) and execute on that from
handle_ing() side. The __QDISC_STATE_DEACTIVATE test can be removed,
ingress qdisc doesn't have a queue and thus dev_deactivate_queue()
is also not applicable, ingress_cl_list provides similar behaviour.
In other words, ingress qdisc acts like TCQ_F_BUILTIN qdisc.
One next possible step is the removal of the dev's ingress (dummy)
netdev_queue, and to only have the list member in the netdevice
itself.
Note, the filter chain is RCU protected and individual filter elements
are being kfree'd by sched subsystem after RCU grace period. RCU read
lock is being held by __netif_receive_skb_core().
Joint work with Alexei Starovoitov.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09 20:51:32 +00:00
|
|
|
* out here.
|
2015-05-09 20:51:31 +00:00
|
|
|
*/
|
net: sched: further simplify handle_ing
Ingress qdisc has no other purpose than calling into tc_classify()
that executes attached classifier(s) and action(s).
It has a 1:1 relationship to dev->ingress_queue. After having commit
087c1a601ad7 ("net: sched: run ingress qdisc without locks") removed
the central ingress lock, one major contention point is gone.
The extra indirection layers however, are not necessary for calling
into ingress qdisc. pktgen calling locally into netif_receive_skb()
with a dummy u32, single CPU result on a Supermicro X10SLM-F, Xeon
E3-1240: before ~21,1 Mpps, after patch ~22,9 Mpps.
We can redirect the private classifier list to the netdev directly,
without changing any classifier API bits (!) and execute on that from
handle_ing() side. The __QDISC_STATE_DEACTIVATE test can be removed,
ingress qdisc doesn't have a queue and thus dev_deactivate_queue()
is also not applicable, ingress_cl_list provides similar behaviour.
In other words, ingress qdisc acts like TCQ_F_BUILTIN qdisc.
One next possible step is the removal of the dev's ingress (dummy)
netdev_queue, and to only have the list member in the netdevice
itself.
Note, the filter chain is RCU protected and individual filter elements
are being kfree'd by sched subsystem after RCU grace period. RCU read
lock is being held by __netif_receive_skb_core().
Joint work with Alexei Starovoitov.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09 20:51:32 +00:00
|
|
|
if (!cl)
|
net: use jump label patching for ingress qdisc in __netif_receive_skb_core
Even if we make use of classifier and actions from the egress
path, we're going into handle_ing() executing additional code
on a per-packet cost for ingress qdisc, just to realize that
nothing is attached on ingress.
Instead, this can just be blinded out as a no-op entirely with
the use of a static key. On input fast-path, we already make
use of static keys in various places, e.g. skb time stamping,
in RPS, etc. It makes sense to not waste time when we're assured
that no ingress qdisc is attached anywhere.
Enabling/disabling of that code path is being done via two
helpers, namely net_{inc,dec}_ingress_queue(), that are being
invoked under RTNL mutex when a ingress qdisc is being either
initialized or destructed.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-10 21:07:54 +00:00
|
|
|
return skb;
|
2007-10-14 07:38:47 +00:00
|
|
|
if (*pt_prev) {
|
|
|
|
*ret = deliver_skb(skb, *pt_prev, orig_dev);
|
|
|
|
*pt_prev = NULL;
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
|
2015-05-13 22:36:28 +00:00
|
|
|
qdisc_skb_cb(skb)->pkt_len = skb->len;
|
2017-01-07 22:06:37 +00:00
|
|
|
skb->tc_at_ingress = 1;
|
2015-07-06 12:18:03 +00:00
|
|
|
qdisc_bstats_cpu_update(cl->q, skb);
|
2015-05-09 20:51:31 +00:00
|
|
|
|
2017-05-17 09:07:54 +00:00
|
|
|
switch (tcf_classify(skb, cl, &cl_res, false)) {
|
net: sched: further simplify handle_ing
Ingress qdisc has no other purpose than calling into tc_classify()
that executes attached classifier(s) and action(s).
It has a 1:1 relationship to dev->ingress_queue. After having commit
087c1a601ad7 ("net: sched: run ingress qdisc without locks") removed
the central ingress lock, one major contention point is gone.
The extra indirection layers however, are not necessary for calling
into ingress qdisc. pktgen calling locally into netif_receive_skb()
with a dummy u32, single CPU result on a Supermicro X10SLM-F, Xeon
E3-1240: before ~21,1 Mpps, after patch ~22,9 Mpps.
We can redirect the private classifier list to the netdev directly,
without changing any classifier API bits (!) and execute on that from
handle_ing() side. The __QDISC_STATE_DEACTIVATE test can be removed,
ingress qdisc doesn't have a queue and thus dev_deactivate_queue()
is also not applicable, ingress_cl_list provides similar behaviour.
In other words, ingress qdisc acts like TCQ_F_BUILTIN qdisc.
One next possible step is the removal of the dev's ingress (dummy)
netdev_queue, and to only have the list member in the netdevice
itself.
Note, the filter chain is RCU protected and individual filter elements
are being kfree'd by sched subsystem after RCU grace period. RCU read
lock is being held by __netif_receive_skb_core().
Joint work with Alexei Starovoitov.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09 20:51:32 +00:00
|
|
|
case TC_ACT_OK:
|
|
|
|
case TC_ACT_RECLASSIFY:
|
|
|
|
skb->tc_index = TC_H_MIN(cl_res.classid);
|
|
|
|
break;
|
|
|
|
case TC_ACT_SHOT:
|
2015-07-06 12:18:03 +00:00
|
|
|
qdisc_qstats_cpu_drop(cl->q);
|
2016-05-06 22:55:50 +00:00
|
|
|
kfree_skb(skb);
|
|
|
|
return NULL;
|
net: sched: further simplify handle_ing
Ingress qdisc has no other purpose than calling into tc_classify()
that executes attached classifier(s) and action(s).
It has a 1:1 relationship to dev->ingress_queue. After having commit
087c1a601ad7 ("net: sched: run ingress qdisc without locks") removed
the central ingress lock, one major contention point is gone.
The extra indirection layers however, are not necessary for calling
into ingress qdisc. pktgen calling locally into netif_receive_skb()
with a dummy u32, single CPU result on a Supermicro X10SLM-F, Xeon
E3-1240: before ~21,1 Mpps, after patch ~22,9 Mpps.
We can redirect the private classifier list to the netdev directly,
without changing any classifier API bits (!) and execute on that from
handle_ing() side. The __QDISC_STATE_DEACTIVATE test can be removed,
ingress qdisc doesn't have a queue and thus dev_deactivate_queue()
is also not applicable, ingress_cl_list provides similar behaviour.
In other words, ingress qdisc acts like TCQ_F_BUILTIN qdisc.
One next possible step is the removal of the dev's ingress (dummy)
netdev_queue, and to only have the list member in the netdevice
itself.
Note, the filter chain is RCU protected and individual filter elements
are being kfree'd by sched subsystem after RCU grace period. RCU read
lock is being held by __netif_receive_skb_core().
Joint work with Alexei Starovoitov.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09 20:51:32 +00:00
|
|
|
case TC_ACT_STOLEN:
|
|
|
|
case TC_ACT_QUEUED:
|
2017-06-06 12:12:02 +00:00
|
|
|
case TC_ACT_TRAP:
|
2016-05-06 22:55:50 +00:00
|
|
|
consume_skb(skb);
|
net: sched: further simplify handle_ing
Ingress qdisc has no other purpose than calling into tc_classify()
that executes attached classifier(s) and action(s).
It has a 1:1 relationship to dev->ingress_queue. After having commit
087c1a601ad7 ("net: sched: run ingress qdisc without locks") removed
the central ingress lock, one major contention point is gone.
The extra indirection layers however, are not necessary for calling
into ingress qdisc. pktgen calling locally into netif_receive_skb()
with a dummy u32, single CPU result on a Supermicro X10SLM-F, Xeon
E3-1240: before ~21,1 Mpps, after patch ~22,9 Mpps.
We can redirect the private classifier list to the netdev directly,
without changing any classifier API bits (!) and execute on that from
handle_ing() side. The __QDISC_STATE_DEACTIVATE test can be removed,
ingress qdisc doesn't have a queue and thus dev_deactivate_queue()
is also not applicable, ingress_cl_list provides similar behaviour.
In other words, ingress qdisc acts like TCQ_F_BUILTIN qdisc.
One next possible step is the removal of the dev's ingress (dummy)
netdev_queue, and to only have the list member in the netdevice
itself.
Note, the filter chain is RCU protected and individual filter elements
are being kfree'd by sched subsystem after RCU grace period. RCU read
lock is being held by __netif_receive_skb_core().
Joint work with Alexei Starovoitov.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09 20:51:32 +00:00
|
|
|
return NULL;
|
2015-09-16 06:05:43 +00:00
|
|
|
case TC_ACT_REDIRECT:
|
|
|
|
/* skb_mac_header check was done by cls/act_bpf, so
|
|
|
|
* we can safely push the L2 header back before
|
|
|
|
* redirecting to another netdev
|
|
|
|
*/
|
|
|
|
__skb_push(skb, skb->mac_len);
|
|
|
|
skb_do_redirect(skb);
|
|
|
|
return NULL;
|
net: sched: further simplify handle_ing
Ingress qdisc has no other purpose than calling into tc_classify()
that executes attached classifier(s) and action(s).
It has a 1:1 relationship to dev->ingress_queue. After having commit
087c1a601ad7 ("net: sched: run ingress qdisc without locks") removed
the central ingress lock, one major contention point is gone.
The extra indirection layers however, are not necessary for calling
into ingress qdisc. pktgen calling locally into netif_receive_skb()
with a dummy u32, single CPU result on a Supermicro X10SLM-F, Xeon
E3-1240: before ~21,1 Mpps, after patch ~22,9 Mpps.
We can redirect the private classifier list to the netdev directly,
without changing any classifier API bits (!) and execute on that from
handle_ing() side. The __QDISC_STATE_DEACTIVATE test can be removed,
ingress qdisc doesn't have a queue and thus dev_deactivate_queue()
is also not applicable, ingress_cl_list provides similar behaviour.
In other words, ingress qdisc acts like TCQ_F_BUILTIN qdisc.
One next possible step is the removal of the dev's ingress (dummy)
netdev_queue, and to only have the list member in the netdevice
itself.
Note, the filter chain is RCU protected and individual filter elements
are being kfree'd by sched subsystem after RCU grace period. RCU read
lock is being held by __netif_receive_skb_core().
Joint work with Alexei Starovoitov.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09 20:51:32 +00:00
|
|
|
default:
|
|
|
|
break;
|
2007-10-14 07:38:47 +00:00
|
|
|
}
|
2015-05-19 20:33:25 +00:00
|
|
|
#endif /* CONFIG_NET_CLS_ACT */
|
netfilter: add netfilter ingress hook after handle_ing() under unique static key
This patch adds the Netfilter ingress hook just after the existing tc ingress
hook, that seems to be the consensus solution for this.
Note that the Netfilter hook resides under the global static key that enables
ingress filtering. Nonetheless, Netfilter still also has its own static key for
minimal impact on the existing handle_ing().
* Without this patch:
Result: OK: 6216490(c6216338+d152) usec, 100000000 (60byte,0frags)
16086246pps 7721Mb/sec (7721398080bps) errors: 100000000
42.46% kpktgend_0 [kernel.kallsyms] [k] __netif_receive_skb_core
25.92% kpktgend_0 [kernel.kallsyms] [k] kfree_skb
7.81% kpktgend_0 [pktgen] [k] pktgen_thread_worker
5.62% kpktgend_0 [kernel.kallsyms] [k] ip_rcv
2.70% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_internal
2.34% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_sk
1.44% kpktgend_0 [kernel.kallsyms] [k] __build_skb
* With this patch:
Result: OK: 6214833(c6214731+d101) usec, 100000000 (60byte,0frags)
16090536pps 7723Mb/sec (7723457280bps) errors: 100000000
41.23% kpktgend_0 [kernel.kallsyms] [k] __netif_receive_skb_core
26.57% kpktgend_0 [kernel.kallsyms] [k] kfree_skb
7.72% kpktgend_0 [pktgen] [k] pktgen_thread_worker
5.55% kpktgend_0 [kernel.kallsyms] [k] ip_rcv
2.78% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_internal
2.06% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_sk
1.43% kpktgend_0 [kernel.kallsyms] [k] __build_skb
* Without this patch + tc ingress:
tc filter add dev eth4 parent ffff: protocol ip prio 1 \
u32 match ip dst 4.3.2.1/32
Result: OK: 9269001(c9268821+d179) usec, 100000000 (60byte,0frags)
10788648pps 5178Mb/sec (5178551040bps) errors: 100000000
40.99% kpktgend_0 [kernel.kallsyms] [k] __netif_receive_skb_core
17.50% kpktgend_0 [kernel.kallsyms] [k] kfree_skb
11.77% kpktgend_0 [cls_u32] [k] u32_classify
5.62% kpktgend_0 [kernel.kallsyms] [k] tc_classify_compat
5.18% kpktgend_0 [pktgen] [k] pktgen_thread_worker
3.23% kpktgend_0 [kernel.kallsyms] [k] tc_classify
2.97% kpktgend_0 [kernel.kallsyms] [k] ip_rcv
1.83% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_internal
1.50% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_sk
0.99% kpktgend_0 [kernel.kallsyms] [k] __build_skb
* With this patch + tc ingress:
tc filter add dev eth4 parent ffff: protocol ip prio 1 \
u32 match ip dst 4.3.2.1/32
Result: OK: 9308218(c9308091+d126) usec, 100000000 (60byte,0frags)
10743194pps 5156Mb/sec (5156733120bps) errors: 100000000
42.01% kpktgend_0 [kernel.kallsyms] [k] __netif_receive_skb_core
17.78% kpktgend_0 [kernel.kallsyms] [k] kfree_skb
11.70% kpktgend_0 [cls_u32] [k] u32_classify
5.46% kpktgend_0 [kernel.kallsyms] [k] tc_classify_compat
5.16% kpktgend_0 [pktgen] [k] pktgen_thread_worker
2.98% kpktgend_0 [kernel.kallsyms] [k] ip_rcv
2.84% kpktgend_0 [kernel.kallsyms] [k] tc_classify
1.96% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_internal
1.57% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_sk
Note that the results are very similar before and after.
I can see gcc gets the code under the ingress static key out of the hot path.
Then, on that cold branch, it generates the code to accomodate the netfilter
ingress static key. My explanation for this is that this reduces the pressure
on the instruction cache for non-users as the new code is out of the hot path,
and it comes with minimal impact for tc ingress users.
Using gcc version 4.8.4 on:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
[...]
L1d cache: 16K
L1i cache: 64K
L2 cache: 2048K
L3 cache: 8192K
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-13 16:19:38 +00:00
|
|
|
return skb;
|
|
|
|
}
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2016-09-02 05:18:34 +00:00
|
|
|
/**
|
|
|
|
* netdev_is_rx_handler_busy - check if receive handler is registered
|
|
|
|
* @dev: device to check
|
|
|
|
*
|
|
|
|
* Check if a receive handler is already registered for a given device.
|
|
|
|
* Return true if there one.
|
|
|
|
*
|
|
|
|
* The caller must hold the rtnl_mutex.
|
|
|
|
*/
|
|
|
|
bool netdev_is_rx_handler_busy(struct net_device *dev)
|
|
|
|
{
|
|
|
|
ASSERT_RTNL();
|
|
|
|
return dev && rtnl_dereference(dev->rx_handler);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(netdev_is_rx_handler_busy);
|
|
|
|
|
2010-06-01 21:52:08 +00:00
|
|
|
/**
|
|
|
|
* netdev_rx_handler_register - register receive handler
|
|
|
|
* @dev: device to register a handler for
|
|
|
|
* @rx_handler: receive handler to register
|
2010-06-10 03:34:59 +00:00
|
|
|
* @rx_handler_data: data pointer that is used by rx handler
|
2010-06-01 21:52:08 +00:00
|
|
|
*
|
2014-02-18 13:54:36 +00:00
|
|
|
* Register a receive handler for a device. This handler will then be
|
2010-06-01 21:52:08 +00:00
|
|
|
* called from __netif_receive_skb. A negative errno code is returned
|
|
|
|
* on a failure.
|
|
|
|
*
|
|
|
|
* The caller must hold the rtnl_mutex.
|
2011-03-12 03:14:39 +00:00
|
|
|
*
|
|
|
|
* For a general description of rx_handler, see enum rx_handler_result.
|
2010-06-01 21:52:08 +00:00
|
|
|
*/
|
|
|
|
int netdev_rx_handler_register(struct net_device *dev,
|
2010-06-10 03:34:59 +00:00
|
|
|
rx_handler_func_t *rx_handler,
|
|
|
|
void *rx_handler_data)
|
2010-06-01 21:52:08 +00:00
|
|
|
{
|
2017-01-18 23:02:49 +00:00
|
|
|
if (netdev_is_rx_handler_busy(dev))
|
2010-06-01 21:52:08 +00:00
|
|
|
return -EBUSY;
|
|
|
|
|
net: add a synchronize_net() in netdev_rx_handler_unregister()
commit 35d48903e97819 (bonding: fix rx_handler locking) added a race
in bonding driver, reported by Steven Rostedt who did a very good
diagnosis :
<quoting Steven>
I'm currently debugging a crash in an old 3.0-rt kernel that one of our
customers is seeing. The bug happens with a stress test that loads and
unloads the bonding module in a loop (I don't know all the details as
I'm not the one that is directly interacting with the customer). But the
bug looks to be something that may still be present and possibly present
in mainline too. It will just be much harder to trigger it in mainline.
In -rt, interrupts are threads, and can schedule in and out just like
any other thread. Note, mainline now supports interrupt threads so this
may be easily reproducible in mainline as well. I don't have the ability
to tell the customer to try mainline or other kernels, so my hands are
somewhat tied to what I can do.
But according to a core dump, I tracked down that the eth irq thread
crashed in bond_handle_frame() here:
slave = bond_slave_get_rcu(skb->dev);
bond = slave->bond; <--- BUG
the slave returned was NULL and accessing slave->bond caused a NULL
pointer dereference.
Looking at the code that unregisters the handler:
void netdev_rx_handler_unregister(struct net_device *dev)
{
ASSERT_RTNL();
RCU_INIT_POINTER(dev->rx_handler, NULL);
RCU_INIT_POINTER(dev->rx_handler_data, NULL);
}
Which is basically:
dev->rx_handler = NULL;
dev->rx_handler_data = NULL;
And looking at __netif_receive_skb() we have:
rx_handler = rcu_dereference(skb->dev->rx_handler);
if (rx_handler) {
if (pt_prev) {
ret = deliver_skb(skb, pt_prev, orig_dev);
pt_prev = NULL;
}
switch (rx_handler(&skb)) {
My question to all of you is, what stops this interrupt from happening
while the bonding module is unloading? What happens if the interrupt
triggers and we have this:
CPU0 CPU1
---- ----
rx_handler = skb->dev->rx_handler
netdev_rx_handler_unregister() {
dev->rx_handler = NULL;
dev->rx_handler_data = NULL;
rx_handler()
bond_handle_frame() {
slave = skb->dev->rx_handler;
bond = slave->bond; <-- NULL pointer dereference!!!
What protection am I missing in the bond release handler that would
prevent the above from happening?
</quoting Steven>
We can fix bug this in two ways. First is adding a test in
bond_handle_frame() and others to check if rx_handler_data is NULL.
A second way is adding a synchronize_net() in
netdev_rx_handler_unregister() to make sure that a rcu protected reader
has the guarantee to see a non NULL rx_handler_data.
The second way is better as it avoids an extra test in fast path.
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jiri Pirko <jpirko@redhat.com>
Cc: Paul E. McKenney <paulmck@us.ibm.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-29 03:01:22 +00:00
|
|
|
/* Note: rx_handler_data must be set before rx_handler */
|
2010-06-10 03:34:59 +00:00
|
|
|
rcu_assign_pointer(dev->rx_handler_data, rx_handler_data);
|
2010-06-01 21:52:08 +00:00
|
|
|
rcu_assign_pointer(dev->rx_handler, rx_handler);
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(netdev_rx_handler_register);
|
|
|
|
|
|
|
|
/**
|
|
|
|
* netdev_rx_handler_unregister - unregister receive handler
|
|
|
|
* @dev: device to unregister a handler from
|
|
|
|
*
|
2013-03-18 02:59:52 +00:00
|
|
|
* Unregister a receive handler from a device.
|
2010-06-01 21:52:08 +00:00
|
|
|
*
|
|
|
|
* The caller must hold the rtnl_mutex.
|
|
|
|
*/
|
|
|
|
void netdev_rx_handler_unregister(struct net_device *dev)
|
|
|
|
{
|
|
|
|
|
|
|
|
ASSERT_RTNL();
|
2011-08-01 16:19:00 +00:00
|
|
|
RCU_INIT_POINTER(dev->rx_handler, NULL);
|
net: add a synchronize_net() in netdev_rx_handler_unregister()
commit 35d48903e97819 (bonding: fix rx_handler locking) added a race
in bonding driver, reported by Steven Rostedt who did a very good
diagnosis :
<quoting Steven>
I'm currently debugging a crash in an old 3.0-rt kernel that one of our
customers is seeing. The bug happens with a stress test that loads and
unloads the bonding module in a loop (I don't know all the details as
I'm not the one that is directly interacting with the customer). But the
bug looks to be something that may still be present and possibly present
in mainline too. It will just be much harder to trigger it in mainline.
In -rt, interrupts are threads, and can schedule in and out just like
any other thread. Note, mainline now supports interrupt threads so this
may be easily reproducible in mainline as well. I don't have the ability
to tell the customer to try mainline or other kernels, so my hands are
somewhat tied to what I can do.
But according to a core dump, I tracked down that the eth irq thread
crashed in bond_handle_frame() here:
slave = bond_slave_get_rcu(skb->dev);
bond = slave->bond; <--- BUG
the slave returned was NULL and accessing slave->bond caused a NULL
pointer dereference.
Looking at the code that unregisters the handler:
void netdev_rx_handler_unregister(struct net_device *dev)
{
ASSERT_RTNL();
RCU_INIT_POINTER(dev->rx_handler, NULL);
RCU_INIT_POINTER(dev->rx_handler_data, NULL);
}
Which is basically:
dev->rx_handler = NULL;
dev->rx_handler_data = NULL;
And looking at __netif_receive_skb() we have:
rx_handler = rcu_dereference(skb->dev->rx_handler);
if (rx_handler) {
if (pt_prev) {
ret = deliver_skb(skb, pt_prev, orig_dev);
pt_prev = NULL;
}
switch (rx_handler(&skb)) {
My question to all of you is, what stops this interrupt from happening
while the bonding module is unloading? What happens if the interrupt
triggers and we have this:
CPU0 CPU1
---- ----
rx_handler = skb->dev->rx_handler
netdev_rx_handler_unregister() {
dev->rx_handler = NULL;
dev->rx_handler_data = NULL;
rx_handler()
bond_handle_frame() {
slave = skb->dev->rx_handler;
bond = slave->bond; <-- NULL pointer dereference!!!
What protection am I missing in the bond release handler that would
prevent the above from happening?
</quoting Steven>
We can fix bug this in two ways. First is adding a test in
bond_handle_frame() and others to check if rx_handler_data is NULL.
A second way is adding a synchronize_net() in
netdev_rx_handler_unregister() to make sure that a rcu protected reader
has the guarantee to see a non NULL rx_handler_data.
The second way is better as it avoids an extra test in fast path.
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jiri Pirko <jpirko@redhat.com>
Cc: Paul E. McKenney <paulmck@us.ibm.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-29 03:01:22 +00:00
|
|
|
/* a reader seeing a non NULL rx_handler in a rcu_read_lock()
|
|
|
|
* section has a guarantee to see a non NULL rx_handler_data
|
|
|
|
* as well.
|
|
|
|
*/
|
|
|
|
synchronize_net();
|
2011-08-01 16:19:00 +00:00
|
|
|
RCU_INIT_POINTER(dev->rx_handler_data, NULL);
|
2010-06-01 21:52:08 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(netdev_rx_handler_unregister);
|
|
|
|
|
2012-07-31 23:44:26 +00:00
|
|
|
/*
|
|
|
|
* Limit the use of PFMEMALLOC reserves to those protocols that implement
|
|
|
|
* the special handling of PFMEMALLOC skbs.
|
|
|
|
*/
|
|
|
|
static bool skb_pfmemalloc_protocol(struct sk_buff *skb)
|
|
|
|
{
|
|
|
|
switch (skb->protocol) {
|
2014-03-12 17:04:17 +00:00
|
|
|
case htons(ETH_P_ARP):
|
|
|
|
case htons(ETH_P_IP):
|
|
|
|
case htons(ETH_P_IPV6):
|
|
|
|
case htons(ETH_P_8021Q):
|
|
|
|
case htons(ETH_P_8021AD):
|
2012-07-31 23:44:26 +00:00
|
|
|
return true;
|
|
|
|
default:
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
netfilter: add netfilter ingress hook after handle_ing() under unique static key
This patch adds the Netfilter ingress hook just after the existing tc ingress
hook, that seems to be the consensus solution for this.
Note that the Netfilter hook resides under the global static key that enables
ingress filtering. Nonetheless, Netfilter still also has its own static key for
minimal impact on the existing handle_ing().
* Without this patch:
Result: OK: 6216490(c6216338+d152) usec, 100000000 (60byte,0frags)
16086246pps 7721Mb/sec (7721398080bps) errors: 100000000
42.46% kpktgend_0 [kernel.kallsyms] [k] __netif_receive_skb_core
25.92% kpktgend_0 [kernel.kallsyms] [k] kfree_skb
7.81% kpktgend_0 [pktgen] [k] pktgen_thread_worker
5.62% kpktgend_0 [kernel.kallsyms] [k] ip_rcv
2.70% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_internal
2.34% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_sk
1.44% kpktgend_0 [kernel.kallsyms] [k] __build_skb
* With this patch:
Result: OK: 6214833(c6214731+d101) usec, 100000000 (60byte,0frags)
16090536pps 7723Mb/sec (7723457280bps) errors: 100000000
41.23% kpktgend_0 [kernel.kallsyms] [k] __netif_receive_skb_core
26.57% kpktgend_0 [kernel.kallsyms] [k] kfree_skb
7.72% kpktgend_0 [pktgen] [k] pktgen_thread_worker
5.55% kpktgend_0 [kernel.kallsyms] [k] ip_rcv
2.78% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_internal
2.06% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_sk
1.43% kpktgend_0 [kernel.kallsyms] [k] __build_skb
* Without this patch + tc ingress:
tc filter add dev eth4 parent ffff: protocol ip prio 1 \
u32 match ip dst 4.3.2.1/32
Result: OK: 9269001(c9268821+d179) usec, 100000000 (60byte,0frags)
10788648pps 5178Mb/sec (5178551040bps) errors: 100000000
40.99% kpktgend_0 [kernel.kallsyms] [k] __netif_receive_skb_core
17.50% kpktgend_0 [kernel.kallsyms] [k] kfree_skb
11.77% kpktgend_0 [cls_u32] [k] u32_classify
5.62% kpktgend_0 [kernel.kallsyms] [k] tc_classify_compat
5.18% kpktgend_0 [pktgen] [k] pktgen_thread_worker
3.23% kpktgend_0 [kernel.kallsyms] [k] tc_classify
2.97% kpktgend_0 [kernel.kallsyms] [k] ip_rcv
1.83% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_internal
1.50% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_sk
0.99% kpktgend_0 [kernel.kallsyms] [k] __build_skb
* With this patch + tc ingress:
tc filter add dev eth4 parent ffff: protocol ip prio 1 \
u32 match ip dst 4.3.2.1/32
Result: OK: 9308218(c9308091+d126) usec, 100000000 (60byte,0frags)
10743194pps 5156Mb/sec (5156733120bps) errors: 100000000
42.01% kpktgend_0 [kernel.kallsyms] [k] __netif_receive_skb_core
17.78% kpktgend_0 [kernel.kallsyms] [k] kfree_skb
11.70% kpktgend_0 [cls_u32] [k] u32_classify
5.46% kpktgend_0 [kernel.kallsyms] [k] tc_classify_compat
5.16% kpktgend_0 [pktgen] [k] pktgen_thread_worker
2.98% kpktgend_0 [kernel.kallsyms] [k] ip_rcv
2.84% kpktgend_0 [kernel.kallsyms] [k] tc_classify
1.96% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_internal
1.57% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_sk
Note that the results are very similar before and after.
I can see gcc gets the code under the ingress static key out of the hot path.
Then, on that cold branch, it generates the code to accomodate the netfilter
ingress static key. My explanation for this is that this reduces the pressure
on the instruction cache for non-users as the new code is out of the hot path,
and it comes with minimal impact for tc ingress users.
Using gcc version 4.8.4 on:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
[...]
L1d cache: 16K
L1i cache: 64K
L2 cache: 2048K
L3 cache: 8192K
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-13 16:19:38 +00:00
|
|
|
static inline int nf_ingress(struct sk_buff *skb, struct packet_type **pt_prev,
|
|
|
|
int *ret, struct net_device *orig_dev)
|
|
|
|
{
|
2015-05-19 20:33:25 +00:00
|
|
|
#ifdef CONFIG_NETFILTER_INGRESS
|
netfilter: add netfilter ingress hook after handle_ing() under unique static key
This patch adds the Netfilter ingress hook just after the existing tc ingress
hook, that seems to be the consensus solution for this.
Note that the Netfilter hook resides under the global static key that enables
ingress filtering. Nonetheless, Netfilter still also has its own static key for
minimal impact on the existing handle_ing().
* Without this patch:
Result: OK: 6216490(c6216338+d152) usec, 100000000 (60byte,0frags)
16086246pps 7721Mb/sec (7721398080bps) errors: 100000000
42.46% kpktgend_0 [kernel.kallsyms] [k] __netif_receive_skb_core
25.92% kpktgend_0 [kernel.kallsyms] [k] kfree_skb
7.81% kpktgend_0 [pktgen] [k] pktgen_thread_worker
5.62% kpktgend_0 [kernel.kallsyms] [k] ip_rcv
2.70% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_internal
2.34% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_sk
1.44% kpktgend_0 [kernel.kallsyms] [k] __build_skb
* With this patch:
Result: OK: 6214833(c6214731+d101) usec, 100000000 (60byte,0frags)
16090536pps 7723Mb/sec (7723457280bps) errors: 100000000
41.23% kpktgend_0 [kernel.kallsyms] [k] __netif_receive_skb_core
26.57% kpktgend_0 [kernel.kallsyms] [k] kfree_skb
7.72% kpktgend_0 [pktgen] [k] pktgen_thread_worker
5.55% kpktgend_0 [kernel.kallsyms] [k] ip_rcv
2.78% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_internal
2.06% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_sk
1.43% kpktgend_0 [kernel.kallsyms] [k] __build_skb
* Without this patch + tc ingress:
tc filter add dev eth4 parent ffff: protocol ip prio 1 \
u32 match ip dst 4.3.2.1/32
Result: OK: 9269001(c9268821+d179) usec, 100000000 (60byte,0frags)
10788648pps 5178Mb/sec (5178551040bps) errors: 100000000
40.99% kpktgend_0 [kernel.kallsyms] [k] __netif_receive_skb_core
17.50% kpktgend_0 [kernel.kallsyms] [k] kfree_skb
11.77% kpktgend_0 [cls_u32] [k] u32_classify
5.62% kpktgend_0 [kernel.kallsyms] [k] tc_classify_compat
5.18% kpktgend_0 [pktgen] [k] pktgen_thread_worker
3.23% kpktgend_0 [kernel.kallsyms] [k] tc_classify
2.97% kpktgend_0 [kernel.kallsyms] [k] ip_rcv
1.83% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_internal
1.50% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_sk
0.99% kpktgend_0 [kernel.kallsyms] [k] __build_skb
* With this patch + tc ingress:
tc filter add dev eth4 parent ffff: protocol ip prio 1 \
u32 match ip dst 4.3.2.1/32
Result: OK: 9308218(c9308091+d126) usec, 100000000 (60byte,0frags)
10743194pps 5156Mb/sec (5156733120bps) errors: 100000000
42.01% kpktgend_0 [kernel.kallsyms] [k] __netif_receive_skb_core
17.78% kpktgend_0 [kernel.kallsyms] [k] kfree_skb
11.70% kpktgend_0 [cls_u32] [k] u32_classify
5.46% kpktgend_0 [kernel.kallsyms] [k] tc_classify_compat
5.16% kpktgend_0 [pktgen] [k] pktgen_thread_worker
2.98% kpktgend_0 [kernel.kallsyms] [k] ip_rcv
2.84% kpktgend_0 [kernel.kallsyms] [k] tc_classify
1.96% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_internal
1.57% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_sk
Note that the results are very similar before and after.
I can see gcc gets the code under the ingress static key out of the hot path.
Then, on that cold branch, it generates the code to accomodate the netfilter
ingress static key. My explanation for this is that this reduces the pressure
on the instruction cache for non-users as the new code is out of the hot path,
and it comes with minimal impact for tc ingress users.
Using gcc version 4.8.4 on:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
[...]
L1d cache: 16K
L1i cache: 64K
L2 cache: 2048K
L3 cache: 8192K
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-13 16:19:38 +00:00
|
|
|
if (nf_hook_ingress_active(skb)) {
|
2016-09-21 15:35:03 +00:00
|
|
|
int ingress_retval;
|
|
|
|
|
netfilter: add netfilter ingress hook after handle_ing() under unique static key
This patch adds the Netfilter ingress hook just after the existing tc ingress
hook, that seems to be the consensus solution for this.
Note that the Netfilter hook resides under the global static key that enables
ingress filtering. Nonetheless, Netfilter still also has its own static key for
minimal impact on the existing handle_ing().
* Without this patch:
Result: OK: 6216490(c6216338+d152) usec, 100000000 (60byte,0frags)
16086246pps 7721Mb/sec (7721398080bps) errors: 100000000
42.46% kpktgend_0 [kernel.kallsyms] [k] __netif_receive_skb_core
25.92% kpktgend_0 [kernel.kallsyms] [k] kfree_skb
7.81% kpktgend_0 [pktgen] [k] pktgen_thread_worker
5.62% kpktgend_0 [kernel.kallsyms] [k] ip_rcv
2.70% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_internal
2.34% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_sk
1.44% kpktgend_0 [kernel.kallsyms] [k] __build_skb
* With this patch:
Result: OK: 6214833(c6214731+d101) usec, 100000000 (60byte,0frags)
16090536pps 7723Mb/sec (7723457280bps) errors: 100000000
41.23% kpktgend_0 [kernel.kallsyms] [k] __netif_receive_skb_core
26.57% kpktgend_0 [kernel.kallsyms] [k] kfree_skb
7.72% kpktgend_0 [pktgen] [k] pktgen_thread_worker
5.55% kpktgend_0 [kernel.kallsyms] [k] ip_rcv
2.78% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_internal
2.06% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_sk
1.43% kpktgend_0 [kernel.kallsyms] [k] __build_skb
* Without this patch + tc ingress:
tc filter add dev eth4 parent ffff: protocol ip prio 1 \
u32 match ip dst 4.3.2.1/32
Result: OK: 9269001(c9268821+d179) usec, 100000000 (60byte,0frags)
10788648pps 5178Mb/sec (5178551040bps) errors: 100000000
40.99% kpktgend_0 [kernel.kallsyms] [k] __netif_receive_skb_core
17.50% kpktgend_0 [kernel.kallsyms] [k] kfree_skb
11.77% kpktgend_0 [cls_u32] [k] u32_classify
5.62% kpktgend_0 [kernel.kallsyms] [k] tc_classify_compat
5.18% kpktgend_0 [pktgen] [k] pktgen_thread_worker
3.23% kpktgend_0 [kernel.kallsyms] [k] tc_classify
2.97% kpktgend_0 [kernel.kallsyms] [k] ip_rcv
1.83% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_internal
1.50% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_sk
0.99% kpktgend_0 [kernel.kallsyms] [k] __build_skb
* With this patch + tc ingress:
tc filter add dev eth4 parent ffff: protocol ip prio 1 \
u32 match ip dst 4.3.2.1/32
Result: OK: 9308218(c9308091+d126) usec, 100000000 (60byte,0frags)
10743194pps 5156Mb/sec (5156733120bps) errors: 100000000
42.01% kpktgend_0 [kernel.kallsyms] [k] __netif_receive_skb_core
17.78% kpktgend_0 [kernel.kallsyms] [k] kfree_skb
11.70% kpktgend_0 [cls_u32] [k] u32_classify
5.46% kpktgend_0 [kernel.kallsyms] [k] tc_classify_compat
5.16% kpktgend_0 [pktgen] [k] pktgen_thread_worker
2.98% kpktgend_0 [kernel.kallsyms] [k] ip_rcv
2.84% kpktgend_0 [kernel.kallsyms] [k] tc_classify
1.96% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_internal
1.57% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_sk
Note that the results are very similar before and after.
I can see gcc gets the code under the ingress static key out of the hot path.
Then, on that cold branch, it generates the code to accomodate the netfilter
ingress static key. My explanation for this is that this reduces the pressure
on the instruction cache for non-users as the new code is out of the hot path,
and it comes with minimal impact for tc ingress users.
Using gcc version 4.8.4 on:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
[...]
L1d cache: 16K
L1i cache: 64K
L2 cache: 2048K
L3 cache: 8192K
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-13 16:19:38 +00:00
|
|
|
if (*pt_prev) {
|
|
|
|
*ret = deliver_skb(skb, *pt_prev, orig_dev);
|
|
|
|
*pt_prev = NULL;
|
|
|
|
}
|
|
|
|
|
2016-09-21 15:35:03 +00:00
|
|
|
rcu_read_lock();
|
|
|
|
ingress_retval = nf_hook_ingress(skb);
|
|
|
|
rcu_read_unlock();
|
|
|
|
return ingress_retval;
|
netfilter: add netfilter ingress hook after handle_ing() under unique static key
This patch adds the Netfilter ingress hook just after the existing tc ingress
hook, that seems to be the consensus solution for this.
Note that the Netfilter hook resides under the global static key that enables
ingress filtering. Nonetheless, Netfilter still also has its own static key for
minimal impact on the existing handle_ing().
* Without this patch:
Result: OK: 6216490(c6216338+d152) usec, 100000000 (60byte,0frags)
16086246pps 7721Mb/sec (7721398080bps) errors: 100000000
42.46% kpktgend_0 [kernel.kallsyms] [k] __netif_receive_skb_core
25.92% kpktgend_0 [kernel.kallsyms] [k] kfree_skb
7.81% kpktgend_0 [pktgen] [k] pktgen_thread_worker
5.62% kpktgend_0 [kernel.kallsyms] [k] ip_rcv
2.70% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_internal
2.34% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_sk
1.44% kpktgend_0 [kernel.kallsyms] [k] __build_skb
* With this patch:
Result: OK: 6214833(c6214731+d101) usec, 100000000 (60byte,0frags)
16090536pps 7723Mb/sec (7723457280bps) errors: 100000000
41.23% kpktgend_0 [kernel.kallsyms] [k] __netif_receive_skb_core
26.57% kpktgend_0 [kernel.kallsyms] [k] kfree_skb
7.72% kpktgend_0 [pktgen] [k] pktgen_thread_worker
5.55% kpktgend_0 [kernel.kallsyms] [k] ip_rcv
2.78% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_internal
2.06% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_sk
1.43% kpktgend_0 [kernel.kallsyms] [k] __build_skb
* Without this patch + tc ingress:
tc filter add dev eth4 parent ffff: protocol ip prio 1 \
u32 match ip dst 4.3.2.1/32
Result: OK: 9269001(c9268821+d179) usec, 100000000 (60byte,0frags)
10788648pps 5178Mb/sec (5178551040bps) errors: 100000000
40.99% kpktgend_0 [kernel.kallsyms] [k] __netif_receive_skb_core
17.50% kpktgend_0 [kernel.kallsyms] [k] kfree_skb
11.77% kpktgend_0 [cls_u32] [k] u32_classify
5.62% kpktgend_0 [kernel.kallsyms] [k] tc_classify_compat
5.18% kpktgend_0 [pktgen] [k] pktgen_thread_worker
3.23% kpktgend_0 [kernel.kallsyms] [k] tc_classify
2.97% kpktgend_0 [kernel.kallsyms] [k] ip_rcv
1.83% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_internal
1.50% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_sk
0.99% kpktgend_0 [kernel.kallsyms] [k] __build_skb
* With this patch + tc ingress:
tc filter add dev eth4 parent ffff: protocol ip prio 1 \
u32 match ip dst 4.3.2.1/32
Result: OK: 9308218(c9308091+d126) usec, 100000000 (60byte,0frags)
10743194pps 5156Mb/sec (5156733120bps) errors: 100000000
42.01% kpktgend_0 [kernel.kallsyms] [k] __netif_receive_skb_core
17.78% kpktgend_0 [kernel.kallsyms] [k] kfree_skb
11.70% kpktgend_0 [cls_u32] [k] u32_classify
5.46% kpktgend_0 [kernel.kallsyms] [k] tc_classify_compat
5.16% kpktgend_0 [pktgen] [k] pktgen_thread_worker
2.98% kpktgend_0 [kernel.kallsyms] [k] ip_rcv
2.84% kpktgend_0 [kernel.kallsyms] [k] tc_classify
1.96% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_internal
1.57% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_sk
Note that the results are very similar before and after.
I can see gcc gets the code under the ingress static key out of the hot path.
Then, on that cold branch, it generates the code to accomodate the netfilter
ingress static key. My explanation for this is that this reduces the pressure
on the instruction cache for non-users as the new code is out of the hot path,
and it comes with minimal impact for tc ingress users.
Using gcc version 4.8.4 on:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
[...]
L1d cache: 16K
L1i cache: 64K
L2 cache: 2048K
L3 cache: 8192K
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-13 16:19:38 +00:00
|
|
|
}
|
2015-05-19 20:33:25 +00:00
|
|
|
#endif /* CONFIG_NETFILTER_INGRESS */
|
netfilter: add netfilter ingress hook after handle_ing() under unique static key
This patch adds the Netfilter ingress hook just after the existing tc ingress
hook, that seems to be the consensus solution for this.
Note that the Netfilter hook resides under the global static key that enables
ingress filtering. Nonetheless, Netfilter still also has its own static key for
minimal impact on the existing handle_ing().
* Without this patch:
Result: OK: 6216490(c6216338+d152) usec, 100000000 (60byte,0frags)
16086246pps 7721Mb/sec (7721398080bps) errors: 100000000
42.46% kpktgend_0 [kernel.kallsyms] [k] __netif_receive_skb_core
25.92% kpktgend_0 [kernel.kallsyms] [k] kfree_skb
7.81% kpktgend_0 [pktgen] [k] pktgen_thread_worker
5.62% kpktgend_0 [kernel.kallsyms] [k] ip_rcv
2.70% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_internal
2.34% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_sk
1.44% kpktgend_0 [kernel.kallsyms] [k] __build_skb
* With this patch:
Result: OK: 6214833(c6214731+d101) usec, 100000000 (60byte,0frags)
16090536pps 7723Mb/sec (7723457280bps) errors: 100000000
41.23% kpktgend_0 [kernel.kallsyms] [k] __netif_receive_skb_core
26.57% kpktgend_0 [kernel.kallsyms] [k] kfree_skb
7.72% kpktgend_0 [pktgen] [k] pktgen_thread_worker
5.55% kpktgend_0 [kernel.kallsyms] [k] ip_rcv
2.78% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_internal
2.06% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_sk
1.43% kpktgend_0 [kernel.kallsyms] [k] __build_skb
* Without this patch + tc ingress:
tc filter add dev eth4 parent ffff: protocol ip prio 1 \
u32 match ip dst 4.3.2.1/32
Result: OK: 9269001(c9268821+d179) usec, 100000000 (60byte,0frags)
10788648pps 5178Mb/sec (5178551040bps) errors: 100000000
40.99% kpktgend_0 [kernel.kallsyms] [k] __netif_receive_skb_core
17.50% kpktgend_0 [kernel.kallsyms] [k] kfree_skb
11.77% kpktgend_0 [cls_u32] [k] u32_classify
5.62% kpktgend_0 [kernel.kallsyms] [k] tc_classify_compat
5.18% kpktgend_0 [pktgen] [k] pktgen_thread_worker
3.23% kpktgend_0 [kernel.kallsyms] [k] tc_classify
2.97% kpktgend_0 [kernel.kallsyms] [k] ip_rcv
1.83% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_internal
1.50% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_sk
0.99% kpktgend_0 [kernel.kallsyms] [k] __build_skb
* With this patch + tc ingress:
tc filter add dev eth4 parent ffff: protocol ip prio 1 \
u32 match ip dst 4.3.2.1/32
Result: OK: 9308218(c9308091+d126) usec, 100000000 (60byte,0frags)
10743194pps 5156Mb/sec (5156733120bps) errors: 100000000
42.01% kpktgend_0 [kernel.kallsyms] [k] __netif_receive_skb_core
17.78% kpktgend_0 [kernel.kallsyms] [k] kfree_skb
11.70% kpktgend_0 [cls_u32] [k] u32_classify
5.46% kpktgend_0 [kernel.kallsyms] [k] tc_classify_compat
5.16% kpktgend_0 [pktgen] [k] pktgen_thread_worker
2.98% kpktgend_0 [kernel.kallsyms] [k] ip_rcv
2.84% kpktgend_0 [kernel.kallsyms] [k] tc_classify
1.96% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_internal
1.57% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_sk
Note that the results are very similar before and after.
I can see gcc gets the code under the ingress static key out of the hot path.
Then, on that cold branch, it generates the code to accomodate the netfilter
ingress static key. My explanation for this is that this reduces the pressure
on the instruction cache for non-users as the new code is out of the hot path,
and it comes with minimal impact for tc ingress users.
Using gcc version 4.8.4 on:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
[...]
L1d cache: 16K
L1i cache: 64K
L2 cache: 2048K
L3 cache: 8192K
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-13 16:19:38 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2013-02-14 20:57:38 +00:00
|
|
|
static int __netif_receive_skb_core(struct sk_buff *skb, bool pfmemalloc)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
|
|
|
struct packet_type *ptype, *pt_prev;
|
2010-06-01 21:52:08 +00:00
|
|
|
rx_handler_func_t *rx_handler;
|
2005-08-10 02:34:12 +00:00
|
|
|
struct net_device *orig_dev;
|
2011-03-12 03:14:39 +00:00
|
|
|
bool deliver_exact = false;
|
2005-04-16 22:20:36 +00:00
|
|
|
int ret = NET_RX_DROP;
|
2006-11-15 04:48:11 +00:00
|
|
|
__be16 type;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2011-11-15 04:12:55 +00:00
|
|
|
net_timestamp_check(!netdev_tstamp_prequeue, skb);
|
2009-09-30 23:42:42 +00:00
|
|
|
|
2010-08-23 09:45:02 +00:00
|
|
|
trace_netif_receive_skb(skb);
|
2008-11-04 22:49:57 +00:00
|
|
|
|
2008-07-03 01:22:00 +00:00
|
|
|
orig_dev = skb->dev;
|
2006-02-22 00:36:44 +00:00
|
|
|
|
2007-04-11 03:45:18 +00:00
|
|
|
skb_reset_network_header(skb);
|
2013-01-07 09:28:21 +00:00
|
|
|
if (!skb_transport_header_was_set(skb))
|
|
|
|
skb_reset_transport_header(skb);
|
2011-06-10 06:56:58 +00:00
|
|
|
skb_reset_mac_len(skb);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
pt_prev = NULL;
|
|
|
|
|
2011-02-28 18:48:59 +00:00
|
|
|
another_round:
|
2012-07-23 23:27:54 +00:00
|
|
|
skb->skb_iif = skb->dev->ifindex;
|
2011-02-28 18:48:59 +00:00
|
|
|
|
|
|
|
__this_cpu_inc(softnet_data.processed);
|
|
|
|
|
net: vlan: add 802.1ad support
Add support for 802.1ad VLAN devices. This mainly consists of checking for
ETH_P_8021AD in addition to ETH_P_8021Q in a couple of places and check
offloading capabilities based on the used protocol.
Configuration is done using "ip link":
# ip link add link eth0 eth0.1000 \
type vlan proto 802.1ad id 1000
# ip link add link eth0.1000 eth0.1000.1000 \
type vlan proto 802.1q id 1000
52:54:00:12:34:56 > 92:b1:54:28:e4:8c, ethertype 802.1Q (0x8100), length 106: vlan 1000, p 0, ethertype 802.1Q, vlan 1000, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto ICMP (1), length 84)
20.1.0.2 > 20.1.0.1: ICMP echo request, id 3003, seq 8, length 64
92:b1:54:28:e4:8c > 52:54:00:12:34:56, ethertype 802.1Q-QinQ (0x88a8), length 106: vlan 1000, p 0, ethertype 802.1Q, vlan 1000, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 47944, offset 0, flags [none], proto ICMP (1), length 84)
20.1.0.1 > 20.1.0.2: ICMP echo reply, id 3003, seq 8, length 64
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-19 02:04:31 +00:00
|
|
|
if (skb->protocol == cpu_to_be16(ETH_P_8021Q) ||
|
|
|
|
skb->protocol == cpu_to_be16(ETH_P_8021AD)) {
|
net: Always untag vlan-tagged traffic on input.
Currently the functionality to untag traffic on input resides
as part of the vlan module and is build only when VLAN support
is enabled in the kernel. When VLAN is disabled, the function
vlan_untag() turns into a stub and doesn't really untag the
packets. This seems to create an interesting interaction
between VMs supporting checksum offloading and some network drivers.
There are some drivers that do not allow the user to change
tx-vlan-offload feature of the driver. These drivers also seem
to assume that any VLAN-tagged traffic they transmit will
have the vlan information in the vlan_tci and not in the vlan
header already in the skb. When transmitting skbs that already
have tagged data with partial checksum set, the checksum doesn't
appear to be updated correctly by the card thus resulting in a
failure to establish TCP connections.
The following is a packet trace taken on the receiver where a
sender is a VM with a VLAN configued. The host VM is running on
doest not have VLAN support and the outging interface on the
host is tg3:
10:12:43.503055 52:54:00:ae:42:3f > 28:d2:44:7d:c2:de, ethertype 802.1Q
(0x8100), length 78: vlan 100, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 27243,
offset 0, flags [DF], proto TCP (6), length 60)
10.0.100.1.58545 > 10.0.100.10.ircu-2: Flags [S], cksum 0xdc39 (incorrect
-> 0x48d9), seq 1069378582, win 29200, options [mss 1460,sackOK,TS val
4294837885 ecr 0,nop,wscale 7], length 0
10:12:44.505556 52:54:00:ae:42:3f > 28:d2:44:7d:c2:de, ethertype 802.1Q
(0x8100), length 78: vlan 100, p 0, ethertype IPv4, (tos 0x0, ttl 64, id 27244,
offset 0, flags [DF], proto TCP (6), length 60)
10.0.100.1.58545 > 10.0.100.10.ircu-2: Flags [S], cksum 0xdc39 (incorrect
-> 0x44ee), seq 1069378582, win 29200, options [mss 1460,sackOK,TS val
4294838888 ecr 0,nop,wscale 7], length 0
This connection finally times out.
I've only access to the TG3 hardware in this configuration thus have
only tested this with TG3 driver. There are a lot of other drivers
that do not permit user changes to vlan acceleration features, and
I don't know if they all suffere from a similar issue.
The patch attempt to fix this another way. It moves the vlan header
stipping code out of the vlan module and always builds it into the
kernel network core. This way, even if vlan is not supported on
a virtualizatoin host, the virtual machines running on top of such
host will still work with VLANs enabled.
CC: Patrick McHardy <kaber@trash.net>
CC: Nithin Nayak Sujir <nsujir@broadcom.com>
CC: Michael Chan <mchan@broadcom.com>
CC: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-08 18:42:13 +00:00
|
|
|
skb = skb_vlan_untag(skb);
|
2011-04-07 19:48:33 +00:00
|
|
|
if (unlikely(!skb))
|
2015-07-09 06:59:10 +00:00
|
|
|
goto out;
|
2011-04-07 19:48:33 +00:00
|
|
|
}
|
|
|
|
|
2017-01-07 22:06:35 +00:00
|
|
|
if (skb_skip_tc_classify(skb))
|
|
|
|
goto skip_classify;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2013-02-14 20:57:38 +00:00
|
|
|
if (pfmemalloc)
|
2012-07-31 23:44:26 +00:00
|
|
|
goto skip_taps;
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
list_for_each_entry_rcu(ptype, &ptype_all, list) {
|
2015-01-27 19:35:48 +00:00
|
|
|
if (pt_prev)
|
|
|
|
ret = deliver_skb(skb, pt_prev, orig_dev);
|
|
|
|
pt_prev = ptype;
|
|
|
|
}
|
|
|
|
|
|
|
|
list_for_each_entry_rcu(ptype, &skb->dev->ptype_all, list) {
|
|
|
|
if (pt_prev)
|
|
|
|
ret = deliver_skb(skb, pt_prev, orig_dev);
|
|
|
|
pt_prev = ptype;
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
|
2012-07-31 23:44:26 +00:00
|
|
|
skip_taps:
|
2015-05-13 16:19:37 +00:00
|
|
|
#ifdef CONFIG_NET_INGRESS
|
net: use jump label patching for ingress qdisc in __netif_receive_skb_core
Even if we make use of classifier and actions from the egress
path, we're going into handle_ing() executing additional code
on a per-packet cost for ingress qdisc, just to realize that
nothing is attached on ingress.
Instead, this can just be blinded out as a no-op entirely with
the use of a static key. On input fast-path, we already make
use of static keys in various places, e.g. skb time stamping,
in RPS, etc. It makes sense to not waste time when we're assured
that no ingress qdisc is attached anywhere.
Enabling/disabling of that code path is being done via two
helpers, namely net_{inc,dec}_ingress_queue(), that are being
invoked under RTNL mutex when a ingress qdisc is being either
initialized or destructed.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-10 21:07:54 +00:00
|
|
|
if (static_key_false(&ingress_needed)) {
|
net, sched: add clsact qdisc
This work adds a generalization of the ingress qdisc as a qdisc holding
only classifiers. The clsact qdisc works on ingress, but also on egress.
In both cases, it's execution happens without taking the qdisc lock, and
the main difference for the egress part compared to prior version of [1]
is that this can be applied with _any_ underlying real egress qdisc (also
classless ones).
Besides solving the use-case of [1], that is, allowing for more programmability
on assigning skb->priority for the mqprio case that is supported by most
popular 10G+ NICs, it also opens up a lot more flexibility for other tc
applications. The main work on classification can already be done at clsact
egress time if the use-case allows and state stored for later retrieval
f.e. again in skb->priority with major/minors (which is checked by most
classful qdiscs before consulting tc_classify()) and/or in other skb fields
like skb->tc_index for some light-weight post-processing to get to the
eventual classid in case of a classful qdisc. Another use case is that
the clsact egress part allows to have a central egress counterpart to
the ingress classifiers, so that classifiers can easily share state (e.g.
in cls_bpf via eBPF maps) for ingress and egress.
Currently, default setups like mq + pfifo_fast would require for this to
use, for example, prio qdisc instead (to get a tc_classify() run) and to
duplicate the egress classifier for each queue. With clsact, it allows
for leaving the setup as is, it can additionally assign skb->priority to
put the skb in one of pfifo_fast's bands and it can share state with maps.
Moreover, we can access the skb's dst entry (f.e. to retrieve tclassid)
w/o the need to perform a skb_dst_force() to hold on to it any longer. In
lwt case, we can also use this facility to setup dst metadata via cls_bpf
(bpf_skb_set_tunnel_key()) without needing a real egress qdisc just for
that (case of IFF_NO_QUEUE devices, for example).
The realization can be done without any changes to the scheduler core
framework. All it takes is that we have two a-priori defined minors/child
classes, where we can mux between ingress and egress classifier list
(dev->ingress_cl_list and dev->egress_cl_list, latter stored close to
dev->_tx to avoid extra cacheline miss for moderate loads). The egress
part is a bit similar modelled to handle_ing() and patched to a noop in
case the functionality is not used. Both handlers are now called
sch_handle_ingress() and sch_handle_egress(), code sharing among the two
doesn't seem practical as there are various minor differences in both
paths, so that making them conditional in a single handler would rather
slow things down.
Full compatibility to ingress qdisc is provided as well. Since both
piggyback on TC_H_CLSACT, only one of them (ingress/clsact) can exist
per netdevice, and thus ingress qdisc specific behaviour can be retained
for user space. This means, either a user does 'tc qdisc add dev foo ingress'
and configures ingress qdisc as usual, or the 'tc qdisc add dev foo clsact'
alternative, where both, ingress and egress classifier can be configured
as in the below example. ingress qdisc supports attaching classifier to any
minor number whereas clsact has two fixed minors for muxing between the
lists, therefore to not break user space setups, they are better done as
two separate qdiscs.
I decided to extend the sch_ingress module with clsact functionality so
that commonly used code can be reused, the module is being aliased with
sch_clsact so that it can be auto-loaded properly. Alternative would have been
to add a flag when initializing ingress to alter its behaviour plus aliasing
to a different name (as it's more than just ingress). However, the first would
end up, based on the flag, choosing the new/old behaviour by calling different
function implementations to handle each anyway, the latter would require to
register ingress qdisc once again under different alias. So, this really begs
to provide a minimal, cleaner approach to have Qdisc_ops and Qdisc_class_ops
by its own that share callbacks used by both.
Example, adding qdisc:
# tc qdisc add dev foo clsact
# tc qdisc show dev foo
qdisc mq 0: root
qdisc pfifo_fast 0: parent :1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :4 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc clsact ffff: parent ffff:fff1
Adding filters (deleting, etc works analogous by specifying ingress/egress):
# tc filter add dev foo ingress bpf da obj bar.o sec ingress
# tc filter add dev foo egress bpf da obj bar.o sec egress
# tc filter show dev foo ingress
filter protocol all pref 49152 bpf
filter protocol all pref 49152 bpf handle 0x1 bar.o:[ingress] direct-action
# tc filter show dev foo egress
filter protocol all pref 49152 bpf
filter protocol all pref 49152 bpf handle 0x1 bar.o:[egress] direct-action
A 'tc filter show dev foo' or 'tc filter show dev foo parent ffff:' will
show an empty list for clsact. Either using the parent names (ingress/egress)
or specifying the full major/minor will then show the related filter lists.
Prior work on a mqprio prequeue() facility [1] was done mainly by John Fastabend.
[1] http://patchwork.ozlabs.org/patch/512949/
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-07 21:29:47 +00:00
|
|
|
skb = sch_handle_ingress(skb, &pt_prev, &ret, orig_dev);
|
net: use jump label patching for ingress qdisc in __netif_receive_skb_core
Even if we make use of classifier and actions from the egress
path, we're going into handle_ing() executing additional code
on a per-packet cost for ingress qdisc, just to realize that
nothing is attached on ingress.
Instead, this can just be blinded out as a no-op entirely with
the use of a static key. On input fast-path, we already make
use of static keys in various places, e.g. skb time stamping,
in RPS, etc. It makes sense to not waste time when we're assured
that no ingress qdisc is attached anywhere.
Enabling/disabling of that code path is being done via two
helpers, namely net_{inc,dec}_ingress_queue(), that are being
invoked under RTNL mutex when a ingress qdisc is being either
initialized or destructed.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-10 21:07:54 +00:00
|
|
|
if (!skb)
|
2015-07-09 06:59:10 +00:00
|
|
|
goto out;
|
netfilter: add netfilter ingress hook after handle_ing() under unique static key
This patch adds the Netfilter ingress hook just after the existing tc ingress
hook, that seems to be the consensus solution for this.
Note that the Netfilter hook resides under the global static key that enables
ingress filtering. Nonetheless, Netfilter still also has its own static key for
minimal impact on the existing handle_ing().
* Without this patch:
Result: OK: 6216490(c6216338+d152) usec, 100000000 (60byte,0frags)
16086246pps 7721Mb/sec (7721398080bps) errors: 100000000
42.46% kpktgend_0 [kernel.kallsyms] [k] __netif_receive_skb_core
25.92% kpktgend_0 [kernel.kallsyms] [k] kfree_skb
7.81% kpktgend_0 [pktgen] [k] pktgen_thread_worker
5.62% kpktgend_0 [kernel.kallsyms] [k] ip_rcv
2.70% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_internal
2.34% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_sk
1.44% kpktgend_0 [kernel.kallsyms] [k] __build_skb
* With this patch:
Result: OK: 6214833(c6214731+d101) usec, 100000000 (60byte,0frags)
16090536pps 7723Mb/sec (7723457280bps) errors: 100000000
41.23% kpktgend_0 [kernel.kallsyms] [k] __netif_receive_skb_core
26.57% kpktgend_0 [kernel.kallsyms] [k] kfree_skb
7.72% kpktgend_0 [pktgen] [k] pktgen_thread_worker
5.55% kpktgend_0 [kernel.kallsyms] [k] ip_rcv
2.78% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_internal
2.06% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_sk
1.43% kpktgend_0 [kernel.kallsyms] [k] __build_skb
* Without this patch + tc ingress:
tc filter add dev eth4 parent ffff: protocol ip prio 1 \
u32 match ip dst 4.3.2.1/32
Result: OK: 9269001(c9268821+d179) usec, 100000000 (60byte,0frags)
10788648pps 5178Mb/sec (5178551040bps) errors: 100000000
40.99% kpktgend_0 [kernel.kallsyms] [k] __netif_receive_skb_core
17.50% kpktgend_0 [kernel.kallsyms] [k] kfree_skb
11.77% kpktgend_0 [cls_u32] [k] u32_classify
5.62% kpktgend_0 [kernel.kallsyms] [k] tc_classify_compat
5.18% kpktgend_0 [pktgen] [k] pktgen_thread_worker
3.23% kpktgend_0 [kernel.kallsyms] [k] tc_classify
2.97% kpktgend_0 [kernel.kallsyms] [k] ip_rcv
1.83% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_internal
1.50% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_sk
0.99% kpktgend_0 [kernel.kallsyms] [k] __build_skb
* With this patch + tc ingress:
tc filter add dev eth4 parent ffff: protocol ip prio 1 \
u32 match ip dst 4.3.2.1/32
Result: OK: 9308218(c9308091+d126) usec, 100000000 (60byte,0frags)
10743194pps 5156Mb/sec (5156733120bps) errors: 100000000
42.01% kpktgend_0 [kernel.kallsyms] [k] __netif_receive_skb_core
17.78% kpktgend_0 [kernel.kallsyms] [k] kfree_skb
11.70% kpktgend_0 [cls_u32] [k] u32_classify
5.46% kpktgend_0 [kernel.kallsyms] [k] tc_classify_compat
5.16% kpktgend_0 [pktgen] [k] pktgen_thread_worker
2.98% kpktgend_0 [kernel.kallsyms] [k] ip_rcv
2.84% kpktgend_0 [kernel.kallsyms] [k] tc_classify
1.96% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_internal
1.57% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_sk
Note that the results are very similar before and after.
I can see gcc gets the code under the ingress static key out of the hot path.
Then, on that cold branch, it generates the code to accomodate the netfilter
ingress static key. My explanation for this is that this reduces the pressure
on the instruction cache for non-users as the new code is out of the hot path,
and it comes with minimal impact for tc ingress users.
Using gcc version 4.8.4 on:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
[...]
L1d cache: 16K
L1i cache: 64K
L2 cache: 2048K
L3 cache: 8192K
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-13 16:19:38 +00:00
|
|
|
|
|
|
|
if (nf_ingress(skb, &pt_prev, &ret, orig_dev) < 0)
|
2015-07-09 06:59:10 +00:00
|
|
|
goto out;
|
net: use jump label patching for ingress qdisc in __netif_receive_skb_core
Even if we make use of classifier and actions from the egress
path, we're going into handle_ing() executing additional code
on a per-packet cost for ingress qdisc, just to realize that
nothing is attached on ingress.
Instead, this can just be blinded out as a no-op entirely with
the use of a static key. On input fast-path, we already make
use of static keys in various places, e.g. skb time stamping,
in RPS, etc. It makes sense to not waste time when we're assured
that no ingress qdisc is attached anywhere.
Enabling/disabling of that code path is being done via two
helpers, namely net_{inc,dec}_ingress_queue(), that are being
invoked under RTNL mutex when a ingress qdisc is being either
initialized or destructed.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-10 21:07:54 +00:00
|
|
|
}
|
2015-05-13 16:19:37 +00:00
|
|
|
#endif
|
2017-01-07 22:06:36 +00:00
|
|
|
skb_reset_tc(skb);
|
2017-01-07 22:06:35 +00:00
|
|
|
skip_classify:
|
2013-02-14 20:57:38 +00:00
|
|
|
if (pfmemalloc && !skb_pfmemalloc_protocol(skb))
|
2012-07-31 23:44:26 +00:00
|
|
|
goto drop;
|
|
|
|
|
2015-01-13 16:13:44 +00:00
|
|
|
if (skb_vlan_tag_present(skb)) {
|
2011-10-10 09:16:41 +00:00
|
|
|
if (pt_prev) {
|
|
|
|
ret = deliver_skb(skb, pt_prev, orig_dev);
|
|
|
|
pt_prev = NULL;
|
|
|
|
}
|
vlan: don't deliver frames for unknown vlans to protocols
6a32e4f9dd9219261f8856f817e6655114cfec2f made the vlan code skip marking
vlan-tagged frames for not locally configured vlans as PACKET_OTHERHOST if
there was an rx_handler, as the rx_handler could cause the frame to be received
on a different (virtual) vlan-capable interface where that vlan might be
configured.
As rx_handlers do not necessarily return RX_HANDLER_ANOTHER, this could cause
frames for unknown vlans to be delivered to the protocol stack as if they had
been received untagged.
For example, if an ipv6 router advertisement that's tagged for a locally not
configured vlan is received on an interface with macvlan interfaces attached,
macvlan's rx_handler returns RX_HANDLER_PASS after delivering the frame to the
macvlan interfaces, which caused it to be passed to the protocol stack, leading
to ipv6 addresses for the announced prefix being configured even though those
are completely unusable on the underlying interface.
The fix moves marking as PACKET_OTHERHOST after the rx_handler so the
rx_handler, if there is one, sees the frame unchanged, but afterwards,
before the frame is delivered to the protocol stack, it gets marked whether
there is an rx_handler or not.
Signed-off-by: Florian Zumbiehl <florz@florz.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-07 15:51:58 +00:00
|
|
|
if (vlan_do_receive(&skb))
|
2011-10-10 09:16:41 +00:00
|
|
|
goto another_round;
|
|
|
|
else if (unlikely(!skb))
|
2015-07-09 06:59:10 +00:00
|
|
|
goto out;
|
2011-10-10 09:16:41 +00:00
|
|
|
}
|
|
|
|
|
vlan: don't deliver frames for unknown vlans to protocols
6a32e4f9dd9219261f8856f817e6655114cfec2f made the vlan code skip marking
vlan-tagged frames for not locally configured vlans as PACKET_OTHERHOST if
there was an rx_handler, as the rx_handler could cause the frame to be received
on a different (virtual) vlan-capable interface where that vlan might be
configured.
As rx_handlers do not necessarily return RX_HANDLER_ANOTHER, this could cause
frames for unknown vlans to be delivered to the protocol stack as if they had
been received untagged.
For example, if an ipv6 router advertisement that's tagged for a locally not
configured vlan is received on an interface with macvlan interfaces attached,
macvlan's rx_handler returns RX_HANDLER_PASS after delivering the frame to the
macvlan interfaces, which caused it to be passed to the protocol stack, leading
to ipv6 addresses for the announced prefix being configured even though those
are completely unusable on the underlying interface.
The fix moves marking as PACKET_OTHERHOST after the rx_handler so the
rx_handler, if there is one, sees the frame unchanged, but afterwards,
before the frame is delivered to the protocol stack, it gets marked whether
there is an rx_handler or not.
Signed-off-by: Florian Zumbiehl <florz@florz.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-07 15:51:58 +00:00
|
|
|
rx_handler = rcu_dereference(skb->dev->rx_handler);
|
2010-06-01 21:52:08 +00:00
|
|
|
if (rx_handler) {
|
|
|
|
if (pt_prev) {
|
|
|
|
ret = deliver_skb(skb, pt_prev, orig_dev);
|
|
|
|
pt_prev = NULL;
|
|
|
|
}
|
2011-03-12 03:14:39 +00:00
|
|
|
switch (rx_handler(&skb)) {
|
|
|
|
case RX_HANDLER_CONSUMED:
|
2013-03-08 07:03:38 +00:00
|
|
|
ret = NET_RX_SUCCESS;
|
2015-07-09 06:59:10 +00:00
|
|
|
goto out;
|
2011-03-12 03:14:39 +00:00
|
|
|
case RX_HANDLER_ANOTHER:
|
2011-02-28 18:48:59 +00:00
|
|
|
goto another_round;
|
2011-03-12 03:14:39 +00:00
|
|
|
case RX_HANDLER_EXACT:
|
|
|
|
deliver_exact = true;
|
|
|
|
case RX_HANDLER_PASS:
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
BUG();
|
|
|
|
}
|
2010-06-01 21:52:08 +00:00
|
|
|
}
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2015-01-13 16:13:44 +00:00
|
|
|
if (unlikely(skb_vlan_tag_present(skb))) {
|
|
|
|
if (skb_vlan_tag_get_id(skb))
|
2013-07-18 14:19:26 +00:00
|
|
|
skb->pkt_type = PACKET_OTHERHOST;
|
|
|
|
/* Note: we might in the future use prio bits
|
|
|
|
* and set skb->priority like in vlan_do_receive()
|
|
|
|
* For the time being, just ignore Priority Code Point
|
|
|
|
*/
|
|
|
|
skb->vlan_tci = 0;
|
|
|
|
}
|
vlan: don't deliver frames for unknown vlans to protocols
6a32e4f9dd9219261f8856f817e6655114cfec2f made the vlan code skip marking
vlan-tagged frames for not locally configured vlans as PACKET_OTHERHOST if
there was an rx_handler, as the rx_handler could cause the frame to be received
on a different (virtual) vlan-capable interface where that vlan might be
configured.
As rx_handlers do not necessarily return RX_HANDLER_ANOTHER, this could cause
frames for unknown vlans to be delivered to the protocol stack as if they had
been received untagged.
For example, if an ipv6 router advertisement that's tagged for a locally not
configured vlan is received on an interface with macvlan interfaces attached,
macvlan's rx_handler returns RX_HANDLER_PASS after delivering the frame to the
macvlan interfaces, which caused it to be passed to the protocol stack, leading
to ipv6 addresses for the announced prefix being configured even though those
are completely unusable on the underlying interface.
The fix moves marking as PACKET_OTHERHOST after the rx_handler so the
rx_handler, if there is one, sees the frame unchanged, but afterwards,
before the frame is delivered to the protocol stack, it gets marked whether
there is an rx_handler or not.
Signed-off-by: Florian Zumbiehl <florz@florz.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-10-07 15:51:58 +00:00
|
|
|
|
2015-01-27 19:35:48 +00:00
|
|
|
type = skb->protocol;
|
|
|
|
|
2011-02-28 18:48:59 +00:00
|
|
|
/* deliver only exact match when indicated */
|
2015-01-27 19:35:48 +00:00
|
|
|
if (likely(!deliver_exact)) {
|
|
|
|
deliver_ptype_list_skb(skb, &pt_prev, orig_dev, type,
|
|
|
|
&ptype_base[ntohs(type) &
|
|
|
|
PTYPE_HASH_MASK]);
|
|
|
|
}
|
bonding: allow arp_ip_targets on separate vlans to use arp validation
This allows a bond device to specify an arp_ip_target as a host that is
not on the same vlan as the base bond device and still use arp
validation. A configuration like this, now works:
BONDING_OPTS="mode=active-backup arp_interval=1000 arp_ip_target=10.0.100.1 arp_validate=3"
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master bond0 qlen 1000
link/ether 00:13:21:be:33:e9 brd ff:ff:ff:ff:ff:ff
3: eth0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master bond0 qlen 1000
link/ether 00:13:21:be:33:e9 brd ff:ff:ff:ff:ff:ff
8: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue
link/ether 00:13:21:be:33:e9 brd ff:ff:ff:ff:ff:ff
inet6 fe80::213:21ff:febe:33e9/64 scope link
valid_lft forever preferred_lft forever
9: bond0.100@bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue
link/ether 00:13:21:be:33:e9 brd ff:ff:ff:ff:ff:ff
inet 10.0.100.2/24 brd 10.0.100.255 scope global bond0.100
inet6 fe80::213:21ff:febe:33e9/64 scope link
valid_lft forever preferred_lft forever
Ethernet Channel Bonding Driver: v3.6.0 (September 26, 2009)
Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth1
MII Status: up
MII Polling Interval (ms): 0
Up Delay (ms): 0
Down Delay (ms): 0
ARP Polling Interval (ms): 1000
ARP IP target/s (n.n.n.n form): 10.0.100.1
Slave Interface: eth1
MII Status: up
Link Failure Count: 1
Permanent HW addr: 00:40:05:30:ff:30
Slave Interface: eth0
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:13:21:be:33:e9
Signed-off-by: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-14 10:48:58 +00:00
|
|
|
|
2015-01-27 19:35:48 +00:00
|
|
|
deliver_ptype_list_skb(skb, &pt_prev, orig_dev, type,
|
|
|
|
&orig_dev->ptype_specific);
|
|
|
|
|
|
|
|
if (unlikely(skb->dev != orig_dev)) {
|
|
|
|
deliver_ptype_list_skb(skb, &pt_prev, orig_dev, type,
|
|
|
|
&skb->dev->ptype_specific);
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
if (pt_prev) {
|
sock: enable MSG_ZEROCOPY
Prepare the datapath for refcounted ubuf_info. Clone ubuf_info with
skb_zerocopy_clone() wherever needed due to skb split, merge, resize
or clone.
Split skb_orphan_frags into two variants. The split, merge, .. paths
support reference counted zerocopy buffers, so do not do a deep copy.
Add skb_orphan_frags_rx for paths that may loop packets to receive
sockets. That is not allowed, as it may cause unbounded latency.
Deep copy all zerocopy copy buffers, ref-counted or not, in this path.
The exact locations to modify were chosen by exhaustively searching
through all code that might modify skb_frag references and/or the
the SKBTX_DEV_ZEROCOPY tx_flags bit.
The changes err on the safe side, in two ways.
(1) legacy ubuf_info paths virtio and tap are not modified. They keep
a 1:1 ubuf_info to sk_buff relationship. Calls to skb_orphan_frags
still call skb_copy_ubufs and thus copy frags in this case.
(2) not all copies deep in the stack are addressed yet. skb_shift,
skb_split and skb_try_coalesce can be refined to avoid copying.
These are not in the hot path and this patch is hairy enough as
is, so that is left for future refinement.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03 20:29:41 +00:00
|
|
|
if (unlikely(skb_orphan_frags_rx(skb, GFP_ATOMIC)))
|
2012-09-15 22:44:16 +00:00
|
|
|
goto drop;
|
2012-07-20 09:23:17 +00:00
|
|
|
else
|
|
|
|
ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
|
2005-04-16 22:20:36 +00:00
|
|
|
} else {
|
2012-07-31 23:44:26 +00:00
|
|
|
drop:
|
2016-02-01 23:51:05 +00:00
|
|
|
if (!deliver_exact)
|
|
|
|
atomic_long_inc(&skb->dev->rx_dropped);
|
|
|
|
else
|
|
|
|
atomic_long_inc(&skb->dev->rx_nohandler);
|
2005-04-16 22:20:36 +00:00
|
|
|
kfree_skb(skb);
|
|
|
|
/* Jamal, now you will not able to escape explaining
|
|
|
|
* me how you were going to use this. :-)
|
|
|
|
*/
|
|
|
|
ret = NET_RX_DROP;
|
|
|
|
}
|
|
|
|
|
2015-07-09 06:59:10 +00:00
|
|
|
out:
|
2013-02-14 20:57:38 +00:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int __netif_receive_skb(struct sk_buff *skb)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
if (sk_memalloc_socks() && skb_pfmemalloc(skb)) {
|
2017-05-08 22:59:53 +00:00
|
|
|
unsigned int noreclaim_flag;
|
2013-02-14 20:57:38 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* PFMEMALLOC skbs are special, they should
|
|
|
|
* - be delivered to SOCK_MEMALLOC sockets only
|
|
|
|
* - stay away from userspace
|
|
|
|
* - have bounded memory usage
|
|
|
|
*
|
|
|
|
* Use PF_MEMALLOC as this saves us from propagating the allocation
|
|
|
|
* context down to all allocation sites.
|
|
|
|
*/
|
2017-05-08 22:59:53 +00:00
|
|
|
noreclaim_flag = memalloc_noreclaim_save();
|
2013-02-14 20:57:38 +00:00
|
|
|
ret = __netif_receive_skb_core(skb, true);
|
2017-05-08 22:59:53 +00:00
|
|
|
memalloc_noreclaim_restore(noreclaim_flag);
|
2013-02-14 20:57:38 +00:00
|
|
|
} else
|
|
|
|
ret = __netif_receive_skb_core(skb, false);
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
return ret;
|
|
|
|
}
|
2010-03-16 08:03:29 +00:00
|
|
|
|
2017-04-18 19:36:58 +00:00
|
|
|
static int generic_xdp_install(struct net_device *dev, struct netdev_xdp *xdp)
|
|
|
|
{
|
2017-06-16 00:29:09 +00:00
|
|
|
struct bpf_prog *old = rtnl_dereference(dev->xdp_prog);
|
2017-04-18 19:36:58 +00:00
|
|
|
struct bpf_prog *new = xdp->prog;
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
switch (xdp->command) {
|
2017-06-16 00:29:09 +00:00
|
|
|
case XDP_SETUP_PROG:
|
2017-04-18 19:36:58 +00:00
|
|
|
rcu_assign_pointer(dev->xdp_prog, new);
|
|
|
|
if (old)
|
|
|
|
bpf_prog_put(old);
|
|
|
|
|
|
|
|
if (old && !new) {
|
|
|
|
static_key_slow_dec(&generic_xdp_needed);
|
|
|
|
} else if (new && !old) {
|
|
|
|
static_key_slow_inc(&generic_xdp_needed);
|
|
|
|
dev_disable_lro(dev);
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
|
|
|
|
case XDP_QUERY_PROG:
|
2017-06-16 00:29:09 +00:00
|
|
|
xdp->prog_attached = !!old;
|
|
|
|
xdp->prog_id = old ? old->aux->id : 0;
|
2017-04-18 19:36:58 +00:00
|
|
|
break;
|
|
|
|
|
|
|
|
default:
|
|
|
|
ret = -EINVAL;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2014-01-10 22:17:24 +00:00
|
|
|
static int netif_receive_skb_internal(struct sk_buff *skb)
|
2010-03-16 08:03:29 +00:00
|
|
|
{
|
2015-07-09 06:59:10 +00:00
|
|
|
int ret;
|
|
|
|
|
2011-11-15 04:12:55 +00:00
|
|
|
net_timestamp_check(netdev_tstamp_prequeue, skb);
|
net: Consistent skb timestamping
With RPS inclusion, skb timestamping is not consistent in RX path.
If netif_receive_skb() is used, its deferred after RPS dispatch.
If netif_rx() is used, its done before RPS dispatch.
This can give strange tcpdump timestamps results.
I think timestamping should be done as soon as possible in the receive
path, to get meaningful values (ie timestamps taken at the time packet
was delivered by NIC driver to our stack), even if NAPI already can
defer timestamping a bit (RPS can help to reduce the gap)
Tom Herbert prefer to sample timestamps after RPS dispatch. In case
sampling is expensive (HPET/acpi_pm on x86), this makes sense.
Let admins switch from one mode to another, using a new
sysctl, /proc/sys/net/core/netdev_tstamp_prequeue
Its default value (1), means timestamps are taken as soon as possible,
before backlog queueing, giving accurate timestamps.
Setting a 0 value permits to sample timestamps when processing backlog,
after RPS dispatch, to lower the load of the pre-RPS cpu.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-16 06:57:10 +00:00
|
|
|
|
2010-07-17 08:49:36 +00:00
|
|
|
if (skb_defer_rx_timestamp(skb))
|
|
|
|
return NET_RX_SUCCESS;
|
|
|
|
|
2017-04-18 19:36:58 +00:00
|
|
|
if (static_key_false(&generic_xdp_needed)) {
|
2017-09-08 21:00:30 +00:00
|
|
|
int ret;
|
2017-04-18 19:36:58 +00:00
|
|
|
|
2017-09-08 21:00:30 +00:00
|
|
|
preempt_disable();
|
|
|
|
rcu_read_lock();
|
|
|
|
ret = do_xdp_generic(rcu_dereference(skb->dev->xdp_prog), skb);
|
|
|
|
rcu_read_unlock();
|
|
|
|
preempt_enable();
|
|
|
|
|
|
|
|
if (ret != XDP_PASS)
|
2017-07-17 16:26:45 +00:00
|
|
|
return NET_RX_DROP;
|
2017-04-18 19:36:58 +00:00
|
|
|
}
|
|
|
|
|
2017-09-08 21:00:30 +00:00
|
|
|
rcu_read_lock();
|
2010-03-24 19:13:54 +00:00
|
|
|
#ifdef CONFIG_RPS
|
2012-02-24 07:31:31 +00:00
|
|
|
if (static_key_false(&rps_needed)) {
|
net: Consistent skb timestamping
With RPS inclusion, skb timestamping is not consistent in RX path.
If netif_receive_skb() is used, its deferred after RPS dispatch.
If netif_rx() is used, its done before RPS dispatch.
This can give strange tcpdump timestamps results.
I think timestamping should be done as soon as possible in the receive
path, to get meaningful values (ie timestamps taken at the time packet
was delivered by NIC driver to our stack), even if NAPI already can
defer timestamping a bit (RPS can help to reduce the gap)
Tom Herbert prefer to sample timestamps after RPS dispatch. In case
sampling is expensive (HPET/acpi_pm on x86), this makes sense.
Let admins switch from one mode to another, using a new
sysctl, /proc/sys/net/core/netdev_tstamp_prequeue
Its default value (1), means timestamps are taken as soon as possible,
before backlog queueing, giving accurate timestamps.
Setting a 0 value permits to sample timestamps when processing backlog,
after RPS dispatch, to lower the load of the pre-RPS cpu.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-16 06:57:10 +00:00
|
|
|
struct rps_dev_flow voidflow, *rflow = &voidflow;
|
2015-07-09 06:59:10 +00:00
|
|
|
int cpu = get_rps_cpu(skb->dev, skb, &rflow);
|
2010-03-16 08:03:29 +00:00
|
|
|
|
net: Consistent skb timestamping
With RPS inclusion, skb timestamping is not consistent in RX path.
If netif_receive_skb() is used, its deferred after RPS dispatch.
If netif_rx() is used, its done before RPS dispatch.
This can give strange tcpdump timestamps results.
I think timestamping should be done as soon as possible in the receive
path, to get meaningful values (ie timestamps taken at the time packet
was delivered by NIC driver to our stack), even if NAPI already can
defer timestamping a bit (RPS can help to reduce the gap)
Tom Herbert prefer to sample timestamps after RPS dispatch. In case
sampling is expensive (HPET/acpi_pm on x86), this makes sense.
Let admins switch from one mode to another, using a new
sysctl, /proc/sys/net/core/netdev_tstamp_prequeue
Its default value (1), means timestamps are taken as soon as possible,
before backlog queueing, giving accurate timestamps.
Setting a 0 value permits to sample timestamps when processing backlog,
after RPS dispatch, to lower the load of the pre-RPS cpu.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-16 06:57:10 +00:00
|
|
|
if (cpu >= 0) {
|
|
|
|
ret = enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
|
|
|
|
rcu_read_unlock();
|
2011-11-17 03:13:26 +00:00
|
|
|
return ret;
|
net: Consistent skb timestamping
With RPS inclusion, skb timestamping is not consistent in RX path.
If netif_receive_skb() is used, its deferred after RPS dispatch.
If netif_rx() is used, its done before RPS dispatch.
This can give strange tcpdump timestamps results.
I think timestamping should be done as soon as possible in the receive
path, to get meaningful values (ie timestamps taken at the time packet
was delivered by NIC driver to our stack), even if NAPI already can
defer timestamping a bit (RPS can help to reduce the gap)
Tom Herbert prefer to sample timestamps after RPS dispatch. In case
sampling is expensive (HPET/acpi_pm on x86), this makes sense.
Let admins switch from one mode to another, using a new
sysctl, /proc/sys/net/core/netdev_tstamp_prequeue
Its default value (1), means timestamps are taken as soon as possible,
before backlog queueing, giving accurate timestamps.
Setting a 0 value permits to sample timestamps when processing backlog,
after RPS dispatch, to lower the load of the pre-RPS cpu.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-05-16 06:57:10 +00:00
|
|
|
}
|
rfs: Receive Flow Steering
This patch implements receive flow steering (RFS). RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running. RFS is an
extension of Receive Packet Steering (RPS).
The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure. The rxhash is passed in skb's received on
the connection from netif_receive_skb. For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.
The convolution of the simple approach is that it would potentially
allow OOO packets. If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.
To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.
rps_sock_table is a global hash table. Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.
rps_dev_flow_table is specific to each device queue. Each entry
contains a CPU and a tail queue counter. The CPU is the "current"
CPU for a matching flow. The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.
Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length. When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.
And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted. When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:
- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table. This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.
Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality. 2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.
This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.
There are two configuration parameters for RFS. The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue. Both are rounded to power of two.
The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).
The benefits of RFS are dependent on cache hierarchy, application
load, and other factors. On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation. However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.
Below are some benchmark results which show the potential benfit of
this patch. The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp. The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.
e1000e on 8 core Intel
No RFS or RPS 104K tps at 30% CPU
No RFS (best RPS config): 290K tps at 63% CPU
RFS 303K tps at 61% CPU
RPC test tps CPU% 50/90/99% usec latency Latency StdDev
No RFS/RPS 103K 48% 757/900/3185 4472.35
RPS only: 174K 73% 415/993/2468 491.66
RFS 223K 73% 379/651/1382 315.61
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-16 23:01:27 +00:00
|
|
|
}
|
2010-03-19 00:45:44 +00:00
|
|
|
#endif
|
2015-07-09 06:59:10 +00:00
|
|
|
ret = __netif_receive_skb(skb);
|
|
|
|
rcu_read_unlock();
|
|
|
|
return ret;
|
2010-03-16 08:03:29 +00:00
|
|
|
}
|
2014-01-10 22:17:24 +00:00
|
|
|
|
|
|
|
/**
|
|
|
|
* netif_receive_skb - process receive buffer from network
|
|
|
|
* @skb: buffer to process
|
|
|
|
*
|
|
|
|
* netif_receive_skb() is the main receive data processing function.
|
|
|
|
* It always succeeds. The buffer may be dropped during processing
|
|
|
|
* for congestion control or by the protocol layers.
|
|
|
|
*
|
|
|
|
* This function may only be called from softirq context and interrupts
|
|
|
|
* should be enabled.
|
|
|
|
*
|
|
|
|
* Return values (usually ignored):
|
|
|
|
* NET_RX_SUCCESS: no congestion
|
|
|
|
* NET_RX_DROP: packet was dropped
|
|
|
|
*/
|
2015-09-16 01:04:15 +00:00
|
|
|
int netif_receive_skb(struct sk_buff *skb)
|
2014-01-10 22:17:24 +00:00
|
|
|
{
|
|
|
|
trace_netif_receive_skb_entry(skb);
|
|
|
|
|
|
|
|
return netif_receive_skb_internal(skb);
|
|
|
|
}
|
2015-09-16 01:04:15 +00:00
|
|
|
EXPORT_SYMBOL(netif_receive_skb);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2016-08-26 19:50:39 +00:00
|
|
|
DEFINE_PER_CPU(struct work_struct, flush_works);
|
2016-08-25 13:58:44 +00:00
|
|
|
|
|
|
|
/* Network device is going away, flush any packets still pending */
|
|
|
|
static void flush_backlog(struct work_struct *work)
|
2008-08-04 04:29:57 +00:00
|
|
|
{
|
|
|
|
struct sk_buff *skb, *tmp;
|
2016-08-25 13:58:44 +00:00
|
|
|
struct softnet_data *sd;
|
|
|
|
|
|
|
|
local_bh_disable();
|
|
|
|
sd = this_cpu_ptr(&softnet_data);
|
2008-08-04 04:29:57 +00:00
|
|
|
|
2016-08-25 13:58:44 +00:00
|
|
|
local_irq_disable();
|
2010-04-19 21:17:14 +00:00
|
|
|
rps_lock(sd);
|
2010-04-27 22:07:33 +00:00
|
|
|
skb_queue_walk_safe(&sd->input_pkt_queue, skb, tmp) {
|
2016-08-26 19:50:39 +00:00
|
|
|
if (skb->dev->reg_state == NETREG_UNREGISTERING) {
|
2010-04-19 21:17:14 +00:00
|
|
|
__skb_unlink(skb, &sd->input_pkt_queue);
|
2008-08-04 04:29:57 +00:00
|
|
|
kfree_skb(skb);
|
2010-05-20 18:37:59 +00:00
|
|
|
input_queue_head_incr(sd);
|
2008-08-04 04:29:57 +00:00
|
|
|
}
|
2010-04-27 22:07:33 +00:00
|
|
|
}
|
2010-04-19 21:17:14 +00:00
|
|
|
rps_unlock(sd);
|
2016-08-25 13:58:44 +00:00
|
|
|
local_irq_enable();
|
2010-04-27 22:07:33 +00:00
|
|
|
|
|
|
|
skb_queue_walk_safe(&sd->process_queue, skb, tmp) {
|
2016-08-26 19:50:39 +00:00
|
|
|
if (skb->dev->reg_state == NETREG_UNREGISTERING) {
|
2010-04-27 22:07:33 +00:00
|
|
|
__skb_unlink(skb, &sd->process_queue);
|
|
|
|
kfree_skb(skb);
|
2010-05-20 18:37:59 +00:00
|
|
|
input_queue_head_incr(sd);
|
2010-04-27 22:07:33 +00:00
|
|
|
}
|
|
|
|
}
|
2016-08-25 13:58:44 +00:00
|
|
|
local_bh_enable();
|
|
|
|
}
|
|
|
|
|
2016-08-26 19:50:39 +00:00
|
|
|
static void flush_all_backlogs(void)
|
2016-08-25 13:58:44 +00:00
|
|
|
{
|
|
|
|
unsigned int cpu;
|
|
|
|
|
|
|
|
get_online_cpus();
|
|
|
|
|
2016-08-26 19:50:39 +00:00
|
|
|
for_each_online_cpu(cpu)
|
|
|
|
queue_work_on(cpu, system_highpri_wq,
|
|
|
|
per_cpu_ptr(&flush_works, cpu));
|
2016-08-25 13:58:44 +00:00
|
|
|
|
|
|
|
for_each_online_cpu(cpu)
|
2016-08-26 19:50:39 +00:00
|
|
|
flush_work(per_cpu_ptr(&flush_works, cpu));
|
2016-08-25 13:58:44 +00:00
|
|
|
|
|
|
|
put_online_cpus();
|
2008-08-04 04:29:57 +00:00
|
|
|
}
|
|
|
|
|
net: Add Generic Receive Offload infrastructure
This patch adds the top-level GRO (Generic Receive Offload) infrastructure.
This is pretty similar to LRO except that this is protocol-independent.
Instead of holding packets in an lro_mgr structure, they're now held in
napi_struct.
For drivers that intend to use this, they can set the NETIF_F_GRO bit and
call napi_gro_receive instead of netif_receive_skb or just call netif_rx.
The latter will call napi_receive_skb automatically. When napi_gro_receive
is used, the driver must either call napi_complete/napi_rx_complete, or
call napi_gro_flush in softirq context if the driver uses the primitives
__napi_complete/__napi_rx_complete.
Protocols will set the gro_receive and gro_complete function pointers in
order to participate in this scheme.
In addition to the packet, gro_receive will get a list of currently held
packets. Each packet in the list has a same_flow field which is non-zero
if it is a potential match for the new packet. For each packet that may
match, they also have a flush field which is non-zero if the held packet
must not be merged with the new packet.
Once gro_receive has determined that the new skb matches a held packet,
the held packet may be processed immediately if the new skb cannot be
merged with it. In this case gro_receive should return the pointer to
the existing skb in gro_list. Otherwise the new skb should be merged into
the existing packet and NULL should be returned, unless the new skb makes
it impossible for any further merges to be made (e.g., FIN packet) where
the merged skb should be returned.
Whenever the skb is merged into an existing entry, the gro_receive
function should set NAPI_GRO_CB(skb)->same_flow. Note that if an skb
merely matches an existing entry but can't be merged with it, then
this shouldn't be set.
If gro_receive finds it pointless to hold the new skb for future merging,
it should set NAPI_GRO_CB(skb)->flush.
Held packets will be flushed by napi_gro_flush which is called by
napi_complete and napi_rx_complete.
Currently held packets are stored in a singly liked list just like LRO.
The list is limited to a maximum of 8 entries. In future, this may be
expanded to use a hash table to allow more flows to be held for merging.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-16 07:38:52 +00:00
|
|
|
static int napi_gro_complete(struct sk_buff *skb)
|
|
|
|
{
|
2012-11-15 08:49:11 +00:00
|
|
|
struct packet_offload *ptype;
|
net: Add Generic Receive Offload infrastructure
This patch adds the top-level GRO (Generic Receive Offload) infrastructure.
This is pretty similar to LRO except that this is protocol-independent.
Instead of holding packets in an lro_mgr structure, they're now held in
napi_struct.
For drivers that intend to use this, they can set the NETIF_F_GRO bit and
call napi_gro_receive instead of netif_receive_skb or just call netif_rx.
The latter will call napi_receive_skb automatically. When napi_gro_receive
is used, the driver must either call napi_complete/napi_rx_complete, or
call napi_gro_flush in softirq context if the driver uses the primitives
__napi_complete/__napi_rx_complete.
Protocols will set the gro_receive and gro_complete function pointers in
order to participate in this scheme.
In addition to the packet, gro_receive will get a list of currently held
packets. Each packet in the list has a same_flow field which is non-zero
if it is a potential match for the new packet. For each packet that may
match, they also have a flush field which is non-zero if the held packet
must not be merged with the new packet.
Once gro_receive has determined that the new skb matches a held packet,
the held packet may be processed immediately if the new skb cannot be
merged with it. In this case gro_receive should return the pointer to
the existing skb in gro_list. Otherwise the new skb should be merged into
the existing packet and NULL should be returned, unless the new skb makes
it impossible for any further merges to be made (e.g., FIN packet) where
the merged skb should be returned.
Whenever the skb is merged into an existing entry, the gro_receive
function should set NAPI_GRO_CB(skb)->same_flow. Note that if an skb
merely matches an existing entry but can't be merged with it, then
this shouldn't be set.
If gro_receive finds it pointless to hold the new skb for future merging,
it should set NAPI_GRO_CB(skb)->flush.
Held packets will be flushed by napi_gro_flush which is called by
napi_complete and napi_rx_complete.
Currently held packets are stored in a singly liked list just like LRO.
The list is limited to a maximum of 8 entries. In future, this may be
expanded to use a hash table to allow more flows to be held for merging.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-16 07:38:52 +00:00
|
|
|
__be16 type = skb->protocol;
|
2012-11-15 08:49:11 +00:00
|
|
|
struct list_head *head = &offload_base;
|
net: Add Generic Receive Offload infrastructure
This patch adds the top-level GRO (Generic Receive Offload) infrastructure.
This is pretty similar to LRO except that this is protocol-independent.
Instead of holding packets in an lro_mgr structure, they're now held in
napi_struct.
For drivers that intend to use this, they can set the NETIF_F_GRO bit and
call napi_gro_receive instead of netif_receive_skb or just call netif_rx.
The latter will call napi_receive_skb automatically. When napi_gro_receive
is used, the driver must either call napi_complete/napi_rx_complete, or
call napi_gro_flush in softirq context if the driver uses the primitives
__napi_complete/__napi_rx_complete.
Protocols will set the gro_receive and gro_complete function pointers in
order to participate in this scheme.
In addition to the packet, gro_receive will get a list of currently held
packets. Each packet in the list has a same_flow field which is non-zero
if it is a potential match for the new packet. For each packet that may
match, they also have a flush field which is non-zero if the held packet
must not be merged with the new packet.
Once gro_receive has determined that the new skb matches a held packet,
the held packet may be processed immediately if the new skb cannot be
merged with it. In this case gro_receive should return the pointer to
the existing skb in gro_list. Otherwise the new skb should be merged into
the existing packet and NULL should be returned, unless the new skb makes
it impossible for any further merges to be made (e.g., FIN packet) where
the merged skb should be returned.
Whenever the skb is merged into an existing entry, the gro_receive
function should set NAPI_GRO_CB(skb)->same_flow. Note that if an skb
merely matches an existing entry but can't be merged with it, then
this shouldn't be set.
If gro_receive finds it pointless to hold the new skb for future merging,
it should set NAPI_GRO_CB(skb)->flush.
Held packets will be flushed by napi_gro_flush which is called by
napi_complete and napi_rx_complete.
Currently held packets are stored in a singly liked list just like LRO.
The list is limited to a maximum of 8 entries. In future, this may be
expanded to use a hash table to allow more flows to be held for merging.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-16 07:38:52 +00:00
|
|
|
int err = -ENOENT;
|
|
|
|
|
2012-12-06 13:54:59 +00:00
|
|
|
BUILD_BUG_ON(sizeof(struct napi_gro_cb) > sizeof(skb->cb));
|
|
|
|
|
2009-04-14 22:11:06 +00:00
|
|
|
if (NAPI_GRO_CB(skb)->count == 1) {
|
|
|
|
skb_shinfo(skb)->gso_size = 0;
|
net: Add Generic Receive Offload infrastructure
This patch adds the top-level GRO (Generic Receive Offload) infrastructure.
This is pretty similar to LRO except that this is protocol-independent.
Instead of holding packets in an lro_mgr structure, they're now held in
napi_struct.
For drivers that intend to use this, they can set the NETIF_F_GRO bit and
call napi_gro_receive instead of netif_receive_skb or just call netif_rx.
The latter will call napi_receive_skb automatically. When napi_gro_receive
is used, the driver must either call napi_complete/napi_rx_complete, or
call napi_gro_flush in softirq context if the driver uses the primitives
__napi_complete/__napi_rx_complete.
Protocols will set the gro_receive and gro_complete function pointers in
order to participate in this scheme.
In addition to the packet, gro_receive will get a list of currently held
packets. Each packet in the list has a same_flow field which is non-zero
if it is a potential match for the new packet. For each packet that may
match, they also have a flush field which is non-zero if the held packet
must not be merged with the new packet.
Once gro_receive has determined that the new skb matches a held packet,
the held packet may be processed immediately if the new skb cannot be
merged with it. In this case gro_receive should return the pointer to
the existing skb in gro_list. Otherwise the new skb should be merged into
the existing packet and NULL should be returned, unless the new skb makes
it impossible for any further merges to be made (e.g., FIN packet) where
the merged skb should be returned.
Whenever the skb is merged into an existing entry, the gro_receive
function should set NAPI_GRO_CB(skb)->same_flow. Note that if an skb
merely matches an existing entry but can't be merged with it, then
this shouldn't be set.
If gro_receive finds it pointless to hold the new skb for future merging,
it should set NAPI_GRO_CB(skb)->flush.
Held packets will be flushed by napi_gro_flush which is called by
napi_complete and napi_rx_complete.
Currently held packets are stored in a singly liked list just like LRO.
The list is limited to a maximum of 8 entries. In future, this may be
expanded to use a hash table to allow more flows to be held for merging.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-16 07:38:52 +00:00
|
|
|
goto out;
|
2009-04-14 22:11:06 +00:00
|
|
|
}
|
net: Add Generic Receive Offload infrastructure
This patch adds the top-level GRO (Generic Receive Offload) infrastructure.
This is pretty similar to LRO except that this is protocol-independent.
Instead of holding packets in an lro_mgr structure, they're now held in
napi_struct.
For drivers that intend to use this, they can set the NETIF_F_GRO bit and
call napi_gro_receive instead of netif_receive_skb or just call netif_rx.
The latter will call napi_receive_skb automatically. When napi_gro_receive
is used, the driver must either call napi_complete/napi_rx_complete, or
call napi_gro_flush in softirq context if the driver uses the primitives
__napi_complete/__napi_rx_complete.
Protocols will set the gro_receive and gro_complete function pointers in
order to participate in this scheme.
In addition to the packet, gro_receive will get a list of currently held
packets. Each packet in the list has a same_flow field which is non-zero
if it is a potential match for the new packet. For each packet that may
match, they also have a flush field which is non-zero if the held packet
must not be merged with the new packet.
Once gro_receive has determined that the new skb matches a held packet,
the held packet may be processed immediately if the new skb cannot be
merged with it. In this case gro_receive should return the pointer to
the existing skb in gro_list. Otherwise the new skb should be merged into
the existing packet and NULL should be returned, unless the new skb makes
it impossible for any further merges to be made (e.g., FIN packet) where
the merged skb should be returned.
Whenever the skb is merged into an existing entry, the gro_receive
function should set NAPI_GRO_CB(skb)->same_flow. Note that if an skb
merely matches an existing entry but can't be merged with it, then
this shouldn't be set.
If gro_receive finds it pointless to hold the new skb for future merging,
it should set NAPI_GRO_CB(skb)->flush.
Held packets will be flushed by napi_gro_flush which is called by
napi_complete and napi_rx_complete.
Currently held packets are stored in a singly liked list just like LRO.
The list is limited to a maximum of 8 entries. In future, this may be
expanded to use a hash table to allow more flows to be held for merging.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-16 07:38:52 +00:00
|
|
|
|
|
|
|
rcu_read_lock();
|
|
|
|
list_for_each_entry_rcu(ptype, head, list) {
|
2012-11-15 08:49:23 +00:00
|
|
|
if (ptype->type != type || !ptype->callbacks.gro_complete)
|
net: Add Generic Receive Offload infrastructure
This patch adds the top-level GRO (Generic Receive Offload) infrastructure.
This is pretty similar to LRO except that this is protocol-independent.
Instead of holding packets in an lro_mgr structure, they're now held in
napi_struct.
For drivers that intend to use this, they can set the NETIF_F_GRO bit and
call napi_gro_receive instead of netif_receive_skb or just call netif_rx.
The latter will call napi_receive_skb automatically. When napi_gro_receive
is used, the driver must either call napi_complete/napi_rx_complete, or
call napi_gro_flush in softirq context if the driver uses the primitives
__napi_complete/__napi_rx_complete.
Protocols will set the gro_receive and gro_complete function pointers in
order to participate in this scheme.
In addition to the packet, gro_receive will get a list of currently held
packets. Each packet in the list has a same_flow field which is non-zero
if it is a potential match for the new packet. For each packet that may
match, they also have a flush field which is non-zero if the held packet
must not be merged with the new packet.
Once gro_receive has determined that the new skb matches a held packet,
the held packet may be processed immediately if the new skb cannot be
merged with it. In this case gro_receive should return the pointer to
the existing skb in gro_list. Otherwise the new skb should be merged into
the existing packet and NULL should be returned, unless the new skb makes
it impossible for any further merges to be made (e.g., FIN packet) where
the merged skb should be returned.
Whenever the skb is merged into an existing entry, the gro_receive
function should set NAPI_GRO_CB(skb)->same_flow. Note that if an skb
merely matches an existing entry but can't be merged with it, then
this shouldn't be set.
If gro_receive finds it pointless to hold the new skb for future merging,
it should set NAPI_GRO_CB(skb)->flush.
Held packets will be flushed by napi_gro_flush which is called by
napi_complete and napi_rx_complete.
Currently held packets are stored in a singly liked list just like LRO.
The list is limited to a maximum of 8 entries. In future, this may be
expanded to use a hash table to allow more flows to be held for merging.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-16 07:38:52 +00:00
|
|
|
continue;
|
|
|
|
|
net-gro: Prepare GRO stack for the upcoming tunneling support
This patch modifies the GRO stack to avoid the use of "network_header"
and associated macros like ip_hdr() and ipv6_hdr() in order to allow
an arbitary number of IP hdrs (v4 or v6) to be used in the
encapsulation chain. This lays the foundation for various IP
tunneling support (IP-in-IP, GRE, VXLAN, SIT,...) to be added later.
With this patch, the GRO stack traversing now is mostly based on
skb_gro_offset rather than special hdr offsets saved in skb (e.g.,
skb->network_header). As a result all but the top layer (i.e., the
the transport layer) must have hdrs of the same length in order for
a pkt to be considered for aggregation. Therefore when adding a new
encap layer (e.g., for tunneling), one must check and skip flows
(e.g., by setting NAPI_GRO_CB(p)->same_flow to 0) that have a
different hdr length.
Note that unlike the network header, the transport header can and
will continue to be set by the GRO code since there will be at
most one "transport layer" in the encap chain.
Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-12 04:53:45 +00:00
|
|
|
err = ptype->callbacks.gro_complete(skb, 0);
|
net: Add Generic Receive Offload infrastructure
This patch adds the top-level GRO (Generic Receive Offload) infrastructure.
This is pretty similar to LRO except that this is protocol-independent.
Instead of holding packets in an lro_mgr structure, they're now held in
napi_struct.
For drivers that intend to use this, they can set the NETIF_F_GRO bit and
call napi_gro_receive instead of netif_receive_skb or just call netif_rx.
The latter will call napi_receive_skb automatically. When napi_gro_receive
is used, the driver must either call napi_complete/napi_rx_complete, or
call napi_gro_flush in softirq context if the driver uses the primitives
__napi_complete/__napi_rx_complete.
Protocols will set the gro_receive and gro_complete function pointers in
order to participate in this scheme.
In addition to the packet, gro_receive will get a list of currently held
packets. Each packet in the list has a same_flow field which is non-zero
if it is a potential match for the new packet. For each packet that may
match, they also have a flush field which is non-zero if the held packet
must not be merged with the new packet.
Once gro_receive has determined that the new skb matches a held packet,
the held packet may be processed immediately if the new skb cannot be
merged with it. In this case gro_receive should return the pointer to
the existing skb in gro_list. Otherwise the new skb should be merged into
the existing packet and NULL should be returned, unless the new skb makes
it impossible for any further merges to be made (e.g., FIN packet) where
the merged skb should be returned.
Whenever the skb is merged into an existing entry, the gro_receive
function should set NAPI_GRO_CB(skb)->same_flow. Note that if an skb
merely matches an existing entry but can't be merged with it, then
this shouldn't be set.
If gro_receive finds it pointless to hold the new skb for future merging,
it should set NAPI_GRO_CB(skb)->flush.
Held packets will be flushed by napi_gro_flush which is called by
napi_complete and napi_rx_complete.
Currently held packets are stored in a singly liked list just like LRO.
The list is limited to a maximum of 8 entries. In future, this may be
expanded to use a hash table to allow more flows to be held for merging.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-16 07:38:52 +00:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
rcu_read_unlock();
|
|
|
|
|
|
|
|
if (err) {
|
|
|
|
WARN_ON(&ptype->list == head);
|
|
|
|
kfree_skb(skb);
|
|
|
|
return NET_RX_SUCCESS;
|
|
|
|
}
|
|
|
|
|
|
|
|
out:
|
2014-01-10 22:17:24 +00:00
|
|
|
return netif_receive_skb_internal(skb);
|
net: Add Generic Receive Offload infrastructure
This patch adds the top-level GRO (Generic Receive Offload) infrastructure.
This is pretty similar to LRO except that this is protocol-independent.
Instead of holding packets in an lro_mgr structure, they're now held in
napi_struct.
For drivers that intend to use this, they can set the NETIF_F_GRO bit and
call napi_gro_receive instead of netif_receive_skb or just call netif_rx.
The latter will call napi_receive_skb automatically. When napi_gro_receive
is used, the driver must either call napi_complete/napi_rx_complete, or
call napi_gro_flush in softirq context if the driver uses the primitives
__napi_complete/__napi_rx_complete.
Protocols will set the gro_receive and gro_complete function pointers in
order to participate in this scheme.
In addition to the packet, gro_receive will get a list of currently held
packets. Each packet in the list has a same_flow field which is non-zero
if it is a potential match for the new packet. For each packet that may
match, they also have a flush field which is non-zero if the held packet
must not be merged with the new packet.
Once gro_receive has determined that the new skb matches a held packet,
the held packet may be processed immediately if the new skb cannot be
merged with it. In this case gro_receive should return the pointer to
the existing skb in gro_list. Otherwise the new skb should be merged into
the existing packet and NULL should be returned, unless the new skb makes
it impossible for any further merges to be made (e.g., FIN packet) where
the merged skb should be returned.
Whenever the skb is merged into an existing entry, the gro_receive
function should set NAPI_GRO_CB(skb)->same_flow. Note that if an skb
merely matches an existing entry but can't be merged with it, then
this shouldn't be set.
If gro_receive finds it pointless to hold the new skb for future merging,
it should set NAPI_GRO_CB(skb)->flush.
Held packets will be flushed by napi_gro_flush which is called by
napi_complete and napi_rx_complete.
Currently held packets are stored in a singly liked list just like LRO.
The list is limited to a maximum of 8 entries. In future, this may be
expanded to use a hash table to allow more flows to be held for merging.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-16 07:38:52 +00:00
|
|
|
}
|
|
|
|
|
2012-10-06 08:08:49 +00:00
|
|
|
/* napi->gro_list contains packets ordered by age.
|
|
|
|
* youngest packets at the head of it.
|
|
|
|
* Complete skbs in reverse order to reduce latencies.
|
|
|
|
*/
|
|
|
|
void napi_gro_flush(struct napi_struct *napi, bool flush_old)
|
net: Add Generic Receive Offload infrastructure
This patch adds the top-level GRO (Generic Receive Offload) infrastructure.
This is pretty similar to LRO except that this is protocol-independent.
Instead of holding packets in an lro_mgr structure, they're now held in
napi_struct.
For drivers that intend to use this, they can set the NETIF_F_GRO bit and
call napi_gro_receive instead of netif_receive_skb or just call netif_rx.
The latter will call napi_receive_skb automatically. When napi_gro_receive
is used, the driver must either call napi_complete/napi_rx_complete, or
call napi_gro_flush in softirq context if the driver uses the primitives
__napi_complete/__napi_rx_complete.
Protocols will set the gro_receive and gro_complete function pointers in
order to participate in this scheme.
In addition to the packet, gro_receive will get a list of currently held
packets. Each packet in the list has a same_flow field which is non-zero
if it is a potential match for the new packet. For each packet that may
match, they also have a flush field which is non-zero if the held packet
must not be merged with the new packet.
Once gro_receive has determined that the new skb matches a held packet,
the held packet may be processed immediately if the new skb cannot be
merged with it. In this case gro_receive should return the pointer to
the existing skb in gro_list. Otherwise the new skb should be merged into
the existing packet and NULL should be returned, unless the new skb makes
it impossible for any further merges to be made (e.g., FIN packet) where
the merged skb should be returned.
Whenever the skb is merged into an existing entry, the gro_receive
function should set NAPI_GRO_CB(skb)->same_flow. Note that if an skb
merely matches an existing entry but can't be merged with it, then
this shouldn't be set.
If gro_receive finds it pointless to hold the new skb for future merging,
it should set NAPI_GRO_CB(skb)->flush.
Held packets will be flushed by napi_gro_flush which is called by
napi_complete and napi_rx_complete.
Currently held packets are stored in a singly liked list just like LRO.
The list is limited to a maximum of 8 entries. In future, this may be
expanded to use a hash table to allow more flows to be held for merging.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-16 07:38:52 +00:00
|
|
|
{
|
2012-10-06 08:08:49 +00:00
|
|
|
struct sk_buff *skb, *prev = NULL;
|
net: Add Generic Receive Offload infrastructure
This patch adds the top-level GRO (Generic Receive Offload) infrastructure.
This is pretty similar to LRO except that this is protocol-independent.
Instead of holding packets in an lro_mgr structure, they're now held in
napi_struct.
For drivers that intend to use this, they can set the NETIF_F_GRO bit and
call napi_gro_receive instead of netif_receive_skb or just call netif_rx.
The latter will call napi_receive_skb automatically. When napi_gro_receive
is used, the driver must either call napi_complete/napi_rx_complete, or
call napi_gro_flush in softirq context if the driver uses the primitives
__napi_complete/__napi_rx_complete.
Protocols will set the gro_receive and gro_complete function pointers in
order to participate in this scheme.
In addition to the packet, gro_receive will get a list of currently held
packets. Each packet in the list has a same_flow field which is non-zero
if it is a potential match for the new packet. For each packet that may
match, they also have a flush field which is non-zero if the held packet
must not be merged with the new packet.
Once gro_receive has determined that the new skb matches a held packet,
the held packet may be processed immediately if the new skb cannot be
merged with it. In this case gro_receive should return the pointer to
the existing skb in gro_list. Otherwise the new skb should be merged into
the existing packet and NULL should be returned, unless the new skb makes
it impossible for any further merges to be made (e.g., FIN packet) where
the merged skb should be returned.
Whenever the skb is merged into an existing entry, the gro_receive
function should set NAPI_GRO_CB(skb)->same_flow. Note that if an skb
merely matches an existing entry but can't be merged with it, then
this shouldn't be set.
If gro_receive finds it pointless to hold the new skb for future merging,
it should set NAPI_GRO_CB(skb)->flush.
Held packets will be flushed by napi_gro_flush which is called by
napi_complete and napi_rx_complete.
Currently held packets are stored in a singly liked list just like LRO.
The list is limited to a maximum of 8 entries. In future, this may be
expanded to use a hash table to allow more flows to be held for merging.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-16 07:38:52 +00:00
|
|
|
|
2012-10-06 08:08:49 +00:00
|
|
|
/* scan list and build reverse chain */
|
|
|
|
for (skb = napi->gro_list; skb != NULL; skb = skb->next) {
|
|
|
|
skb->prev = prev;
|
|
|
|
prev = skb;
|
|
|
|
}
|
|
|
|
|
|
|
|
for (skb = prev; skb; skb = prev) {
|
net: Add Generic Receive Offload infrastructure
This patch adds the top-level GRO (Generic Receive Offload) infrastructure.
This is pretty similar to LRO except that this is protocol-independent.
Instead of holding packets in an lro_mgr structure, they're now held in
napi_struct.
For drivers that intend to use this, they can set the NETIF_F_GRO bit and
call napi_gro_receive instead of netif_receive_skb or just call netif_rx.
The latter will call napi_receive_skb automatically. When napi_gro_receive
is used, the driver must either call napi_complete/napi_rx_complete, or
call napi_gro_flush in softirq context if the driver uses the primitives
__napi_complete/__napi_rx_complete.
Protocols will set the gro_receive and gro_complete function pointers in
order to participate in this scheme.
In addition to the packet, gro_receive will get a list of currently held
packets. Each packet in the list has a same_flow field which is non-zero
if it is a potential match for the new packet. For each packet that may
match, they also have a flush field which is non-zero if the held packet
must not be merged with the new packet.
Once gro_receive has determined that the new skb matches a held packet,
the held packet may be processed immediately if the new skb cannot be
merged with it. In this case gro_receive should return the pointer to
the existing skb in gro_list. Otherwise the new skb should be merged into
the existing packet and NULL should be returned, unless the new skb makes
it impossible for any further merges to be made (e.g., FIN packet) where
the merged skb should be returned.
Whenever the skb is merged into an existing entry, the gro_receive
function should set NAPI_GRO_CB(skb)->same_flow. Note that if an skb
merely matches an existing entry but can't be merged with it, then
this shouldn't be set.
If gro_receive finds it pointless to hold the new skb for future merging,
it should set NAPI_GRO_CB(skb)->flush.
Held packets will be flushed by napi_gro_flush which is called by
napi_complete and napi_rx_complete.
Currently held packets are stored in a singly liked list just like LRO.
The list is limited to a maximum of 8 entries. In future, this may be
expanded to use a hash table to allow more flows to be held for merging.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-16 07:38:52 +00:00
|
|
|
skb->next = NULL;
|
2012-10-06 08:08:49 +00:00
|
|
|
|
|
|
|
if (flush_old && NAPI_GRO_CB(skb)->age == jiffies)
|
|
|
|
return;
|
|
|
|
|
|
|
|
prev = skb->prev;
|
net: Add Generic Receive Offload infrastructure
This patch adds the top-level GRO (Generic Receive Offload) infrastructure.
This is pretty similar to LRO except that this is protocol-independent.
Instead of holding packets in an lro_mgr structure, they're now held in
napi_struct.
For drivers that intend to use this, they can set the NETIF_F_GRO bit and
call napi_gro_receive instead of netif_receive_skb or just call netif_rx.
The latter will call napi_receive_skb automatically. When napi_gro_receive
is used, the driver must either call napi_complete/napi_rx_complete, or
call napi_gro_flush in softirq context if the driver uses the primitives
__napi_complete/__napi_rx_complete.
Protocols will set the gro_receive and gro_complete function pointers in
order to participate in this scheme.
In addition to the packet, gro_receive will get a list of currently held
packets. Each packet in the list has a same_flow field which is non-zero
if it is a potential match for the new packet. For each packet that may
match, they also have a flush field which is non-zero if the held packet
must not be merged with the new packet.
Once gro_receive has determined that the new skb matches a held packet,
the held packet may be processed immediately if the new skb cannot be
merged with it. In this case gro_receive should return the pointer to
the existing skb in gro_list. Otherwise the new skb should be merged into
the existing packet and NULL should be returned, unless the new skb makes
it impossible for any further merges to be made (e.g., FIN packet) where
the merged skb should be returned.
Whenever the skb is merged into an existing entry, the gro_receive
function should set NAPI_GRO_CB(skb)->same_flow. Note that if an skb
merely matches an existing entry but can't be merged with it, then
this shouldn't be set.
If gro_receive finds it pointless to hold the new skb for future merging,
it should set NAPI_GRO_CB(skb)->flush.
Held packets will be flushed by napi_gro_flush which is called by
napi_complete and napi_rx_complete.
Currently held packets are stored in a singly liked list just like LRO.
The list is limited to a maximum of 8 entries. In future, this may be
expanded to use a hash table to allow more flows to be held for merging.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-16 07:38:52 +00:00
|
|
|
napi_gro_complete(skb);
|
2012-10-06 08:08:49 +00:00
|
|
|
napi->gro_count--;
|
net: Add Generic Receive Offload infrastructure
This patch adds the top-level GRO (Generic Receive Offload) infrastructure.
This is pretty similar to LRO except that this is protocol-independent.
Instead of holding packets in an lro_mgr structure, they're now held in
napi_struct.
For drivers that intend to use this, they can set the NETIF_F_GRO bit and
call napi_gro_receive instead of netif_receive_skb or just call netif_rx.
The latter will call napi_receive_skb automatically. When napi_gro_receive
is used, the driver must either call napi_complete/napi_rx_complete, or
call napi_gro_flush in softirq context if the driver uses the primitives
__napi_complete/__napi_rx_complete.
Protocols will set the gro_receive and gro_complete function pointers in
order to participate in this scheme.
In addition to the packet, gro_receive will get a list of currently held
packets. Each packet in the list has a same_flow field which is non-zero
if it is a potential match for the new packet. For each packet that may
match, they also have a flush field which is non-zero if the held packet
must not be merged with the new packet.
Once gro_receive has determined that the new skb matches a held packet,
the held packet may be processed immediately if the new skb cannot be
merged with it. In this case gro_receive should return the pointer to
the existing skb in gro_list. Otherwise the new skb should be merged into
the existing packet and NULL should be returned, unless the new skb makes
it impossible for any further merges to be made (e.g., FIN packet) where
the merged skb should be returned.
Whenever the skb is merged into an existing entry, the gro_receive
function should set NAPI_GRO_CB(skb)->same_flow. Note that if an skb
merely matches an existing entry but can't be merged with it, then
this shouldn't be set.
If gro_receive finds it pointless to hold the new skb for future merging,
it should set NAPI_GRO_CB(skb)->flush.
Held packets will be flushed by napi_gro_flush which is called by
napi_complete and napi_rx_complete.
Currently held packets are stored in a singly liked list just like LRO.
The list is limited to a maximum of 8 entries. In future, this may be
expanded to use a hash table to allow more flows to be held for merging.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-16 07:38:52 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
napi->gro_list = NULL;
|
|
|
|
}
|
2010-08-31 18:25:32 +00:00
|
|
|
EXPORT_SYMBOL(napi_gro_flush);
|
net: Add Generic Receive Offload infrastructure
This patch adds the top-level GRO (Generic Receive Offload) infrastructure.
This is pretty similar to LRO except that this is protocol-independent.
Instead of holding packets in an lro_mgr structure, they're now held in
napi_struct.
For drivers that intend to use this, they can set the NETIF_F_GRO bit and
call napi_gro_receive instead of netif_receive_skb or just call netif_rx.
The latter will call napi_receive_skb automatically. When napi_gro_receive
is used, the driver must either call napi_complete/napi_rx_complete, or
call napi_gro_flush in softirq context if the driver uses the primitives
__napi_complete/__napi_rx_complete.
Protocols will set the gro_receive and gro_complete function pointers in
order to participate in this scheme.
In addition to the packet, gro_receive will get a list of currently held
packets. Each packet in the list has a same_flow field which is non-zero
if it is a potential match for the new packet. For each packet that may
match, they also have a flush field which is non-zero if the held packet
must not be merged with the new packet.
Once gro_receive has determined that the new skb matches a held packet,
the held packet may be processed immediately if the new skb cannot be
merged with it. In this case gro_receive should return the pointer to
the existing skb in gro_list. Otherwise the new skb should be merged into
the existing packet and NULL should be returned, unless the new skb makes
it impossible for any further merges to be made (e.g., FIN packet) where
the merged skb should be returned.
Whenever the skb is merged into an existing entry, the gro_receive
function should set NAPI_GRO_CB(skb)->same_flow. Note that if an skb
merely matches an existing entry but can't be merged with it, then
this shouldn't be set.
If gro_receive finds it pointless to hold the new skb for future merging,
it should set NAPI_GRO_CB(skb)->flush.
Held packets will be flushed by napi_gro_flush which is called by
napi_complete and napi_rx_complete.
Currently held packets are stored in a singly liked list just like LRO.
The list is limited to a maximum of 8 entries. In future, this may be
expanded to use a hash table to allow more flows to be held for merging.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-16 07:38:52 +00:00
|
|
|
|
2012-12-10 13:28:16 +00:00
|
|
|
static void gro_list_prepare(struct napi_struct *napi, struct sk_buff *skb)
|
|
|
|
{
|
|
|
|
struct sk_buff *p;
|
|
|
|
unsigned int maclen = skb->dev->hard_header_len;
|
2014-01-15 16:58:06 +00:00
|
|
|
u32 hash = skb_get_hash_raw(skb);
|
2012-12-10 13:28:16 +00:00
|
|
|
|
|
|
|
for (p = napi->gro_list; p; p = p->next) {
|
|
|
|
unsigned long diffs;
|
|
|
|
|
2014-01-15 16:58:06 +00:00
|
|
|
NAPI_GRO_CB(p)->flush = 0;
|
|
|
|
|
|
|
|
if (hash != skb_get_hash_raw(p)) {
|
|
|
|
NAPI_GRO_CB(p)->same_flow = 0;
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2012-12-10 13:28:16 +00:00
|
|
|
diffs = (unsigned long)p->dev ^ (unsigned long)skb->dev;
|
|
|
|
diffs |= p->vlan_tci ^ skb->vlan_tci;
|
2016-01-21 01:59:49 +00:00
|
|
|
diffs |= skb_metadata_dst_cmp(p, skb);
|
bpf: add meta pointer for direct access
This work enables generic transfer of metadata from XDP into skb. The
basic idea is that we can make use of the fact that the resulting skb
must be linear and already comes with a larger headroom for supporting
bpf_xdp_adjust_head(), which mangles xdp->data. Here, we base our work
on a similar principle and introduce a small helper bpf_xdp_adjust_meta()
for adjusting a new pointer called xdp->data_meta. Thus, the packet has
a flexible and programmable room for meta data, followed by the actual
packet data. struct xdp_buff is therefore laid out that we first point
to data_hard_start, then data_meta directly prepended to data followed
by data_end marking the end of packet. bpf_xdp_adjust_head() takes into
account whether we have meta data already prepended and if so, memmove()s
this along with the given offset provided there's enough room.
xdp->data_meta is optional and programs are not required to use it. The
rationale is that when we process the packet in XDP (e.g. as DoS filter),
we can push further meta data along with it for the XDP_PASS case, and
give the guarantee that a clsact ingress BPF program on the same device
can pick this up for further post-processing. Since we work with skb
there, we can also set skb->mark, skb->priority or other skb meta data
out of BPF, thus having this scratch space generic and programmable
allows for more flexibility than defining a direct 1:1 transfer of
potentially new XDP members into skb (it's also more efficient as we
don't need to initialize/handle each of such new members). The facility
also works together with GRO aggregation. The scratch space at the head
of the packet can be multiple of 4 byte up to 32 byte large. Drivers not
yet supporting xdp->data_meta can simply be set up with xdp->data_meta
as xdp->data + 1 as bpf_xdp_adjust_meta() will detect this and bail out,
such that the subsequent match against xdp->data for later access is
guaranteed to fail.
The verifier treats xdp->data_meta/xdp->data the same way as we treat
xdp->data/xdp->data_end pointer comparisons. The requirement for doing
the compare against xdp->data is that it hasn't been modified from it's
original address we got from ctx access. It may have a range marking
already from prior successful xdp->data/xdp->data_end pointer comparisons
though.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-09-25 00:25:51 +00:00
|
|
|
diffs |= skb_metadata_differs(p, skb);
|
2012-12-10 13:28:16 +00:00
|
|
|
if (maclen == ETH_HLEN)
|
|
|
|
diffs |= compare_ether_header(skb_mac_header(p),
|
2014-03-30 04:28:21 +00:00
|
|
|
skb_mac_header(skb));
|
2012-12-10 13:28:16 +00:00
|
|
|
else if (!diffs)
|
|
|
|
diffs = memcmp(skb_mac_header(p),
|
2014-03-30 04:28:21 +00:00
|
|
|
skb_mac_header(skb),
|
2012-12-10 13:28:16 +00:00
|
|
|
maclen);
|
|
|
|
NAPI_GRO_CB(p)->same_flow = !diffs;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
net-gro: Prepare GRO stack for the upcoming tunneling support
This patch modifies the GRO stack to avoid the use of "network_header"
and associated macros like ip_hdr() and ipv6_hdr() in order to allow
an arbitary number of IP hdrs (v4 or v6) to be used in the
encapsulation chain. This lays the foundation for various IP
tunneling support (IP-in-IP, GRE, VXLAN, SIT,...) to be added later.
With this patch, the GRO stack traversing now is mostly based on
skb_gro_offset rather than special hdr offsets saved in skb (e.g.,
skb->network_header). As a result all but the top layer (i.e., the
the transport layer) must have hdrs of the same length in order for
a pkt to be considered for aggregation. Therefore when adding a new
encap layer (e.g., for tunneling), one must check and skip flows
(e.g., by setting NAPI_GRO_CB(p)->same_flow to 0) that have a
different hdr length.
Note that unlike the network header, the transport header can and
will continue to be set by the GRO code since there will be at
most one "transport layer" in the encap chain.
Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-12 04:53:45 +00:00
|
|
|
static void skb_gro_reset_offset(struct sk_buff *skb)
|
|
|
|
{
|
|
|
|
const struct skb_shared_info *pinfo = skb_shinfo(skb);
|
|
|
|
const skb_frag_t *frag0 = &pinfo->frags[0];
|
|
|
|
|
|
|
|
NAPI_GRO_CB(skb)->data_offset = 0;
|
|
|
|
NAPI_GRO_CB(skb)->frag0 = NULL;
|
|
|
|
NAPI_GRO_CB(skb)->frag0_len = 0;
|
|
|
|
|
|
|
|
if (skb_mac_header(skb) == skb_tail_pointer(skb) &&
|
|
|
|
pinfo->nr_frags &&
|
|
|
|
!PageHighMem(skb_frag_page(frag0))) {
|
|
|
|
NAPI_GRO_CB(skb)->frag0 = skb_frag_address(frag0);
|
2017-01-11 03:52:43 +00:00
|
|
|
NAPI_GRO_CB(skb)->frag0_len = min_t(unsigned int,
|
|
|
|
skb_frag_size(frag0),
|
|
|
|
skb->end - skb->tail);
|
2012-12-10 13:28:16 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2014-03-30 04:28:21 +00:00
|
|
|
static void gro_pull_from_frag0(struct sk_buff *skb, int grow)
|
|
|
|
{
|
|
|
|
struct skb_shared_info *pinfo = skb_shinfo(skb);
|
|
|
|
|
|
|
|
BUG_ON(skb->end - skb->tail < grow);
|
|
|
|
|
|
|
|
memcpy(skb_tail_pointer(skb), NAPI_GRO_CB(skb)->frag0, grow);
|
|
|
|
|
|
|
|
skb->data_len -= grow;
|
|
|
|
skb->tail += grow;
|
|
|
|
|
|
|
|
pinfo->frags[0].page_offset += grow;
|
|
|
|
skb_frag_size_sub(&pinfo->frags[0], grow);
|
|
|
|
|
|
|
|
if (unlikely(!skb_frag_size(&pinfo->frags[0]))) {
|
|
|
|
skb_frag_unref(skb, 0);
|
|
|
|
memmove(pinfo->frags, pinfo->frags + 1,
|
|
|
|
--pinfo->nr_frags * sizeof(pinfo->frags[0]));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2012-11-28 21:55:25 +00:00
|
|
|
static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff *skb)
|
net: Add Generic Receive Offload infrastructure
This patch adds the top-level GRO (Generic Receive Offload) infrastructure.
This is pretty similar to LRO except that this is protocol-independent.
Instead of holding packets in an lro_mgr structure, they're now held in
napi_struct.
For drivers that intend to use this, they can set the NETIF_F_GRO bit and
call napi_gro_receive instead of netif_receive_skb or just call netif_rx.
The latter will call napi_receive_skb automatically. When napi_gro_receive
is used, the driver must either call napi_complete/napi_rx_complete, or
call napi_gro_flush in softirq context if the driver uses the primitives
__napi_complete/__napi_rx_complete.
Protocols will set the gro_receive and gro_complete function pointers in
order to participate in this scheme.
In addition to the packet, gro_receive will get a list of currently held
packets. Each packet in the list has a same_flow field which is non-zero
if it is a potential match for the new packet. For each packet that may
match, they also have a flush field which is non-zero if the held packet
must not be merged with the new packet.
Once gro_receive has determined that the new skb matches a held packet,
the held packet may be processed immediately if the new skb cannot be
merged with it. In this case gro_receive should return the pointer to
the existing skb in gro_list. Otherwise the new skb should be merged into
the existing packet and NULL should be returned, unless the new skb makes
it impossible for any further merges to be made (e.g., FIN packet) where
the merged skb should be returned.
Whenever the skb is merged into an existing entry, the gro_receive
function should set NAPI_GRO_CB(skb)->same_flow. Note that if an skb
merely matches an existing entry but can't be merged with it, then
this shouldn't be set.
If gro_receive finds it pointless to hold the new skb for future merging,
it should set NAPI_GRO_CB(skb)->flush.
Held packets will be flushed by napi_gro_flush which is called by
napi_complete and napi_rx_complete.
Currently held packets are stored in a singly liked list just like LRO.
The list is limited to a maximum of 8 entries. In future, this may be
expanded to use a hash table to allow more flows to be held for merging.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-16 07:38:52 +00:00
|
|
|
{
|
|
|
|
struct sk_buff **pp = NULL;
|
2012-11-15 08:49:11 +00:00
|
|
|
struct packet_offload *ptype;
|
net: Add Generic Receive Offload infrastructure
This patch adds the top-level GRO (Generic Receive Offload) infrastructure.
This is pretty similar to LRO except that this is protocol-independent.
Instead of holding packets in an lro_mgr structure, they're now held in
napi_struct.
For drivers that intend to use this, they can set the NETIF_F_GRO bit and
call napi_gro_receive instead of netif_receive_skb or just call netif_rx.
The latter will call napi_receive_skb automatically. When napi_gro_receive
is used, the driver must either call napi_complete/napi_rx_complete, or
call napi_gro_flush in softirq context if the driver uses the primitives
__napi_complete/__napi_rx_complete.
Protocols will set the gro_receive and gro_complete function pointers in
order to participate in this scheme.
In addition to the packet, gro_receive will get a list of currently held
packets. Each packet in the list has a same_flow field which is non-zero
if it is a potential match for the new packet. For each packet that may
match, they also have a flush field which is non-zero if the held packet
must not be merged with the new packet.
Once gro_receive has determined that the new skb matches a held packet,
the held packet may be processed immediately if the new skb cannot be
merged with it. In this case gro_receive should return the pointer to
the existing skb in gro_list. Otherwise the new skb should be merged into
the existing packet and NULL should be returned, unless the new skb makes
it impossible for any further merges to be made (e.g., FIN packet) where
the merged skb should be returned.
Whenever the skb is merged into an existing entry, the gro_receive
function should set NAPI_GRO_CB(skb)->same_flow. Note that if an skb
merely matches an existing entry but can't be merged with it, then
this shouldn't be set.
If gro_receive finds it pointless to hold the new skb for future merging,
it should set NAPI_GRO_CB(skb)->flush.
Held packets will be flushed by napi_gro_flush which is called by
napi_complete and napi_rx_complete.
Currently held packets are stored in a singly liked list just like LRO.
The list is limited to a maximum of 8 entries. In future, this may be
expanded to use a hash table to allow more flows to be held for merging.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-16 07:38:52 +00:00
|
|
|
__be16 type = skb->protocol;
|
2012-11-15 08:49:11 +00:00
|
|
|
struct list_head *head = &offload_base;
|
2008-12-26 22:57:42 +00:00
|
|
|
int same_flow;
|
2009-10-29 07:17:09 +00:00
|
|
|
enum gro_result ret;
|
2014-03-30 04:28:21 +00:00
|
|
|
int grow;
|
net: Add Generic Receive Offload infrastructure
This patch adds the top-level GRO (Generic Receive Offload) infrastructure.
This is pretty similar to LRO except that this is protocol-independent.
Instead of holding packets in an lro_mgr structure, they're now held in
napi_struct.
For drivers that intend to use this, they can set the NETIF_F_GRO bit and
call napi_gro_receive instead of netif_receive_skb or just call netif_rx.
The latter will call napi_receive_skb automatically. When napi_gro_receive
is used, the driver must either call napi_complete/napi_rx_complete, or
call napi_gro_flush in softirq context if the driver uses the primitives
__napi_complete/__napi_rx_complete.
Protocols will set the gro_receive and gro_complete function pointers in
order to participate in this scheme.
In addition to the packet, gro_receive will get a list of currently held
packets. Each packet in the list has a same_flow field which is non-zero
if it is a potential match for the new packet. For each packet that may
match, they also have a flush field which is non-zero if the held packet
must not be merged with the new packet.
Once gro_receive has determined that the new skb matches a held packet,
the held packet may be processed immediately if the new skb cannot be
merged with it. In this case gro_receive should return the pointer to
the existing skb in gro_list. Otherwise the new skb should be merged into
the existing packet and NULL should be returned, unless the new skb makes
it impossible for any further merges to be made (e.g., FIN packet) where
the merged skb should be returned.
Whenever the skb is merged into an existing entry, the gro_receive
function should set NAPI_GRO_CB(skb)->same_flow. Note that if an skb
merely matches an existing entry but can't be merged with it, then
this shouldn't be set.
If gro_receive finds it pointless to hold the new skb for future merging,
it should set NAPI_GRO_CB(skb)->flush.
Held packets will be flushed by napi_gro_flush which is called by
napi_complete and napi_rx_complete.
Currently held packets are stored in a singly liked list just like LRO.
The list is limited to a maximum of 8 entries. In future, this may be
expanded to use a hash table to allow more flows to be held for merging.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-16 07:38:52 +00:00
|
|
|
|
2017-04-18 19:36:58 +00:00
|
|
|
if (netif_elide_gro(skb->dev))
|
net: Add Generic Receive Offload infrastructure
This patch adds the top-level GRO (Generic Receive Offload) infrastructure.
This is pretty similar to LRO except that this is protocol-independent.
Instead of holding packets in an lro_mgr structure, they're now held in
napi_struct.
For drivers that intend to use this, they can set the NETIF_F_GRO bit and
call napi_gro_receive instead of netif_receive_skb or just call netif_rx.
The latter will call napi_receive_skb automatically. When napi_gro_receive
is used, the driver must either call napi_complete/napi_rx_complete, or
call napi_gro_flush in softirq context if the driver uses the primitives
__napi_complete/__napi_rx_complete.
Protocols will set the gro_receive and gro_complete function pointers in
order to participate in this scheme.
In addition to the packet, gro_receive will get a list of currently held
packets. Each packet in the list has a same_flow field which is non-zero
if it is a potential match for the new packet. For each packet that may
match, they also have a flush field which is non-zero if the held packet
must not be merged with the new packet.
Once gro_receive has determined that the new skb matches a held packet,
the held packet may be processed immediately if the new skb cannot be
merged with it. In this case gro_receive should return the pointer to
the existing skb in gro_list. Otherwise the new skb should be merged into
the existing packet and NULL should be returned, unless the new skb makes
it impossible for any further merges to be made (e.g., FIN packet) where
the merged skb should be returned.
Whenever the skb is merged into an existing entry, the gro_receive
function should set NAPI_GRO_CB(skb)->same_flow. Note that if an skb
merely matches an existing entry but can't be merged with it, then
this shouldn't be set.
If gro_receive finds it pointless to hold the new skb for future merging,
it should set NAPI_GRO_CB(skb)->flush.
Held packets will be flushed by napi_gro_flush which is called by
napi_complete and napi_rx_complete.
Currently held packets are stored in a singly liked list just like LRO.
The list is limited to a maximum of 8 entries. In future, this may be
expanded to use a hash table to allow more flows to be held for merging.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-16 07:38:52 +00:00
|
|
|
goto normal;
|
|
|
|
|
2012-12-10 13:28:16 +00:00
|
|
|
gro_list_prepare(napi, skb);
|
|
|
|
|
net: Add Generic Receive Offload infrastructure
This patch adds the top-level GRO (Generic Receive Offload) infrastructure.
This is pretty similar to LRO except that this is protocol-independent.
Instead of holding packets in an lro_mgr structure, they're now held in
napi_struct.
For drivers that intend to use this, they can set the NETIF_F_GRO bit and
call napi_gro_receive instead of netif_receive_skb or just call netif_rx.
The latter will call napi_receive_skb automatically. When napi_gro_receive
is used, the driver must either call napi_complete/napi_rx_complete, or
call napi_gro_flush in softirq context if the driver uses the primitives
__napi_complete/__napi_rx_complete.
Protocols will set the gro_receive and gro_complete function pointers in
order to participate in this scheme.
In addition to the packet, gro_receive will get a list of currently held
packets. Each packet in the list has a same_flow field which is non-zero
if it is a potential match for the new packet. For each packet that may
match, they also have a flush field which is non-zero if the held packet
must not be merged with the new packet.
Once gro_receive has determined that the new skb matches a held packet,
the held packet may be processed immediately if the new skb cannot be
merged with it. In this case gro_receive should return the pointer to
the existing skb in gro_list. Otherwise the new skb should be merged into
the existing packet and NULL should be returned, unless the new skb makes
it impossible for any further merges to be made (e.g., FIN packet) where
the merged skb should be returned.
Whenever the skb is merged into an existing entry, the gro_receive
function should set NAPI_GRO_CB(skb)->same_flow. Note that if an skb
merely matches an existing entry but can't be merged with it, then
this shouldn't be set.
If gro_receive finds it pointless to hold the new skb for future merging,
it should set NAPI_GRO_CB(skb)->flush.
Held packets will be flushed by napi_gro_flush which is called by
napi_complete and napi_rx_complete.
Currently held packets are stored in a singly liked list just like LRO.
The list is limited to a maximum of 8 entries. In future, this may be
expanded to use a hash table to allow more flows to be held for merging.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-16 07:38:52 +00:00
|
|
|
rcu_read_lock();
|
|
|
|
list_for_each_entry_rcu(ptype, head, list) {
|
2012-11-15 08:49:23 +00:00
|
|
|
if (ptype->type != type || !ptype->callbacks.gro_receive)
|
net: Add Generic Receive Offload infrastructure
This patch adds the top-level GRO (Generic Receive Offload) infrastructure.
This is pretty similar to LRO except that this is protocol-independent.
Instead of holding packets in an lro_mgr structure, they're now held in
napi_struct.
For drivers that intend to use this, they can set the NETIF_F_GRO bit and
call napi_gro_receive instead of netif_receive_skb or just call netif_rx.
The latter will call napi_receive_skb automatically. When napi_gro_receive
is used, the driver must either call napi_complete/napi_rx_complete, or
call napi_gro_flush in softirq context if the driver uses the primitives
__napi_complete/__napi_rx_complete.
Protocols will set the gro_receive and gro_complete function pointers in
order to participate in this scheme.
In addition to the packet, gro_receive will get a list of currently held
packets. Each packet in the list has a same_flow field which is non-zero
if it is a potential match for the new packet. For each packet that may
match, they also have a flush field which is non-zero if the held packet
must not be merged with the new packet.
Once gro_receive has determined that the new skb matches a held packet,
the held packet may be processed immediately if the new skb cannot be
merged with it. In this case gro_receive should return the pointer to
the existing skb in gro_list. Otherwise the new skb should be merged into
the existing packet and NULL should be returned, unless the new skb makes
it impossible for any further merges to be made (e.g., FIN packet) where
the merged skb should be returned.
Whenever the skb is merged into an existing entry, the gro_receive
function should set NAPI_GRO_CB(skb)->same_flow. Note that if an skb
merely matches an existing entry but can't be merged with it, then
this shouldn't be set.
If gro_receive finds it pointless to hold the new skb for future merging,
it should set NAPI_GRO_CB(skb)->flush.
Held packets will be flushed by napi_gro_flush which is called by
napi_complete and napi_rx_complete.
Currently held packets are stored in a singly liked list just like LRO.
The list is limited to a maximum of 8 entries. In future, this may be
expanded to use a hash table to allow more flows to be held for merging.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-16 07:38:52 +00:00
|
|
|
continue;
|
|
|
|
|
2009-01-29 14:19:50 +00:00
|
|
|
skb_set_network_header(skb, skb_gro_offset(skb));
|
2013-02-14 17:31:48 +00:00
|
|
|
skb_reset_mac_len(skb);
|
net: Add Generic Receive Offload infrastructure
This patch adds the top-level GRO (Generic Receive Offload) infrastructure.
This is pretty similar to LRO except that this is protocol-independent.
Instead of holding packets in an lro_mgr structure, they're now held in
napi_struct.
For drivers that intend to use this, they can set the NETIF_F_GRO bit and
call napi_gro_receive instead of netif_receive_skb or just call netif_rx.
The latter will call napi_receive_skb automatically. When napi_gro_receive
is used, the driver must either call napi_complete/napi_rx_complete, or
call napi_gro_flush in softirq context if the driver uses the primitives
__napi_complete/__napi_rx_complete.
Protocols will set the gro_receive and gro_complete function pointers in
order to participate in this scheme.
In addition to the packet, gro_receive will get a list of currently held
packets. Each packet in the list has a same_flow field which is non-zero
if it is a potential match for the new packet. For each packet that may
match, they also have a flush field which is non-zero if the held packet
must not be merged with the new packet.
Once gro_receive has determined that the new skb matches a held packet,
the held packet may be processed immediately if the new skb cannot be
merged with it. In this case gro_receive should return the pointer to
the existing skb in gro_list. Otherwise the new skb should be merged into
the existing packet and NULL should be returned, unless the new skb makes
it impossible for any further merges to be made (e.g., FIN packet) where
the merged skb should be returned.
Whenever the skb is merged into an existing entry, the gro_receive
function should set NAPI_GRO_CB(skb)->same_flow. Note that if an skb
merely matches an existing entry but can't be merged with it, then
this shouldn't be set.
If gro_receive finds it pointless to hold the new skb for future merging,
it should set NAPI_GRO_CB(skb)->flush.
Held packets will be flushed by napi_gro_flush which is called by
napi_complete and napi_rx_complete.
Currently held packets are stored in a singly liked list just like LRO.
The list is limited to a maximum of 8 entries. In future, this may be
expanded to use a hash table to allow more flows to be held for merging.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-16 07:38:52 +00:00
|
|
|
NAPI_GRO_CB(skb)->same_flow = 0;
|
2016-11-07 19:12:27 +00:00
|
|
|
NAPI_GRO_CB(skb)->flush = skb_is_gso(skb) || skb_has_frag_list(skb);
|
2009-01-05 00:13:40 +00:00
|
|
|
NAPI_GRO_CB(skb)->free = 0;
|
2016-03-19 16:32:01 +00:00
|
|
|
NAPI_GRO_CB(skb)->encap_mark = 0;
|
2016-10-20 13:58:02 +00:00
|
|
|
NAPI_GRO_CB(skb)->recursion_counter = 0;
|
2016-04-05 16:13:39 +00:00
|
|
|
NAPI_GRO_CB(skb)->is_fou = 0;
|
2016-04-11 01:44:57 +00:00
|
|
|
NAPI_GRO_CB(skb)->is_atomic = 1;
|
2015-02-11 00:30:31 +00:00
|
|
|
NAPI_GRO_CB(skb)->gro_remcsum_start = 0;
|
net: Add Generic Receive Offload infrastructure
This patch adds the top-level GRO (Generic Receive Offload) infrastructure.
This is pretty similar to LRO except that this is protocol-independent.
Instead of holding packets in an lro_mgr structure, they're now held in
napi_struct.
For drivers that intend to use this, they can set the NETIF_F_GRO bit and
call napi_gro_receive instead of netif_receive_skb or just call netif_rx.
The latter will call napi_receive_skb automatically. When napi_gro_receive
is used, the driver must either call napi_complete/napi_rx_complete, or
call napi_gro_flush in softirq context if the driver uses the primitives
__napi_complete/__napi_rx_complete.
Protocols will set the gro_receive and gro_complete function pointers in
order to participate in this scheme.
In addition to the packet, gro_receive will get a list of currently held
packets. Each packet in the list has a same_flow field which is non-zero
if it is a potential match for the new packet. For each packet that may
match, they also have a flush field which is non-zero if the held packet
must not be merged with the new packet.
Once gro_receive has determined that the new skb matches a held packet,
the held packet may be processed immediately if the new skb cannot be
merged with it. In this case gro_receive should return the pointer to
the existing skb in gro_list. Otherwise the new skb should be merged into
the existing packet and NULL should be returned, unless the new skb makes
it impossible for any further merges to be made (e.g., FIN packet) where
the merged skb should be returned.
Whenever the skb is merged into an existing entry, the gro_receive
function should set NAPI_GRO_CB(skb)->same_flow. Note that if an skb
merely matches an existing entry but can't be merged with it, then
this shouldn't be set.
If gro_receive finds it pointless to hold the new skb for future merging,
it should set NAPI_GRO_CB(skb)->flush.
Held packets will be flushed by napi_gro_flush which is called by
napi_complete and napi_rx_complete.
Currently held packets are stored in a singly liked list just like LRO.
The list is limited to a maximum of 8 entries. In future, this may be
expanded to use a hash table to allow more flows to be held for merging.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-16 07:38:52 +00:00
|
|
|
|
2014-08-28 04:26:56 +00:00
|
|
|
/* Setup for GRO checksum validation */
|
|
|
|
switch (skb->ip_summed) {
|
|
|
|
case CHECKSUM_COMPLETE:
|
|
|
|
NAPI_GRO_CB(skb)->csum = skb->csum;
|
|
|
|
NAPI_GRO_CB(skb)->csum_valid = 1;
|
|
|
|
NAPI_GRO_CB(skb)->csum_cnt = 0;
|
|
|
|
break;
|
|
|
|
case CHECKSUM_UNNECESSARY:
|
|
|
|
NAPI_GRO_CB(skb)->csum_cnt = skb->csum_level + 1;
|
|
|
|
NAPI_GRO_CB(skb)->csum_valid = 0;
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
NAPI_GRO_CB(skb)->csum_cnt = 0;
|
|
|
|
NAPI_GRO_CB(skb)->csum_valid = 0;
|
|
|
|
}
|
net: Add Generic Receive Offload infrastructure
This patch adds the top-level GRO (Generic Receive Offload) infrastructure.
This is pretty similar to LRO except that this is protocol-independent.
Instead of holding packets in an lro_mgr structure, they're now held in
napi_struct.
For drivers that intend to use this, they can set the NETIF_F_GRO bit and
call napi_gro_receive instead of netif_receive_skb or just call netif_rx.
The latter will call napi_receive_skb automatically. When napi_gro_receive
is used, the driver must either call napi_complete/napi_rx_complete, or
call napi_gro_flush in softirq context if the driver uses the primitives
__napi_complete/__napi_rx_complete.
Protocols will set the gro_receive and gro_complete function pointers in
order to participate in this scheme.
In addition to the packet, gro_receive will get a list of currently held
packets. Each packet in the list has a same_flow field which is non-zero
if it is a potential match for the new packet. For each packet that may
match, they also have a flush field which is non-zero if the held packet
must not be merged with the new packet.
Once gro_receive has determined that the new skb matches a held packet,
the held packet may be processed immediately if the new skb cannot be
merged with it. In this case gro_receive should return the pointer to
the existing skb in gro_list. Otherwise the new skb should be merged into
the existing packet and NULL should be returned, unless the new skb makes
it impossible for any further merges to be made (e.g., FIN packet) where
the merged skb should be returned.
Whenever the skb is merged into an existing entry, the gro_receive
function should set NAPI_GRO_CB(skb)->same_flow. Note that if an skb
merely matches an existing entry but can't be merged with it, then
this shouldn't be set.
If gro_receive finds it pointless to hold the new skb for future merging,
it should set NAPI_GRO_CB(skb)->flush.
Held packets will be flushed by napi_gro_flush which is called by
napi_complete and napi_rx_complete.
Currently held packets are stored in a singly liked list just like LRO.
The list is limited to a maximum of 8 entries. In future, this may be
expanded to use a hash table to allow more flows to be held for merging.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-16 07:38:52 +00:00
|
|
|
|
2012-11-15 08:49:23 +00:00
|
|
|
pp = ptype->callbacks.gro_receive(&napi->gro_list, skb);
|
net: Add Generic Receive Offload infrastructure
This patch adds the top-level GRO (Generic Receive Offload) infrastructure.
This is pretty similar to LRO except that this is protocol-independent.
Instead of holding packets in an lro_mgr structure, they're now held in
napi_struct.
For drivers that intend to use this, they can set the NETIF_F_GRO bit and
call napi_gro_receive instead of netif_receive_skb or just call netif_rx.
The latter will call napi_receive_skb automatically. When napi_gro_receive
is used, the driver must either call napi_complete/napi_rx_complete, or
call napi_gro_flush in softirq context if the driver uses the primitives
__napi_complete/__napi_rx_complete.
Protocols will set the gro_receive and gro_complete function pointers in
order to participate in this scheme.
In addition to the packet, gro_receive will get a list of currently held
packets. Each packet in the list has a same_flow field which is non-zero
if it is a potential match for the new packet. For each packet that may
match, they also have a flush field which is non-zero if the held packet
must not be merged with the new packet.
Once gro_receive has determined that the new skb matches a held packet,
the held packet may be processed immediately if the new skb cannot be
merged with it. In this case gro_receive should return the pointer to
the existing skb in gro_list. Otherwise the new skb should be merged into
the existing packet and NULL should be returned, unless the new skb makes
it impossible for any further merges to be made (e.g., FIN packet) where
the merged skb should be returned.
Whenever the skb is merged into an existing entry, the gro_receive
function should set NAPI_GRO_CB(skb)->same_flow. Note that if an skb
merely matches an existing entry but can't be merged with it, then
this shouldn't be set.
If gro_receive finds it pointless to hold the new skb for future merging,
it should set NAPI_GRO_CB(skb)->flush.
Held packets will be flushed by napi_gro_flush which is called by
napi_complete and napi_rx_complete.
Currently held packets are stored in a singly liked list just like LRO.
The list is limited to a maximum of 8 entries. In future, this may be
expanded to use a hash table to allow more flows to be held for merging.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-16 07:38:52 +00:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
rcu_read_unlock();
|
|
|
|
|
|
|
|
if (&ptype->list == head)
|
|
|
|
goto normal;
|
|
|
|
|
2017-02-15 08:39:44 +00:00
|
|
|
if (IS_ERR(pp) && PTR_ERR(pp) == -EINPROGRESS) {
|
|
|
|
ret = GRO_CONSUMED;
|
|
|
|
goto ok;
|
|
|
|
}
|
|
|
|
|
2008-12-26 22:57:42 +00:00
|
|
|
same_flow = NAPI_GRO_CB(skb)->same_flow;
|
2009-01-29 14:19:48 +00:00
|
|
|
ret = NAPI_GRO_CB(skb)->free ? GRO_MERGED_FREE : GRO_MERGED;
|
2008-12-26 22:57:42 +00:00
|
|
|
|
net: Add Generic Receive Offload infrastructure
This patch adds the top-level GRO (Generic Receive Offload) infrastructure.
This is pretty similar to LRO except that this is protocol-independent.
Instead of holding packets in an lro_mgr structure, they're now held in
napi_struct.
For drivers that intend to use this, they can set the NETIF_F_GRO bit and
call napi_gro_receive instead of netif_receive_skb or just call netif_rx.
The latter will call napi_receive_skb automatically. When napi_gro_receive
is used, the driver must either call napi_complete/napi_rx_complete, or
call napi_gro_flush in softirq context if the driver uses the primitives
__napi_complete/__napi_rx_complete.
Protocols will set the gro_receive and gro_complete function pointers in
order to participate in this scheme.
In addition to the packet, gro_receive will get a list of currently held
packets. Each packet in the list has a same_flow field which is non-zero
if it is a potential match for the new packet. For each packet that may
match, they also have a flush field which is non-zero if the held packet
must not be merged with the new packet.
Once gro_receive has determined that the new skb matches a held packet,
the held packet may be processed immediately if the new skb cannot be
merged with it. In this case gro_receive should return the pointer to
the existing skb in gro_list. Otherwise the new skb should be merged into
the existing packet and NULL should be returned, unless the new skb makes
it impossible for any further merges to be made (e.g., FIN packet) where
the merged skb should be returned.
Whenever the skb is merged into an existing entry, the gro_receive
function should set NAPI_GRO_CB(skb)->same_flow. Note that if an skb
merely matches an existing entry but can't be merged with it, then
this shouldn't be set.
If gro_receive finds it pointless to hold the new skb for future merging,
it should set NAPI_GRO_CB(skb)->flush.
Held packets will be flushed by napi_gro_flush which is called by
napi_complete and napi_rx_complete.
Currently held packets are stored in a singly liked list just like LRO.
The list is limited to a maximum of 8 entries. In future, this may be
expanded to use a hash table to allow more flows to be held for merging.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-16 07:38:52 +00:00
|
|
|
if (pp) {
|
|
|
|
struct sk_buff *nskb = *pp;
|
|
|
|
|
|
|
|
*pp = nskb->next;
|
|
|
|
nskb->next = NULL;
|
|
|
|
napi_gro_complete(nskb);
|
2009-02-08 18:00:36 +00:00
|
|
|
napi->gro_count--;
|
net: Add Generic Receive Offload infrastructure
This patch adds the top-level GRO (Generic Receive Offload) infrastructure.
This is pretty similar to LRO except that this is protocol-independent.
Instead of holding packets in an lro_mgr structure, they're now held in
napi_struct.
For drivers that intend to use this, they can set the NETIF_F_GRO bit and
call napi_gro_receive instead of netif_receive_skb or just call netif_rx.
The latter will call napi_receive_skb automatically. When napi_gro_receive
is used, the driver must either call napi_complete/napi_rx_complete, or
call napi_gro_flush in softirq context if the driver uses the primitives
__napi_complete/__napi_rx_complete.
Protocols will set the gro_receive and gro_complete function pointers in
order to participate in this scheme.
In addition to the packet, gro_receive will get a list of currently held
packets. Each packet in the list has a same_flow field which is non-zero
if it is a potential match for the new packet. For each packet that may
match, they also have a flush field which is non-zero if the held packet
must not be merged with the new packet.
Once gro_receive has determined that the new skb matches a held packet,
the held packet may be processed immediately if the new skb cannot be
merged with it. In this case gro_receive should return the pointer to
the existing skb in gro_list. Otherwise the new skb should be merged into
the existing packet and NULL should be returned, unless the new skb makes
it impossible for any further merges to be made (e.g., FIN packet) where
the merged skb should be returned.
Whenever the skb is merged into an existing entry, the gro_receive
function should set NAPI_GRO_CB(skb)->same_flow. Note that if an skb
merely matches an existing entry but can't be merged with it, then
this shouldn't be set.
If gro_receive finds it pointless to hold the new skb for future merging,
it should set NAPI_GRO_CB(skb)->flush.
Held packets will be flushed by napi_gro_flush which is called by
napi_complete and napi_rx_complete.
Currently held packets are stored in a singly liked list just like LRO.
The list is limited to a maximum of 8 entries. In future, this may be
expanded to use a hash table to allow more flows to be held for merging.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-16 07:38:52 +00:00
|
|
|
}
|
|
|
|
|
2008-12-26 22:57:42 +00:00
|
|
|
if (same_flow)
|
net: Add Generic Receive Offload infrastructure
This patch adds the top-level GRO (Generic Receive Offload) infrastructure.
This is pretty similar to LRO except that this is protocol-independent.
Instead of holding packets in an lro_mgr structure, they're now held in
napi_struct.
For drivers that intend to use this, they can set the NETIF_F_GRO bit and
call napi_gro_receive instead of netif_receive_skb or just call netif_rx.
The latter will call napi_receive_skb automatically. When napi_gro_receive
is used, the driver must either call napi_complete/napi_rx_complete, or
call napi_gro_flush in softirq context if the driver uses the primitives
__napi_complete/__napi_rx_complete.
Protocols will set the gro_receive and gro_complete function pointers in
order to participate in this scheme.
In addition to the packet, gro_receive will get a list of currently held
packets. Each packet in the list has a same_flow field which is non-zero
if it is a potential match for the new packet. For each packet that may
match, they also have a flush field which is non-zero if the held packet
must not be merged with the new packet.
Once gro_receive has determined that the new skb matches a held packet,
the held packet may be processed immediately if the new skb cannot be
merged with it. In this case gro_receive should return the pointer to
the existing skb in gro_list. Otherwise the new skb should be merged into
the existing packet and NULL should be returned, unless the new skb makes
it impossible for any further merges to be made (e.g., FIN packet) where
the merged skb should be returned.
Whenever the skb is merged into an existing entry, the gro_receive
function should set NAPI_GRO_CB(skb)->same_flow. Note that if an skb
merely matches an existing entry but can't be merged with it, then
this shouldn't be set.
If gro_receive finds it pointless to hold the new skb for future merging,
it should set NAPI_GRO_CB(skb)->flush.
Held packets will be flushed by napi_gro_flush which is called by
napi_complete and napi_rx_complete.
Currently held packets are stored in a singly liked list just like LRO.
The list is limited to a maximum of 8 entries. In future, this may be
expanded to use a hash table to allow more flows to be held for merging.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-16 07:38:52 +00:00
|
|
|
goto ok;
|
|
|
|
|
2014-01-09 22:12:19 +00:00
|
|
|
if (NAPI_GRO_CB(skb)->flush)
|
net: Add Generic Receive Offload infrastructure
This patch adds the top-level GRO (Generic Receive Offload) infrastructure.
This is pretty similar to LRO except that this is protocol-independent.
Instead of holding packets in an lro_mgr structure, they're now held in
napi_struct.
For drivers that intend to use this, they can set the NETIF_F_GRO bit and
call napi_gro_receive instead of netif_receive_skb or just call netif_rx.
The latter will call napi_receive_skb automatically. When napi_gro_receive
is used, the driver must either call napi_complete/napi_rx_complete, or
call napi_gro_flush in softirq context if the driver uses the primitives
__napi_complete/__napi_rx_complete.
Protocols will set the gro_receive and gro_complete function pointers in
order to participate in this scheme.
In addition to the packet, gro_receive will get a list of currently held
packets. Each packet in the list has a same_flow field which is non-zero
if it is a potential match for the new packet. For each packet that may
match, they also have a flush field which is non-zero if the held packet
must not be merged with the new packet.
Once gro_receive has determined that the new skb matches a held packet,
the held packet may be processed immediately if the new skb cannot be
merged with it. In this case gro_receive should return the pointer to
the existing skb in gro_list. Otherwise the new skb should be merged into
the existing packet and NULL should be returned, unless the new skb makes
it impossible for any further merges to be made (e.g., FIN packet) where
the merged skb should be returned.
Whenever the skb is merged into an existing entry, the gro_receive
function should set NAPI_GRO_CB(skb)->same_flow. Note that if an skb
merely matches an existing entry but can't be merged with it, then
this shouldn't be set.
If gro_receive finds it pointless to hold the new skb for future merging,
it should set NAPI_GRO_CB(skb)->flush.
Held packets will be flushed by napi_gro_flush which is called by
napi_complete and napi_rx_complete.
Currently held packets are stored in a singly liked list just like LRO.
The list is limited to a maximum of 8 entries. In future, this may be
expanded to use a hash table to allow more flows to be held for merging.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-16 07:38:52 +00:00
|
|
|
goto normal;
|
|
|
|
|
2014-01-09 22:12:19 +00:00
|
|
|
if (unlikely(napi->gro_count >= MAX_GRO_SKBS)) {
|
|
|
|
struct sk_buff *nskb = napi->gro_list;
|
|
|
|
|
|
|
|
/* locate the end of the list to select the 'oldest' flow */
|
|
|
|
while (nskb->next) {
|
|
|
|
pp = &nskb->next;
|
|
|
|
nskb = *pp;
|
|
|
|
}
|
|
|
|
*pp = NULL;
|
|
|
|
nskb->next = NULL;
|
|
|
|
napi_gro_complete(nskb);
|
|
|
|
} else {
|
|
|
|
napi->gro_count++;
|
|
|
|
}
|
net: Add Generic Receive Offload infrastructure
This patch adds the top-level GRO (Generic Receive Offload) infrastructure.
This is pretty similar to LRO except that this is protocol-independent.
Instead of holding packets in an lro_mgr structure, they're now held in
napi_struct.
For drivers that intend to use this, they can set the NETIF_F_GRO bit and
call napi_gro_receive instead of netif_receive_skb or just call netif_rx.
The latter will call napi_receive_skb automatically. When napi_gro_receive
is used, the driver must either call napi_complete/napi_rx_complete, or
call napi_gro_flush in softirq context if the driver uses the primitives
__napi_complete/__napi_rx_complete.
Protocols will set the gro_receive and gro_complete function pointers in
order to participate in this scheme.
In addition to the packet, gro_receive will get a list of currently held
packets. Each packet in the list has a same_flow field which is non-zero
if it is a potential match for the new packet. For each packet that may
match, they also have a flush field which is non-zero if the held packet
must not be merged with the new packet.
Once gro_receive has determined that the new skb matches a held packet,
the held packet may be processed immediately if the new skb cannot be
merged with it. In this case gro_receive should return the pointer to
the existing skb in gro_list. Otherwise the new skb should be merged into
the existing packet and NULL should be returned, unless the new skb makes
it impossible for any further merges to be made (e.g., FIN packet) where
the merged skb should be returned.
Whenever the skb is merged into an existing entry, the gro_receive
function should set NAPI_GRO_CB(skb)->same_flow. Note that if an skb
merely matches an existing entry but can't be merged with it, then
this shouldn't be set.
If gro_receive finds it pointless to hold the new skb for future merging,
it should set NAPI_GRO_CB(skb)->flush.
Held packets will be flushed by napi_gro_flush which is called by
napi_complete and napi_rx_complete.
Currently held packets are stored in a singly liked list just like LRO.
The list is limited to a maximum of 8 entries. In future, this may be
expanded to use a hash table to allow more flows to be held for merging.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-16 07:38:52 +00:00
|
|
|
NAPI_GRO_CB(skb)->count = 1;
|
2012-10-06 08:08:49 +00:00
|
|
|
NAPI_GRO_CB(skb)->age = jiffies;
|
2014-05-16 18:34:37 +00:00
|
|
|
NAPI_GRO_CB(skb)->last = skb;
|
2009-01-29 14:19:50 +00:00
|
|
|
skb_shinfo(skb)->gso_size = skb_gro_len(skb);
|
net: Add Generic Receive Offload infrastructure
This patch adds the top-level GRO (Generic Receive Offload) infrastructure.
This is pretty similar to LRO except that this is protocol-independent.
Instead of holding packets in an lro_mgr structure, they're now held in
napi_struct.
For drivers that intend to use this, they can set the NETIF_F_GRO bit and
call napi_gro_receive instead of netif_receive_skb or just call netif_rx.
The latter will call napi_receive_skb automatically. When napi_gro_receive
is used, the driver must either call napi_complete/napi_rx_complete, or
call napi_gro_flush in softirq context if the driver uses the primitives
__napi_complete/__napi_rx_complete.
Protocols will set the gro_receive and gro_complete function pointers in
order to participate in this scheme.
In addition to the packet, gro_receive will get a list of currently held
packets. Each packet in the list has a same_flow field which is non-zero
if it is a potential match for the new packet. For each packet that may
match, they also have a flush field which is non-zero if the held packet
must not be merged with the new packet.
Once gro_receive has determined that the new skb matches a held packet,
the held packet may be processed immediately if the new skb cannot be
merged with it. In this case gro_receive should return the pointer to
the existing skb in gro_list. Otherwise the new skb should be merged into
the existing packet and NULL should be returned, unless the new skb makes
it impossible for any further merges to be made (e.g., FIN packet) where
the merged skb should be returned.
Whenever the skb is merged into an existing entry, the gro_receive
function should set NAPI_GRO_CB(skb)->same_flow. Note that if an skb
merely matches an existing entry but can't be merged with it, then
this shouldn't be set.
If gro_receive finds it pointless to hold the new skb for future merging,
it should set NAPI_GRO_CB(skb)->flush.
Held packets will be flushed by napi_gro_flush which is called by
napi_complete and napi_rx_complete.
Currently held packets are stored in a singly liked list just like LRO.
The list is limited to a maximum of 8 entries. In future, this may be
expanded to use a hash table to allow more flows to be held for merging.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-16 07:38:52 +00:00
|
|
|
skb->next = napi->gro_list;
|
|
|
|
napi->gro_list = skb;
|
2009-01-29 14:19:48 +00:00
|
|
|
ret = GRO_HELD;
|
net: Add Generic Receive Offload infrastructure
This patch adds the top-level GRO (Generic Receive Offload) infrastructure.
This is pretty similar to LRO except that this is protocol-independent.
Instead of holding packets in an lro_mgr structure, they're now held in
napi_struct.
For drivers that intend to use this, they can set the NETIF_F_GRO bit and
call napi_gro_receive instead of netif_receive_skb or just call netif_rx.
The latter will call napi_receive_skb automatically. When napi_gro_receive
is used, the driver must either call napi_complete/napi_rx_complete, or
call napi_gro_flush in softirq context if the driver uses the primitives
__napi_complete/__napi_rx_complete.
Protocols will set the gro_receive and gro_complete function pointers in
order to participate in this scheme.
In addition to the packet, gro_receive will get a list of currently held
packets. Each packet in the list has a same_flow field which is non-zero
if it is a potential match for the new packet. For each packet that may
match, they also have a flush field which is non-zero if the held packet
must not be merged with the new packet.
Once gro_receive has determined that the new skb matches a held packet,
the held packet may be processed immediately if the new skb cannot be
merged with it. In this case gro_receive should return the pointer to
the existing skb in gro_list. Otherwise the new skb should be merged into
the existing packet and NULL should be returned, unless the new skb makes
it impossible for any further merges to be made (e.g., FIN packet) where
the merged skb should be returned.
Whenever the skb is merged into an existing entry, the gro_receive
function should set NAPI_GRO_CB(skb)->same_flow. Note that if an skb
merely matches an existing entry but can't be merged with it, then
this shouldn't be set.
If gro_receive finds it pointless to hold the new skb for future merging,
it should set NAPI_GRO_CB(skb)->flush.
Held packets will be flushed by napi_gro_flush which is called by
napi_complete and napi_rx_complete.
Currently held packets are stored in a singly liked list just like LRO.
The list is limited to a maximum of 8 entries. In future, this may be
expanded to use a hash table to allow more flows to be held for merging.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-16 07:38:52 +00:00
|
|
|
|
2009-02-01 09:24:55 +00:00
|
|
|
pull:
|
2014-03-30 04:28:21 +00:00
|
|
|
grow = skb_gro_offset(skb) - skb_headlen(skb);
|
|
|
|
if (grow > 0)
|
|
|
|
gro_pull_from_frag0(skb, grow);
|
net: Add Generic Receive Offload infrastructure
This patch adds the top-level GRO (Generic Receive Offload) infrastructure.
This is pretty similar to LRO except that this is protocol-independent.
Instead of holding packets in an lro_mgr structure, they're now held in
napi_struct.
For drivers that intend to use this, they can set the NETIF_F_GRO bit and
call napi_gro_receive instead of netif_receive_skb or just call netif_rx.
The latter will call napi_receive_skb automatically. When napi_gro_receive
is used, the driver must either call napi_complete/napi_rx_complete, or
call napi_gro_flush in softirq context if the driver uses the primitives
__napi_complete/__napi_rx_complete.
Protocols will set the gro_receive and gro_complete function pointers in
order to participate in this scheme.
In addition to the packet, gro_receive will get a list of currently held
packets. Each packet in the list has a same_flow field which is non-zero
if it is a potential match for the new packet. For each packet that may
match, they also have a flush field which is non-zero if the held packet
must not be merged with the new packet.
Once gro_receive has determined that the new skb matches a held packet,
the held packet may be processed immediately if the new skb cannot be
merged with it. In this case gro_receive should return the pointer to
the existing skb in gro_list. Otherwise the new skb should be merged into
the existing packet and NULL should be returned, unless the new skb makes
it impossible for any further merges to be made (e.g., FIN packet) where
the merged skb should be returned.
Whenever the skb is merged into an existing entry, the gro_receive
function should set NAPI_GRO_CB(skb)->same_flow. Note that if an skb
merely matches an existing entry but can't be merged with it, then
this shouldn't be set.
If gro_receive finds it pointless to hold the new skb for future merging,
it should set NAPI_GRO_CB(skb)->flush.
Held packets will be flushed by napi_gro_flush which is called by
napi_complete and napi_rx_complete.
Currently held packets are stored in a singly liked list just like LRO.
The list is limited to a maximum of 8 entries. In future, this may be
expanded to use a hash table to allow more flows to be held for merging.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-16 07:38:52 +00:00
|
|
|
ok:
|
2009-01-29 14:19:48 +00:00
|
|
|
return ret;
|
net: Add Generic Receive Offload infrastructure
This patch adds the top-level GRO (Generic Receive Offload) infrastructure.
This is pretty similar to LRO except that this is protocol-independent.
Instead of holding packets in an lro_mgr structure, they're now held in
napi_struct.
For drivers that intend to use this, they can set the NETIF_F_GRO bit and
call napi_gro_receive instead of netif_receive_skb or just call netif_rx.
The latter will call napi_receive_skb automatically. When napi_gro_receive
is used, the driver must either call napi_complete/napi_rx_complete, or
call napi_gro_flush in softirq context if the driver uses the primitives
__napi_complete/__napi_rx_complete.
Protocols will set the gro_receive and gro_complete function pointers in
order to participate in this scheme.
In addition to the packet, gro_receive will get a list of currently held
packets. Each packet in the list has a same_flow field which is non-zero
if it is a potential match for the new packet. For each packet that may
match, they also have a flush field which is non-zero if the held packet
must not be merged with the new packet.
Once gro_receive has determined that the new skb matches a held packet,
the held packet may be processed immediately if the new skb cannot be
merged with it. In this case gro_receive should return the pointer to
the existing skb in gro_list. Otherwise the new skb should be merged into
the existing packet and NULL should be returned, unless the new skb makes
it impossible for any further merges to be made (e.g., FIN packet) where
the merged skb should be returned.
Whenever the skb is merged into an existing entry, the gro_receive
function should set NAPI_GRO_CB(skb)->same_flow. Note that if an skb
merely matches an existing entry but can't be merged with it, then
this shouldn't be set.
If gro_receive finds it pointless to hold the new skb for future merging,
it should set NAPI_GRO_CB(skb)->flush.
Held packets will be flushed by napi_gro_flush which is called by
napi_complete and napi_rx_complete.
Currently held packets are stored in a singly liked list just like LRO.
The list is limited to a maximum of 8 entries. In future, this may be
expanded to use a hash table to allow more flows to be held for merging.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-16 07:38:52 +00:00
|
|
|
|
|
|
|
normal:
|
2009-02-01 09:24:55 +00:00
|
|
|
ret = GRO_NORMAL;
|
|
|
|
goto pull;
|
2009-01-05 00:13:40 +00:00
|
|
|
}
|
2009-01-06 18:49:34 +00:00
|
|
|
|
net-gre-gro: Add GRE support to the GRO stack
This patch built on top of Commit 299603e8370a93dd5d8e8d800f0dff1ce2c53d36
("net-gro: Prepare GRO stack for the upcoming tunneling support") to add
the support of the standard GRE (RFC1701/RFC2784/RFC2890) to the GRO
stack. It also serves as an example for supporting other encapsulation
protocols in the GRO stack in the future.
The patch supports version 0 and all the flags (key, csum, seq#) but
will flush any pkt with the S (seq#) flag. This is because the S flag
is not support by GSO, and a GRO pkt may end up in the forwarding path,
thus requiring GSO support to break it up correctly.
Currently the "packet_offload" structure only contains L3 (ETH_P_IP/
ETH_P_IPV6) GRO offload support so the encapped pkts are limited to
IP pkts (i.e., w/o L2 hdr). But support for other protocol type can
be easily added, so is the support for GRE variations like NVGRE.
The patch also support csum offload. Specifically if the csum flag is on
and the h/w is capable of checksumming the payload (CHECKSUM_COMPLETE),
the code will take advantage of the csum computed by the h/w when
validating the GRE csum.
Note that commit 60769a5dcd8755715c7143b4571d5c44f01796f1 "ipv4: gre:
add GRO capability" already introduces GRO capability to IPv4 GRE
tunnels, using the gro_cells infrastructure. But GRO is done after
GRE hdr has been removed (i.e., decapped). The following patch applies
GRO when pkts first come in (before hitting the GRE tunnel code). There
is some performance advantage for applying GRO as early as possible.
Also this approach is transparent to other subsystem like Open vSwitch
where GRE decap is handled outside of the IP stack hence making it
harder for the gro_cells stuff to apply. On the other hand, some NICs
are still not capable of hashing on the inner hdr of a GRE pkt (RSS).
In that case the GRO processing of pkts from the same remote host will
all happen on the same CPU and the performance may be suboptimal.
I'm including some rough preliminary performance numbers below. Note
that the performance will be highly dependent on traffic load, mix as
usual. Moreover it also depends on NIC offload features hence the
following is by no means a comprehesive study. Local testing and tuning
will be needed to decide the best setting.
All tests spawned 50 copies of netperf TCP_STREAM and ran for 30 secs.
(super_netperf 50 -H 192.168.1.18 -l 30)
An IP GRE tunnel with only the key flag on (e.g., ip tunnel add gre1
mode gre local 10.246.17.18 remote 10.246.17.17 ttl 255 key 123)
is configured.
The GRO support for pkts AFTER decap are controlled through the device
feature of the GRE device (e.g., ethtool -K gre1 gro on/off).
1.1 ethtool -K gre1 gro off; ethtool -K eth0 gro off
thruput: 9.16Gbps
CPU utilization: 19%
1.2 ethtool -K gre1 gro on; ethtool -K eth0 gro off
thruput: 5.9Gbps
CPU utilization: 15%
1.3 ethtool -K gre1 gro off; ethtool -K eth0 gro on
thruput: 9.26Gbps
CPU utilization: 12-13%
1.4 ethtool -K gre1 gro on; ethtool -K eth0 gro on
thruput: 9.26Gbps
CPU utilization: 10%
The following tests were performed on a different NIC that is capable of
csum offload. I.e., the h/w is capable of computing IP payload csum
(CHECKSUM_COMPLETE).
2.1 ethtool -K gre1 gro on (hence will use gro_cells)
2.1.1 ethtool -K eth0 gro off; csum offload disabled
thruput: 8.53Gbps
CPU utilization: 9%
2.1.2 ethtool -K eth0 gro off; csum offload enabled
thruput: 8.97Gbps
CPU utilization: 7-8%
2.1.3 ethtool -K eth0 gro on; csum offload disabled
thruput: 8.83Gbps
CPU utilization: 5-6%
2.1.4 ethtool -K eth0 gro on; csum offload enabled
thruput: 8.98Gbps
CPU utilization: 5%
2.2 ethtool -K gre1 gro off
2.2.1 ethtool -K eth0 gro off; csum offload disabled
thruput: 5.93Gbps
CPU utilization: 9%
2.2.2 ethtool -K eth0 gro off; csum offload enabled
thruput: 5.62Gbps
CPU utilization: 8%
2.2.3 ethtool -K eth0 gro on; csum offload disabled
thruput: 7.69Gbps
CPU utilization: 8%
2.2.4 ethtool -K eth0 gro on; csum offload enabled
thruput: 8.96Gbps
CPU utilization: 5-6%
Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-07 18:23:19 +00:00
|
|
|
struct packet_offload *gro_find_receive_by_type(__be16 type)
|
|
|
|
{
|
|
|
|
struct list_head *offload_head = &offload_base;
|
|
|
|
struct packet_offload *ptype;
|
|
|
|
|
|
|
|
list_for_each_entry_rcu(ptype, offload_head, list) {
|
|
|
|
if (ptype->type != type || !ptype->callbacks.gro_receive)
|
|
|
|
continue;
|
|
|
|
return ptype;
|
|
|
|
}
|
|
|
|
return NULL;
|
|
|
|
}
|
2014-01-20 11:59:20 +00:00
|
|
|
EXPORT_SYMBOL(gro_find_receive_by_type);
|
net-gre-gro: Add GRE support to the GRO stack
This patch built on top of Commit 299603e8370a93dd5d8e8d800f0dff1ce2c53d36
("net-gro: Prepare GRO stack for the upcoming tunneling support") to add
the support of the standard GRE (RFC1701/RFC2784/RFC2890) to the GRO
stack. It also serves as an example for supporting other encapsulation
protocols in the GRO stack in the future.
The patch supports version 0 and all the flags (key, csum, seq#) but
will flush any pkt with the S (seq#) flag. This is because the S flag
is not support by GSO, and a GRO pkt may end up in the forwarding path,
thus requiring GSO support to break it up correctly.
Currently the "packet_offload" structure only contains L3 (ETH_P_IP/
ETH_P_IPV6) GRO offload support so the encapped pkts are limited to
IP pkts (i.e., w/o L2 hdr). But support for other protocol type can
be easily added, so is the support for GRE variations like NVGRE.
The patch also support csum offload. Specifically if the csum flag is on
and the h/w is capable of checksumming the payload (CHECKSUM_COMPLETE),
the code will take advantage of the csum computed by the h/w when
validating the GRE csum.
Note that commit 60769a5dcd8755715c7143b4571d5c44f01796f1 "ipv4: gre:
add GRO capability" already introduces GRO capability to IPv4 GRE
tunnels, using the gro_cells infrastructure. But GRO is done after
GRE hdr has been removed (i.e., decapped). The following patch applies
GRO when pkts first come in (before hitting the GRE tunnel code). There
is some performance advantage for applying GRO as early as possible.
Also this approach is transparent to other subsystem like Open vSwitch
where GRE decap is handled outside of the IP stack hence making it
harder for the gro_cells stuff to apply. On the other hand, some NICs
are still not capable of hashing on the inner hdr of a GRE pkt (RSS).
In that case the GRO processing of pkts from the same remote host will
all happen on the same CPU and the performance may be suboptimal.
I'm including some rough preliminary performance numbers below. Note
that the performance will be highly dependent on traffic load, mix as
usual. Moreover it also depends on NIC offload features hence the
following is by no means a comprehesive study. Local testing and tuning
will be needed to decide the best setting.
All tests spawned 50 copies of netperf TCP_STREAM and ran for 30 secs.
(super_netperf 50 -H 192.168.1.18 -l 30)
An IP GRE tunnel with only the key flag on (e.g., ip tunnel add gre1
mode gre local 10.246.17.18 remote 10.246.17.17 ttl 255 key 123)
is configured.
The GRO support for pkts AFTER decap are controlled through the device
feature of the GRE device (e.g., ethtool -K gre1 gro on/off).
1.1 ethtool -K gre1 gro off; ethtool -K eth0 gro off
thruput: 9.16Gbps
CPU utilization: 19%
1.2 ethtool -K gre1 gro on; ethtool -K eth0 gro off
thruput: 5.9Gbps
CPU utilization: 15%
1.3 ethtool -K gre1 gro off; ethtool -K eth0 gro on
thruput: 9.26Gbps
CPU utilization: 12-13%
1.4 ethtool -K gre1 gro on; ethtool -K eth0 gro on
thruput: 9.26Gbps
CPU utilization: 10%
The following tests were performed on a different NIC that is capable of
csum offload. I.e., the h/w is capable of computing IP payload csum
(CHECKSUM_COMPLETE).
2.1 ethtool -K gre1 gro on (hence will use gro_cells)
2.1.1 ethtool -K eth0 gro off; csum offload disabled
thruput: 8.53Gbps
CPU utilization: 9%
2.1.2 ethtool -K eth0 gro off; csum offload enabled
thruput: 8.97Gbps
CPU utilization: 7-8%
2.1.3 ethtool -K eth0 gro on; csum offload disabled
thruput: 8.83Gbps
CPU utilization: 5-6%
2.1.4 ethtool -K eth0 gro on; csum offload enabled
thruput: 8.98Gbps
CPU utilization: 5%
2.2 ethtool -K gre1 gro off
2.2.1 ethtool -K eth0 gro off; csum offload disabled
thruput: 5.93Gbps
CPU utilization: 9%
2.2.2 ethtool -K eth0 gro off; csum offload enabled
thruput: 5.62Gbps
CPU utilization: 8%
2.2.3 ethtool -K eth0 gro on; csum offload disabled
thruput: 7.69Gbps
CPU utilization: 8%
2.2.4 ethtool -K eth0 gro on; csum offload enabled
thruput: 8.96Gbps
CPU utilization: 5-6%
Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-07 18:23:19 +00:00
|
|
|
|
|
|
|
struct packet_offload *gro_find_complete_by_type(__be16 type)
|
|
|
|
{
|
|
|
|
struct list_head *offload_head = &offload_base;
|
|
|
|
struct packet_offload *ptype;
|
|
|
|
|
|
|
|
list_for_each_entry_rcu(ptype, offload_head, list) {
|
|
|
|
if (ptype->type != type || !ptype->callbacks.gro_complete)
|
|
|
|
continue;
|
|
|
|
return ptype;
|
|
|
|
}
|
|
|
|
return NULL;
|
|
|
|
}
|
2014-01-20 11:59:20 +00:00
|
|
|
EXPORT_SYMBOL(gro_find_complete_by_type);
|
2009-01-05 00:13:40 +00:00
|
|
|
|
2017-06-29 09:13:36 +00:00
|
|
|
static void napi_skb_free_stolen_head(struct sk_buff *skb)
|
|
|
|
{
|
|
|
|
skb_dst_drop(skb);
|
|
|
|
secpath_reset(skb);
|
|
|
|
kmem_cache_free(skbuff_head_cache, skb);
|
|
|
|
}
|
|
|
|
|
2012-11-28 21:55:25 +00:00
|
|
|
static gro_result_t napi_skb_finish(gro_result_t ret, struct sk_buff *skb)
|
2009-01-05 00:13:40 +00:00
|
|
|
{
|
2009-01-29 14:19:48 +00:00
|
|
|
switch (ret) {
|
|
|
|
case GRO_NORMAL:
|
2014-01-10 22:17:24 +00:00
|
|
|
if (netif_receive_skb_internal(skb))
|
2009-10-30 04:36:53 +00:00
|
|
|
ret = GRO_DROP;
|
|
|
|
break;
|
2009-01-05 00:13:40 +00:00
|
|
|
|
2009-01-29 14:19:48 +00:00
|
|
|
case GRO_DROP:
|
2009-01-05 00:13:40 +00:00
|
|
|
kfree_skb(skb);
|
|
|
|
break;
|
2009-10-29 07:17:09 +00:00
|
|
|
|
2012-04-19 07:07:40 +00:00
|
|
|
case GRO_MERGED_FREE:
|
2017-06-29 09:13:36 +00:00
|
|
|
if (NAPI_GRO_CB(skb)->free == NAPI_GRO_FREE_STOLEN_HEAD)
|
|
|
|
napi_skb_free_stolen_head(skb);
|
|
|
|
else
|
2012-04-30 08:10:34 +00:00
|
|
|
__kfree_skb(skb);
|
2012-04-19 07:07:40 +00:00
|
|
|
break;
|
|
|
|
|
2009-10-29 07:17:09 +00:00
|
|
|
case GRO_HELD:
|
|
|
|
case GRO_MERGED:
|
2017-02-15 08:39:44 +00:00
|
|
|
case GRO_CONSUMED:
|
2009-10-29 07:17:09 +00:00
|
|
|
break;
|
2009-01-05 00:13:40 +00:00
|
|
|
}
|
|
|
|
|
2009-10-30 04:36:53 +00:00
|
|
|
return ret;
|
2009-01-29 14:19:48 +00:00
|
|
|
}
|
|
|
|
|
2009-10-30 04:36:53 +00:00
|
|
|
gro_result_t napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb)
|
2009-01-29 14:19:48 +00:00
|
|
|
{
|
2015-11-18 14:30:59 +00:00
|
|
|
skb_mark_napi_id(skb, napi);
|
2014-01-10 22:17:24 +00:00
|
|
|
trace_napi_gro_receive_entry(skb);
|
2009-01-29 14:19:50 +00:00
|
|
|
|
2014-03-30 04:28:21 +00:00
|
|
|
skb_gro_reset_offset(skb);
|
|
|
|
|
2012-12-10 13:28:16 +00:00
|
|
|
return napi_skb_finish(dev_gro_receive(napi, skb), skb);
|
net: Add Generic Receive Offload infrastructure
This patch adds the top-level GRO (Generic Receive Offload) infrastructure.
This is pretty similar to LRO except that this is protocol-independent.
Instead of holding packets in an lro_mgr structure, they're now held in
napi_struct.
For drivers that intend to use this, they can set the NETIF_F_GRO bit and
call napi_gro_receive instead of netif_receive_skb or just call netif_rx.
The latter will call napi_receive_skb automatically. When napi_gro_receive
is used, the driver must either call napi_complete/napi_rx_complete, or
call napi_gro_flush in softirq context if the driver uses the primitives
__napi_complete/__napi_rx_complete.
Protocols will set the gro_receive and gro_complete function pointers in
order to participate in this scheme.
In addition to the packet, gro_receive will get a list of currently held
packets. Each packet in the list has a same_flow field which is non-zero
if it is a potential match for the new packet. For each packet that may
match, they also have a flush field which is non-zero if the held packet
must not be merged with the new packet.
Once gro_receive has determined that the new skb matches a held packet,
the held packet may be processed immediately if the new skb cannot be
merged with it. In this case gro_receive should return the pointer to
the existing skb in gro_list. Otherwise the new skb should be merged into
the existing packet and NULL should be returned, unless the new skb makes
it impossible for any further merges to be made (e.g., FIN packet) where
the merged skb should be returned.
Whenever the skb is merged into an existing entry, the gro_receive
function should set NAPI_GRO_CB(skb)->same_flow. Note that if an skb
merely matches an existing entry but can't be merged with it, then
this shouldn't be set.
If gro_receive finds it pointless to hold the new skb for future merging,
it should set NAPI_GRO_CB(skb)->flush.
Held packets will be flushed by napi_gro_flush which is called by
napi_complete and napi_rx_complete.
Currently held packets are stored in a singly liked list just like LRO.
The list is limited to a maximum of 8 entries. In future, this may be
expanded to use a hash table to allow more flows to be held for merging.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-16 07:38:52 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(napi_gro_receive);
|
|
|
|
|
2010-10-19 07:12:10 +00:00
|
|
|
static void napi_reuse_skb(struct napi_struct *napi, struct sk_buff *skb)
|
2009-01-06 18:49:34 +00:00
|
|
|
{
|
2014-10-23 13:30:30 +00:00
|
|
|
if (unlikely(skb->pfmemalloc)) {
|
|
|
|
consume_skb(skb);
|
|
|
|
return;
|
|
|
|
}
|
2009-01-06 18:49:34 +00:00
|
|
|
__skb_pull(skb, skb_headlen(skb));
|
2012-03-21 06:58:03 +00:00
|
|
|
/* restore the reserve we had after netdev_alloc_skb_ip_align() */
|
|
|
|
skb_reserve(skb, NET_SKB_PAD + NET_IP_ALIGN - skb_headroom(skb));
|
2010-10-20 13:56:06 +00:00
|
|
|
skb->vlan_tci = 0;
|
2011-01-30 04:44:54 +00:00
|
|
|
skb->dev = napi->dev;
|
2011-02-02 22:53:25 +00:00
|
|
|
skb->skb_iif = 0;
|
2014-07-14 22:54:46 +00:00
|
|
|
skb->encapsulation = 0;
|
|
|
|
skb_shinfo(skb)->gso_type = 0;
|
2014-04-03 16:28:10 +00:00
|
|
|
skb->truesize = SKB_TRUESIZE(skb_end_offset(skb));
|
2017-01-30 05:45:38 +00:00
|
|
|
secpath_reset(skb);
|
2009-01-06 18:49:34 +00:00
|
|
|
|
|
|
|
napi->skb = skb;
|
|
|
|
}
|
|
|
|
|
2009-04-16 09:02:07 +00:00
|
|
|
struct sk_buff *napi_get_frags(struct napi_struct *napi)
|
2009-01-05 00:13:40 +00:00
|
|
|
{
|
|
|
|
struct sk_buff *skb = napi->skb;
|
|
|
|
|
|
|
|
if (!skb) {
|
2014-12-10 03:40:49 +00:00
|
|
|
skb = napi_alloc_skb(napi, GRO_MAX_HEAD);
|
2015-11-19 20:11:23 +00:00
|
|
|
if (skb) {
|
|
|
|
napi->skb = skb;
|
|
|
|
skb_mark_napi_id(skb, napi);
|
|
|
|
}
|
2009-01-29 14:19:52 +00:00
|
|
|
}
|
2009-01-06 18:49:34 +00:00
|
|
|
return skb;
|
|
|
|
}
|
2009-04-16 09:02:07 +00:00
|
|
|
EXPORT_SYMBOL(napi_get_frags);
|
2009-01-06 18:49:34 +00:00
|
|
|
|
2014-03-30 04:28:21 +00:00
|
|
|
static gro_result_t napi_frags_finish(struct napi_struct *napi,
|
|
|
|
struct sk_buff *skb,
|
|
|
|
gro_result_t ret)
|
2009-01-06 18:49:34 +00:00
|
|
|
{
|
2009-01-29 14:19:48 +00:00
|
|
|
switch (ret) {
|
|
|
|
case GRO_NORMAL:
|
2014-03-30 04:28:21 +00:00
|
|
|
case GRO_HELD:
|
|
|
|
__skb_push(skb, ETH_HLEN);
|
|
|
|
skb->protocol = eth_type_trans(skb, skb->dev);
|
|
|
|
if (ret == GRO_NORMAL && netif_receive_skb_internal(skb))
|
2009-10-30 04:36:53 +00:00
|
|
|
ret = GRO_DROP;
|
2009-01-29 14:19:50 +00:00
|
|
|
break;
|
2009-01-05 00:13:40 +00:00
|
|
|
|
2009-01-29 14:19:48 +00:00
|
|
|
case GRO_DROP:
|
|
|
|
napi_reuse_skb(napi, skb);
|
|
|
|
break;
|
2009-10-29 07:17:09 +00:00
|
|
|
|
2017-06-29 09:13:36 +00:00
|
|
|
case GRO_MERGED_FREE:
|
|
|
|
if (NAPI_GRO_CB(skb)->free == NAPI_GRO_FREE_STOLEN_HEAD)
|
|
|
|
napi_skb_free_stolen_head(skb);
|
|
|
|
else
|
|
|
|
napi_reuse_skb(napi, skb);
|
|
|
|
break;
|
|
|
|
|
2009-10-29 07:17:09 +00:00
|
|
|
case GRO_MERGED:
|
2017-02-15 08:39:44 +00:00
|
|
|
case GRO_CONSUMED:
|
2009-10-29 07:17:09 +00:00
|
|
|
break;
|
2009-01-29 14:19:48 +00:00
|
|
|
}
|
2009-01-05 00:13:40 +00:00
|
|
|
|
2009-10-30 04:36:53 +00:00
|
|
|
return ret;
|
2009-01-05 00:13:40 +00:00
|
|
|
}
|
2009-01-29 14:19:48 +00:00
|
|
|
|
2014-03-30 04:28:21 +00:00
|
|
|
/* Upper GRO stack assumes network header starts at gro_offset=0
|
|
|
|
* Drivers could call both napi_gro_frags() and napi_gro_receive()
|
|
|
|
* We copy ethernet header into skb->data to have a common layout.
|
|
|
|
*/
|
2012-05-18 20:49:06 +00:00
|
|
|
static struct sk_buff *napi_frags_skb(struct napi_struct *napi)
|
2009-04-16 09:02:07 +00:00
|
|
|
{
|
|
|
|
struct sk_buff *skb = napi->skb;
|
2014-03-30 04:28:21 +00:00
|
|
|
const struct ethhdr *eth;
|
|
|
|
unsigned int hlen = sizeof(*eth);
|
2009-04-16 09:02:07 +00:00
|
|
|
|
|
|
|
napi->skb = NULL;
|
|
|
|
|
2014-03-30 04:28:21 +00:00
|
|
|
skb_reset_mac_header(skb);
|
|
|
|
skb_gro_reset_offset(skb);
|
|
|
|
|
|
|
|
eth = skb_gro_header_fast(skb, 0);
|
|
|
|
if (unlikely(skb_gro_header_hard(skb, hlen))) {
|
|
|
|
eth = skb_gro_header_slow(skb, hlen, 0);
|
|
|
|
if (unlikely(!eth)) {
|
2016-04-02 19:26:43 +00:00
|
|
|
net_warn_ratelimited("%s: dropping impossible skb from %s\n",
|
|
|
|
__func__, napi->dev->name);
|
2014-03-30 04:28:21 +00:00
|
|
|
napi_reuse_skb(napi, skb);
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
gro_pull_from_frag0(skb, hlen);
|
|
|
|
NAPI_GRO_CB(skb)->frag0 += hlen;
|
|
|
|
NAPI_GRO_CB(skb)->frag0_len -= hlen;
|
2009-04-16 09:02:07 +00:00
|
|
|
}
|
2014-03-30 04:28:21 +00:00
|
|
|
__skb_pull(skb, hlen);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This works because the only protocols we care about don't require
|
|
|
|
* special handling.
|
|
|
|
* We'll fix it up properly in napi_frags_finish()
|
|
|
|
*/
|
|
|
|
skb->protocol = eth->h_proto;
|
2009-04-16 09:02:07 +00:00
|
|
|
|
|
|
|
return skb;
|
|
|
|
}
|
|
|
|
|
2009-10-30 04:36:53 +00:00
|
|
|
gro_result_t napi_gro_frags(struct napi_struct *napi)
|
2009-01-29 14:19:48 +00:00
|
|
|
{
|
2009-04-16 09:02:07 +00:00
|
|
|
struct sk_buff *skb = napi_frags_skb(napi);
|
2009-01-29 14:19:48 +00:00
|
|
|
|
|
|
|
if (!skb)
|
2009-10-30 04:36:53 +00:00
|
|
|
return GRO_DROP;
|
2009-01-29 14:19:48 +00:00
|
|
|
|
2014-01-10 22:17:24 +00:00
|
|
|
trace_napi_gro_frags_entry(skb);
|
|
|
|
|
2012-12-10 13:28:16 +00:00
|
|
|
return napi_frags_finish(napi, skb, dev_gro_receive(napi, skb));
|
2009-01-29 14:19:48 +00:00
|
|
|
}
|
2009-01-05 00:13:40 +00:00
|
|
|
EXPORT_SYMBOL(napi_gro_frags);
|
|
|
|
|
2014-08-22 20:33:47 +00:00
|
|
|
/* Compute the checksum from gro_offset and return the folded value
|
|
|
|
* after adding in any pseudo checksum.
|
|
|
|
*/
|
|
|
|
__sum16 __skb_gro_checksum_complete(struct sk_buff *skb)
|
|
|
|
{
|
|
|
|
__wsum wsum;
|
|
|
|
__sum16 sum;
|
|
|
|
|
|
|
|
wsum = skb_checksum(skb, skb_gro_offset(skb), skb_gro_len(skb), 0);
|
|
|
|
|
|
|
|
/* NAPI_GRO_CB(skb)->csum holds pseudo checksum */
|
|
|
|
sum = csum_fold(csum_add(NAPI_GRO_CB(skb)->csum, wsum));
|
|
|
|
if (likely(!sum)) {
|
|
|
|
if (unlikely(skb->ip_summed == CHECKSUM_COMPLETE) &&
|
|
|
|
!skb->csum_complete_sw)
|
|
|
|
netdev_rx_csum_fault(skb->dev);
|
|
|
|
}
|
|
|
|
|
|
|
|
NAPI_GRO_CB(skb)->csum = wsum;
|
|
|
|
NAPI_GRO_CB(skb)->csum_valid = 1;
|
|
|
|
|
|
|
|
return sum;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(__skb_gro_checksum_complete);
|
|
|
|
|
2017-06-09 08:54:58 +00:00
|
|
|
static void net_rps_send_ipi(struct softnet_data *remsd)
|
|
|
|
{
|
|
|
|
#ifdef CONFIG_RPS
|
|
|
|
while (remsd) {
|
|
|
|
struct softnet_data *next = remsd->rps_ipi_next;
|
|
|
|
|
|
|
|
if (cpu_online(remsd->cpu))
|
|
|
|
smp_call_function_single_async(remsd->cpu, &remsd->csd);
|
|
|
|
remsd = next;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
2010-04-22 07:22:45 +00:00
|
|
|
/*
|
2013-12-31 20:34:50 +00:00
|
|
|
* net_rps_action_and_irq_enable sends any pending IPI's for rps.
|
2010-04-22 07:22:45 +00:00
|
|
|
* Note: called with local irq disabled, but exits with local irq enabled.
|
|
|
|
*/
|
|
|
|
static void net_rps_action_and_irq_enable(struct softnet_data *sd)
|
|
|
|
{
|
|
|
|
#ifdef CONFIG_RPS
|
|
|
|
struct softnet_data *remsd = sd->rps_ipi_list;
|
|
|
|
|
|
|
|
if (remsd) {
|
|
|
|
sd->rps_ipi_list = NULL;
|
|
|
|
|
|
|
|
local_irq_enable();
|
|
|
|
|
|
|
|
/* Send pending IPI's to kick RPS processing on remote cpus. */
|
2017-06-09 08:54:58 +00:00
|
|
|
net_rps_send_ipi(remsd);
|
2010-04-22 07:22:45 +00:00
|
|
|
} else
|
|
|
|
#endif
|
|
|
|
local_irq_enable();
|
|
|
|
}
|
|
|
|
|
net: less interrupt masking in NAPI
net_rx_action() can mask irqs a single time to transfert sd->poll_list
into a private list, for a very short duration.
Then, napi_complete() can avoid masking irqs again,
and net_rx_action() only needs to mask irq again in slow path.
This patch removes 2 couples of irq mask/unmask per typical NAPI run,
more if multiple napi were triggered.
Note this also allows to give control back to caller (do_softirq())
more often, so that other softirq handlers can be called a bit earlier,
or ksoftirqd can be wakeup earlier under pressure.
This was developed while testing an alternative to RX interrupt
mitigation to reduce latencies while keeping or improving GRO
aggregation on fast NIC.
Idea is to test napi->gro_list at the end of a napi->poll() and
reschedule one NAPI poll, but after servicing a full round of
softirqs (timers, TX, rcu, ...). This will be allowed only if softirq
is currently serviced by idle task or ksoftirqd, and resched not needed.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-02 14:19:33 +00:00
|
|
|
static bool sd_has_rps_ipi_waiting(struct softnet_data *sd)
|
|
|
|
{
|
|
|
|
#ifdef CONFIG_RPS
|
|
|
|
return sd->rps_ipi_list != NULL;
|
|
|
|
#else
|
|
|
|
return false;
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
[NET]: Make NAPI polling independent of struct net_device objects.
Several devices have multiple independant RX queues per net
device, and some have a single interrupt doorbell for several
queues.
In either case, it's easier to support layouts like that if the
structure representing the poll is independant from the net
device itself.
The signature of the ->poll() call back goes from:
int foo_poll(struct net_device *dev, int *budget)
to
int foo_poll(struct napi_struct *napi, int budget)
The caller is returned the number of RX packets processed (or
the number of "NAPI credits" consumed if you want to get
abstract). The callee no longer messes around bumping
dev->quota, *budget, etc. because that is all handled in the
caller upon return.
The napi_struct is to be embedded in the device driver private data
structures.
Furthermore, it is the driver's responsibility to disable all NAPI
instances in it's ->stop() device close handler. Since the
napi_struct is privatized into the driver's private data structures,
only the driver knows how to get at all of the napi_struct instances
it may have per-device.
With lots of help and suggestions from Rusty Russell, Roland Dreier,
Michael Chan, Jeff Garzik, and Jamal Hadi Salim.
Bug fixes from Thomas Graf, Roland Dreier, Peter Zijlstra,
Joseph Fannin, Scott Wood, Hans J. Koch, and Michael Chan.
[ Ported to current tree and all drivers converted. Integrated
Stephen's follow-on kerneldoc additions, and restored poll_list
handling to the old style to fix mutual exclusion issues. -DaveM ]
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-03 23:41:36 +00:00
|
|
|
static int process_backlog(struct napi_struct *napi, int quota)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2010-05-07 05:07:48 +00:00
|
|
|
struct softnet_data *sd = container_of(napi, struct softnet_data, backlog);
|
2016-08-25 13:58:44 +00:00
|
|
|
bool again = true;
|
|
|
|
int work = 0;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2010-04-22 07:22:45 +00:00
|
|
|
/* Check if we have pending ipi, its better to send them now,
|
|
|
|
* not waiting net_rx_action() end.
|
|
|
|
*/
|
net: less interrupt masking in NAPI
net_rx_action() can mask irqs a single time to transfert sd->poll_list
into a private list, for a very short duration.
Then, napi_complete() can avoid masking irqs again,
and net_rx_action() only needs to mask irq again in slow path.
This patch removes 2 couples of irq mask/unmask per typical NAPI run,
more if multiple napi were triggered.
Note this also allows to give control back to caller (do_softirq())
more often, so that other softirq handlers can be called a bit earlier,
or ksoftirqd can be wakeup earlier under pressure.
This was developed while testing an alternative to RX interrupt
mitigation to reduce latencies while keeping or improving GRO
aggregation on fast NIC.
Idea is to test napi->gro_list at the end of a napi->poll() and
reschedule one NAPI poll, but after servicing a full round of
softirqs (timers, TX, rcu, ...). This will be allowed only if softirq
is currently serviced by idle task or ksoftirqd, and resched not needed.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-02 14:19:33 +00:00
|
|
|
if (sd_has_rps_ipi_waiting(sd)) {
|
2010-04-22 07:22:45 +00:00
|
|
|
local_irq_disable();
|
|
|
|
net_rps_action_and_irq_enable(sd);
|
|
|
|
}
|
net: less interrupt masking in NAPI
net_rx_action() can mask irqs a single time to transfert sd->poll_list
into a private list, for a very short duration.
Then, napi_complete() can avoid masking irqs again,
and net_rx_action() only needs to mask irq again in slow path.
This patch removes 2 couples of irq mask/unmask per typical NAPI run,
more if multiple napi were triggered.
Note this also allows to give control back to caller (do_softirq())
more often, so that other softirq handlers can be called a bit earlier,
or ksoftirqd can be wakeup earlier under pressure.
This was developed while testing an alternative to RX interrupt
mitigation to reduce latencies while keeping or improving GRO
aggregation on fast NIC.
Idea is to test napi->gro_list at the end of a napi->poll() and
reschedule one NAPI poll, but after servicing a full round of
softirqs (timers, TX, rcu, ...). This will be allowed only if softirq
is currently serviced by idle task or ksoftirqd, and resched not needed.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-02 14:19:33 +00:00
|
|
|
|
2016-12-29 20:37:21 +00:00
|
|
|
napi->weight = dev_rx_weight;
|
2016-08-25 13:58:44 +00:00
|
|
|
while (again) {
|
2005-04-16 22:20:36 +00:00
|
|
|
struct sk_buff *skb;
|
2010-04-27 22:07:33 +00:00
|
|
|
|
|
|
|
while ((skb = __skb_dequeue(&sd->process_queue))) {
|
2015-07-09 06:59:10 +00:00
|
|
|
rcu_read_lock();
|
2010-04-27 22:07:33 +00:00
|
|
|
__netif_receive_skb(skb);
|
2015-07-09 06:59:10 +00:00
|
|
|
rcu_read_unlock();
|
2010-05-20 18:37:59 +00:00
|
|
|
input_queue_head_incr(sd);
|
2016-08-25 13:58:44 +00:00
|
|
|
if (++work >= quota)
|
2010-05-20 18:37:59 +00:00
|
|
|
return work;
|
2016-08-25 13:58:44 +00:00
|
|
|
|
2010-04-27 22:07:33 +00:00
|
|
|
}
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2016-08-25 13:58:44 +00:00
|
|
|
local_irq_disable();
|
2010-04-19 21:17:14 +00:00
|
|
|
rps_lock(sd);
|
2014-06-30 16:50:40 +00:00
|
|
|
if (skb_queue_empty(&sd->input_pkt_queue)) {
|
2010-05-07 05:07:48 +00:00
|
|
|
/*
|
|
|
|
* Inline a custom version of __napi_complete().
|
|
|
|
* only current cpu owns and manipulates this napi,
|
2014-06-30 16:50:40 +00:00
|
|
|
* and NAPI_STATE_SCHED is the only possible flag set
|
|
|
|
* on backlog.
|
|
|
|
* We can use a plain write instead of clear_bit(),
|
2010-05-07 05:07:48 +00:00
|
|
|
* and we dont need an smp_mb() memory barrier.
|
|
|
|
*/
|
|
|
|
napi->state = 0;
|
2016-08-25 13:58:44 +00:00
|
|
|
again = false;
|
|
|
|
} else {
|
|
|
|
skb_queue_splice_tail_init(&sd->input_pkt_queue,
|
|
|
|
&sd->process_queue);
|
[NET]: Make NAPI polling independent of struct net_device objects.
Several devices have multiple independant RX queues per net
device, and some have a single interrupt doorbell for several
queues.
In either case, it's easier to support layouts like that if the
structure representing the poll is independant from the net
device itself.
The signature of the ->poll() call back goes from:
int foo_poll(struct net_device *dev, int *budget)
to
int foo_poll(struct napi_struct *napi, int budget)
The caller is returned the number of RX packets processed (or
the number of "NAPI credits" consumed if you want to get
abstract). The callee no longer messes around bumping
dev->quota, *budget, etc. because that is all handled in the
caller upon return.
The napi_struct is to be embedded in the device driver private data
structures.
Furthermore, it is the driver's responsibility to disable all NAPI
instances in it's ->stop() device close handler. Since the
napi_struct is privatized into the driver's private data structures,
only the driver knows how to get at all of the napi_struct instances
it may have per-device.
With lots of help and suggestions from Rusty Russell, Roland Dreier,
Michael Chan, Jeff Garzik, and Jamal Hadi Salim.
Bug fixes from Thomas Graf, Roland Dreier, Peter Zijlstra,
Joseph Fannin, Scott Wood, Hans J. Koch, and Michael Chan.
[ Ported to current tree and all drivers converted. Integrated
Stephen's follow-on kerneldoc additions, and restored poll_list
handling to the old style to fix mutual exclusion issues. -DaveM ]
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-03 23:41:36 +00:00
|
|
|
}
|
2010-04-19 21:17:14 +00:00
|
|
|
rps_unlock(sd);
|
2016-08-25 13:58:44 +00:00
|
|
|
local_irq_enable();
|
2010-04-27 22:07:33 +00:00
|
|
|
}
|
2005-04-16 22:20:36 +00:00
|
|
|
|
[NET]: Make NAPI polling independent of struct net_device objects.
Several devices have multiple independant RX queues per net
device, and some have a single interrupt doorbell for several
queues.
In either case, it's easier to support layouts like that if the
structure representing the poll is independant from the net
device itself.
The signature of the ->poll() call back goes from:
int foo_poll(struct net_device *dev, int *budget)
to
int foo_poll(struct napi_struct *napi, int budget)
The caller is returned the number of RX packets processed (or
the number of "NAPI credits" consumed if you want to get
abstract). The callee no longer messes around bumping
dev->quota, *budget, etc. because that is all handled in the
caller upon return.
The napi_struct is to be embedded in the device driver private data
structures.
Furthermore, it is the driver's responsibility to disable all NAPI
instances in it's ->stop() device close handler. Since the
napi_struct is privatized into the driver's private data structures,
only the driver knows how to get at all of the napi_struct instances
it may have per-device.
With lots of help and suggestions from Rusty Russell, Roland Dreier,
Michael Chan, Jeff Garzik, and Jamal Hadi Salim.
Bug fixes from Thomas Graf, Roland Dreier, Peter Zijlstra,
Joseph Fannin, Scott Wood, Hans J. Koch, and Michael Chan.
[ Ported to current tree and all drivers converted. Integrated
Stephen's follow-on kerneldoc additions, and restored poll_list
handling to the old style to fix mutual exclusion issues. -DaveM ]
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-03 23:41:36 +00:00
|
|
|
return work;
|
|
|
|
}
|
2005-04-16 22:20:36 +00:00
|
|
|
|
[NET]: Make NAPI polling independent of struct net_device objects.
Several devices have multiple independant RX queues per net
device, and some have a single interrupt doorbell for several
queues.
In either case, it's easier to support layouts like that if the
structure representing the poll is independant from the net
device itself.
The signature of the ->poll() call back goes from:
int foo_poll(struct net_device *dev, int *budget)
to
int foo_poll(struct napi_struct *napi, int budget)
The caller is returned the number of RX packets processed (or
the number of "NAPI credits" consumed if you want to get
abstract). The callee no longer messes around bumping
dev->quota, *budget, etc. because that is all handled in the
caller upon return.
The napi_struct is to be embedded in the device driver private data
structures.
Furthermore, it is the driver's responsibility to disable all NAPI
instances in it's ->stop() device close handler. Since the
napi_struct is privatized into the driver's private data structures,
only the driver knows how to get at all of the napi_struct instances
it may have per-device.
With lots of help and suggestions from Rusty Russell, Roland Dreier,
Michael Chan, Jeff Garzik, and Jamal Hadi Salim.
Bug fixes from Thomas Graf, Roland Dreier, Peter Zijlstra,
Joseph Fannin, Scott Wood, Hans J. Koch, and Michael Chan.
[ Ported to current tree and all drivers converted. Integrated
Stephen's follow-on kerneldoc additions, and restored poll_list
handling to the old style to fix mutual exclusion issues. -DaveM ]
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-03 23:41:36 +00:00
|
|
|
/**
|
|
|
|
* __napi_schedule - schedule for receive
|
2007-10-13 04:17:49 +00:00
|
|
|
* @n: entry to schedule
|
[NET]: Make NAPI polling independent of struct net_device objects.
Several devices have multiple independant RX queues per net
device, and some have a single interrupt doorbell for several
queues.
In either case, it's easier to support layouts like that if the
structure representing the poll is independant from the net
device itself.
The signature of the ->poll() call back goes from:
int foo_poll(struct net_device *dev, int *budget)
to
int foo_poll(struct napi_struct *napi, int budget)
The caller is returned the number of RX packets processed (or
the number of "NAPI credits" consumed if you want to get
abstract). The callee no longer messes around bumping
dev->quota, *budget, etc. because that is all handled in the
caller upon return.
The napi_struct is to be embedded in the device driver private data
structures.
Furthermore, it is the driver's responsibility to disable all NAPI
instances in it's ->stop() device close handler. Since the
napi_struct is privatized into the driver's private data structures,
only the driver knows how to get at all of the napi_struct instances
it may have per-device.
With lots of help and suggestions from Rusty Russell, Roland Dreier,
Michael Chan, Jeff Garzik, and Jamal Hadi Salim.
Bug fixes from Thomas Graf, Roland Dreier, Peter Zijlstra,
Joseph Fannin, Scott Wood, Hans J. Koch, and Michael Chan.
[ Ported to current tree and all drivers converted. Integrated
Stephen's follow-on kerneldoc additions, and restored poll_list
handling to the old style to fix mutual exclusion issues. -DaveM ]
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-03 23:41:36 +00:00
|
|
|
*
|
2014-10-29 01:05:13 +00:00
|
|
|
* The entry's receive function will be scheduled to run.
|
|
|
|
* Consider using __napi_schedule_irqoff() if hard irqs are masked.
|
[NET]: Make NAPI polling independent of struct net_device objects.
Several devices have multiple independant RX queues per net
device, and some have a single interrupt doorbell for several
queues.
In either case, it's easier to support layouts like that if the
structure representing the poll is independant from the net
device itself.
The signature of the ->poll() call back goes from:
int foo_poll(struct net_device *dev, int *budget)
to
int foo_poll(struct napi_struct *napi, int budget)
The caller is returned the number of RX packets processed (or
the number of "NAPI credits" consumed if you want to get
abstract). The callee no longer messes around bumping
dev->quota, *budget, etc. because that is all handled in the
caller upon return.
The napi_struct is to be embedded in the device driver private data
structures.
Furthermore, it is the driver's responsibility to disable all NAPI
instances in it's ->stop() device close handler. Since the
napi_struct is privatized into the driver's private data structures,
only the driver knows how to get at all of the napi_struct instances
it may have per-device.
With lots of help and suggestions from Rusty Russell, Roland Dreier,
Michael Chan, Jeff Garzik, and Jamal Hadi Salim.
Bug fixes from Thomas Graf, Roland Dreier, Peter Zijlstra,
Joseph Fannin, Scott Wood, Hans J. Koch, and Michael Chan.
[ Ported to current tree and all drivers converted. Integrated
Stephen's follow-on kerneldoc additions, and restored poll_list
handling to the old style to fix mutual exclusion issues. -DaveM ]
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-03 23:41:36 +00:00
|
|
|
*/
|
2008-02-13 23:03:16 +00:00
|
|
|
void __napi_schedule(struct napi_struct *n)
|
[NET]: Make NAPI polling independent of struct net_device objects.
Several devices have multiple independant RX queues per net
device, and some have a single interrupt doorbell for several
queues.
In either case, it's easier to support layouts like that if the
structure representing the poll is independant from the net
device itself.
The signature of the ->poll() call back goes from:
int foo_poll(struct net_device *dev, int *budget)
to
int foo_poll(struct napi_struct *napi, int budget)
The caller is returned the number of RX packets processed (or
the number of "NAPI credits" consumed if you want to get
abstract). The callee no longer messes around bumping
dev->quota, *budget, etc. because that is all handled in the
caller upon return.
The napi_struct is to be embedded in the device driver private data
structures.
Furthermore, it is the driver's responsibility to disable all NAPI
instances in it's ->stop() device close handler. Since the
napi_struct is privatized into the driver's private data structures,
only the driver knows how to get at all of the napi_struct instances
it may have per-device.
With lots of help and suggestions from Rusty Russell, Roland Dreier,
Michael Chan, Jeff Garzik, and Jamal Hadi Salim.
Bug fixes from Thomas Graf, Roland Dreier, Peter Zijlstra,
Joseph Fannin, Scott Wood, Hans J. Koch, and Michael Chan.
[ Ported to current tree and all drivers converted. Integrated
Stephen's follow-on kerneldoc additions, and restored poll_list
handling to the old style to fix mutual exclusion issues. -DaveM ]
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-03 23:41:36 +00:00
|
|
|
{
|
|
|
|
unsigned long flags;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
[NET]: Make NAPI polling independent of struct net_device objects.
Several devices have multiple independant RX queues per net
device, and some have a single interrupt doorbell for several
queues.
In either case, it's easier to support layouts like that if the
structure representing the poll is independant from the net
device itself.
The signature of the ->poll() call back goes from:
int foo_poll(struct net_device *dev, int *budget)
to
int foo_poll(struct napi_struct *napi, int budget)
The caller is returned the number of RX packets processed (or
the number of "NAPI credits" consumed if you want to get
abstract). The callee no longer messes around bumping
dev->quota, *budget, etc. because that is all handled in the
caller upon return.
The napi_struct is to be embedded in the device driver private data
structures.
Furthermore, it is the driver's responsibility to disable all NAPI
instances in it's ->stop() device close handler. Since the
napi_struct is privatized into the driver's private data structures,
only the driver knows how to get at all of the napi_struct instances
it may have per-device.
With lots of help and suggestions from Rusty Russell, Roland Dreier,
Michael Chan, Jeff Garzik, and Jamal Hadi Salim.
Bug fixes from Thomas Graf, Roland Dreier, Peter Zijlstra,
Joseph Fannin, Scott Wood, Hans J. Koch, and Michael Chan.
[ Ported to current tree and all drivers converted. Integrated
Stephen's follow-on kerneldoc additions, and restored poll_list
handling to the old style to fix mutual exclusion issues. -DaveM ]
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-03 23:41:36 +00:00
|
|
|
local_irq_save(flags);
|
2014-08-17 17:30:35 +00:00
|
|
|
____napi_schedule(this_cpu_ptr(&softnet_data), n);
|
[NET]: Make NAPI polling independent of struct net_device objects.
Several devices have multiple independant RX queues per net
device, and some have a single interrupt doorbell for several
queues.
In either case, it's easier to support layouts like that if the
structure representing the poll is independant from the net
device itself.
The signature of the ->poll() call back goes from:
int foo_poll(struct net_device *dev, int *budget)
to
int foo_poll(struct napi_struct *napi, int budget)
The caller is returned the number of RX packets processed (or
the number of "NAPI credits" consumed if you want to get
abstract). The callee no longer messes around bumping
dev->quota, *budget, etc. because that is all handled in the
caller upon return.
The napi_struct is to be embedded in the device driver private data
structures.
Furthermore, it is the driver's responsibility to disable all NAPI
instances in it's ->stop() device close handler. Since the
napi_struct is privatized into the driver's private data structures,
only the driver knows how to get at all of the napi_struct instances
it may have per-device.
With lots of help and suggestions from Rusty Russell, Roland Dreier,
Michael Chan, Jeff Garzik, and Jamal Hadi Salim.
Bug fixes from Thomas Graf, Roland Dreier, Peter Zijlstra,
Joseph Fannin, Scott Wood, Hans J. Koch, and Michael Chan.
[ Ported to current tree and all drivers converted. Integrated
Stephen's follow-on kerneldoc additions, and restored poll_list
handling to the old style to fix mutual exclusion issues. -DaveM ]
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-03 23:41:36 +00:00
|
|
|
local_irq_restore(flags);
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
[NET]: Make NAPI polling independent of struct net_device objects.
Several devices have multiple independant RX queues per net
device, and some have a single interrupt doorbell for several
queues.
In either case, it's easier to support layouts like that if the
structure representing the poll is independant from the net
device itself.
The signature of the ->poll() call back goes from:
int foo_poll(struct net_device *dev, int *budget)
to
int foo_poll(struct napi_struct *napi, int budget)
The caller is returned the number of RX packets processed (or
the number of "NAPI credits" consumed if you want to get
abstract). The callee no longer messes around bumping
dev->quota, *budget, etc. because that is all handled in the
caller upon return.
The napi_struct is to be embedded in the device driver private data
structures.
Furthermore, it is the driver's responsibility to disable all NAPI
instances in it's ->stop() device close handler. Since the
napi_struct is privatized into the driver's private data structures,
only the driver knows how to get at all of the napi_struct instances
it may have per-device.
With lots of help and suggestions from Rusty Russell, Roland Dreier,
Michael Chan, Jeff Garzik, and Jamal Hadi Salim.
Bug fixes from Thomas Graf, Roland Dreier, Peter Zijlstra,
Joseph Fannin, Scott Wood, Hans J. Koch, and Michael Chan.
[ Ported to current tree and all drivers converted. Integrated
Stephen's follow-on kerneldoc additions, and restored poll_list
handling to the old style to fix mutual exclusion issues. -DaveM ]
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-03 23:41:36 +00:00
|
|
|
EXPORT_SYMBOL(__napi_schedule);
|
|
|
|
|
net: solve a NAPI race
While playing with mlx4 hardware timestamping of RX packets, I found
that some packets were received by TCP stack with a ~200 ms delay...
Since the timestamp was provided by the NIC, and my probe was added
in tcp_v4_rcv() while in BH handler, I was confident it was not
a sender issue, or a drop in the network.
This would happen with a very low probability, but hurting RPC
workloads.
A NAPI driver normally arms the IRQ after the napi_complete_done(),
after NAPI_STATE_SCHED is cleared, so that the hard irq handler can grab
it.
Problem is that if another point in the stack grabs NAPI_STATE_SCHED bit
while IRQ are not disabled, we might have later an IRQ firing and
finding this bit set, right before napi_complete_done() clears it.
This can happen with busy polling users, or if gro_flush_timeout is
used. But some other uses of napi_schedule() in drivers can cause this
as well.
thread 1 thread 2 (could be on same cpu, or not)
// busy polling or napi_watchdog()
napi_schedule();
...
napi->poll()
device polling:
read 2 packets from ring buffer
Additional 3rd packet is
available.
device hard irq
// does nothing because
NAPI_STATE_SCHED bit is owned by thread 1
napi_schedule();
napi_complete_done(napi, 2);
rearm_irq();
Note that rearm_irq() will not force the device to send an additional
IRQ for the packet it already signaled (3rd packet in my example)
This patch adds a new NAPI_STATE_MISSED bit, that napi_schedule_prep()
can set if it could not grab NAPI_STATE_SCHED
Then napi_complete_done() properly reschedules the napi to make sure
we do not miss something.
Since we manipulate multiple bits at once, use cmpxchg() like in
sk_busy_loop() to provide proper transactions.
In v2, I changed napi_watchdog() to use a relaxed variant of
napi_schedule_prep() : No need to set NAPI_STATE_MISSED from this point.
In v3, I added more details in the changelog and clears
NAPI_STATE_MISSED in busy_poll_stop()
In v4, I added the ideas given by Alexander Duyck in v3 review
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-28 18:34:50 +00:00
|
|
|
/**
|
|
|
|
* napi_schedule_prep - check if napi can be scheduled
|
|
|
|
* @n: napi context
|
|
|
|
*
|
|
|
|
* Test if NAPI routine is already running, and if not mark
|
|
|
|
* it as running. This is used as a condition variable
|
|
|
|
* insure only one NAPI poll instance runs. We also make
|
|
|
|
* sure there is no pending NAPI disable.
|
|
|
|
*/
|
|
|
|
bool napi_schedule_prep(struct napi_struct *n)
|
|
|
|
{
|
|
|
|
unsigned long val, new;
|
|
|
|
|
|
|
|
do {
|
|
|
|
val = READ_ONCE(n->state);
|
|
|
|
if (unlikely(val & NAPIF_STATE_DISABLE))
|
|
|
|
return false;
|
|
|
|
new = val | NAPIF_STATE_SCHED;
|
|
|
|
|
|
|
|
/* Sets STATE_MISSED bit if STATE_SCHED was already set
|
|
|
|
* This was suggested by Alexander Duyck, as compiler
|
|
|
|
* emits better code than :
|
|
|
|
* if (val & NAPIF_STATE_SCHED)
|
|
|
|
* new |= NAPIF_STATE_MISSED;
|
|
|
|
*/
|
|
|
|
new |= (val & NAPIF_STATE_SCHED) / NAPIF_STATE_SCHED *
|
|
|
|
NAPIF_STATE_MISSED;
|
|
|
|
} while (cmpxchg(&n->state, val, new) != val);
|
|
|
|
|
|
|
|
return !(val & NAPIF_STATE_SCHED);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(napi_schedule_prep);
|
|
|
|
|
2014-10-29 01:05:13 +00:00
|
|
|
/**
|
|
|
|
* __napi_schedule_irqoff - schedule for receive
|
|
|
|
* @n: entry to schedule
|
|
|
|
*
|
|
|
|
* Variant of __napi_schedule() assuming hard irqs are masked
|
|
|
|
*/
|
|
|
|
void __napi_schedule_irqoff(struct napi_struct *n)
|
|
|
|
{
|
|
|
|
____napi_schedule(this_cpu_ptr(&softnet_data), n);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(__napi_schedule_irqoff);
|
|
|
|
|
2016-11-15 18:15:13 +00:00
|
|
|
bool napi_complete_done(struct napi_struct *n, int work_done)
|
net: Add Generic Receive Offload infrastructure
This patch adds the top-level GRO (Generic Receive Offload) infrastructure.
This is pretty similar to LRO except that this is protocol-independent.
Instead of holding packets in an lro_mgr structure, they're now held in
napi_struct.
For drivers that intend to use this, they can set the NETIF_F_GRO bit and
call napi_gro_receive instead of netif_receive_skb or just call netif_rx.
The latter will call napi_receive_skb automatically. When napi_gro_receive
is used, the driver must either call napi_complete/napi_rx_complete, or
call napi_gro_flush in softirq context if the driver uses the primitives
__napi_complete/__napi_rx_complete.
Protocols will set the gro_receive and gro_complete function pointers in
order to participate in this scheme.
In addition to the packet, gro_receive will get a list of currently held
packets. Each packet in the list has a same_flow field which is non-zero
if it is a potential match for the new packet. For each packet that may
match, they also have a flush field which is non-zero if the held packet
must not be merged with the new packet.
Once gro_receive has determined that the new skb matches a held packet,
the held packet may be processed immediately if the new skb cannot be
merged with it. In this case gro_receive should return the pointer to
the existing skb in gro_list. Otherwise the new skb should be merged into
the existing packet and NULL should be returned, unless the new skb makes
it impossible for any further merges to be made (e.g., FIN packet) where
the merged skb should be returned.
Whenever the skb is merged into an existing entry, the gro_receive
function should set NAPI_GRO_CB(skb)->same_flow. Note that if an skb
merely matches an existing entry but can't be merged with it, then
this shouldn't be set.
If gro_receive finds it pointless to hold the new skb for future merging,
it should set NAPI_GRO_CB(skb)->flush.
Held packets will be flushed by napi_gro_flush which is called by
napi_complete and napi_rx_complete.
Currently held packets are stored in a singly liked list just like LRO.
The list is limited to a maximum of 8 entries. In future, this may be
expanded to use a hash table to allow more flows to be held for merging.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-16 07:38:52 +00:00
|
|
|
{
|
net: solve a NAPI race
While playing with mlx4 hardware timestamping of RX packets, I found
that some packets were received by TCP stack with a ~200 ms delay...
Since the timestamp was provided by the NIC, and my probe was added
in tcp_v4_rcv() while in BH handler, I was confident it was not
a sender issue, or a drop in the network.
This would happen with a very low probability, but hurting RPC
workloads.
A NAPI driver normally arms the IRQ after the napi_complete_done(),
after NAPI_STATE_SCHED is cleared, so that the hard irq handler can grab
it.
Problem is that if another point in the stack grabs NAPI_STATE_SCHED bit
while IRQ are not disabled, we might have later an IRQ firing and
finding this bit set, right before napi_complete_done() clears it.
This can happen with busy polling users, or if gro_flush_timeout is
used. But some other uses of napi_schedule() in drivers can cause this
as well.
thread 1 thread 2 (could be on same cpu, or not)
// busy polling or napi_watchdog()
napi_schedule();
...
napi->poll()
device polling:
read 2 packets from ring buffer
Additional 3rd packet is
available.
device hard irq
// does nothing because
NAPI_STATE_SCHED bit is owned by thread 1
napi_schedule();
napi_complete_done(napi, 2);
rearm_irq();
Note that rearm_irq() will not force the device to send an additional
IRQ for the packet it already signaled (3rd packet in my example)
This patch adds a new NAPI_STATE_MISSED bit, that napi_schedule_prep()
can set if it could not grab NAPI_STATE_SCHED
Then napi_complete_done() properly reschedules the napi to make sure
we do not miss something.
Since we manipulate multiple bits at once, use cmpxchg() like in
sk_busy_loop() to provide proper transactions.
In v2, I changed napi_watchdog() to use a relaxed variant of
napi_schedule_prep() : No need to set NAPI_STATE_MISSED from this point.
In v3, I added more details in the changelog and clears
NAPI_STATE_MISSED in busy_poll_stop()
In v4, I added the ideas given by Alexander Duyck in v3 review
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-28 18:34:50 +00:00
|
|
|
unsigned long flags, val, new;
|
net: Add Generic Receive Offload infrastructure
This patch adds the top-level GRO (Generic Receive Offload) infrastructure.
This is pretty similar to LRO except that this is protocol-independent.
Instead of holding packets in an lro_mgr structure, they're now held in
napi_struct.
For drivers that intend to use this, they can set the NETIF_F_GRO bit and
call napi_gro_receive instead of netif_receive_skb or just call netif_rx.
The latter will call napi_receive_skb automatically. When napi_gro_receive
is used, the driver must either call napi_complete/napi_rx_complete, or
call napi_gro_flush in softirq context if the driver uses the primitives
__napi_complete/__napi_rx_complete.
Protocols will set the gro_receive and gro_complete function pointers in
order to participate in this scheme.
In addition to the packet, gro_receive will get a list of currently held
packets. Each packet in the list has a same_flow field which is non-zero
if it is a potential match for the new packet. For each packet that may
match, they also have a flush field which is non-zero if the held packet
must not be merged with the new packet.
Once gro_receive has determined that the new skb matches a held packet,
the held packet may be processed immediately if the new skb cannot be
merged with it. In this case gro_receive should return the pointer to
the existing skb in gro_list. Otherwise the new skb should be merged into
the existing packet and NULL should be returned, unless the new skb makes
it impossible for any further merges to be made (e.g., FIN packet) where
the merged skb should be returned.
Whenever the skb is merged into an existing entry, the gro_receive
function should set NAPI_GRO_CB(skb)->same_flow. Note that if an skb
merely matches an existing entry but can't be merged with it, then
this shouldn't be set.
If gro_receive finds it pointless to hold the new skb for future merging,
it should set NAPI_GRO_CB(skb)->flush.
Held packets will be flushed by napi_gro_flush which is called by
napi_complete and napi_rx_complete.
Currently held packets are stored in a singly liked list just like LRO.
The list is limited to a maximum of 8 entries. In future, this may be
expanded to use a hash table to allow more flows to be held for merging.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-16 07:38:52 +00:00
|
|
|
|
|
|
|
/*
|
2016-11-15 18:15:11 +00:00
|
|
|
* 1) Don't let napi dequeue from the cpu poll list
|
|
|
|
* just in case its running on a different cpu.
|
|
|
|
* 2) If we are busy polling, do nothing here, we have
|
|
|
|
* the guarantee we will be called later.
|
net: Add Generic Receive Offload infrastructure
This patch adds the top-level GRO (Generic Receive Offload) infrastructure.
This is pretty similar to LRO except that this is protocol-independent.
Instead of holding packets in an lro_mgr structure, they're now held in
napi_struct.
For drivers that intend to use this, they can set the NETIF_F_GRO bit and
call napi_gro_receive instead of netif_receive_skb or just call netif_rx.
The latter will call napi_receive_skb automatically. When napi_gro_receive
is used, the driver must either call napi_complete/napi_rx_complete, or
call napi_gro_flush in softirq context if the driver uses the primitives
__napi_complete/__napi_rx_complete.
Protocols will set the gro_receive and gro_complete function pointers in
order to participate in this scheme.
In addition to the packet, gro_receive will get a list of currently held
packets. Each packet in the list has a same_flow field which is non-zero
if it is a potential match for the new packet. For each packet that may
match, they also have a flush field which is non-zero if the held packet
must not be merged with the new packet.
Once gro_receive has determined that the new skb matches a held packet,
the held packet may be processed immediately if the new skb cannot be
merged with it. In this case gro_receive should return the pointer to
the existing skb in gro_list. Otherwise the new skb should be merged into
the existing packet and NULL should be returned, unless the new skb makes
it impossible for any further merges to be made (e.g., FIN packet) where
the merged skb should be returned.
Whenever the skb is merged into an existing entry, the gro_receive
function should set NAPI_GRO_CB(skb)->same_flow. Note that if an skb
merely matches an existing entry but can't be merged with it, then
this shouldn't be set.
If gro_receive finds it pointless to hold the new skb for future merging,
it should set NAPI_GRO_CB(skb)->flush.
Held packets will be flushed by napi_gro_flush which is called by
napi_complete and napi_rx_complete.
Currently held packets are stored in a singly liked list just like LRO.
The list is limited to a maximum of 8 entries. In future, this may be
expanded to use a hash table to allow more flows to be held for merging.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-16 07:38:52 +00:00
|
|
|
*/
|
2016-11-15 18:15:11 +00:00
|
|
|
if (unlikely(n->state & (NAPIF_STATE_NPSVC |
|
|
|
|
NAPIF_STATE_IN_BUSY_POLL)))
|
2016-11-15 18:15:13 +00:00
|
|
|
return false;
|
net: Add Generic Receive Offload infrastructure
This patch adds the top-level GRO (Generic Receive Offload) infrastructure.
This is pretty similar to LRO except that this is protocol-independent.
Instead of holding packets in an lro_mgr structure, they're now held in
napi_struct.
For drivers that intend to use this, they can set the NETIF_F_GRO bit and
call napi_gro_receive instead of netif_receive_skb or just call netif_rx.
The latter will call napi_receive_skb automatically. When napi_gro_receive
is used, the driver must either call napi_complete/napi_rx_complete, or
call napi_gro_flush in softirq context if the driver uses the primitives
__napi_complete/__napi_rx_complete.
Protocols will set the gro_receive and gro_complete function pointers in
order to participate in this scheme.
In addition to the packet, gro_receive will get a list of currently held
packets. Each packet in the list has a same_flow field which is non-zero
if it is a potential match for the new packet. For each packet that may
match, they also have a flush field which is non-zero if the held packet
must not be merged with the new packet.
Once gro_receive has determined that the new skb matches a held packet,
the held packet may be processed immediately if the new skb cannot be
merged with it. In this case gro_receive should return the pointer to
the existing skb in gro_list. Otherwise the new skb should be merged into
the existing packet and NULL should be returned, unless the new skb makes
it impossible for any further merges to be made (e.g., FIN packet) where
the merged skb should be returned.
Whenever the skb is merged into an existing entry, the gro_receive
function should set NAPI_GRO_CB(skb)->same_flow. Note that if an skb
merely matches an existing entry but can't be merged with it, then
this shouldn't be set.
If gro_receive finds it pointless to hold the new skb for future merging,
it should set NAPI_GRO_CB(skb)->flush.
Held packets will be flushed by napi_gro_flush which is called by
napi_complete and napi_rx_complete.
Currently held packets are stored in a singly liked list just like LRO.
The list is limited to a maximum of 8 entries. In future, this may be
expanded to use a hash table to allow more flows to be held for merging.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-16 07:38:52 +00:00
|
|
|
|
net: gro: add a per device gro flush timer
Tuning coalescing parameters on NIC can be really hard.
Servers can handle both bulk and RPC like traffic, with conflicting
goals : bulk flows want as big GRO packets as possible, RPC want minimal
latencies.
To reach big GRO packets on 10Gbe NIC, one can use :
ethtool -C eth0 rx-usecs 4 rx-frames 44
But this penalizes rpc sessions, with an increase of latencies, up to
50% in some cases, as NICs generally do not force an interrupt when
a packet with TCP Push flag is received.
Some NICs do not have an absolute timer, only a timer rearmed for every
incoming packet.
This patch uses a different strategy : Let GRO stack decides what do do,
based on traffic pattern.
Packets with Push flag wont be delayed.
Packets without Push flag might be held in GRO engine, if we keep
receiving data.
This new mechanism is off by default, and shall be enabled by setting
/sys/class/net/ethX/gro_flush_timeout to a value in nanosecond.
To fully enable this mechanism, drivers should use napi_complete_done()
instead of napi_complete().
Tested:
Ran 200 netperf TCP_STREAM from A to B (10Gbe mlx4 link, 8 RX queues)
Without this feature, we send back about 305,000 ACK per second.
GRO aggregation ratio is low (811/305 = 2.65 segments per GRO packet)
Setting a timer of 2000 nsec is enough to increase GRO packet sizes
and reduce number of ACK packets. (811/19.2 = 42)
Receiver performs less calls to upper stacks, less wakes up.
This also reduces cpu usage on the sender, as it receives less ACK
packets.
Note that reducing number of wakes up increases cpu efficiency, but can
decrease QPS, as applications wont have the chance to warmup cpu caches
doing a partial read of RPC requests/answers if they fit in one skb.
B:~# sar -n DEV 1 10 | grep eth0 | tail -1
Average: eth0 811269.80 305732.30 1199462.57 19705.72 0.00
0.00 0.50
B:~# echo 2000 >/sys/class/net/eth0/gro_flush_timeout
B:~# sar -n DEV 1 10 | grep eth0 | tail -1
Average: eth0 811577.30 19230.80 1199916.51 1239.80 0.00
0.00 0.50
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-07 05:09:44 +00:00
|
|
|
if (n->gro_list) {
|
|
|
|
unsigned long timeout = 0;
|
net: less interrupt masking in NAPI
net_rx_action() can mask irqs a single time to transfert sd->poll_list
into a private list, for a very short duration.
Then, napi_complete() can avoid masking irqs again,
and net_rx_action() only needs to mask irq again in slow path.
This patch removes 2 couples of irq mask/unmask per typical NAPI run,
more if multiple napi were triggered.
Note this also allows to give control back to caller (do_softirq())
more often, so that other softirq handlers can be called a bit earlier,
or ksoftirqd can be wakeup earlier under pressure.
This was developed while testing an alternative to RX interrupt
mitigation to reduce latencies while keeping or improving GRO
aggregation on fast NIC.
Idea is to test napi->gro_list at the end of a napi->poll() and
reschedule one NAPI poll, but after servicing a full round of
softirqs (timers, TX, rcu, ...). This will be allowed only if softirq
is currently serviced by idle task or ksoftirqd, and resched not needed.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-02 14:19:33 +00:00
|
|
|
|
net: gro: add a per device gro flush timer
Tuning coalescing parameters on NIC can be really hard.
Servers can handle both bulk and RPC like traffic, with conflicting
goals : bulk flows want as big GRO packets as possible, RPC want minimal
latencies.
To reach big GRO packets on 10Gbe NIC, one can use :
ethtool -C eth0 rx-usecs 4 rx-frames 44
But this penalizes rpc sessions, with an increase of latencies, up to
50% in some cases, as NICs generally do not force an interrupt when
a packet with TCP Push flag is received.
Some NICs do not have an absolute timer, only a timer rearmed for every
incoming packet.
This patch uses a different strategy : Let GRO stack decides what do do,
based on traffic pattern.
Packets with Push flag wont be delayed.
Packets without Push flag might be held in GRO engine, if we keep
receiving data.
This new mechanism is off by default, and shall be enabled by setting
/sys/class/net/ethX/gro_flush_timeout to a value in nanosecond.
To fully enable this mechanism, drivers should use napi_complete_done()
instead of napi_complete().
Tested:
Ran 200 netperf TCP_STREAM from A to B (10Gbe mlx4 link, 8 RX queues)
Without this feature, we send back about 305,000 ACK per second.
GRO aggregation ratio is low (811/305 = 2.65 segments per GRO packet)
Setting a timer of 2000 nsec is enough to increase GRO packet sizes
and reduce number of ACK packets. (811/19.2 = 42)
Receiver performs less calls to upper stacks, less wakes up.
This also reduces cpu usage on the sender, as it receives less ACK
packets.
Note that reducing number of wakes up increases cpu efficiency, but can
decrease QPS, as applications wont have the chance to warmup cpu caches
doing a partial read of RPC requests/answers if they fit in one skb.
B:~# sar -n DEV 1 10 | grep eth0 | tail -1
Average: eth0 811269.80 305732.30 1199462.57 19705.72 0.00
0.00 0.50
B:~# echo 2000 >/sys/class/net/eth0/gro_flush_timeout
B:~# sar -n DEV 1 10 | grep eth0 | tail -1
Average: eth0 811577.30 19230.80 1199916.51 1239.80 0.00
0.00 0.50
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-07 05:09:44 +00:00
|
|
|
if (work_done)
|
|
|
|
timeout = n->dev->gro_flush_timeout;
|
|
|
|
|
|
|
|
if (timeout)
|
|
|
|
hrtimer_start(&n->timer, ns_to_ktime(timeout),
|
|
|
|
HRTIMER_MODE_REL_PINNED);
|
|
|
|
else
|
|
|
|
napi_gro_flush(n, false);
|
|
|
|
}
|
2017-02-04 23:25:02 +00:00
|
|
|
if (unlikely(!list_empty(&n->poll_list))) {
|
net: less interrupt masking in NAPI
net_rx_action() can mask irqs a single time to transfert sd->poll_list
into a private list, for a very short duration.
Then, napi_complete() can avoid masking irqs again,
and net_rx_action() only needs to mask irq again in slow path.
This patch removes 2 couples of irq mask/unmask per typical NAPI run,
more if multiple napi were triggered.
Note this also allows to give control back to caller (do_softirq())
more often, so that other softirq handlers can be called a bit earlier,
or ksoftirqd can be wakeup earlier under pressure.
This was developed while testing an alternative to RX interrupt
mitigation to reduce latencies while keeping or improving GRO
aggregation on fast NIC.
Idea is to test napi->gro_list at the end of a napi->poll() and
reschedule one NAPI poll, but after servicing a full round of
softirqs (timers, TX, rcu, ...). This will be allowed only if softirq
is currently serviced by idle task or ksoftirqd, and resched not needed.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-02 14:19:33 +00:00
|
|
|
/* If n->poll_list is not empty, we need to mask irqs */
|
|
|
|
local_irq_save(flags);
|
2017-02-04 23:25:02 +00:00
|
|
|
list_del_init(&n->poll_list);
|
net: less interrupt masking in NAPI
net_rx_action() can mask irqs a single time to transfert sd->poll_list
into a private list, for a very short duration.
Then, napi_complete() can avoid masking irqs again,
and net_rx_action() only needs to mask irq again in slow path.
This patch removes 2 couples of irq mask/unmask per typical NAPI run,
more if multiple napi were triggered.
Note this also allows to give control back to caller (do_softirq())
more often, so that other softirq handlers can be called a bit earlier,
or ksoftirqd can be wakeup earlier under pressure.
This was developed while testing an alternative to RX interrupt
mitigation to reduce latencies while keeping or improving GRO
aggregation on fast NIC.
Idea is to test napi->gro_list at the end of a napi->poll() and
reschedule one NAPI poll, but after servicing a full round of
softirqs (timers, TX, rcu, ...). This will be allowed only if softirq
is currently serviced by idle task or ksoftirqd, and resched not needed.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-02 14:19:33 +00:00
|
|
|
local_irq_restore(flags);
|
|
|
|
}
|
net: solve a NAPI race
While playing with mlx4 hardware timestamping of RX packets, I found
that some packets were received by TCP stack with a ~200 ms delay...
Since the timestamp was provided by the NIC, and my probe was added
in tcp_v4_rcv() while in BH handler, I was confident it was not
a sender issue, or a drop in the network.
This would happen with a very low probability, but hurting RPC
workloads.
A NAPI driver normally arms the IRQ after the napi_complete_done(),
after NAPI_STATE_SCHED is cleared, so that the hard irq handler can grab
it.
Problem is that if another point in the stack grabs NAPI_STATE_SCHED bit
while IRQ are not disabled, we might have later an IRQ firing and
finding this bit set, right before napi_complete_done() clears it.
This can happen with busy polling users, or if gro_flush_timeout is
used. But some other uses of napi_schedule() in drivers can cause this
as well.
thread 1 thread 2 (could be on same cpu, or not)
// busy polling or napi_watchdog()
napi_schedule();
...
napi->poll()
device polling:
read 2 packets from ring buffer
Additional 3rd packet is
available.
device hard irq
// does nothing because
NAPI_STATE_SCHED bit is owned by thread 1
napi_schedule();
napi_complete_done(napi, 2);
rearm_irq();
Note that rearm_irq() will not force the device to send an additional
IRQ for the packet it already signaled (3rd packet in my example)
This patch adds a new NAPI_STATE_MISSED bit, that napi_schedule_prep()
can set if it could not grab NAPI_STATE_SCHED
Then napi_complete_done() properly reschedules the napi to make sure
we do not miss something.
Since we manipulate multiple bits at once, use cmpxchg() like in
sk_busy_loop() to provide proper transactions.
In v2, I changed napi_watchdog() to use a relaxed variant of
napi_schedule_prep() : No need to set NAPI_STATE_MISSED from this point.
In v3, I added more details in the changelog and clears
NAPI_STATE_MISSED in busy_poll_stop()
In v4, I added the ideas given by Alexander Duyck in v3 review
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-28 18:34:50 +00:00
|
|
|
|
|
|
|
do {
|
|
|
|
val = READ_ONCE(n->state);
|
|
|
|
|
|
|
|
WARN_ON_ONCE(!(val & NAPIF_STATE_SCHED));
|
|
|
|
|
|
|
|
new = val & ~(NAPIF_STATE_MISSED | NAPIF_STATE_SCHED);
|
|
|
|
|
|
|
|
/* If STATE_MISSED was set, leave STATE_SCHED set,
|
|
|
|
* because we will call napi->poll() one more time.
|
|
|
|
* This C code was suggested by Alexander Duyck to help gcc.
|
|
|
|
*/
|
|
|
|
new |= (val & NAPIF_STATE_MISSED) / NAPIF_STATE_MISSED *
|
|
|
|
NAPIF_STATE_SCHED;
|
|
|
|
} while (cmpxchg(&n->state, val, new) != val);
|
|
|
|
|
|
|
|
if (unlikely(val & NAPIF_STATE_MISSED)) {
|
|
|
|
__napi_schedule(n);
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2016-11-15 18:15:13 +00:00
|
|
|
return true;
|
net: Add Generic Receive Offload infrastructure
This patch adds the top-level GRO (Generic Receive Offload) infrastructure.
This is pretty similar to LRO except that this is protocol-independent.
Instead of holding packets in an lro_mgr structure, they're now held in
napi_struct.
For drivers that intend to use this, they can set the NETIF_F_GRO bit and
call napi_gro_receive instead of netif_receive_skb or just call netif_rx.
The latter will call napi_receive_skb automatically. When napi_gro_receive
is used, the driver must either call napi_complete/napi_rx_complete, or
call napi_gro_flush in softirq context if the driver uses the primitives
__napi_complete/__napi_rx_complete.
Protocols will set the gro_receive and gro_complete function pointers in
order to participate in this scheme.
In addition to the packet, gro_receive will get a list of currently held
packets. Each packet in the list has a same_flow field which is non-zero
if it is a potential match for the new packet. For each packet that may
match, they also have a flush field which is non-zero if the held packet
must not be merged with the new packet.
Once gro_receive has determined that the new skb matches a held packet,
the held packet may be processed immediately if the new skb cannot be
merged with it. In this case gro_receive should return the pointer to
the existing skb in gro_list. Otherwise the new skb should be merged into
the existing packet and NULL should be returned, unless the new skb makes
it impossible for any further merges to be made (e.g., FIN packet) where
the merged skb should be returned.
Whenever the skb is merged into an existing entry, the gro_receive
function should set NAPI_GRO_CB(skb)->same_flow. Note that if an skb
merely matches an existing entry but can't be merged with it, then
this shouldn't be set.
If gro_receive finds it pointless to hold the new skb for future merging,
it should set NAPI_GRO_CB(skb)->flush.
Held packets will be flushed by napi_gro_flush which is called by
napi_complete and napi_rx_complete.
Currently held packets are stored in a singly liked list just like LRO.
The list is limited to a maximum of 8 entries. In future, this may be
expanded to use a hash table to allow more flows to be held for merging.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-16 07:38:52 +00:00
|
|
|
}
|
net: gro: add a per device gro flush timer
Tuning coalescing parameters on NIC can be really hard.
Servers can handle both bulk and RPC like traffic, with conflicting
goals : bulk flows want as big GRO packets as possible, RPC want minimal
latencies.
To reach big GRO packets on 10Gbe NIC, one can use :
ethtool -C eth0 rx-usecs 4 rx-frames 44
But this penalizes rpc sessions, with an increase of latencies, up to
50% in some cases, as NICs generally do not force an interrupt when
a packet with TCP Push flag is received.
Some NICs do not have an absolute timer, only a timer rearmed for every
incoming packet.
This patch uses a different strategy : Let GRO stack decides what do do,
based on traffic pattern.
Packets with Push flag wont be delayed.
Packets without Push flag might be held in GRO engine, if we keep
receiving data.
This new mechanism is off by default, and shall be enabled by setting
/sys/class/net/ethX/gro_flush_timeout to a value in nanosecond.
To fully enable this mechanism, drivers should use napi_complete_done()
instead of napi_complete().
Tested:
Ran 200 netperf TCP_STREAM from A to B (10Gbe mlx4 link, 8 RX queues)
Without this feature, we send back about 305,000 ACK per second.
GRO aggregation ratio is low (811/305 = 2.65 segments per GRO packet)
Setting a timer of 2000 nsec is enough to increase GRO packet sizes
and reduce number of ACK packets. (811/19.2 = 42)
Receiver performs less calls to upper stacks, less wakes up.
This also reduces cpu usage on the sender, as it receives less ACK
packets.
Note that reducing number of wakes up increases cpu efficiency, but can
decrease QPS, as applications wont have the chance to warmup cpu caches
doing a partial read of RPC requests/answers if they fit in one skb.
B:~# sar -n DEV 1 10 | grep eth0 | tail -1
Average: eth0 811269.80 305732.30 1199462.57 19705.72 0.00
0.00 0.50
B:~# echo 2000 >/sys/class/net/eth0/gro_flush_timeout
B:~# sar -n DEV 1 10 | grep eth0 | tail -1
Average: eth0 811577.30 19230.80 1199916.51 1239.80 0.00
0.00 0.50
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-07 05:09:44 +00:00
|
|
|
EXPORT_SYMBOL(napi_complete_done);
|
net: Add Generic Receive Offload infrastructure
This patch adds the top-level GRO (Generic Receive Offload) infrastructure.
This is pretty similar to LRO except that this is protocol-independent.
Instead of holding packets in an lro_mgr structure, they're now held in
napi_struct.
For drivers that intend to use this, they can set the NETIF_F_GRO bit and
call napi_gro_receive instead of netif_receive_skb or just call netif_rx.
The latter will call napi_receive_skb automatically. When napi_gro_receive
is used, the driver must either call napi_complete/napi_rx_complete, or
call napi_gro_flush in softirq context if the driver uses the primitives
__napi_complete/__napi_rx_complete.
Protocols will set the gro_receive and gro_complete function pointers in
order to participate in this scheme.
In addition to the packet, gro_receive will get a list of currently held
packets. Each packet in the list has a same_flow field which is non-zero
if it is a potential match for the new packet. For each packet that may
match, they also have a flush field which is non-zero if the held packet
must not be merged with the new packet.
Once gro_receive has determined that the new skb matches a held packet,
the held packet may be processed immediately if the new skb cannot be
merged with it. In this case gro_receive should return the pointer to
the existing skb in gro_list. Otherwise the new skb should be merged into
the existing packet and NULL should be returned, unless the new skb makes
it impossible for any further merges to be made (e.g., FIN packet) where
the merged skb should be returned.
Whenever the skb is merged into an existing entry, the gro_receive
function should set NAPI_GRO_CB(skb)->same_flow. Note that if an skb
merely matches an existing entry but can't be merged with it, then
this shouldn't be set.
If gro_receive finds it pointless to hold the new skb for future merging,
it should set NAPI_GRO_CB(skb)->flush.
Held packets will be flushed by napi_gro_flush which is called by
napi_complete and napi_rx_complete.
Currently held packets are stored in a singly liked list just like LRO.
The list is limited to a maximum of 8 entries. In future, this may be
expanded to use a hash table to allow more flows to be held for merging.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-16 07:38:52 +00:00
|
|
|
|
2013-06-10 08:39:41 +00:00
|
|
|
/* must be called under rcu_read_lock(), as we dont take a reference */
|
2015-11-18 14:30:52 +00:00
|
|
|
static struct napi_struct *napi_by_id(unsigned int napi_id)
|
2013-06-10 08:39:41 +00:00
|
|
|
{
|
|
|
|
unsigned int hash = napi_id % HASH_SIZE(napi_hash);
|
|
|
|
struct napi_struct *napi;
|
|
|
|
|
|
|
|
hlist_for_each_entry_rcu(napi, &napi_hash[hash], napi_hash_node)
|
|
|
|
if (napi->napi_id == napi_id)
|
|
|
|
return napi;
|
|
|
|
|
|
|
|
return NULL;
|
|
|
|
}
|
2015-11-18 14:30:52 +00:00
|
|
|
|
|
|
|
#if defined(CONFIG_NET_RX_BUSY_POLL)
|
2016-11-15 18:15:11 +00:00
|
|
|
|
2015-11-18 14:30:54 +00:00
|
|
|
#define BUSY_POLL_BUDGET 8
|
2016-11-15 18:15:11 +00:00
|
|
|
|
|
|
|
static void busy_poll_stop(struct napi_struct *napi, void *have_poll_lock)
|
|
|
|
{
|
|
|
|
int rc;
|
|
|
|
|
net: solve a NAPI race
While playing with mlx4 hardware timestamping of RX packets, I found
that some packets were received by TCP stack with a ~200 ms delay...
Since the timestamp was provided by the NIC, and my probe was added
in tcp_v4_rcv() while in BH handler, I was confident it was not
a sender issue, or a drop in the network.
This would happen with a very low probability, but hurting RPC
workloads.
A NAPI driver normally arms the IRQ after the napi_complete_done(),
after NAPI_STATE_SCHED is cleared, so that the hard irq handler can grab
it.
Problem is that if another point in the stack grabs NAPI_STATE_SCHED bit
while IRQ are not disabled, we might have later an IRQ firing and
finding this bit set, right before napi_complete_done() clears it.
This can happen with busy polling users, or if gro_flush_timeout is
used. But some other uses of napi_schedule() in drivers can cause this
as well.
thread 1 thread 2 (could be on same cpu, or not)
// busy polling or napi_watchdog()
napi_schedule();
...
napi->poll()
device polling:
read 2 packets from ring buffer
Additional 3rd packet is
available.
device hard irq
// does nothing because
NAPI_STATE_SCHED bit is owned by thread 1
napi_schedule();
napi_complete_done(napi, 2);
rearm_irq();
Note that rearm_irq() will not force the device to send an additional
IRQ for the packet it already signaled (3rd packet in my example)
This patch adds a new NAPI_STATE_MISSED bit, that napi_schedule_prep()
can set if it could not grab NAPI_STATE_SCHED
Then napi_complete_done() properly reschedules the napi to make sure
we do not miss something.
Since we manipulate multiple bits at once, use cmpxchg() like in
sk_busy_loop() to provide proper transactions.
In v2, I changed napi_watchdog() to use a relaxed variant of
napi_schedule_prep() : No need to set NAPI_STATE_MISSED from this point.
In v3, I added more details in the changelog and clears
NAPI_STATE_MISSED in busy_poll_stop()
In v4, I added the ideas given by Alexander Duyck in v3 review
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-28 18:34:50 +00:00
|
|
|
/* Busy polling means there is a high chance device driver hard irq
|
|
|
|
* could not grab NAPI_STATE_SCHED, and that NAPI_STATE_MISSED was
|
|
|
|
* set in napi_schedule_prep().
|
|
|
|
* Since we are about to call napi->poll() once more, we can safely
|
|
|
|
* clear NAPI_STATE_MISSED.
|
|
|
|
*
|
|
|
|
* Note: x86 could use a single "lock and ..." instruction
|
|
|
|
* to perform these two clear_bit()
|
|
|
|
*/
|
|
|
|
clear_bit(NAPI_STATE_MISSED, &napi->state);
|
2016-11-15 18:15:11 +00:00
|
|
|
clear_bit(NAPI_STATE_IN_BUSY_POLL, &napi->state);
|
|
|
|
|
|
|
|
local_bh_disable();
|
|
|
|
|
|
|
|
/* All we really want here is to re-enable device interrupts.
|
|
|
|
* Ideally, a new ndo_busy_poll_stop() could avoid another round.
|
|
|
|
*/
|
|
|
|
rc = napi->poll(napi, BUSY_POLL_BUDGET);
|
2017-08-25 13:04:32 +00:00
|
|
|
trace_napi_poll(napi, rc, BUSY_POLL_BUDGET);
|
2016-11-15 18:15:11 +00:00
|
|
|
netpoll_poll_unlock(have_poll_lock);
|
|
|
|
if (rc == BUSY_POLL_BUDGET)
|
|
|
|
__napi_schedule(napi);
|
|
|
|
local_bh_enable();
|
|
|
|
}
|
|
|
|
|
2017-03-24 17:08:24 +00:00
|
|
|
void napi_busy_loop(unsigned int napi_id,
|
|
|
|
bool (*loop_end)(void *, unsigned long),
|
|
|
|
void *loop_end_arg)
|
2015-11-18 14:30:52 +00:00
|
|
|
{
|
2017-03-24 17:08:24 +00:00
|
|
|
unsigned long start_time = loop_end ? busy_loop_current_time() : 0;
|
2016-11-15 18:15:11 +00:00
|
|
|
int (*napi_poll)(struct napi_struct *napi, int budget);
|
|
|
|
void *have_poll_lock = NULL;
|
2015-11-18 14:30:52 +00:00
|
|
|
struct napi_struct *napi;
|
2016-11-15 18:15:11 +00:00
|
|
|
|
|
|
|
restart:
|
|
|
|
napi_poll = NULL;
|
2015-11-18 14:30:52 +00:00
|
|
|
|
2015-11-18 14:30:53 +00:00
|
|
|
rcu_read_lock();
|
2015-11-18 14:30:52 +00:00
|
|
|
|
2017-03-24 17:07:53 +00:00
|
|
|
napi = napi_by_id(napi_id);
|
2015-11-18 14:30:52 +00:00
|
|
|
if (!napi)
|
|
|
|
goto out;
|
|
|
|
|
2016-11-15 18:15:11 +00:00
|
|
|
preempt_disable();
|
|
|
|
for (;;) {
|
2017-03-24 17:08:12 +00:00
|
|
|
int work = 0;
|
|
|
|
|
2015-11-18 14:30:53 +00:00
|
|
|
local_bh_disable();
|
2016-11-15 18:15:11 +00:00
|
|
|
if (!napi_poll) {
|
|
|
|
unsigned long val = READ_ONCE(napi->state);
|
|
|
|
|
|
|
|
/* If multiple threads are competing for this napi,
|
|
|
|
* we avoid dirtying napi->state as much as we can.
|
|
|
|
*/
|
|
|
|
if (val & (NAPIF_STATE_DISABLE | NAPIF_STATE_SCHED |
|
|
|
|
NAPIF_STATE_IN_BUSY_POLL))
|
|
|
|
goto count;
|
|
|
|
if (cmpxchg(&napi->state, val,
|
|
|
|
val | NAPIF_STATE_IN_BUSY_POLL |
|
|
|
|
NAPIF_STATE_SCHED) != val)
|
|
|
|
goto count;
|
|
|
|
have_poll_lock = netpoll_poll_lock(napi);
|
|
|
|
napi_poll = napi->poll;
|
|
|
|
}
|
2017-03-24 17:08:12 +00:00
|
|
|
work = napi_poll(napi, BUSY_POLL_BUDGET);
|
|
|
|
trace_napi_poll(napi, work, BUSY_POLL_BUDGET);
|
2016-11-15 18:15:11 +00:00
|
|
|
count:
|
2017-03-24 17:08:12 +00:00
|
|
|
if (work > 0)
|
2017-03-24 17:08:24 +00:00
|
|
|
__NET_ADD_STATS(dev_net(napi->dev),
|
2017-03-24 17:08:12 +00:00
|
|
|
LINUX_MIB_BUSYPOLLRXPACKETS, work);
|
2015-11-18 14:30:53 +00:00
|
|
|
local_bh_enable();
|
2015-11-18 14:30:52 +00:00
|
|
|
|
2017-03-24 17:08:24 +00:00
|
|
|
if (!loop_end || loop_end(loop_end_arg, start_time))
|
2016-11-15 18:15:11 +00:00
|
|
|
break;
|
2015-11-18 14:30:52 +00:00
|
|
|
|
2016-11-15 18:15:11 +00:00
|
|
|
if (unlikely(need_resched())) {
|
|
|
|
if (napi_poll)
|
|
|
|
busy_poll_stop(napi, have_poll_lock);
|
|
|
|
preempt_enable();
|
|
|
|
rcu_read_unlock();
|
|
|
|
cond_resched();
|
2017-03-24 17:08:24 +00:00
|
|
|
if (loop_end(loop_end_arg, start_time))
|
2017-03-24 17:08:12 +00:00
|
|
|
return;
|
2016-11-15 18:15:11 +00:00
|
|
|
goto restart;
|
|
|
|
}
|
Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
"The tree got pretty big in this development cycle, but the net effect
is pretty good:
115 files changed, 673 insertions(+), 1522 deletions(-)
The main changes were:
- Rework and generalize the mutex code to remove per arch mutex
primitives. (Peter Zijlstra)
- Add vCPU preemption support: add an interface to query the
preemption status of vCPUs and use it in locking primitives - this
optimizes paravirt performance. (Pan Xinhui, Juergen Gross,
Christian Borntraeger)
- Introduce cpu_relax_yield() and remov cpu_relax_lowlatency() to
clean up and improve the s390 lock yielding machinery and its core
kernel impact. (Christian Borntraeger)
- Micro-optimize mutexes some more. (Waiman Long)
- Reluctantly add the to-be-deprecated mutex_trylock_recursive()
interface on a temporary basis, to give the DRM code more time to
get rid of its locking hacks. Any other users will be NAK-ed on
sight. (We turned off the deprecation warning for the time being to
not pollute the build log.) (Peter Zijlstra)
- Improve the rtmutex code a bit, in light of recent long lived
bugs/races. (Thomas Gleixner)
- Misc fixes, cleanups"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
x86/paravirt: Fix bool return type for PVOP_CALL()
x86/paravirt: Fix native_patch()
locking/ww_mutex: Use relaxed atomics
locking/rtmutex: Explain locking rules for rt_mutex_proxy_unlock()/init_proxy_locked()
locking/rtmutex: Get rid of RT_MUTEX_OWNER_MASKALL
x86/paravirt: Optimize native pv_lock_ops.vcpu_is_preempted()
locking/mutex: Break out of expensive busy-loop on {mutex,rwsem}_spin_on_owner() when owner vCPU is preempted
locking/osq: Break out of spin-wait busy waiting loop for a preempted vCPU in osq_lock()
Documentation/virtual/kvm: Support the vCPU preemption check
x86/xen: Support the vCPU preemption check
x86/kvm: Support the vCPU preemption check
x86/kvm: Support the vCPU preemption check
kvm: Introduce kvm_write_guest_offset_cached()
locking/core, x86/paravirt: Implement vcpu_is_preempted(cpu) for KVM and Xen guests
locking/spinlocks, s390: Implement vcpu_is_preempted(cpu)
locking/core, powerpc: Implement vcpu_is_preempted(cpu)
sched/core: Introduce the vcpu_is_preempted(cpu) interface
sched/wake_q: Rename WAKE_Q to DEFINE_WAKE_Q
locking/core: Provide common cpu_relax_yield() definition
locking/mutex: Don't mark mutex_trylock_recursive() as deprecated, temporarily
...
2016-12-12 18:48:02 +00:00
|
|
|
cpu_relax();
|
2016-11-15 18:15:11 +00:00
|
|
|
}
|
|
|
|
if (napi_poll)
|
|
|
|
busy_poll_stop(napi, have_poll_lock);
|
|
|
|
preempt_enable();
|
2015-11-18 14:30:52 +00:00
|
|
|
out:
|
2015-11-18 14:30:53 +00:00
|
|
|
rcu_read_unlock();
|
2015-11-18 14:30:52 +00:00
|
|
|
}
|
2017-03-24 17:08:24 +00:00
|
|
|
EXPORT_SYMBOL(napi_busy_loop);
|
2015-11-18 14:30:52 +00:00
|
|
|
|
|
|
|
#endif /* CONFIG_NET_RX_BUSY_POLL */
|
2013-06-10 08:39:41 +00:00
|
|
|
|
2016-11-08 19:07:28 +00:00
|
|
|
static void napi_hash_add(struct napi_struct *napi)
|
2013-06-10 08:39:41 +00:00
|
|
|
{
|
2015-11-18 14:31:00 +00:00
|
|
|
if (test_bit(NAPI_STATE_NO_BUSY_POLL, &napi->state) ||
|
|
|
|
test_and_set_bit(NAPI_STATE_HASHED, &napi->state))
|
2015-11-18 14:30:50 +00:00
|
|
|
return;
|
2013-06-10 08:39:41 +00:00
|
|
|
|
2015-11-18 14:30:50 +00:00
|
|
|
spin_lock(&napi_hash_lock);
|
2013-06-10 08:39:41 +00:00
|
|
|
|
2017-03-24 17:07:53 +00:00
|
|
|
/* 0..NR_CPUS range is reserved for sender_cpu use */
|
2015-11-18 14:30:50 +00:00
|
|
|
do {
|
2017-03-24 17:07:53 +00:00
|
|
|
if (unlikely(++napi_gen_id < MIN_NAPI_ID))
|
|
|
|
napi_gen_id = MIN_NAPI_ID;
|
2015-11-18 14:30:50 +00:00
|
|
|
} while (napi_by_id(napi_gen_id));
|
|
|
|
napi->napi_id = napi_gen_id;
|
2013-06-10 08:39:41 +00:00
|
|
|
|
2015-11-18 14:30:50 +00:00
|
|
|
hlist_add_head_rcu(&napi->napi_hash_node,
|
|
|
|
&napi_hash[napi->napi_id % HASH_SIZE(napi_hash)]);
|
2013-06-10 08:39:41 +00:00
|
|
|
|
2015-11-18 14:30:50 +00:00
|
|
|
spin_unlock(&napi_hash_lock);
|
2013-06-10 08:39:41 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Warning : caller is responsible to make sure rcu grace period
|
|
|
|
* is respected before freeing memory containing @napi
|
|
|
|
*/
|
2015-11-18 14:31:02 +00:00
|
|
|
bool napi_hash_del(struct napi_struct *napi)
|
2013-06-10 08:39:41 +00:00
|
|
|
{
|
2015-11-18 14:31:02 +00:00
|
|
|
bool rcu_sync_needed = false;
|
|
|
|
|
2013-06-10 08:39:41 +00:00
|
|
|
spin_lock(&napi_hash_lock);
|
|
|
|
|
2015-11-18 14:31:02 +00:00
|
|
|
if (test_and_clear_bit(NAPI_STATE_HASHED, &napi->state)) {
|
|
|
|
rcu_sync_needed = true;
|
2013-06-10 08:39:41 +00:00
|
|
|
hlist_del_rcu(&napi->napi_hash_node);
|
2015-11-18 14:31:02 +00:00
|
|
|
}
|
2013-06-10 08:39:41 +00:00
|
|
|
spin_unlock(&napi_hash_lock);
|
2015-11-18 14:31:02 +00:00
|
|
|
return rcu_sync_needed;
|
2013-06-10 08:39:41 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(napi_hash_del);
|
|
|
|
|
net: gro: add a per device gro flush timer
Tuning coalescing parameters on NIC can be really hard.
Servers can handle both bulk and RPC like traffic, with conflicting
goals : bulk flows want as big GRO packets as possible, RPC want minimal
latencies.
To reach big GRO packets on 10Gbe NIC, one can use :
ethtool -C eth0 rx-usecs 4 rx-frames 44
But this penalizes rpc sessions, with an increase of latencies, up to
50% in some cases, as NICs generally do not force an interrupt when
a packet with TCP Push flag is received.
Some NICs do not have an absolute timer, only a timer rearmed for every
incoming packet.
This patch uses a different strategy : Let GRO stack decides what do do,
based on traffic pattern.
Packets with Push flag wont be delayed.
Packets without Push flag might be held in GRO engine, if we keep
receiving data.
This new mechanism is off by default, and shall be enabled by setting
/sys/class/net/ethX/gro_flush_timeout to a value in nanosecond.
To fully enable this mechanism, drivers should use napi_complete_done()
instead of napi_complete().
Tested:
Ran 200 netperf TCP_STREAM from A to B (10Gbe mlx4 link, 8 RX queues)
Without this feature, we send back about 305,000 ACK per second.
GRO aggregation ratio is low (811/305 = 2.65 segments per GRO packet)
Setting a timer of 2000 nsec is enough to increase GRO packet sizes
and reduce number of ACK packets. (811/19.2 = 42)
Receiver performs less calls to upper stacks, less wakes up.
This also reduces cpu usage on the sender, as it receives less ACK
packets.
Note that reducing number of wakes up increases cpu efficiency, but can
decrease QPS, as applications wont have the chance to warmup cpu caches
doing a partial read of RPC requests/answers if they fit in one skb.
B:~# sar -n DEV 1 10 | grep eth0 | tail -1
Average: eth0 811269.80 305732.30 1199462.57 19705.72 0.00
0.00 0.50
B:~# echo 2000 >/sys/class/net/eth0/gro_flush_timeout
B:~# sar -n DEV 1 10 | grep eth0 | tail -1
Average: eth0 811577.30 19230.80 1199916.51 1239.80 0.00
0.00 0.50
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-07 05:09:44 +00:00
|
|
|
static enum hrtimer_restart napi_watchdog(struct hrtimer *timer)
|
|
|
|
{
|
|
|
|
struct napi_struct *napi;
|
|
|
|
|
|
|
|
napi = container_of(timer, struct napi_struct, timer);
|
net: solve a NAPI race
While playing with mlx4 hardware timestamping of RX packets, I found
that some packets were received by TCP stack with a ~200 ms delay...
Since the timestamp was provided by the NIC, and my probe was added
in tcp_v4_rcv() while in BH handler, I was confident it was not
a sender issue, or a drop in the network.
This would happen with a very low probability, but hurting RPC
workloads.
A NAPI driver normally arms the IRQ after the napi_complete_done(),
after NAPI_STATE_SCHED is cleared, so that the hard irq handler can grab
it.
Problem is that if another point in the stack grabs NAPI_STATE_SCHED bit
while IRQ are not disabled, we might have later an IRQ firing and
finding this bit set, right before napi_complete_done() clears it.
This can happen with busy polling users, or if gro_flush_timeout is
used. But some other uses of napi_schedule() in drivers can cause this
as well.
thread 1 thread 2 (could be on same cpu, or not)
// busy polling or napi_watchdog()
napi_schedule();
...
napi->poll()
device polling:
read 2 packets from ring buffer
Additional 3rd packet is
available.
device hard irq
// does nothing because
NAPI_STATE_SCHED bit is owned by thread 1
napi_schedule();
napi_complete_done(napi, 2);
rearm_irq();
Note that rearm_irq() will not force the device to send an additional
IRQ for the packet it already signaled (3rd packet in my example)
This patch adds a new NAPI_STATE_MISSED bit, that napi_schedule_prep()
can set if it could not grab NAPI_STATE_SCHED
Then napi_complete_done() properly reschedules the napi to make sure
we do not miss something.
Since we manipulate multiple bits at once, use cmpxchg() like in
sk_busy_loop() to provide proper transactions.
In v2, I changed napi_watchdog() to use a relaxed variant of
napi_schedule_prep() : No need to set NAPI_STATE_MISSED from this point.
In v3, I added more details in the changelog and clears
NAPI_STATE_MISSED in busy_poll_stop()
In v4, I added the ideas given by Alexander Duyck in v3 review
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-28 18:34:50 +00:00
|
|
|
|
|
|
|
/* Note : we use a relaxed variant of napi_schedule_prep() not setting
|
|
|
|
* NAPI_STATE_MISSED, since we do not react to a device IRQ.
|
|
|
|
*/
|
|
|
|
if (napi->gro_list && !napi_disable_pending(napi) &&
|
|
|
|
!test_and_set_bit(NAPI_STATE_SCHED, &napi->state))
|
|
|
|
__napi_schedule_irqoff(napi);
|
net: gro: add a per device gro flush timer
Tuning coalescing parameters on NIC can be really hard.
Servers can handle both bulk and RPC like traffic, with conflicting
goals : bulk flows want as big GRO packets as possible, RPC want minimal
latencies.
To reach big GRO packets on 10Gbe NIC, one can use :
ethtool -C eth0 rx-usecs 4 rx-frames 44
But this penalizes rpc sessions, with an increase of latencies, up to
50% in some cases, as NICs generally do not force an interrupt when
a packet with TCP Push flag is received.
Some NICs do not have an absolute timer, only a timer rearmed for every
incoming packet.
This patch uses a different strategy : Let GRO stack decides what do do,
based on traffic pattern.
Packets with Push flag wont be delayed.
Packets without Push flag might be held in GRO engine, if we keep
receiving data.
This new mechanism is off by default, and shall be enabled by setting
/sys/class/net/ethX/gro_flush_timeout to a value in nanosecond.
To fully enable this mechanism, drivers should use napi_complete_done()
instead of napi_complete().
Tested:
Ran 200 netperf TCP_STREAM from A to B (10Gbe mlx4 link, 8 RX queues)
Without this feature, we send back about 305,000 ACK per second.
GRO aggregation ratio is low (811/305 = 2.65 segments per GRO packet)
Setting a timer of 2000 nsec is enough to increase GRO packet sizes
and reduce number of ACK packets. (811/19.2 = 42)
Receiver performs less calls to upper stacks, less wakes up.
This also reduces cpu usage on the sender, as it receives less ACK
packets.
Note that reducing number of wakes up increases cpu efficiency, but can
decrease QPS, as applications wont have the chance to warmup cpu caches
doing a partial read of RPC requests/answers if they fit in one skb.
B:~# sar -n DEV 1 10 | grep eth0 | tail -1
Average: eth0 811269.80 305732.30 1199462.57 19705.72 0.00
0.00 0.50
B:~# echo 2000 >/sys/class/net/eth0/gro_flush_timeout
B:~# sar -n DEV 1 10 | grep eth0 | tail -1
Average: eth0 811577.30 19230.80 1199916.51 1239.80 0.00
0.00 0.50
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-07 05:09:44 +00:00
|
|
|
|
|
|
|
return HRTIMER_NORESTART;
|
|
|
|
}
|
|
|
|
|
net: Add Generic Receive Offload infrastructure
This patch adds the top-level GRO (Generic Receive Offload) infrastructure.
This is pretty similar to LRO except that this is protocol-independent.
Instead of holding packets in an lro_mgr structure, they're now held in
napi_struct.
For drivers that intend to use this, they can set the NETIF_F_GRO bit and
call napi_gro_receive instead of netif_receive_skb or just call netif_rx.
The latter will call napi_receive_skb automatically. When napi_gro_receive
is used, the driver must either call napi_complete/napi_rx_complete, or
call napi_gro_flush in softirq context if the driver uses the primitives
__napi_complete/__napi_rx_complete.
Protocols will set the gro_receive and gro_complete function pointers in
order to participate in this scheme.
In addition to the packet, gro_receive will get a list of currently held
packets. Each packet in the list has a same_flow field which is non-zero
if it is a potential match for the new packet. For each packet that may
match, they also have a flush field which is non-zero if the held packet
must not be merged with the new packet.
Once gro_receive has determined that the new skb matches a held packet,
the held packet may be processed immediately if the new skb cannot be
merged with it. In this case gro_receive should return the pointer to
the existing skb in gro_list. Otherwise the new skb should be merged into
the existing packet and NULL should be returned, unless the new skb makes
it impossible for any further merges to be made (e.g., FIN packet) where
the merged skb should be returned.
Whenever the skb is merged into an existing entry, the gro_receive
function should set NAPI_GRO_CB(skb)->same_flow. Note that if an skb
merely matches an existing entry but can't be merged with it, then
this shouldn't be set.
If gro_receive finds it pointless to hold the new skb for future merging,
it should set NAPI_GRO_CB(skb)->flush.
Held packets will be flushed by napi_gro_flush which is called by
napi_complete and napi_rx_complete.
Currently held packets are stored in a singly liked list just like LRO.
The list is limited to a maximum of 8 entries. In future, this may be
expanded to use a hash table to allow more flows to be held for merging.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-16 07:38:52 +00:00
|
|
|
void netif_napi_add(struct net_device *dev, struct napi_struct *napi,
|
|
|
|
int (*poll)(struct napi_struct *, int), int weight)
|
|
|
|
{
|
|
|
|
INIT_LIST_HEAD(&napi->poll_list);
|
net: gro: add a per device gro flush timer
Tuning coalescing parameters on NIC can be really hard.
Servers can handle both bulk and RPC like traffic, with conflicting
goals : bulk flows want as big GRO packets as possible, RPC want minimal
latencies.
To reach big GRO packets on 10Gbe NIC, one can use :
ethtool -C eth0 rx-usecs 4 rx-frames 44
But this penalizes rpc sessions, with an increase of latencies, up to
50% in some cases, as NICs generally do not force an interrupt when
a packet with TCP Push flag is received.
Some NICs do not have an absolute timer, only a timer rearmed for every
incoming packet.
This patch uses a different strategy : Let GRO stack decides what do do,
based on traffic pattern.
Packets with Push flag wont be delayed.
Packets without Push flag might be held in GRO engine, if we keep
receiving data.
This new mechanism is off by default, and shall be enabled by setting
/sys/class/net/ethX/gro_flush_timeout to a value in nanosecond.
To fully enable this mechanism, drivers should use napi_complete_done()
instead of napi_complete().
Tested:
Ran 200 netperf TCP_STREAM from A to B (10Gbe mlx4 link, 8 RX queues)
Without this feature, we send back about 305,000 ACK per second.
GRO aggregation ratio is low (811/305 = 2.65 segments per GRO packet)
Setting a timer of 2000 nsec is enough to increase GRO packet sizes
and reduce number of ACK packets. (811/19.2 = 42)
Receiver performs less calls to upper stacks, less wakes up.
This also reduces cpu usage on the sender, as it receives less ACK
packets.
Note that reducing number of wakes up increases cpu efficiency, but can
decrease QPS, as applications wont have the chance to warmup cpu caches
doing a partial read of RPC requests/answers if they fit in one skb.
B:~# sar -n DEV 1 10 | grep eth0 | tail -1
Average: eth0 811269.80 305732.30 1199462.57 19705.72 0.00
0.00 0.50
B:~# echo 2000 >/sys/class/net/eth0/gro_flush_timeout
B:~# sar -n DEV 1 10 | grep eth0 | tail -1
Average: eth0 811577.30 19230.80 1199916.51 1239.80 0.00
0.00 0.50
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-07 05:09:44 +00:00
|
|
|
hrtimer_init(&napi->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED);
|
|
|
|
napi->timer.function = napi_watchdog;
|
2009-02-08 18:00:36 +00:00
|
|
|
napi->gro_count = 0;
|
net: Add Generic Receive Offload infrastructure
This patch adds the top-level GRO (Generic Receive Offload) infrastructure.
This is pretty similar to LRO except that this is protocol-independent.
Instead of holding packets in an lro_mgr structure, they're now held in
napi_struct.
For drivers that intend to use this, they can set the NETIF_F_GRO bit and
call napi_gro_receive instead of netif_receive_skb or just call netif_rx.
The latter will call napi_receive_skb automatically. When napi_gro_receive
is used, the driver must either call napi_complete/napi_rx_complete, or
call napi_gro_flush in softirq context if the driver uses the primitives
__napi_complete/__napi_rx_complete.
Protocols will set the gro_receive and gro_complete function pointers in
order to participate in this scheme.
In addition to the packet, gro_receive will get a list of currently held
packets. Each packet in the list has a same_flow field which is non-zero
if it is a potential match for the new packet. For each packet that may
match, they also have a flush field which is non-zero if the held packet
must not be merged with the new packet.
Once gro_receive has determined that the new skb matches a held packet,
the held packet may be processed immediately if the new skb cannot be
merged with it. In this case gro_receive should return the pointer to
the existing skb in gro_list. Otherwise the new skb should be merged into
the existing packet and NULL should be returned, unless the new skb makes
it impossible for any further merges to be made (e.g., FIN packet) where
the merged skb should be returned.
Whenever the skb is merged into an existing entry, the gro_receive
function should set NAPI_GRO_CB(skb)->same_flow. Note that if an skb
merely matches an existing entry but can't be merged with it, then
this shouldn't be set.
If gro_receive finds it pointless to hold the new skb for future merging,
it should set NAPI_GRO_CB(skb)->flush.
Held packets will be flushed by napi_gro_flush which is called by
napi_complete and napi_rx_complete.
Currently held packets are stored in a singly liked list just like LRO.
The list is limited to a maximum of 8 entries. In future, this may be
expanded to use a hash table to allow more flows to be held for merging.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-16 07:38:52 +00:00
|
|
|
napi->gro_list = NULL;
|
2009-01-05 00:13:40 +00:00
|
|
|
napi->skb = NULL;
|
net: Add Generic Receive Offload infrastructure
This patch adds the top-level GRO (Generic Receive Offload) infrastructure.
This is pretty similar to LRO except that this is protocol-independent.
Instead of holding packets in an lro_mgr structure, they're now held in
napi_struct.
For drivers that intend to use this, they can set the NETIF_F_GRO bit and
call napi_gro_receive instead of netif_receive_skb or just call netif_rx.
The latter will call napi_receive_skb automatically. When napi_gro_receive
is used, the driver must either call napi_complete/napi_rx_complete, or
call napi_gro_flush in softirq context if the driver uses the primitives
__napi_complete/__napi_rx_complete.
Protocols will set the gro_receive and gro_complete function pointers in
order to participate in this scheme.
In addition to the packet, gro_receive will get a list of currently held
packets. Each packet in the list has a same_flow field which is non-zero
if it is a potential match for the new packet. For each packet that may
match, they also have a flush field which is non-zero if the held packet
must not be merged with the new packet.
Once gro_receive has determined that the new skb matches a held packet,
the held packet may be processed immediately if the new skb cannot be
merged with it. In this case gro_receive should return the pointer to
the existing skb in gro_list. Otherwise the new skb should be merged into
the existing packet and NULL should be returned, unless the new skb makes
it impossible for any further merges to be made (e.g., FIN packet) where
the merged skb should be returned.
Whenever the skb is merged into an existing entry, the gro_receive
function should set NAPI_GRO_CB(skb)->same_flow. Note that if an skb
merely matches an existing entry but can't be merged with it, then
this shouldn't be set.
If gro_receive finds it pointless to hold the new skb for future merging,
it should set NAPI_GRO_CB(skb)->flush.
Held packets will be flushed by napi_gro_flush which is called by
napi_complete and napi_rx_complete.
Currently held packets are stored in a singly liked list just like LRO.
The list is limited to a maximum of 8 entries. In future, this may be
expanded to use a hash table to allow more flows to be held for merging.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-16 07:38:52 +00:00
|
|
|
napi->poll = poll;
|
2013-03-05 15:57:22 +00:00
|
|
|
if (weight > NAPI_POLL_WEIGHT)
|
|
|
|
pr_err_once("netif_napi_add() called with weight %d on device %s\n",
|
|
|
|
weight, dev->name);
|
net: Add Generic Receive Offload infrastructure
This patch adds the top-level GRO (Generic Receive Offload) infrastructure.
This is pretty similar to LRO except that this is protocol-independent.
Instead of holding packets in an lro_mgr structure, they're now held in
napi_struct.
For drivers that intend to use this, they can set the NETIF_F_GRO bit and
call napi_gro_receive instead of netif_receive_skb or just call netif_rx.
The latter will call napi_receive_skb automatically. When napi_gro_receive
is used, the driver must either call napi_complete/napi_rx_complete, or
call napi_gro_flush in softirq context if the driver uses the primitives
__napi_complete/__napi_rx_complete.
Protocols will set the gro_receive and gro_complete function pointers in
order to participate in this scheme.
In addition to the packet, gro_receive will get a list of currently held
packets. Each packet in the list has a same_flow field which is non-zero
if it is a potential match for the new packet. For each packet that may
match, they also have a flush field which is non-zero if the held packet
must not be merged with the new packet.
Once gro_receive has determined that the new skb matches a held packet,
the held packet may be processed immediately if the new skb cannot be
merged with it. In this case gro_receive should return the pointer to
the existing skb in gro_list. Otherwise the new skb should be merged into
the existing packet and NULL should be returned, unless the new skb makes
it impossible for any further merges to be made (e.g., FIN packet) where
the merged skb should be returned.
Whenever the skb is merged into an existing entry, the gro_receive
function should set NAPI_GRO_CB(skb)->same_flow. Note that if an skb
merely matches an existing entry but can't be merged with it, then
this shouldn't be set.
If gro_receive finds it pointless to hold the new skb for future merging,
it should set NAPI_GRO_CB(skb)->flush.
Held packets will be flushed by napi_gro_flush which is called by
napi_complete and napi_rx_complete.
Currently held packets are stored in a singly liked list just like LRO.
The list is limited to a maximum of 8 entries. In future, this may be
expanded to use a hash table to allow more flows to be held for merging.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-16 07:38:52 +00:00
|
|
|
napi->weight = weight;
|
|
|
|
list_add(&napi->dev_list, &dev->napi_list);
|
|
|
|
napi->dev = dev;
|
2009-01-05 00:13:40 +00:00
|
|
|
#ifdef CONFIG_NETPOLL
|
net: Add Generic Receive Offload infrastructure
This patch adds the top-level GRO (Generic Receive Offload) infrastructure.
This is pretty similar to LRO except that this is protocol-independent.
Instead of holding packets in an lro_mgr structure, they're now held in
napi_struct.
For drivers that intend to use this, they can set the NETIF_F_GRO bit and
call napi_gro_receive instead of netif_receive_skb or just call netif_rx.
The latter will call napi_receive_skb automatically. When napi_gro_receive
is used, the driver must either call napi_complete/napi_rx_complete, or
call napi_gro_flush in softirq context if the driver uses the primitives
__napi_complete/__napi_rx_complete.
Protocols will set the gro_receive and gro_complete function pointers in
order to participate in this scheme.
In addition to the packet, gro_receive will get a list of currently held
packets. Each packet in the list has a same_flow field which is non-zero
if it is a potential match for the new packet. For each packet that may
match, they also have a flush field which is non-zero if the held packet
must not be merged with the new packet.
Once gro_receive has determined that the new skb matches a held packet,
the held packet may be processed immediately if the new skb cannot be
merged with it. In this case gro_receive should return the pointer to
the existing skb in gro_list. Otherwise the new skb should be merged into
the existing packet and NULL should be returned, unless the new skb makes
it impossible for any further merges to be made (e.g., FIN packet) where
the merged skb should be returned.
Whenever the skb is merged into an existing entry, the gro_receive
function should set NAPI_GRO_CB(skb)->same_flow. Note that if an skb
merely matches an existing entry but can't be merged with it, then
this shouldn't be set.
If gro_receive finds it pointless to hold the new skb for future merging,
it should set NAPI_GRO_CB(skb)->flush.
Held packets will be flushed by napi_gro_flush which is called by
napi_complete and napi_rx_complete.
Currently held packets are stored in a singly liked list just like LRO.
The list is limited to a maximum of 8 entries. In future, this may be
expanded to use a hash table to allow more flows to be held for merging.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-16 07:38:52 +00:00
|
|
|
napi->poll_owner = -1;
|
|
|
|
#endif
|
|
|
|
set_bit(NAPI_STATE_SCHED, &napi->state);
|
2015-11-18 14:31:03 +00:00
|
|
|
napi_hash_add(napi);
|
net: Add Generic Receive Offload infrastructure
This patch adds the top-level GRO (Generic Receive Offload) infrastructure.
This is pretty similar to LRO except that this is protocol-independent.
Instead of holding packets in an lro_mgr structure, they're now held in
napi_struct.
For drivers that intend to use this, they can set the NETIF_F_GRO bit and
call napi_gro_receive instead of netif_receive_skb or just call netif_rx.
The latter will call napi_receive_skb automatically. When napi_gro_receive
is used, the driver must either call napi_complete/napi_rx_complete, or
call napi_gro_flush in softirq context if the driver uses the primitives
__napi_complete/__napi_rx_complete.
Protocols will set the gro_receive and gro_complete function pointers in
order to participate in this scheme.
In addition to the packet, gro_receive will get a list of currently held
packets. Each packet in the list has a same_flow field which is non-zero
if it is a potential match for the new packet. For each packet that may
match, they also have a flush field which is non-zero if the held packet
must not be merged with the new packet.
Once gro_receive has determined that the new skb matches a held packet,
the held packet may be processed immediately if the new skb cannot be
merged with it. In this case gro_receive should return the pointer to
the existing skb in gro_list. Otherwise the new skb should be merged into
the existing packet and NULL should be returned, unless the new skb makes
it impossible for any further merges to be made (e.g., FIN packet) where
the merged skb should be returned.
Whenever the skb is merged into an existing entry, the gro_receive
function should set NAPI_GRO_CB(skb)->same_flow. Note that if an skb
merely matches an existing entry but can't be merged with it, then
this shouldn't be set.
If gro_receive finds it pointless to hold the new skb for future merging,
it should set NAPI_GRO_CB(skb)->flush.
Held packets will be flushed by napi_gro_flush which is called by
napi_complete and napi_rx_complete.
Currently held packets are stored in a singly liked list just like LRO.
The list is limited to a maximum of 8 entries. In future, this may be
expanded to use a hash table to allow more flows to be held for merging.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-16 07:38:52 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(netif_napi_add);
|
|
|
|
|
net: gro: add a per device gro flush timer
Tuning coalescing parameters on NIC can be really hard.
Servers can handle both bulk and RPC like traffic, with conflicting
goals : bulk flows want as big GRO packets as possible, RPC want minimal
latencies.
To reach big GRO packets on 10Gbe NIC, one can use :
ethtool -C eth0 rx-usecs 4 rx-frames 44
But this penalizes rpc sessions, with an increase of latencies, up to
50% in some cases, as NICs generally do not force an interrupt when
a packet with TCP Push flag is received.
Some NICs do not have an absolute timer, only a timer rearmed for every
incoming packet.
This patch uses a different strategy : Let GRO stack decides what do do,
based on traffic pattern.
Packets with Push flag wont be delayed.
Packets without Push flag might be held in GRO engine, if we keep
receiving data.
This new mechanism is off by default, and shall be enabled by setting
/sys/class/net/ethX/gro_flush_timeout to a value in nanosecond.
To fully enable this mechanism, drivers should use napi_complete_done()
instead of napi_complete().
Tested:
Ran 200 netperf TCP_STREAM from A to B (10Gbe mlx4 link, 8 RX queues)
Without this feature, we send back about 305,000 ACK per second.
GRO aggregation ratio is low (811/305 = 2.65 segments per GRO packet)
Setting a timer of 2000 nsec is enough to increase GRO packet sizes
and reduce number of ACK packets. (811/19.2 = 42)
Receiver performs less calls to upper stacks, less wakes up.
This also reduces cpu usage on the sender, as it receives less ACK
packets.
Note that reducing number of wakes up increases cpu efficiency, but can
decrease QPS, as applications wont have the chance to warmup cpu caches
doing a partial read of RPC requests/answers if they fit in one skb.
B:~# sar -n DEV 1 10 | grep eth0 | tail -1
Average: eth0 811269.80 305732.30 1199462.57 19705.72 0.00
0.00 0.50
B:~# echo 2000 >/sys/class/net/eth0/gro_flush_timeout
B:~# sar -n DEV 1 10 | grep eth0 | tail -1
Average: eth0 811577.30 19230.80 1199916.51 1239.80 0.00
0.00 0.50
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-07 05:09:44 +00:00
|
|
|
void napi_disable(struct napi_struct *n)
|
|
|
|
{
|
|
|
|
might_sleep();
|
|
|
|
set_bit(NAPI_STATE_DISABLE, &n->state);
|
|
|
|
|
|
|
|
while (test_and_set_bit(NAPI_STATE_SCHED, &n->state))
|
|
|
|
msleep(1);
|
netpoll: Close race condition between poll_one_napi and napi_disable
Drivers might call napi_disable while not holding the napi instance poll_lock.
In those instances, its possible for a race condition to exist between
poll_one_napi and napi_disable. That is to say, poll_one_napi only tests the
NAPI_STATE_SCHED bit to see if there is work to do during a poll, and as such
the following may happen:
CPU0 CPU1
ndo_tx_timeout napi_poll_dev
napi_disable poll_one_napi
test_and_set_bit (ret 0)
test_bit (ret 1)
reset adapter napi_poll_routine
If the adapter gets a tx timeout without a napi instance scheduled, its possible
for the adapter to think it has exclusive access to the hardware (as the napi
instance is now scheduled via the napi_disable call), while the netpoll code
thinks there is simply work to do. The result is parallel hardware access
leading to corrupt data structures in the driver, and a crash.
Additionaly, there is another, more critical race between netpoll and
napi_disable. The disabled napi state is actually identical to the scheduled
state for a given napi instance. The implication being that, if a napi instance
is disabled, a netconsole instance would see the napi state of the device as
having been scheduled, and poll it, likely while the driver was dong something
requiring exclusive access. In the case above, its fairly clear that not having
the rings in a state ready to be polled will cause any number of crashes.
The fix should be pretty easy. netpoll uses its own bit to indicate that that
the napi instance is in a state of being serviced by netpoll (NAPI_STATE_NPSVC).
We can just gate disabling on that bit as well as the sched bit. That should
prevent netpoll from conducting a napi poll if we convert its set bit to a
test_and_set_bit operation to provide mutual exclusion
Change notes:
V2)
Remove a trailing whtiespace
Resubmit with proper subject prefix
V3)
Clean up spacing nits
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: jmaxwell@redhat.com
Tested-by: jmaxwell@redhat.com
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-23 18:57:58 +00:00
|
|
|
while (test_and_set_bit(NAPI_STATE_NPSVC, &n->state))
|
|
|
|
msleep(1);
|
net: gro: add a per device gro flush timer
Tuning coalescing parameters on NIC can be really hard.
Servers can handle both bulk and RPC like traffic, with conflicting
goals : bulk flows want as big GRO packets as possible, RPC want minimal
latencies.
To reach big GRO packets on 10Gbe NIC, one can use :
ethtool -C eth0 rx-usecs 4 rx-frames 44
But this penalizes rpc sessions, with an increase of latencies, up to
50% in some cases, as NICs generally do not force an interrupt when
a packet with TCP Push flag is received.
Some NICs do not have an absolute timer, only a timer rearmed for every
incoming packet.
This patch uses a different strategy : Let GRO stack decides what do do,
based on traffic pattern.
Packets with Push flag wont be delayed.
Packets without Push flag might be held in GRO engine, if we keep
receiving data.
This new mechanism is off by default, and shall be enabled by setting
/sys/class/net/ethX/gro_flush_timeout to a value in nanosecond.
To fully enable this mechanism, drivers should use napi_complete_done()
instead of napi_complete().
Tested:
Ran 200 netperf TCP_STREAM from A to B (10Gbe mlx4 link, 8 RX queues)
Without this feature, we send back about 305,000 ACK per second.
GRO aggregation ratio is low (811/305 = 2.65 segments per GRO packet)
Setting a timer of 2000 nsec is enough to increase GRO packet sizes
and reduce number of ACK packets. (811/19.2 = 42)
Receiver performs less calls to upper stacks, less wakes up.
This also reduces cpu usage on the sender, as it receives less ACK
packets.
Note that reducing number of wakes up increases cpu efficiency, but can
decrease QPS, as applications wont have the chance to warmup cpu caches
doing a partial read of RPC requests/answers if they fit in one skb.
B:~# sar -n DEV 1 10 | grep eth0 | tail -1
Average: eth0 811269.80 305732.30 1199462.57 19705.72 0.00
0.00 0.50
B:~# echo 2000 >/sys/class/net/eth0/gro_flush_timeout
B:~# sar -n DEV 1 10 | grep eth0 | tail -1
Average: eth0 811577.30 19230.80 1199916.51 1239.80 0.00
0.00 0.50
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-07 05:09:44 +00:00
|
|
|
|
|
|
|
hrtimer_cancel(&n->timer);
|
|
|
|
|
|
|
|
clear_bit(NAPI_STATE_DISABLE, &n->state);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(napi_disable);
|
|
|
|
|
2015-11-18 14:31:03 +00:00
|
|
|
/* Must be called in process context */
|
net: Add Generic Receive Offload infrastructure
This patch adds the top-level GRO (Generic Receive Offload) infrastructure.
This is pretty similar to LRO except that this is protocol-independent.
Instead of holding packets in an lro_mgr structure, they're now held in
napi_struct.
For drivers that intend to use this, they can set the NETIF_F_GRO bit and
call napi_gro_receive instead of netif_receive_skb or just call netif_rx.
The latter will call napi_receive_skb automatically. When napi_gro_receive
is used, the driver must either call napi_complete/napi_rx_complete, or
call napi_gro_flush in softirq context if the driver uses the primitives
__napi_complete/__napi_rx_complete.
Protocols will set the gro_receive and gro_complete function pointers in
order to participate in this scheme.
In addition to the packet, gro_receive will get a list of currently held
packets. Each packet in the list has a same_flow field which is non-zero
if it is a potential match for the new packet. For each packet that may
match, they also have a flush field which is non-zero if the held packet
must not be merged with the new packet.
Once gro_receive has determined that the new skb matches a held packet,
the held packet may be processed immediately if the new skb cannot be
merged with it. In this case gro_receive should return the pointer to
the existing skb in gro_list. Otherwise the new skb should be merged into
the existing packet and NULL should be returned, unless the new skb makes
it impossible for any further merges to be made (e.g., FIN packet) where
the merged skb should be returned.
Whenever the skb is merged into an existing entry, the gro_receive
function should set NAPI_GRO_CB(skb)->same_flow. Note that if an skb
merely matches an existing entry but can't be merged with it, then
this shouldn't be set.
If gro_receive finds it pointless to hold the new skb for future merging,
it should set NAPI_GRO_CB(skb)->flush.
Held packets will be flushed by napi_gro_flush which is called by
napi_complete and napi_rx_complete.
Currently held packets are stored in a singly liked list just like LRO.
The list is limited to a maximum of 8 entries. In future, this may be
expanded to use a hash table to allow more flows to be held for merging.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-16 07:38:52 +00:00
|
|
|
void netif_napi_del(struct napi_struct *napi)
|
|
|
|
{
|
2015-11-18 14:31:03 +00:00
|
|
|
might_sleep();
|
|
|
|
if (napi_hash_del(napi))
|
|
|
|
synchronize_net();
|
2008-12-26 09:35:35 +00:00
|
|
|
list_del_init(&napi->dev_list);
|
2009-04-16 09:02:07 +00:00
|
|
|
napi_free_frags(napi);
|
net: Add Generic Receive Offload infrastructure
This patch adds the top-level GRO (Generic Receive Offload) infrastructure.
This is pretty similar to LRO except that this is protocol-independent.
Instead of holding packets in an lro_mgr structure, they're now held in
napi_struct.
For drivers that intend to use this, they can set the NETIF_F_GRO bit and
call napi_gro_receive instead of netif_receive_skb or just call netif_rx.
The latter will call napi_receive_skb automatically. When napi_gro_receive
is used, the driver must either call napi_complete/napi_rx_complete, or
call napi_gro_flush in softirq context if the driver uses the primitives
__napi_complete/__napi_rx_complete.
Protocols will set the gro_receive and gro_complete function pointers in
order to participate in this scheme.
In addition to the packet, gro_receive will get a list of currently held
packets. Each packet in the list has a same_flow field which is non-zero
if it is a potential match for the new packet. For each packet that may
match, they also have a flush field which is non-zero if the held packet
must not be merged with the new packet.
Once gro_receive has determined that the new skb matches a held packet,
the held packet may be processed immediately if the new skb cannot be
merged with it. In this case gro_receive should return the pointer to
the existing skb in gro_list. Otherwise the new skb should be merged into
the existing packet and NULL should be returned, unless the new skb makes
it impossible for any further merges to be made (e.g., FIN packet) where
the merged skb should be returned.
Whenever the skb is merged into an existing entry, the gro_receive
function should set NAPI_GRO_CB(skb)->same_flow. Note that if an skb
merely matches an existing entry but can't be merged with it, then
this shouldn't be set.
If gro_receive finds it pointless to hold the new skb for future merging,
it should set NAPI_GRO_CB(skb)->flush.
Held packets will be flushed by napi_gro_flush which is called by
napi_complete and napi_rx_complete.
Currently held packets are stored in a singly liked list just like LRO.
The list is limited to a maximum of 8 entries. In future, this may be
expanded to use a hash table to allow more flows to be held for merging.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-16 07:38:52 +00:00
|
|
|
|
2013-12-20 22:29:08 +00:00
|
|
|
kfree_skb_list(napi->gro_list);
|
net: Add Generic Receive Offload infrastructure
This patch adds the top-level GRO (Generic Receive Offload) infrastructure.
This is pretty similar to LRO except that this is protocol-independent.
Instead of holding packets in an lro_mgr structure, they're now held in
napi_struct.
For drivers that intend to use this, they can set the NETIF_F_GRO bit and
call napi_gro_receive instead of netif_receive_skb or just call netif_rx.
The latter will call napi_receive_skb automatically. When napi_gro_receive
is used, the driver must either call napi_complete/napi_rx_complete, or
call napi_gro_flush in softirq context if the driver uses the primitives
__napi_complete/__napi_rx_complete.
Protocols will set the gro_receive and gro_complete function pointers in
order to participate in this scheme.
In addition to the packet, gro_receive will get a list of currently held
packets. Each packet in the list has a same_flow field which is non-zero
if it is a potential match for the new packet. For each packet that may
match, they also have a flush field which is non-zero if the held packet
must not be merged with the new packet.
Once gro_receive has determined that the new skb matches a held packet,
the held packet may be processed immediately if the new skb cannot be
merged with it. In this case gro_receive should return the pointer to
the existing skb in gro_list. Otherwise the new skb should be merged into
the existing packet and NULL should be returned, unless the new skb makes
it impossible for any further merges to be made (e.g., FIN packet) where
the merged skb should be returned.
Whenever the skb is merged into an existing entry, the gro_receive
function should set NAPI_GRO_CB(skb)->same_flow. Note that if an skb
merely matches an existing entry but can't be merged with it, then
this shouldn't be set.
If gro_receive finds it pointless to hold the new skb for future merging,
it should set NAPI_GRO_CB(skb)->flush.
Held packets will be flushed by napi_gro_flush which is called by
napi_complete and napi_rx_complete.
Currently held packets are stored in a singly liked list just like LRO.
The list is limited to a maximum of 8 entries. In future, this may be
expanded to use a hash table to allow more flows to be held for merging.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-16 07:38:52 +00:00
|
|
|
napi->gro_list = NULL;
|
2009-02-08 18:00:36 +00:00
|
|
|
napi->gro_count = 0;
|
net: Add Generic Receive Offload infrastructure
This patch adds the top-level GRO (Generic Receive Offload) infrastructure.
This is pretty similar to LRO except that this is protocol-independent.
Instead of holding packets in an lro_mgr structure, they're now held in
napi_struct.
For drivers that intend to use this, they can set the NETIF_F_GRO bit and
call napi_gro_receive instead of netif_receive_skb or just call netif_rx.
The latter will call napi_receive_skb automatically. When napi_gro_receive
is used, the driver must either call napi_complete/napi_rx_complete, or
call napi_gro_flush in softirq context if the driver uses the primitives
__napi_complete/__napi_rx_complete.
Protocols will set the gro_receive and gro_complete function pointers in
order to participate in this scheme.
In addition to the packet, gro_receive will get a list of currently held
packets. Each packet in the list has a same_flow field which is non-zero
if it is a potential match for the new packet. For each packet that may
match, they also have a flush field which is non-zero if the held packet
must not be merged with the new packet.
Once gro_receive has determined that the new skb matches a held packet,
the held packet may be processed immediately if the new skb cannot be
merged with it. In this case gro_receive should return the pointer to
the existing skb in gro_list. Otherwise the new skb should be merged into
the existing packet and NULL should be returned, unless the new skb makes
it impossible for any further merges to be made (e.g., FIN packet) where
the merged skb should be returned.
Whenever the skb is merged into an existing entry, the gro_receive
function should set NAPI_GRO_CB(skb)->same_flow. Note that if an skb
merely matches an existing entry but can't be merged with it, then
this shouldn't be set.
If gro_receive finds it pointless to hold the new skb for future merging,
it should set NAPI_GRO_CB(skb)->flush.
Held packets will be flushed by napi_gro_flush which is called by
napi_complete and napi_rx_complete.
Currently held packets are stored in a singly liked list just like LRO.
The list is limited to a maximum of 8 entries. In future, this may be
expanded to use a hash table to allow more flows to be held for merging.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-16 07:38:52 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(netif_napi_del);
|
|
|
|
|
2014-12-20 20:16:21 +00:00
|
|
|
static int napi_poll(struct napi_struct *n, struct list_head *repoll)
|
|
|
|
{
|
|
|
|
void *have;
|
|
|
|
int work, weight;
|
|
|
|
|
|
|
|
list_del_init(&n->poll_list);
|
|
|
|
|
|
|
|
have = netpoll_poll_lock(n);
|
|
|
|
|
|
|
|
weight = n->weight;
|
|
|
|
|
|
|
|
/* This NAPI_STATE_SCHED test is for avoiding a race
|
|
|
|
* with netpoll's poll_napi(). Only the entity which
|
|
|
|
* obtains the lock and sees NAPI_STATE_SCHED set will
|
|
|
|
* actually make the ->poll() call. Therefore we avoid
|
|
|
|
* accidentally calling ->poll() when NAPI is not scheduled.
|
|
|
|
*/
|
|
|
|
work = 0;
|
|
|
|
if (test_bit(NAPI_STATE_SCHED, &n->state)) {
|
|
|
|
work = n->poll(n, weight);
|
2016-07-07 16:01:32 +00:00
|
|
|
trace_napi_poll(n, work, weight);
|
2014-12-20 20:16:21 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
WARN_ON_ONCE(work > weight);
|
|
|
|
|
|
|
|
if (likely(work < weight))
|
|
|
|
goto out_unlock;
|
|
|
|
|
|
|
|
/* Drivers must not modify the NAPI state if they
|
|
|
|
* consume the entire weight. In such cases this code
|
|
|
|
* still "owns" the NAPI instance and therefore can
|
|
|
|
* move the instance around on the list at-will.
|
|
|
|
*/
|
|
|
|
if (unlikely(napi_disable_pending(n))) {
|
|
|
|
napi_complete(n);
|
|
|
|
goto out_unlock;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (n->gro_list) {
|
|
|
|
/* flush too old packets
|
|
|
|
* If HZ < 1000, flush all packets.
|
|
|
|
*/
|
|
|
|
napi_gro_flush(n, HZ >= 1000);
|
|
|
|
}
|
|
|
|
|
2014-12-20 20:16:22 +00:00
|
|
|
/* Some drivers may have called napi_schedule
|
|
|
|
* prior to exhausting their budget.
|
|
|
|
*/
|
|
|
|
if (unlikely(!list_empty(&n->poll_list))) {
|
|
|
|
pr_warn_once("%s: Budget exhausted after napi rescheduled\n",
|
|
|
|
n->dev ? n->dev->name : "backlog");
|
|
|
|
goto out_unlock;
|
|
|
|
}
|
|
|
|
|
2014-12-20 20:16:21 +00:00
|
|
|
list_add_tail(&n->poll_list, repoll);
|
|
|
|
|
|
|
|
out_unlock:
|
|
|
|
netpoll_poll_unlock(have);
|
|
|
|
|
|
|
|
return work;
|
|
|
|
}
|
|
|
|
|
2016-06-20 18:42:34 +00:00
|
|
|
static __latent_entropy void net_rx_action(struct softirq_action *h)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2014-08-17 17:30:35 +00:00
|
|
|
struct softnet_data *sd = this_cpu_ptr(&softnet_data);
|
2017-04-19 16:37:10 +00:00
|
|
|
unsigned long time_limit = jiffies +
|
|
|
|
usecs_to_jiffies(netdev_budget_usecs);
|
2005-06-24 03:14:40 +00:00
|
|
|
int budget = netdev_budget;
|
net: less interrupt masking in NAPI
net_rx_action() can mask irqs a single time to transfert sd->poll_list
into a private list, for a very short duration.
Then, napi_complete() can avoid masking irqs again,
and net_rx_action() only needs to mask irq again in slow path.
This patch removes 2 couples of irq mask/unmask per typical NAPI run,
more if multiple napi were triggered.
Note this also allows to give control back to caller (do_softirq())
more often, so that other softirq handlers can be called a bit earlier,
or ksoftirqd can be wakeup earlier under pressure.
This was developed while testing an alternative to RX interrupt
mitigation to reduce latencies while keeping or improving GRO
aggregation on fast NIC.
Idea is to test napi->gro_list at the end of a napi->poll() and
reschedule one NAPI poll, but after servicing a full round of
softirqs (timers, TX, rcu, ...). This will be allowed only if softirq
is currently serviced by idle task or ksoftirqd, and resched not needed.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-02 14:19:33 +00:00
|
|
|
LIST_HEAD(list);
|
|
|
|
LIST_HEAD(repoll);
|
2005-08-12 02:27:43 +00:00
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
local_irq_disable();
|
net: less interrupt masking in NAPI
net_rx_action() can mask irqs a single time to transfert sd->poll_list
into a private list, for a very short duration.
Then, napi_complete() can avoid masking irqs again,
and net_rx_action() only needs to mask irq again in slow path.
This patch removes 2 couples of irq mask/unmask per typical NAPI run,
more if multiple napi were triggered.
Note this also allows to give control back to caller (do_softirq())
more often, so that other softirq handlers can be called a bit earlier,
or ksoftirqd can be wakeup earlier under pressure.
This was developed while testing an alternative to RX interrupt
mitigation to reduce latencies while keeping or improving GRO
aggregation on fast NIC.
Idea is to test napi->gro_list at the end of a napi->poll() and
reschedule one NAPI poll, but after servicing a full round of
softirqs (timers, TX, rcu, ...). This will be allowed only if softirq
is currently serviced by idle task or ksoftirqd, and resched not needed.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-02 14:19:33 +00:00
|
|
|
list_splice_init(&sd->poll_list, &list);
|
|
|
|
local_irq_enable();
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2014-12-20 20:16:25 +00:00
|
|
|
for (;;) {
|
[NET]: Make NAPI polling independent of struct net_device objects.
Several devices have multiple independant RX queues per net
device, and some have a single interrupt doorbell for several
queues.
In either case, it's easier to support layouts like that if the
structure representing the poll is independant from the net
device itself.
The signature of the ->poll() call back goes from:
int foo_poll(struct net_device *dev, int *budget)
to
int foo_poll(struct napi_struct *napi, int budget)
The caller is returned the number of RX packets processed (or
the number of "NAPI credits" consumed if you want to get
abstract). The callee no longer messes around bumping
dev->quota, *budget, etc. because that is all handled in the
caller upon return.
The napi_struct is to be embedded in the device driver private data
structures.
Furthermore, it is the driver's responsibility to disable all NAPI
instances in it's ->stop() device close handler. Since the
napi_struct is privatized into the driver's private data structures,
only the driver knows how to get at all of the napi_struct instances
it may have per-device.
With lots of help and suggestions from Rusty Russell, Roland Dreier,
Michael Chan, Jeff Garzik, and Jamal Hadi Salim.
Bug fixes from Thomas Graf, Roland Dreier, Peter Zijlstra,
Joseph Fannin, Scott Wood, Hans J. Koch, and Michael Chan.
[ Ported to current tree and all drivers converted. Integrated
Stephen's follow-on kerneldoc additions, and restored poll_list
handling to the old style to fix mutual exclusion issues. -DaveM ]
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-03 23:41:36 +00:00
|
|
|
struct napi_struct *n;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2014-12-20 20:16:25 +00:00
|
|
|
if (list_empty(&list)) {
|
|
|
|
if (!sd_has_rps_ipi_waiting(sd) && list_empty(&repoll))
|
2016-11-23 16:44:56 +00:00
|
|
|
goto out;
|
2014-12-20 20:16:25 +00:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
2014-12-20 20:16:24 +00:00
|
|
|
n = list_first_entry(&list, struct napi_struct, poll_list);
|
|
|
|
budget -= napi_poll(n, &repoll);
|
|
|
|
|
net: less interrupt masking in NAPI
net_rx_action() can mask irqs a single time to transfert sd->poll_list
into a private list, for a very short duration.
Then, napi_complete() can avoid masking irqs again,
and net_rx_action() only needs to mask irq again in slow path.
This patch removes 2 couples of irq mask/unmask per typical NAPI run,
more if multiple napi were triggered.
Note this also allows to give control back to caller (do_softirq())
more often, so that other softirq handlers can be called a bit earlier,
or ksoftirqd can be wakeup earlier under pressure.
This was developed while testing an alternative to RX interrupt
mitigation to reduce latencies while keeping or improving GRO
aggregation on fast NIC.
Idea is to test napi->gro_list at the end of a napi->poll() and
reschedule one NAPI poll, but after servicing a full round of
softirqs (timers, TX, rcu, ...). This will be allowed only if softirq
is currently serviced by idle task or ksoftirqd, and resched not needed.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-02 14:19:33 +00:00
|
|
|
/* If softirq window is exhausted then punt.
|
2008-11-04 01:14:38 +00:00
|
|
|
* Allow this to run for 2 jiffies since which will allow
|
|
|
|
* an average latency of 1.5/HZ.
|
[NET]: Make NAPI polling independent of struct net_device objects.
Several devices have multiple independant RX queues per net
device, and some have a single interrupt doorbell for several
queues.
In either case, it's easier to support layouts like that if the
structure representing the poll is independant from the net
device itself.
The signature of the ->poll() call back goes from:
int foo_poll(struct net_device *dev, int *budget)
to
int foo_poll(struct napi_struct *napi, int budget)
The caller is returned the number of RX packets processed (or
the number of "NAPI credits" consumed if you want to get
abstract). The callee no longer messes around bumping
dev->quota, *budget, etc. because that is all handled in the
caller upon return.
The napi_struct is to be embedded in the device driver private data
structures.
Furthermore, it is the driver's responsibility to disable all NAPI
instances in it's ->stop() device close handler. Since the
napi_struct is privatized into the driver's private data structures,
only the driver knows how to get at all of the napi_struct instances
it may have per-device.
With lots of help and suggestions from Rusty Russell, Roland Dreier,
Michael Chan, Jeff Garzik, and Jamal Hadi Salim.
Bug fixes from Thomas Graf, Roland Dreier, Peter Zijlstra,
Joseph Fannin, Scott Wood, Hans J. Koch, and Michael Chan.
[ Ported to current tree and all drivers converted. Integrated
Stephen's follow-on kerneldoc additions, and restored poll_list
handling to the old style to fix mutual exclusion issues. -DaveM ]
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-03 23:41:36 +00:00
|
|
|
*/
|
2014-12-20 20:16:25 +00:00
|
|
|
if (unlikely(budget <= 0 ||
|
|
|
|
time_after_eq(jiffies, time_limit))) {
|
|
|
|
sd->time_squeeze++;
|
|
|
|
break;
|
|
|
|
}
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
net: less interrupt masking in NAPI
net_rx_action() can mask irqs a single time to transfert sd->poll_list
into a private list, for a very short duration.
Then, napi_complete() can avoid masking irqs again,
and net_rx_action() only needs to mask irq again in slow path.
This patch removes 2 couples of irq mask/unmask per typical NAPI run,
more if multiple napi were triggered.
Note this also allows to give control back to caller (do_softirq())
more often, so that other softirq handlers can be called a bit earlier,
or ksoftirqd can be wakeup earlier under pressure.
This was developed while testing an alternative to RX interrupt
mitigation to reduce latencies while keeping or improving GRO
aggregation on fast NIC.
Idea is to test napi->gro_list at the end of a napi->poll() and
reschedule one NAPI poll, but after servicing a full round of
softirqs (timers, TX, rcu, ...). This will be allowed only if softirq
is currently serviced by idle task or ksoftirqd, and resched not needed.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-02 14:19:33 +00:00
|
|
|
|
|
|
|
local_irq_disable();
|
|
|
|
|
|
|
|
list_splice_tail_init(&sd->poll_list, &list);
|
|
|
|
list_splice_tail(&repoll, &list);
|
|
|
|
list_splice(&list, &sd->poll_list);
|
|
|
|
if (!list_empty(&sd->poll_list))
|
|
|
|
__raise_softirq_irqoff(NET_RX_SOFTIRQ);
|
|
|
|
|
2010-04-22 07:22:45 +00:00
|
|
|
net_rps_action_and_irq_enable(sd);
|
2016-11-23 16:44:56 +00:00
|
|
|
out:
|
|
|
|
__kfree_skb_flush();
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
|
2013-08-28 21:25:04 +00:00
|
|
|
struct netdev_adjacent {
|
2013-01-03 22:48:49 +00:00
|
|
|
struct net_device *dev;
|
net: add lower_dev_list to net_device and make a full mesh
This patch adds lower_dev_list list_head to net_device, which is the same
as upper_dev_list, only for lower devices, and begins to use it in the same
way as the upper list.
It also changes the way the whole adjacent device lists work - now they
contain *all* of upper/lower devices, not only the first level. The first
level devices are distinguished by the bool neighbour field in
netdev_adjacent, also added by this patch.
There are cases when a device can be added several times to the adjacent
list, the simplest would be:
/---- eth0.10 ---\
eth0- --- bond0
\---- eth0.20 ---/
where both bond0 and eth0 'see' each other in the adjacent lists two times.
To avoid duplication of netdev_adjacent structures ref_nr is being kept as
the number of times the device was added to the list.
The 'full view' is achieved by adding, on link creation, all of the
upper_dev's upper_dev_list devices as upper devices to all of the
lower_dev's lower_dev_list devices (and to the lower_dev itself), and vice
versa. On unlink they are removed using the same logic.
I've tested it with thousands vlans/bonds/bridges, everything works ok and
no observable lags even on a huge number of interfaces.
Memory footprint for 128 devices interconnected with each other via both
upper and lower (which is impossible, but for the comparison) lists would be:
128*128*2*sizeof(netdev_adjacent) = 1.5MB
but in the real world we usualy have at most several devices with slaves
and a lot of vlans, so the footprint will be much lower.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-28 21:25:05 +00:00
|
|
|
|
|
|
|
/* upper master flag, there can only be one master device per list */
|
2013-01-03 22:48:49 +00:00
|
|
|
bool master;
|
net: add lower_dev_list to net_device and make a full mesh
This patch adds lower_dev_list list_head to net_device, which is the same
as upper_dev_list, only for lower devices, and begins to use it in the same
way as the upper list.
It also changes the way the whole adjacent device lists work - now they
contain *all* of upper/lower devices, not only the first level. The first
level devices are distinguished by the bool neighbour field in
netdev_adjacent, also added by this patch.
There are cases when a device can be added several times to the adjacent
list, the simplest would be:
/---- eth0.10 ---\
eth0- --- bond0
\---- eth0.20 ---/
where both bond0 and eth0 'see' each other in the adjacent lists two times.
To avoid duplication of netdev_adjacent structures ref_nr is being kept as
the number of times the device was added to the list.
The 'full view' is achieved by adding, on link creation, all of the
upper_dev's upper_dev_list devices as upper devices to all of the
lower_dev's lower_dev_list devices (and to the lower_dev itself), and vice
versa. On unlink they are removed using the same logic.
I've tested it with thousands vlans/bonds/bridges, everything works ok and
no observable lags even on a huge number of interfaces.
Memory footprint for 128 devices interconnected with each other via both
upper and lower (which is impossible, but for the comparison) lists would be:
128*128*2*sizeof(netdev_adjacent) = 1.5MB
but in the real world we usualy have at most several devices with slaves
and a lot of vlans, so the footprint will be much lower.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-28 21:25:05 +00:00
|
|
|
|
|
|
|
/* counter for the number of times this device was added to us */
|
|
|
|
u16 ref_nr;
|
|
|
|
|
net: add netdev_adjacent->private and allow to use it
Currently, even though we can access any linked device, we can't attach
anything to it, which is vital to properly manage them.
To fix this, add a new void *private to netdev_adjacent and functions
setting/getting it (per link), so that we can save, per example, bonding's
slave structures there, per slave device.
netdev_master_upper_dev_link_private(dev, upper_dev, private) links dev to
upper dev and populates the neighbour link only with private.
netdev_lower_dev_get_private{,_rcu}() returns the private, if found.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-25 07:20:09 +00:00
|
|
|
/* private field for the users */
|
|
|
|
void *private;
|
|
|
|
|
2013-01-03 22:48:49 +00:00
|
|
|
struct list_head list;
|
|
|
|
struct rcu_head rcu;
|
|
|
|
};
|
|
|
|
|
2015-09-24 08:59:05 +00:00
|
|
|
static struct netdev_adjacent *__netdev_find_adj(struct net_device *adj_dev,
|
net: add adj_list to save only neighbours
Currently, we distinguish neighbours (first-level linked devices) from
non-neighbours by the neighbour bool in the netdev_adjacent. This could be
quite time-consuming in case we would like to traverse *only* through
neighbours - cause we'd have to traverse through all devices and check for
this flag, and in a (quite common) scenario where we have lots of vlans on
top of bridge, which is on top of a bond - the bonding would have to go
through all those vlans to get its upper neighbour linked devices.
This situation is really unpleasant, cause there are already a lot of cases
when a device with slaves needs to go through them in hot path.
To fix this, introduce a new upper/lower device lists structure -
adj_list, which contains only the neighbours. It works always in
pair with the all_adj_list structure (renamed from upper/lower_dev_list),
i.e. both of them contain the same links, only that all_adj_list contains
also non-neighbour device links. It's really a small change visible,
currently, only for __netdev_adjacent_dev_insert/remove(), and doesn't
change the main linked logic at all.
Also, add some comments a fix a name collision in
netdev_for_each_upper_dev_rcu() and rework the naming by the following
rules:
netdev_(all_)(upper|lower)_*
If "all_" is present, then we work with the whole list of upper/lower
devices, otherwise - only with direct neighbours. Uninline functions - to
get better stack traces.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-25 07:20:07 +00:00
|
|
|
struct list_head *adj_list)
|
2013-01-03 22:48:49 +00:00
|
|
|
{
|
net: add lower_dev_list to net_device and make a full mesh
This patch adds lower_dev_list list_head to net_device, which is the same
as upper_dev_list, only for lower devices, and begins to use it in the same
way as the upper list.
It also changes the way the whole adjacent device lists work - now they
contain *all* of upper/lower devices, not only the first level. The first
level devices are distinguished by the bool neighbour field in
netdev_adjacent, also added by this patch.
There are cases when a device can be added several times to the adjacent
list, the simplest would be:
/---- eth0.10 ---\
eth0- --- bond0
\---- eth0.20 ---/
where both bond0 and eth0 'see' each other in the adjacent lists two times.
To avoid duplication of netdev_adjacent structures ref_nr is being kept as
the number of times the device was added to the list.
The 'full view' is achieved by adding, on link creation, all of the
upper_dev's upper_dev_list devices as upper devices to all of the
lower_dev's lower_dev_list devices (and to the lower_dev itself), and vice
versa. On unlink they are removed using the same logic.
I've tested it with thousands vlans/bonds/bridges, everything works ok and
no observable lags even on a huge number of interfaces.
Memory footprint for 128 devices interconnected with each other via both
upper and lower (which is impossible, but for the comparison) lists would be:
128*128*2*sizeof(netdev_adjacent) = 1.5MB
but in the real world we usualy have at most several devices with slaves
and a lot of vlans, so the footprint will be much lower.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-28 21:25:05 +00:00
|
|
|
struct netdev_adjacent *adj;
|
|
|
|
|
net: add adj_list to save only neighbours
Currently, we distinguish neighbours (first-level linked devices) from
non-neighbours by the neighbour bool in the netdev_adjacent. This could be
quite time-consuming in case we would like to traverse *only* through
neighbours - cause we'd have to traverse through all devices and check for
this flag, and in a (quite common) scenario where we have lots of vlans on
top of bridge, which is on top of a bond - the bonding would have to go
through all those vlans to get its upper neighbour linked devices.
This situation is really unpleasant, cause there are already a lot of cases
when a device with slaves needs to go through them in hot path.
To fix this, introduce a new upper/lower device lists structure -
adj_list, which contains only the neighbours. It works always in
pair with the all_adj_list structure (renamed from upper/lower_dev_list),
i.e. both of them contain the same links, only that all_adj_list contains
also non-neighbour device links. It's really a small change visible,
currently, only for __netdev_adjacent_dev_insert/remove(), and doesn't
change the main linked logic at all.
Also, add some comments a fix a name collision in
netdev_for_each_upper_dev_rcu() and rework the naming by the following
rules:
netdev_(all_)(upper|lower)_*
If "all_" is present, then we work with the whole list of upper/lower
devices, otherwise - only with direct neighbours. Uninline functions - to
get better stack traces.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-25 07:20:07 +00:00
|
|
|
list_for_each_entry(adj, adj_list, list) {
|
net: add lower_dev_list to net_device and make a full mesh
This patch adds lower_dev_list list_head to net_device, which is the same
as upper_dev_list, only for lower devices, and begins to use it in the same
way as the upper list.
It also changes the way the whole adjacent device lists work - now they
contain *all* of upper/lower devices, not only the first level. The first
level devices are distinguished by the bool neighbour field in
netdev_adjacent, also added by this patch.
There are cases when a device can be added several times to the adjacent
list, the simplest would be:
/---- eth0.10 ---\
eth0- --- bond0
\---- eth0.20 ---/
where both bond0 and eth0 'see' each other in the adjacent lists two times.
To avoid duplication of netdev_adjacent structures ref_nr is being kept as
the number of times the device was added to the list.
The 'full view' is achieved by adding, on link creation, all of the
upper_dev's upper_dev_list devices as upper devices to all of the
lower_dev's lower_dev_list devices (and to the lower_dev itself), and vice
versa. On unlink they are removed using the same logic.
I've tested it with thousands vlans/bonds/bridges, everything works ok and
no observable lags even on a huge number of interfaces.
Memory footprint for 128 devices interconnected with each other via both
upper and lower (which is impossible, but for the comparison) lists would be:
128*128*2*sizeof(netdev_adjacent) = 1.5MB
but in the real world we usualy have at most several devices with slaves
and a lot of vlans, so the footprint will be much lower.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-28 21:25:05 +00:00
|
|
|
if (adj->dev == adj_dev)
|
|
|
|
return adj;
|
2013-01-03 22:48:49 +00:00
|
|
|
}
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2016-10-18 02:15:51 +00:00
|
|
|
static int __netdev_has_upper_dev(struct net_device *upper_dev, void *data)
|
|
|
|
{
|
|
|
|
struct net_device *dev = data;
|
|
|
|
|
|
|
|
return upper_dev == dev;
|
|
|
|
}
|
|
|
|
|
2013-01-03 22:48:49 +00:00
|
|
|
/**
|
|
|
|
* netdev_has_upper_dev - Check if device is linked to an upper device
|
|
|
|
* @dev: device
|
|
|
|
* @upper_dev: upper device to check
|
|
|
|
*
|
|
|
|
* Find out if a device is linked to specified upper device and return true
|
|
|
|
* in case it is. Note that this checks only immediate upper device,
|
|
|
|
* not through a complete stack of devices. The caller must hold the RTNL lock.
|
|
|
|
*/
|
|
|
|
bool netdev_has_upper_dev(struct net_device *dev,
|
|
|
|
struct net_device *upper_dev)
|
|
|
|
{
|
|
|
|
ASSERT_RTNL();
|
|
|
|
|
2016-10-18 02:15:51 +00:00
|
|
|
return netdev_walk_all_upper_dev_rcu(dev, __netdev_has_upper_dev,
|
|
|
|
upper_dev);
|
2013-01-03 22:48:49 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(netdev_has_upper_dev);
|
|
|
|
|
2016-10-18 02:15:44 +00:00
|
|
|
/**
|
|
|
|
* netdev_has_upper_dev_all - Check if device is linked to an upper device
|
|
|
|
* @dev: device
|
|
|
|
* @upper_dev: upper device to check
|
|
|
|
*
|
|
|
|
* Find out if a device is linked to specified upper device and return true
|
|
|
|
* in case it is. Note that this checks the entire upper device chain.
|
|
|
|
* The caller must hold rcu lock.
|
|
|
|
*/
|
|
|
|
|
|
|
|
bool netdev_has_upper_dev_all_rcu(struct net_device *dev,
|
|
|
|
struct net_device *upper_dev)
|
|
|
|
{
|
|
|
|
return !!netdev_walk_all_upper_dev_rcu(dev, __netdev_has_upper_dev,
|
|
|
|
upper_dev);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(netdev_has_upper_dev_all_rcu);
|
|
|
|
|
2013-01-03 22:48:49 +00:00
|
|
|
/**
|
|
|
|
* netdev_has_any_upper_dev - Check if device is linked to some device
|
|
|
|
* @dev: device
|
|
|
|
*
|
|
|
|
* Find out if a device is linked to an upper device and return true in case
|
|
|
|
* it is. The caller must hold the RTNL lock.
|
|
|
|
*/
|
2017-09-01 08:52:31 +00:00
|
|
|
bool netdev_has_any_upper_dev(struct net_device *dev)
|
2013-01-03 22:48:49 +00:00
|
|
|
{
|
|
|
|
ASSERT_RTNL();
|
|
|
|
|
2016-10-18 02:15:51 +00:00
|
|
|
return !list_empty(&dev->adj_list.upper);
|
2013-01-03 22:48:49 +00:00
|
|
|
}
|
2017-09-01 08:52:31 +00:00
|
|
|
EXPORT_SYMBOL(netdev_has_any_upper_dev);
|
2013-01-03 22:48:49 +00:00
|
|
|
|
|
|
|
/**
|
|
|
|
* netdev_master_upper_dev_get - Get master upper device
|
|
|
|
* @dev: device
|
|
|
|
*
|
|
|
|
* Find a master upper device and return pointer to it or NULL in case
|
|
|
|
* it's not there. The caller must hold the RTNL lock.
|
|
|
|
*/
|
|
|
|
struct net_device *netdev_master_upper_dev_get(struct net_device *dev)
|
|
|
|
{
|
2013-08-28 21:25:04 +00:00
|
|
|
struct netdev_adjacent *upper;
|
2013-01-03 22:48:49 +00:00
|
|
|
|
|
|
|
ASSERT_RTNL();
|
|
|
|
|
net: add adj_list to save only neighbours
Currently, we distinguish neighbours (first-level linked devices) from
non-neighbours by the neighbour bool in the netdev_adjacent. This could be
quite time-consuming in case we would like to traverse *only* through
neighbours - cause we'd have to traverse through all devices and check for
this flag, and in a (quite common) scenario where we have lots of vlans on
top of bridge, which is on top of a bond - the bonding would have to go
through all those vlans to get its upper neighbour linked devices.
This situation is really unpleasant, cause there are already a lot of cases
when a device with slaves needs to go through them in hot path.
To fix this, introduce a new upper/lower device lists structure -
adj_list, which contains only the neighbours. It works always in
pair with the all_adj_list structure (renamed from upper/lower_dev_list),
i.e. both of them contain the same links, only that all_adj_list contains
also non-neighbour device links. It's really a small change visible,
currently, only for __netdev_adjacent_dev_insert/remove(), and doesn't
change the main linked logic at all.
Also, add some comments a fix a name collision in
netdev_for_each_upper_dev_rcu() and rework the naming by the following
rules:
netdev_(all_)(upper|lower)_*
If "all_" is present, then we work with the whole list of upper/lower
devices, otherwise - only with direct neighbours. Uninline functions - to
get better stack traces.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-25 07:20:07 +00:00
|
|
|
if (list_empty(&dev->adj_list.upper))
|
2013-01-03 22:48:49 +00:00
|
|
|
return NULL;
|
|
|
|
|
net: add adj_list to save only neighbours
Currently, we distinguish neighbours (first-level linked devices) from
non-neighbours by the neighbour bool in the netdev_adjacent. This could be
quite time-consuming in case we would like to traverse *only* through
neighbours - cause we'd have to traverse through all devices and check for
this flag, and in a (quite common) scenario where we have lots of vlans on
top of bridge, which is on top of a bond - the bonding would have to go
through all those vlans to get its upper neighbour linked devices.
This situation is really unpleasant, cause there are already a lot of cases
when a device with slaves needs to go through them in hot path.
To fix this, introduce a new upper/lower device lists structure -
adj_list, which contains only the neighbours. It works always in
pair with the all_adj_list structure (renamed from upper/lower_dev_list),
i.e. both of them contain the same links, only that all_adj_list contains
also non-neighbour device links. It's really a small change visible,
currently, only for __netdev_adjacent_dev_insert/remove(), and doesn't
change the main linked logic at all.
Also, add some comments a fix a name collision in
netdev_for_each_upper_dev_rcu() and rework the naming by the following
rules:
netdev_(all_)(upper|lower)_*
If "all_" is present, then we work with the whole list of upper/lower
devices, otherwise - only with direct neighbours. Uninline functions - to
get better stack traces.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-25 07:20:07 +00:00
|
|
|
upper = list_first_entry(&dev->adj_list.upper,
|
2013-08-28 21:25:04 +00:00
|
|
|
struct netdev_adjacent, list);
|
2013-01-03 22:48:49 +00:00
|
|
|
if (likely(upper->master))
|
|
|
|
return upper->dev;
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(netdev_master_upper_dev_get);
|
|
|
|
|
2016-10-18 02:15:52 +00:00
|
|
|
/**
|
|
|
|
* netdev_has_any_lower_dev - Check if device is linked to some device
|
|
|
|
* @dev: device
|
|
|
|
*
|
|
|
|
* Find out if a device is linked to a lower device and return true in case
|
|
|
|
* it is. The caller must hold the RTNL lock.
|
|
|
|
*/
|
|
|
|
static bool netdev_has_any_lower_dev(struct net_device *dev)
|
|
|
|
{
|
|
|
|
ASSERT_RTNL();
|
|
|
|
|
|
|
|
return !list_empty(&dev->adj_list.lower);
|
|
|
|
}
|
|
|
|
|
2013-09-25 07:20:23 +00:00
|
|
|
void *netdev_adjacent_get_private(struct list_head *adj_list)
|
|
|
|
{
|
|
|
|
struct netdev_adjacent *adj;
|
|
|
|
|
|
|
|
adj = list_entry(adj_list, struct netdev_adjacent, list);
|
|
|
|
|
|
|
|
return adj->private;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(netdev_adjacent_get_private);
|
|
|
|
|
2014-05-16 21:20:38 +00:00
|
|
|
/**
|
|
|
|
* netdev_upper_get_next_dev_rcu - Get the next dev from upper list
|
|
|
|
* @dev: device
|
|
|
|
* @iter: list_head ** of the current position
|
|
|
|
*
|
|
|
|
* Gets the next device from the dev's upper list, starting from iter
|
|
|
|
* position. The caller must hold RCU read lock.
|
|
|
|
*/
|
|
|
|
struct net_device *netdev_upper_get_next_dev_rcu(struct net_device *dev,
|
|
|
|
struct list_head **iter)
|
|
|
|
{
|
|
|
|
struct netdev_adjacent *upper;
|
|
|
|
|
|
|
|
WARN_ON_ONCE(!rcu_read_lock_held() && !lockdep_rtnl_is_held());
|
|
|
|
|
|
|
|
upper = list_entry_rcu((*iter)->next, struct netdev_adjacent, list);
|
|
|
|
|
|
|
|
if (&upper->list == &dev->adj_list.upper)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
*iter = &upper->list;
|
|
|
|
|
|
|
|
return upper->dev;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(netdev_upper_get_next_dev_rcu);
|
|
|
|
|
2016-10-18 02:15:44 +00:00
|
|
|
static struct net_device *netdev_next_upper_dev_rcu(struct net_device *dev,
|
|
|
|
struct list_head **iter)
|
|
|
|
{
|
|
|
|
struct netdev_adjacent *upper;
|
|
|
|
|
|
|
|
WARN_ON_ONCE(!rcu_read_lock_held() && !lockdep_rtnl_is_held());
|
|
|
|
|
|
|
|
upper = list_entry_rcu((*iter)->next, struct netdev_adjacent, list);
|
|
|
|
|
|
|
|
if (&upper->list == &dev->adj_list.upper)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
*iter = &upper->list;
|
|
|
|
|
|
|
|
return upper->dev;
|
|
|
|
}
|
|
|
|
|
|
|
|
int netdev_walk_all_upper_dev_rcu(struct net_device *dev,
|
|
|
|
int (*fn)(struct net_device *dev,
|
|
|
|
void *data),
|
|
|
|
void *data)
|
|
|
|
{
|
|
|
|
struct net_device *udev;
|
|
|
|
struct list_head *iter;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
for (iter = &dev->adj_list.upper,
|
|
|
|
udev = netdev_next_upper_dev_rcu(dev, &iter);
|
|
|
|
udev;
|
|
|
|
udev = netdev_next_upper_dev_rcu(dev, &iter)) {
|
|
|
|
/* first is the upper device itself */
|
|
|
|
ret = fn(udev, data);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
/* then look at all of its upper devices */
|
|
|
|
ret = netdev_walk_all_upper_dev_rcu(udev, fn, data);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(netdev_walk_all_upper_dev_rcu);
|
|
|
|
|
2013-09-25 07:20:12 +00:00
|
|
|
/**
|
|
|
|
* netdev_lower_get_next_private - Get the next ->private from the
|
|
|
|
* lower neighbour list
|
|
|
|
* @dev: device
|
|
|
|
* @iter: list_head ** of the current position
|
|
|
|
*
|
|
|
|
* Gets the next netdev_adjacent->private from the dev's lower neighbour
|
|
|
|
* list, starting from iter position. The caller must hold either hold the
|
|
|
|
* RTNL lock or its own locking that guarantees that the neighbour lower
|
2015-07-24 03:03:29 +00:00
|
|
|
* list will remain unchanged.
|
2013-09-25 07:20:12 +00:00
|
|
|
*/
|
|
|
|
void *netdev_lower_get_next_private(struct net_device *dev,
|
|
|
|
struct list_head **iter)
|
|
|
|
{
|
|
|
|
struct netdev_adjacent *lower;
|
|
|
|
|
|
|
|
lower = list_entry(*iter, struct netdev_adjacent, list);
|
|
|
|
|
|
|
|
if (&lower->list == &dev->adj_list.lower)
|
|
|
|
return NULL;
|
|
|
|
|
2014-04-07 09:25:12 +00:00
|
|
|
*iter = lower->list.next;
|
2013-09-25 07:20:12 +00:00
|
|
|
|
|
|
|
return lower->private;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(netdev_lower_get_next_private);
|
|
|
|
|
|
|
|
/**
|
|
|
|
* netdev_lower_get_next_private_rcu - Get the next ->private from the
|
|
|
|
* lower neighbour list, RCU
|
|
|
|
* variant
|
|
|
|
* @dev: device
|
|
|
|
* @iter: list_head ** of the current position
|
|
|
|
*
|
|
|
|
* Gets the next netdev_adjacent->private from the dev's lower neighbour
|
|
|
|
* list, starting from iter position. The caller must hold RCU read lock.
|
|
|
|
*/
|
|
|
|
void *netdev_lower_get_next_private_rcu(struct net_device *dev,
|
|
|
|
struct list_head **iter)
|
|
|
|
{
|
|
|
|
struct netdev_adjacent *lower;
|
|
|
|
|
|
|
|
WARN_ON_ONCE(!rcu_read_lock_held());
|
|
|
|
|
|
|
|
lower = list_entry_rcu((*iter)->next, struct netdev_adjacent, list);
|
|
|
|
|
|
|
|
if (&lower->list == &dev->adj_list.lower)
|
|
|
|
return NULL;
|
|
|
|
|
2014-04-07 09:25:12 +00:00
|
|
|
*iter = &lower->list;
|
2013-09-25 07:20:12 +00:00
|
|
|
|
|
|
|
return lower->private;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(netdev_lower_get_next_private_rcu);
|
|
|
|
|
2014-05-16 21:04:53 +00:00
|
|
|
/**
|
|
|
|
* netdev_lower_get_next - Get the next device from the lower neighbour
|
|
|
|
* list
|
|
|
|
* @dev: device
|
|
|
|
* @iter: list_head ** of the current position
|
|
|
|
*
|
|
|
|
* Gets the next netdev_adjacent from the dev's lower neighbour
|
|
|
|
* list, starting from iter position. The caller must hold RTNL lock or
|
|
|
|
* its own locking that guarantees that the neighbour lower
|
2015-07-24 03:03:29 +00:00
|
|
|
* list will remain unchanged.
|
2014-05-16 21:04:53 +00:00
|
|
|
*/
|
|
|
|
void *netdev_lower_get_next(struct net_device *dev, struct list_head **iter)
|
|
|
|
{
|
|
|
|
struct netdev_adjacent *lower;
|
|
|
|
|
2016-02-17 17:00:31 +00:00
|
|
|
lower = list_entry(*iter, struct netdev_adjacent, list);
|
2014-05-16 21:04:53 +00:00
|
|
|
|
|
|
|
if (&lower->list == &dev->adj_list.lower)
|
|
|
|
return NULL;
|
|
|
|
|
2016-02-17 17:00:31 +00:00
|
|
|
*iter = lower->list.next;
|
2014-05-16 21:04:53 +00:00
|
|
|
|
|
|
|
return lower->dev;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(netdev_lower_get_next);
|
|
|
|
|
2016-10-18 02:15:44 +00:00
|
|
|
static struct net_device *netdev_next_lower_dev(struct net_device *dev,
|
|
|
|
struct list_head **iter)
|
|
|
|
{
|
|
|
|
struct netdev_adjacent *lower;
|
|
|
|
|
2016-10-26 20:21:33 +00:00
|
|
|
lower = list_entry((*iter)->next, struct netdev_adjacent, list);
|
2016-10-18 02:15:44 +00:00
|
|
|
|
|
|
|
if (&lower->list == &dev->adj_list.lower)
|
|
|
|
return NULL;
|
|
|
|
|
2016-10-26 20:21:33 +00:00
|
|
|
*iter = &lower->list;
|
2016-10-18 02:15:44 +00:00
|
|
|
|
|
|
|
return lower->dev;
|
|
|
|
}
|
|
|
|
|
|
|
|
int netdev_walk_all_lower_dev(struct net_device *dev,
|
|
|
|
int (*fn)(struct net_device *dev,
|
|
|
|
void *data),
|
|
|
|
void *data)
|
|
|
|
{
|
|
|
|
struct net_device *ldev;
|
|
|
|
struct list_head *iter;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
for (iter = &dev->adj_list.lower,
|
|
|
|
ldev = netdev_next_lower_dev(dev, &iter);
|
|
|
|
ldev;
|
|
|
|
ldev = netdev_next_lower_dev(dev, &iter)) {
|
|
|
|
/* first is the lower device itself */
|
|
|
|
ret = fn(ldev, data);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
/* then look at all of its lower devices */
|
|
|
|
ret = netdev_walk_all_lower_dev(ldev, fn, data);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(netdev_walk_all_lower_dev);
|
|
|
|
|
|
|
|
static struct net_device *netdev_next_lower_dev_rcu(struct net_device *dev,
|
|
|
|
struct list_head **iter)
|
|
|
|
{
|
|
|
|
struct netdev_adjacent *lower;
|
|
|
|
|
|
|
|
lower = list_entry_rcu((*iter)->next, struct netdev_adjacent, list);
|
|
|
|
if (&lower->list == &dev->adj_list.lower)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
*iter = &lower->list;
|
|
|
|
|
|
|
|
return lower->dev;
|
|
|
|
}
|
|
|
|
|
|
|
|
int netdev_walk_all_lower_dev_rcu(struct net_device *dev,
|
|
|
|
int (*fn)(struct net_device *dev,
|
|
|
|
void *data),
|
|
|
|
void *data)
|
|
|
|
{
|
|
|
|
struct net_device *ldev;
|
|
|
|
struct list_head *iter;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
for (iter = &dev->adj_list.lower,
|
|
|
|
ldev = netdev_next_lower_dev_rcu(dev, &iter);
|
|
|
|
ldev;
|
|
|
|
ldev = netdev_next_lower_dev_rcu(dev, &iter)) {
|
|
|
|
/* first is the lower device itself */
|
|
|
|
ret = fn(ldev, data);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
/* then look at all of its lower devices */
|
|
|
|
ret = netdev_walk_all_lower_dev_rcu(ldev, fn, data);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(netdev_walk_all_lower_dev_rcu);
|
|
|
|
|
2013-12-13 02:19:55 +00:00
|
|
|
/**
|
|
|
|
* netdev_lower_get_first_private_rcu - Get the first ->private from the
|
|
|
|
* lower neighbour list, RCU
|
|
|
|
* variant
|
|
|
|
* @dev: device
|
|
|
|
*
|
|
|
|
* Gets the first netdev_adjacent->private from the dev's lower neighbour
|
|
|
|
* list. The caller must hold RCU read lock.
|
|
|
|
*/
|
|
|
|
void *netdev_lower_get_first_private_rcu(struct net_device *dev)
|
|
|
|
{
|
|
|
|
struct netdev_adjacent *lower;
|
|
|
|
|
|
|
|
lower = list_first_or_null_rcu(&dev->adj_list.lower,
|
|
|
|
struct netdev_adjacent, list);
|
|
|
|
if (lower)
|
|
|
|
return lower->private;
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(netdev_lower_get_first_private_rcu);
|
|
|
|
|
2013-01-03 22:48:49 +00:00
|
|
|
/**
|
|
|
|
* netdev_master_upper_dev_get_rcu - Get master upper device
|
|
|
|
* @dev: device
|
|
|
|
*
|
|
|
|
* Find a master upper device and return pointer to it or NULL in case
|
|
|
|
* it's not there. The caller must hold the RCU read lock.
|
|
|
|
*/
|
|
|
|
struct net_device *netdev_master_upper_dev_get_rcu(struct net_device *dev)
|
|
|
|
{
|
2013-08-28 21:25:04 +00:00
|
|
|
struct netdev_adjacent *upper;
|
2013-01-03 22:48:49 +00:00
|
|
|
|
net: add adj_list to save only neighbours
Currently, we distinguish neighbours (first-level linked devices) from
non-neighbours by the neighbour bool in the netdev_adjacent. This could be
quite time-consuming in case we would like to traverse *only* through
neighbours - cause we'd have to traverse through all devices and check for
this flag, and in a (quite common) scenario where we have lots of vlans on
top of bridge, which is on top of a bond - the bonding would have to go
through all those vlans to get its upper neighbour linked devices.
This situation is really unpleasant, cause there are already a lot of cases
when a device with slaves needs to go through them in hot path.
To fix this, introduce a new upper/lower device lists structure -
adj_list, which contains only the neighbours. It works always in
pair with the all_adj_list structure (renamed from upper/lower_dev_list),
i.e. both of them contain the same links, only that all_adj_list contains
also non-neighbour device links. It's really a small change visible,
currently, only for __netdev_adjacent_dev_insert/remove(), and doesn't
change the main linked logic at all.
Also, add some comments a fix a name collision in
netdev_for_each_upper_dev_rcu() and rework the naming by the following
rules:
netdev_(all_)(upper|lower)_*
If "all_" is present, then we work with the whole list of upper/lower
devices, otherwise - only with direct neighbours. Uninline functions - to
get better stack traces.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-25 07:20:07 +00:00
|
|
|
upper = list_first_or_null_rcu(&dev->adj_list.upper,
|
2013-08-28 21:25:04 +00:00
|
|
|
struct netdev_adjacent, list);
|
2013-01-03 22:48:49 +00:00
|
|
|
if (upper && likely(upper->master))
|
|
|
|
return upper->dev;
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(netdev_master_upper_dev_get_rcu);
|
|
|
|
|
2014-02-09 14:56:25 +00:00
|
|
|
static int netdev_adjacent_sysfs_add(struct net_device *dev,
|
2014-01-14 20:58:50 +00:00
|
|
|
struct net_device *adj_dev,
|
|
|
|
struct list_head *dev_list)
|
|
|
|
{
|
|
|
|
char linkname[IFNAMSIZ+7];
|
2017-02-09 06:56:07 +00:00
|
|
|
|
2014-01-14 20:58:50 +00:00
|
|
|
sprintf(linkname, dev_list == &dev->adj_list.upper ?
|
|
|
|
"upper_%s" : "lower_%s", adj_dev->name);
|
|
|
|
return sysfs_create_link(&(dev->dev.kobj), &(adj_dev->dev.kobj),
|
|
|
|
linkname);
|
|
|
|
}
|
2014-02-09 14:56:25 +00:00
|
|
|
static void netdev_adjacent_sysfs_del(struct net_device *dev,
|
2014-01-14 20:58:50 +00:00
|
|
|
char *name,
|
|
|
|
struct list_head *dev_list)
|
|
|
|
{
|
|
|
|
char linkname[IFNAMSIZ+7];
|
2017-02-09 06:56:07 +00:00
|
|
|
|
2014-01-14 20:58:50 +00:00
|
|
|
sprintf(linkname, dev_list == &dev->adj_list.upper ?
|
|
|
|
"upper_%s" : "lower_%s", name);
|
|
|
|
sysfs_remove_link(&(dev->dev.kobj), linkname);
|
|
|
|
}
|
|
|
|
|
2014-09-15 10:22:35 +00:00
|
|
|
static inline bool netdev_adjacent_is_neigh_list(struct net_device *dev,
|
|
|
|
struct net_device *adj_dev,
|
|
|
|
struct list_head *dev_list)
|
|
|
|
{
|
|
|
|
return (dev_list == &dev->adj_list.upper ||
|
|
|
|
dev_list == &dev->adj_list.lower) &&
|
|
|
|
net_eq(dev_net(dev), dev_net(adj_dev));
|
|
|
|
}
|
2014-01-14 20:58:50 +00:00
|
|
|
|
net: add lower_dev_list to net_device and make a full mesh
This patch adds lower_dev_list list_head to net_device, which is the same
as upper_dev_list, only for lower devices, and begins to use it in the same
way as the upper list.
It also changes the way the whole adjacent device lists work - now they
contain *all* of upper/lower devices, not only the first level. The first
level devices are distinguished by the bool neighbour field in
netdev_adjacent, also added by this patch.
There are cases when a device can be added several times to the adjacent
list, the simplest would be:
/---- eth0.10 ---\
eth0- --- bond0
\---- eth0.20 ---/
where both bond0 and eth0 'see' each other in the adjacent lists two times.
To avoid duplication of netdev_adjacent structures ref_nr is being kept as
the number of times the device was added to the list.
The 'full view' is achieved by adding, on link creation, all of the
upper_dev's upper_dev_list devices as upper devices to all of the
lower_dev's lower_dev_list devices (and to the lower_dev itself), and vice
versa. On unlink they are removed using the same logic.
I've tested it with thousands vlans/bonds/bridges, everything works ok and
no observable lags even on a huge number of interfaces.
Memory footprint for 128 devices interconnected with each other via both
upper and lower (which is impossible, but for the comparison) lists would be:
128*128*2*sizeof(netdev_adjacent) = 1.5MB
but in the real world we usualy have at most several devices with slaves
and a lot of vlans, so the footprint will be much lower.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-28 21:25:05 +00:00
|
|
|
static int __netdev_adjacent_dev_insert(struct net_device *dev,
|
|
|
|
struct net_device *adj_dev,
|
2013-09-25 07:20:06 +00:00
|
|
|
struct list_head *dev_list,
|
net: add netdev_adjacent->private and allow to use it
Currently, even though we can access any linked device, we can't attach
anything to it, which is vital to properly manage them.
To fix this, add a new void *private to netdev_adjacent and functions
setting/getting it (per link), so that we can save, per example, bonding's
slave structures there, per slave device.
netdev_master_upper_dev_link_private(dev, upper_dev, private) links dev to
upper dev and populates the neighbour link only with private.
netdev_lower_dev_get_private{,_rcu}() returns the private, if found.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-25 07:20:09 +00:00
|
|
|
void *private, bool master)
|
net: add lower_dev_list to net_device and make a full mesh
This patch adds lower_dev_list list_head to net_device, which is the same
as upper_dev_list, only for lower devices, and begins to use it in the same
way as the upper list.
It also changes the way the whole adjacent device lists work - now they
contain *all* of upper/lower devices, not only the first level. The first
level devices are distinguished by the bool neighbour field in
netdev_adjacent, also added by this patch.
There are cases when a device can be added several times to the adjacent
list, the simplest would be:
/---- eth0.10 ---\
eth0- --- bond0
\---- eth0.20 ---/
where both bond0 and eth0 'see' each other in the adjacent lists two times.
To avoid duplication of netdev_adjacent structures ref_nr is being kept as
the number of times the device was added to the list.
The 'full view' is achieved by adding, on link creation, all of the
upper_dev's upper_dev_list devices as upper devices to all of the
lower_dev's lower_dev_list devices (and to the lower_dev itself), and vice
versa. On unlink they are removed using the same logic.
I've tested it with thousands vlans/bonds/bridges, everything works ok and
no observable lags even on a huge number of interfaces.
Memory footprint for 128 devices interconnected with each other via both
upper and lower (which is impossible, but for the comparison) lists would be:
128*128*2*sizeof(netdev_adjacent) = 1.5MB
but in the real world we usualy have at most several devices with slaves
and a lot of vlans, so the footprint will be much lower.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-28 21:25:05 +00:00
|
|
|
{
|
|
|
|
struct netdev_adjacent *adj;
|
2013-09-25 07:20:31 +00:00
|
|
|
int ret;
|
net: add lower_dev_list to net_device and make a full mesh
This patch adds lower_dev_list list_head to net_device, which is the same
as upper_dev_list, only for lower devices, and begins to use it in the same
way as the upper list.
It also changes the way the whole adjacent device lists work - now they
contain *all* of upper/lower devices, not only the first level. The first
level devices are distinguished by the bool neighbour field in
netdev_adjacent, also added by this patch.
There are cases when a device can be added several times to the adjacent
list, the simplest would be:
/---- eth0.10 ---\
eth0- --- bond0
\---- eth0.20 ---/
where both bond0 and eth0 'see' each other in the adjacent lists two times.
To avoid duplication of netdev_adjacent structures ref_nr is being kept as
the number of times the device was added to the list.
The 'full view' is achieved by adding, on link creation, all of the
upper_dev's upper_dev_list devices as upper devices to all of the
lower_dev's lower_dev_list devices (and to the lower_dev itself), and vice
versa. On unlink they are removed using the same logic.
I've tested it with thousands vlans/bonds/bridges, everything works ok and
no observable lags even on a huge number of interfaces.
Memory footprint for 128 devices interconnected with each other via both
upper and lower (which is impossible, but for the comparison) lists would be:
128*128*2*sizeof(netdev_adjacent) = 1.5MB
but in the real world we usualy have at most several devices with slaves
and a lot of vlans, so the footprint will be much lower.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-28 21:25:05 +00:00
|
|
|
|
2015-09-24 08:59:05 +00:00
|
|
|
adj = __netdev_find_adj(adj_dev, dev_list);
|
net: add lower_dev_list to net_device and make a full mesh
This patch adds lower_dev_list list_head to net_device, which is the same
as upper_dev_list, only for lower devices, and begins to use it in the same
way as the upper list.
It also changes the way the whole adjacent device lists work - now they
contain *all* of upper/lower devices, not only the first level. The first
level devices are distinguished by the bool neighbour field in
netdev_adjacent, also added by this patch.
There are cases when a device can be added several times to the adjacent
list, the simplest would be:
/---- eth0.10 ---\
eth0- --- bond0
\---- eth0.20 ---/
where both bond0 and eth0 'see' each other in the adjacent lists two times.
To avoid duplication of netdev_adjacent structures ref_nr is being kept as
the number of times the device was added to the list.
The 'full view' is achieved by adding, on link creation, all of the
upper_dev's upper_dev_list devices as upper devices to all of the
lower_dev's lower_dev_list devices (and to the lower_dev itself), and vice
versa. On unlink they are removed using the same logic.
I've tested it with thousands vlans/bonds/bridges, everything works ok and
no observable lags even on a huge number of interfaces.
Memory footprint for 128 devices interconnected with each other via both
upper and lower (which is impossible, but for the comparison) lists would be:
128*128*2*sizeof(netdev_adjacent) = 1.5MB
but in the real world we usualy have at most several devices with slaves
and a lot of vlans, so the footprint will be much lower.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-28 21:25:05 +00:00
|
|
|
|
|
|
|
if (adj) {
|
2016-10-18 02:15:43 +00:00
|
|
|
adj->ref_nr += 1;
|
2016-10-18 02:15:53 +00:00
|
|
|
pr_debug("Insert adjacency: dev %s adj_dev %s adj->ref_nr %d\n",
|
|
|
|
dev->name, adj_dev->name, adj->ref_nr);
|
|
|
|
|
net: add lower_dev_list to net_device and make a full mesh
This patch adds lower_dev_list list_head to net_device, which is the same
as upper_dev_list, only for lower devices, and begins to use it in the same
way as the upper list.
It also changes the way the whole adjacent device lists work - now they
contain *all* of upper/lower devices, not only the first level. The first
level devices are distinguished by the bool neighbour field in
netdev_adjacent, also added by this patch.
There are cases when a device can be added several times to the adjacent
list, the simplest would be:
/---- eth0.10 ---\
eth0- --- bond0
\---- eth0.20 ---/
where both bond0 and eth0 'see' each other in the adjacent lists two times.
To avoid duplication of netdev_adjacent structures ref_nr is being kept as
the number of times the device was added to the list.
The 'full view' is achieved by adding, on link creation, all of the
upper_dev's upper_dev_list devices as upper devices to all of the
lower_dev's lower_dev_list devices (and to the lower_dev itself), and vice
versa. On unlink they are removed using the same logic.
I've tested it with thousands vlans/bonds/bridges, everything works ok and
no observable lags even on a huge number of interfaces.
Memory footprint for 128 devices interconnected with each other via both
upper and lower (which is impossible, but for the comparison) lists would be:
128*128*2*sizeof(netdev_adjacent) = 1.5MB
but in the real world we usualy have at most several devices with slaves
and a lot of vlans, so the footprint will be much lower.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-28 21:25:05 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
adj = kmalloc(sizeof(*adj), GFP_KERNEL);
|
|
|
|
if (!adj)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
adj->dev = adj_dev;
|
|
|
|
adj->master = master;
|
2016-10-18 02:15:43 +00:00
|
|
|
adj->ref_nr = 1;
|
net: add netdev_adjacent->private and allow to use it
Currently, even though we can access any linked device, we can't attach
anything to it, which is vital to properly manage them.
To fix this, add a new void *private to netdev_adjacent and functions
setting/getting it (per link), so that we can save, per example, bonding's
slave structures there, per slave device.
netdev_master_upper_dev_link_private(dev, upper_dev, private) links dev to
upper dev and populates the neighbour link only with private.
netdev_lower_dev_get_private{,_rcu}() returns the private, if found.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-25 07:20:09 +00:00
|
|
|
adj->private = private;
|
net: add lower_dev_list to net_device and make a full mesh
This patch adds lower_dev_list list_head to net_device, which is the same
as upper_dev_list, only for lower devices, and begins to use it in the same
way as the upper list.
It also changes the way the whole adjacent device lists work - now they
contain *all* of upper/lower devices, not only the first level. The first
level devices are distinguished by the bool neighbour field in
netdev_adjacent, also added by this patch.
There are cases when a device can be added several times to the adjacent
list, the simplest would be:
/---- eth0.10 ---\
eth0- --- bond0
\---- eth0.20 ---/
where both bond0 and eth0 'see' each other in the adjacent lists two times.
To avoid duplication of netdev_adjacent structures ref_nr is being kept as
the number of times the device was added to the list.
The 'full view' is achieved by adding, on link creation, all of the
upper_dev's upper_dev_list devices as upper devices to all of the
lower_dev's lower_dev_list devices (and to the lower_dev itself), and vice
versa. On unlink they are removed using the same logic.
I've tested it with thousands vlans/bonds/bridges, everything works ok and
no observable lags even on a huge number of interfaces.
Memory footprint for 128 devices interconnected with each other via both
upper and lower (which is impossible, but for the comparison) lists would be:
128*128*2*sizeof(netdev_adjacent) = 1.5MB
but in the real world we usualy have at most several devices with slaves
and a lot of vlans, so the footprint will be much lower.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-28 21:25:05 +00:00
|
|
|
dev_hold(adj_dev);
|
net: add adj_list to save only neighbours
Currently, we distinguish neighbours (first-level linked devices) from
non-neighbours by the neighbour bool in the netdev_adjacent. This could be
quite time-consuming in case we would like to traverse *only* through
neighbours - cause we'd have to traverse through all devices and check for
this flag, and in a (quite common) scenario where we have lots of vlans on
top of bridge, which is on top of a bond - the bonding would have to go
through all those vlans to get its upper neighbour linked devices.
This situation is really unpleasant, cause there are already a lot of cases
when a device with slaves needs to go through them in hot path.
To fix this, introduce a new upper/lower device lists structure -
adj_list, which contains only the neighbours. It works always in
pair with the all_adj_list structure (renamed from upper/lower_dev_list),
i.e. both of them contain the same links, only that all_adj_list contains
also non-neighbour device links. It's really a small change visible,
currently, only for __netdev_adjacent_dev_insert/remove(), and doesn't
change the main linked logic at all.
Also, add some comments a fix a name collision in
netdev_for_each_upper_dev_rcu() and rework the naming by the following
rules:
netdev_(all_)(upper|lower)_*
If "all_" is present, then we work with the whole list of upper/lower
devices, otherwise - only with direct neighbours. Uninline functions - to
get better stack traces.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-25 07:20:07 +00:00
|
|
|
|
2016-10-18 02:15:53 +00:00
|
|
|
pr_debug("Insert adjacency: dev %s adj_dev %s adj->ref_nr %d; dev_hold on %s\n",
|
|
|
|
dev->name, adj_dev->name, adj->ref_nr, adj_dev->name);
|
net: add lower_dev_list to net_device and make a full mesh
This patch adds lower_dev_list list_head to net_device, which is the same
as upper_dev_list, only for lower devices, and begins to use it in the same
way as the upper list.
It also changes the way the whole adjacent device lists work - now they
contain *all* of upper/lower devices, not only the first level. The first
level devices are distinguished by the bool neighbour field in
netdev_adjacent, also added by this patch.
There are cases when a device can be added several times to the adjacent
list, the simplest would be:
/---- eth0.10 ---\
eth0- --- bond0
\---- eth0.20 ---/
where both bond0 and eth0 'see' each other in the adjacent lists two times.
To avoid duplication of netdev_adjacent structures ref_nr is being kept as
the number of times the device was added to the list.
The 'full view' is achieved by adding, on link creation, all of the
upper_dev's upper_dev_list devices as upper devices to all of the
lower_dev's lower_dev_list devices (and to the lower_dev itself), and vice
versa. On unlink they are removed using the same logic.
I've tested it with thousands vlans/bonds/bridges, everything works ok and
no observable lags even on a huge number of interfaces.
Memory footprint for 128 devices interconnected with each other via both
upper and lower (which is impossible, but for the comparison) lists would be:
128*128*2*sizeof(netdev_adjacent) = 1.5MB
but in the real world we usualy have at most several devices with slaves
and a lot of vlans, so the footprint will be much lower.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-28 21:25:05 +00:00
|
|
|
|
2014-09-15 10:22:35 +00:00
|
|
|
if (netdev_adjacent_is_neigh_list(dev, adj_dev, dev_list)) {
|
2014-01-14 20:58:50 +00:00
|
|
|
ret = netdev_adjacent_sysfs_add(dev, adj_dev, dev_list);
|
2013-09-25 07:20:32 +00:00
|
|
|
if (ret)
|
|
|
|
goto free_adj;
|
|
|
|
}
|
|
|
|
|
2013-09-25 07:20:06 +00:00
|
|
|
/* Ensure that master link is always the first item in list. */
|
2013-09-25 07:20:31 +00:00
|
|
|
if (master) {
|
|
|
|
ret = sysfs_create_link(&(dev->dev.kobj),
|
|
|
|
&(adj_dev->dev.kobj), "master");
|
|
|
|
if (ret)
|
2013-09-25 07:20:32 +00:00
|
|
|
goto remove_symlinks;
|
2013-09-25 07:20:31 +00:00
|
|
|
|
2013-09-25 07:20:06 +00:00
|
|
|
list_add_rcu(&adj->list, dev_list);
|
2013-09-25 07:20:31 +00:00
|
|
|
} else {
|
2013-09-25 07:20:06 +00:00
|
|
|
list_add_tail_rcu(&adj->list, dev_list);
|
2013-09-25 07:20:31 +00:00
|
|
|
}
|
net: add lower_dev_list to net_device and make a full mesh
This patch adds lower_dev_list list_head to net_device, which is the same
as upper_dev_list, only for lower devices, and begins to use it in the same
way as the upper list.
It also changes the way the whole adjacent device lists work - now they
contain *all* of upper/lower devices, not only the first level. The first
level devices are distinguished by the bool neighbour field in
netdev_adjacent, also added by this patch.
There are cases when a device can be added several times to the adjacent
list, the simplest would be:
/---- eth0.10 ---\
eth0- --- bond0
\---- eth0.20 ---/
where both bond0 and eth0 'see' each other in the adjacent lists two times.
To avoid duplication of netdev_adjacent structures ref_nr is being kept as
the number of times the device was added to the list.
The 'full view' is achieved by adding, on link creation, all of the
upper_dev's upper_dev_list devices as upper devices to all of the
lower_dev's lower_dev_list devices (and to the lower_dev itself), and vice
versa. On unlink they are removed using the same logic.
I've tested it with thousands vlans/bonds/bridges, everything works ok and
no observable lags even on a huge number of interfaces.
Memory footprint for 128 devices interconnected with each other via both
upper and lower (which is impossible, but for the comparison) lists would be:
128*128*2*sizeof(netdev_adjacent) = 1.5MB
but in the real world we usualy have at most several devices with slaves
and a lot of vlans, so the footprint will be much lower.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-28 21:25:05 +00:00
|
|
|
|
|
|
|
return 0;
|
2013-09-25 07:20:31 +00:00
|
|
|
|
2013-09-25 07:20:32 +00:00
|
|
|
remove_symlinks:
|
2014-09-15 10:22:35 +00:00
|
|
|
if (netdev_adjacent_is_neigh_list(dev, adj_dev, dev_list))
|
2014-01-14 20:58:50 +00:00
|
|
|
netdev_adjacent_sysfs_del(dev, adj_dev->name, dev_list);
|
2013-09-25 07:20:31 +00:00
|
|
|
free_adj:
|
|
|
|
kfree(adj);
|
2013-10-23 13:28:56 +00:00
|
|
|
dev_put(adj_dev);
|
2013-09-25 07:20:31 +00:00
|
|
|
|
|
|
|
return ret;
|
net: add lower_dev_list to net_device and make a full mesh
This patch adds lower_dev_list list_head to net_device, which is the same
as upper_dev_list, only for lower devices, and begins to use it in the same
way as the upper list.
It also changes the way the whole adjacent device lists work - now they
contain *all* of upper/lower devices, not only the first level. The first
level devices are distinguished by the bool neighbour field in
netdev_adjacent, also added by this patch.
There are cases when a device can be added several times to the adjacent
list, the simplest would be:
/---- eth0.10 ---\
eth0- --- bond0
\---- eth0.20 ---/
where both bond0 and eth0 'see' each other in the adjacent lists two times.
To avoid duplication of netdev_adjacent structures ref_nr is being kept as
the number of times the device was added to the list.
The 'full view' is achieved by adding, on link creation, all of the
upper_dev's upper_dev_list devices as upper devices to all of the
lower_dev's lower_dev_list devices (and to the lower_dev itself), and vice
versa. On unlink they are removed using the same logic.
I've tested it with thousands vlans/bonds/bridges, everything works ok and
no observable lags even on a huge number of interfaces.
Memory footprint for 128 devices interconnected with each other via both
upper and lower (which is impossible, but for the comparison) lists would be:
128*128*2*sizeof(netdev_adjacent) = 1.5MB
but in the real world we usualy have at most several devices with slaves
and a lot of vlans, so the footprint will be much lower.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-28 21:25:05 +00:00
|
|
|
}
|
|
|
|
|
2013-12-29 22:01:29 +00:00
|
|
|
static void __netdev_adjacent_dev_remove(struct net_device *dev,
|
|
|
|
struct net_device *adj_dev,
|
2016-10-03 19:43:02 +00:00
|
|
|
u16 ref_nr,
|
2013-12-29 22:01:29 +00:00
|
|
|
struct list_head *dev_list)
|
net: add lower_dev_list to net_device and make a full mesh
This patch adds lower_dev_list list_head to net_device, which is the same
as upper_dev_list, only for lower devices, and begins to use it in the same
way as the upper list.
It also changes the way the whole adjacent device lists work - now they
contain *all* of upper/lower devices, not only the first level. The first
level devices are distinguished by the bool neighbour field in
netdev_adjacent, also added by this patch.
There are cases when a device can be added several times to the adjacent
list, the simplest would be:
/---- eth0.10 ---\
eth0- --- bond0
\---- eth0.20 ---/
where both bond0 and eth0 'see' each other in the adjacent lists two times.
To avoid duplication of netdev_adjacent structures ref_nr is being kept as
the number of times the device was added to the list.
The 'full view' is achieved by adding, on link creation, all of the
upper_dev's upper_dev_list devices as upper devices to all of the
lower_dev's lower_dev_list devices (and to the lower_dev itself), and vice
versa. On unlink they are removed using the same logic.
I've tested it with thousands vlans/bonds/bridges, everything works ok and
no observable lags even on a huge number of interfaces.
Memory footprint for 128 devices interconnected with each other via both
upper and lower (which is impossible, but for the comparison) lists would be:
128*128*2*sizeof(netdev_adjacent) = 1.5MB
but in the real world we usualy have at most several devices with slaves
and a lot of vlans, so the footprint will be much lower.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-28 21:25:05 +00:00
|
|
|
{
|
|
|
|
struct netdev_adjacent *adj;
|
|
|
|
|
2016-10-18 02:15:53 +00:00
|
|
|
pr_debug("Remove adjacency: dev %s adj_dev %s ref_nr %d\n",
|
|
|
|
dev->name, adj_dev->name, ref_nr);
|
|
|
|
|
2015-09-24 08:59:05 +00:00
|
|
|
adj = __netdev_find_adj(adj_dev, dev_list);
|
net: add lower_dev_list to net_device and make a full mesh
This patch adds lower_dev_list list_head to net_device, which is the same
as upper_dev_list, only for lower devices, and begins to use it in the same
way as the upper list.
It also changes the way the whole adjacent device lists work - now they
contain *all* of upper/lower devices, not only the first level. The first
level devices are distinguished by the bool neighbour field in
netdev_adjacent, also added by this patch.
There are cases when a device can be added several times to the adjacent
list, the simplest would be:
/---- eth0.10 ---\
eth0- --- bond0
\---- eth0.20 ---/
where both bond0 and eth0 'see' each other in the adjacent lists two times.
To avoid duplication of netdev_adjacent structures ref_nr is being kept as
the number of times the device was added to the list.
The 'full view' is achieved by adding, on link creation, all of the
upper_dev's upper_dev_list devices as upper devices to all of the
lower_dev's lower_dev_list devices (and to the lower_dev itself), and vice
versa. On unlink they are removed using the same logic.
I've tested it with thousands vlans/bonds/bridges, everything works ok and
no observable lags even on a huge number of interfaces.
Memory footprint for 128 devices interconnected with each other via both
upper and lower (which is impossible, but for the comparison) lists would be:
128*128*2*sizeof(netdev_adjacent) = 1.5MB
but in the real world we usualy have at most several devices with slaves
and a lot of vlans, so the footprint will be much lower.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-28 21:25:05 +00:00
|
|
|
|
net: add adj_list to save only neighbours
Currently, we distinguish neighbours (first-level linked devices) from
non-neighbours by the neighbour bool in the netdev_adjacent. This could be
quite time-consuming in case we would like to traverse *only* through
neighbours - cause we'd have to traverse through all devices and check for
this flag, and in a (quite common) scenario where we have lots of vlans on
top of bridge, which is on top of a bond - the bonding would have to go
through all those vlans to get its upper neighbour linked devices.
This situation is really unpleasant, cause there are already a lot of cases
when a device with slaves needs to go through them in hot path.
To fix this, introduce a new upper/lower device lists structure -
adj_list, which contains only the neighbours. It works always in
pair with the all_adj_list structure (renamed from upper/lower_dev_list),
i.e. both of them contain the same links, only that all_adj_list contains
also non-neighbour device links. It's really a small change visible,
currently, only for __netdev_adjacent_dev_insert/remove(), and doesn't
change the main linked logic at all.
Also, add some comments a fix a name collision in
netdev_for_each_upper_dev_rcu() and rework the naming by the following
rules:
netdev_(all_)(upper|lower)_*
If "all_" is present, then we work with the whole list of upper/lower
devices, otherwise - only with direct neighbours. Uninline functions - to
get better stack traces.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-25 07:20:07 +00:00
|
|
|
if (!adj) {
|
2016-10-18 02:15:53 +00:00
|
|
|
pr_err("Adjacency does not exist for device %s from %s\n",
|
net: add adj_list to save only neighbours
Currently, we distinguish neighbours (first-level linked devices) from
non-neighbours by the neighbour bool in the netdev_adjacent. This could be
quite time-consuming in case we would like to traverse *only* through
neighbours - cause we'd have to traverse through all devices and check for
this flag, and in a (quite common) scenario where we have lots of vlans on
top of bridge, which is on top of a bond - the bonding would have to go
through all those vlans to get its upper neighbour linked devices.
This situation is really unpleasant, cause there are already a lot of cases
when a device with slaves needs to go through them in hot path.
To fix this, introduce a new upper/lower device lists structure -
adj_list, which contains only the neighbours. It works always in
pair with the all_adj_list structure (renamed from upper/lower_dev_list),
i.e. both of them contain the same links, only that all_adj_list contains
also non-neighbour device links. It's really a small change visible,
currently, only for __netdev_adjacent_dev_insert/remove(), and doesn't
change the main linked logic at all.
Also, add some comments a fix a name collision in
netdev_for_each_upper_dev_rcu() and rework the naming by the following
rules:
netdev_(all_)(upper|lower)_*
If "all_" is present, then we work with the whole list of upper/lower
devices, otherwise - only with direct neighbours. Uninline functions - to
get better stack traces.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-25 07:20:07 +00:00
|
|
|
dev->name, adj_dev->name);
|
2016-10-18 02:15:53 +00:00
|
|
|
WARN_ON(1);
|
|
|
|
return;
|
net: add adj_list to save only neighbours
Currently, we distinguish neighbours (first-level linked devices) from
non-neighbours by the neighbour bool in the netdev_adjacent. This could be
quite time-consuming in case we would like to traverse *only* through
neighbours - cause we'd have to traverse through all devices and check for
this flag, and in a (quite common) scenario where we have lots of vlans on
top of bridge, which is on top of a bond - the bonding would have to go
through all those vlans to get its upper neighbour linked devices.
This situation is really unpleasant, cause there are already a lot of cases
when a device with slaves needs to go through them in hot path.
To fix this, introduce a new upper/lower device lists structure -
adj_list, which contains only the neighbours. It works always in
pair with the all_adj_list structure (renamed from upper/lower_dev_list),
i.e. both of them contain the same links, only that all_adj_list contains
also non-neighbour device links. It's really a small change visible,
currently, only for __netdev_adjacent_dev_insert/remove(), and doesn't
change the main linked logic at all.
Also, add some comments a fix a name collision in
netdev_for_each_upper_dev_rcu() and rework the naming by the following
rules:
netdev_(all_)(upper|lower)_*
If "all_" is present, then we work with the whole list of upper/lower
devices, otherwise - only with direct neighbours. Uninline functions - to
get better stack traces.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-25 07:20:07 +00:00
|
|
|
}
|
net: add lower_dev_list to net_device and make a full mesh
This patch adds lower_dev_list list_head to net_device, which is the same
as upper_dev_list, only for lower devices, and begins to use it in the same
way as the upper list.
It also changes the way the whole adjacent device lists work - now they
contain *all* of upper/lower devices, not only the first level. The first
level devices are distinguished by the bool neighbour field in
netdev_adjacent, also added by this patch.
There are cases when a device can be added several times to the adjacent
list, the simplest would be:
/---- eth0.10 ---\
eth0- --- bond0
\---- eth0.20 ---/
where both bond0 and eth0 'see' each other in the adjacent lists two times.
To avoid duplication of netdev_adjacent structures ref_nr is being kept as
the number of times the device was added to the list.
The 'full view' is achieved by adding, on link creation, all of the
upper_dev's upper_dev_list devices as upper devices to all of the
lower_dev's lower_dev_list devices (and to the lower_dev itself), and vice
versa. On unlink they are removed using the same logic.
I've tested it with thousands vlans/bonds/bridges, everything works ok and
no observable lags even on a huge number of interfaces.
Memory footprint for 128 devices interconnected with each other via both
upper and lower (which is impossible, but for the comparison) lists would be:
128*128*2*sizeof(netdev_adjacent) = 1.5MB
but in the real world we usualy have at most several devices with slaves
and a lot of vlans, so the footprint will be much lower.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-28 21:25:05 +00:00
|
|
|
|
2016-10-03 19:43:02 +00:00
|
|
|
if (adj->ref_nr > ref_nr) {
|
2016-10-18 02:15:53 +00:00
|
|
|
pr_debug("adjacency: %s to %s ref_nr - %d = %d\n",
|
|
|
|
dev->name, adj_dev->name, ref_nr,
|
|
|
|
adj->ref_nr - ref_nr);
|
2016-10-03 19:43:02 +00:00
|
|
|
adj->ref_nr -= ref_nr;
|
net: add lower_dev_list to net_device and make a full mesh
This patch adds lower_dev_list list_head to net_device, which is the same
as upper_dev_list, only for lower devices, and begins to use it in the same
way as the upper list.
It also changes the way the whole adjacent device lists work - now they
contain *all* of upper/lower devices, not only the first level. The first
level devices are distinguished by the bool neighbour field in
netdev_adjacent, also added by this patch.
There are cases when a device can be added several times to the adjacent
list, the simplest would be:
/---- eth0.10 ---\
eth0- --- bond0
\---- eth0.20 ---/
where both bond0 and eth0 'see' each other in the adjacent lists two times.
To avoid duplication of netdev_adjacent structures ref_nr is being kept as
the number of times the device was added to the list.
The 'full view' is achieved by adding, on link creation, all of the
upper_dev's upper_dev_list devices as upper devices to all of the
lower_dev's lower_dev_list devices (and to the lower_dev itself), and vice
versa. On unlink they are removed using the same logic.
I've tested it with thousands vlans/bonds/bridges, everything works ok and
no observable lags even on a huge number of interfaces.
Memory footprint for 128 devices interconnected with each other via both
upper and lower (which is impossible, but for the comparison) lists would be:
128*128*2*sizeof(netdev_adjacent) = 1.5MB
but in the real world we usualy have at most several devices with slaves
and a lot of vlans, so the footprint will be much lower.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-28 21:25:05 +00:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2013-09-25 07:20:31 +00:00
|
|
|
if (adj->master)
|
|
|
|
sysfs_remove_link(&(dev->dev.kobj), "master");
|
|
|
|
|
2014-09-15 10:22:35 +00:00
|
|
|
if (netdev_adjacent_is_neigh_list(dev, adj_dev, dev_list))
|
2014-01-14 20:58:50 +00:00
|
|
|
netdev_adjacent_sysfs_del(dev, adj_dev->name, dev_list);
|
2013-09-25 07:20:32 +00:00
|
|
|
|
net: add lower_dev_list to net_device and make a full mesh
This patch adds lower_dev_list list_head to net_device, which is the same
as upper_dev_list, only for lower devices, and begins to use it in the same
way as the upper list.
It also changes the way the whole adjacent device lists work - now they
contain *all* of upper/lower devices, not only the first level. The first
level devices are distinguished by the bool neighbour field in
netdev_adjacent, also added by this patch.
There are cases when a device can be added several times to the adjacent
list, the simplest would be:
/---- eth0.10 ---\
eth0- --- bond0
\---- eth0.20 ---/
where both bond0 and eth0 'see' each other in the adjacent lists two times.
To avoid duplication of netdev_adjacent structures ref_nr is being kept as
the number of times the device was added to the list.
The 'full view' is achieved by adding, on link creation, all of the
upper_dev's upper_dev_list devices as upper devices to all of the
lower_dev's lower_dev_list devices (and to the lower_dev itself), and vice
versa. On unlink they are removed using the same logic.
I've tested it with thousands vlans/bonds/bridges, everything works ok and
no observable lags even on a huge number of interfaces.
Memory footprint for 128 devices interconnected with each other via both
upper and lower (which is impossible, but for the comparison) lists would be:
128*128*2*sizeof(netdev_adjacent) = 1.5MB
but in the real world we usualy have at most several devices with slaves
and a lot of vlans, so the footprint will be much lower.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-28 21:25:05 +00:00
|
|
|
list_del_rcu(&adj->list);
|
2016-10-18 02:15:53 +00:00
|
|
|
pr_debug("adjacency: dev_put for %s, because link removed from %s to %s\n",
|
net: add adj_list to save only neighbours
Currently, we distinguish neighbours (first-level linked devices) from
non-neighbours by the neighbour bool in the netdev_adjacent. This could be
quite time-consuming in case we would like to traverse *only* through
neighbours - cause we'd have to traverse through all devices and check for
this flag, and in a (quite common) scenario where we have lots of vlans on
top of bridge, which is on top of a bond - the bonding would have to go
through all those vlans to get its upper neighbour linked devices.
This situation is really unpleasant, cause there are already a lot of cases
when a device with slaves needs to go through them in hot path.
To fix this, introduce a new upper/lower device lists structure -
adj_list, which contains only the neighbours. It works always in
pair with the all_adj_list structure (renamed from upper/lower_dev_list),
i.e. both of them contain the same links, only that all_adj_list contains
also non-neighbour device links. It's really a small change visible,
currently, only for __netdev_adjacent_dev_insert/remove(), and doesn't
change the main linked logic at all.
Also, add some comments a fix a name collision in
netdev_for_each_upper_dev_rcu() and rework the naming by the following
rules:
netdev_(all_)(upper|lower)_*
If "all_" is present, then we work with the whole list of upper/lower
devices, otherwise - only with direct neighbours. Uninline functions - to
get better stack traces.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-25 07:20:07 +00:00
|
|
|
adj_dev->name, dev->name, adj_dev->name);
|
net: add lower_dev_list to net_device and make a full mesh
This patch adds lower_dev_list list_head to net_device, which is the same
as upper_dev_list, only for lower devices, and begins to use it in the same
way as the upper list.
It also changes the way the whole adjacent device lists work - now they
contain *all* of upper/lower devices, not only the first level. The first
level devices are distinguished by the bool neighbour field in
netdev_adjacent, also added by this patch.
There are cases when a device can be added several times to the adjacent
list, the simplest would be:
/---- eth0.10 ---\
eth0- --- bond0
\---- eth0.20 ---/
where both bond0 and eth0 'see' each other in the adjacent lists two times.
To avoid duplication of netdev_adjacent structures ref_nr is being kept as
the number of times the device was added to the list.
The 'full view' is achieved by adding, on link creation, all of the
upper_dev's upper_dev_list devices as upper devices to all of the
lower_dev's lower_dev_list devices (and to the lower_dev itself), and vice
versa. On unlink they are removed using the same logic.
I've tested it with thousands vlans/bonds/bridges, everything works ok and
no observable lags even on a huge number of interfaces.
Memory footprint for 128 devices interconnected with each other via both
upper and lower (which is impossible, but for the comparison) lists would be:
128*128*2*sizeof(netdev_adjacent) = 1.5MB
but in the real world we usualy have at most several devices with slaves
and a lot of vlans, so the footprint will be much lower.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-28 21:25:05 +00:00
|
|
|
dev_put(adj_dev);
|
|
|
|
kfree_rcu(adj, rcu);
|
|
|
|
}
|
|
|
|
|
2013-12-29 22:01:29 +00:00
|
|
|
static int __netdev_adjacent_dev_link_lists(struct net_device *dev,
|
|
|
|
struct net_device *upper_dev,
|
|
|
|
struct list_head *up_list,
|
|
|
|
struct list_head *down_list,
|
|
|
|
void *private, bool master)
|
net: add lower_dev_list to net_device and make a full mesh
This patch adds lower_dev_list list_head to net_device, which is the same
as upper_dev_list, only for lower devices, and begins to use it in the same
way as the upper list.
It also changes the way the whole adjacent device lists work - now they
contain *all* of upper/lower devices, not only the first level. The first
level devices are distinguished by the bool neighbour field in
netdev_adjacent, also added by this patch.
There are cases when a device can be added several times to the adjacent
list, the simplest would be:
/---- eth0.10 ---\
eth0- --- bond0
\---- eth0.20 ---/
where both bond0 and eth0 'see' each other in the adjacent lists two times.
To avoid duplication of netdev_adjacent structures ref_nr is being kept as
the number of times the device was added to the list.
The 'full view' is achieved by adding, on link creation, all of the
upper_dev's upper_dev_list devices as upper devices to all of the
lower_dev's lower_dev_list devices (and to the lower_dev itself), and vice
versa. On unlink they are removed using the same logic.
I've tested it with thousands vlans/bonds/bridges, everything works ok and
no observable lags even on a huge number of interfaces.
Memory footprint for 128 devices interconnected with each other via both
upper and lower (which is impossible, but for the comparison) lists would be:
128*128*2*sizeof(netdev_adjacent) = 1.5MB
but in the real world we usualy have at most several devices with slaves
and a lot of vlans, so the footprint will be much lower.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-28 21:25:05 +00:00
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
2016-10-18 02:15:43 +00:00
|
|
|
ret = __netdev_adjacent_dev_insert(dev, upper_dev, up_list,
|
2016-10-03 19:43:02 +00:00
|
|
|
private, master);
|
net: add lower_dev_list to net_device and make a full mesh
This patch adds lower_dev_list list_head to net_device, which is the same
as upper_dev_list, only for lower devices, and begins to use it in the same
way as the upper list.
It also changes the way the whole adjacent device lists work - now they
contain *all* of upper/lower devices, not only the first level. The first
level devices are distinguished by the bool neighbour field in
netdev_adjacent, also added by this patch.
There are cases when a device can be added several times to the adjacent
list, the simplest would be:
/---- eth0.10 ---\
eth0- --- bond0
\---- eth0.20 ---/
where both bond0 and eth0 'see' each other in the adjacent lists two times.
To avoid duplication of netdev_adjacent structures ref_nr is being kept as
the number of times the device was added to the list.
The 'full view' is achieved by adding, on link creation, all of the
upper_dev's upper_dev_list devices as upper devices to all of the
lower_dev's lower_dev_list devices (and to the lower_dev itself), and vice
versa. On unlink they are removed using the same logic.
I've tested it with thousands vlans/bonds/bridges, everything works ok and
no observable lags even on a huge number of interfaces.
Memory footprint for 128 devices interconnected with each other via both
upper and lower (which is impossible, but for the comparison) lists would be:
128*128*2*sizeof(netdev_adjacent) = 1.5MB
but in the real world we usualy have at most several devices with slaves
and a lot of vlans, so the footprint will be much lower.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-28 21:25:05 +00:00
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
2016-10-18 02:15:43 +00:00
|
|
|
ret = __netdev_adjacent_dev_insert(upper_dev, dev, down_list,
|
2016-10-03 19:43:02 +00:00
|
|
|
private, false);
|
net: add lower_dev_list to net_device and make a full mesh
This patch adds lower_dev_list list_head to net_device, which is the same
as upper_dev_list, only for lower devices, and begins to use it in the same
way as the upper list.
It also changes the way the whole adjacent device lists work - now they
contain *all* of upper/lower devices, not only the first level. The first
level devices are distinguished by the bool neighbour field in
netdev_adjacent, also added by this patch.
There are cases when a device can be added several times to the adjacent
list, the simplest would be:
/---- eth0.10 ---\
eth0- --- bond0
\---- eth0.20 ---/
where both bond0 and eth0 'see' each other in the adjacent lists two times.
To avoid duplication of netdev_adjacent structures ref_nr is being kept as
the number of times the device was added to the list.
The 'full view' is achieved by adding, on link creation, all of the
upper_dev's upper_dev_list devices as upper devices to all of the
lower_dev's lower_dev_list devices (and to the lower_dev itself), and vice
versa. On unlink they are removed using the same logic.
I've tested it with thousands vlans/bonds/bridges, everything works ok and
no observable lags even on a huge number of interfaces.
Memory footprint for 128 devices interconnected with each other via both
upper and lower (which is impossible, but for the comparison) lists would be:
128*128*2*sizeof(netdev_adjacent) = 1.5MB
but in the real world we usualy have at most several devices with slaves
and a lot of vlans, so the footprint will be much lower.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-28 21:25:05 +00:00
|
|
|
if (ret) {
|
2016-10-18 02:15:43 +00:00
|
|
|
__netdev_adjacent_dev_remove(dev, upper_dev, 1, up_list);
|
net: add lower_dev_list to net_device and make a full mesh
This patch adds lower_dev_list list_head to net_device, which is the same
as upper_dev_list, only for lower devices, and begins to use it in the same
way as the upper list.
It also changes the way the whole adjacent device lists work - now they
contain *all* of upper/lower devices, not only the first level. The first
level devices are distinguished by the bool neighbour field in
netdev_adjacent, also added by this patch.
There are cases when a device can be added several times to the adjacent
list, the simplest would be:
/---- eth0.10 ---\
eth0- --- bond0
\---- eth0.20 ---/
where both bond0 and eth0 'see' each other in the adjacent lists two times.
To avoid duplication of netdev_adjacent structures ref_nr is being kept as
the number of times the device was added to the list.
The 'full view' is achieved by adding, on link creation, all of the
upper_dev's upper_dev_list devices as upper devices to all of the
lower_dev's lower_dev_list devices (and to the lower_dev itself), and vice
versa. On unlink they are removed using the same logic.
I've tested it with thousands vlans/bonds/bridges, everything works ok and
no observable lags even on a huge number of interfaces.
Memory footprint for 128 devices interconnected with each other via both
upper and lower (which is impossible, but for the comparison) lists would be:
128*128*2*sizeof(netdev_adjacent) = 1.5MB
but in the real world we usualy have at most several devices with slaves
and a lot of vlans, so the footprint will be much lower.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-28 21:25:05 +00:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2013-12-29 22:01:29 +00:00
|
|
|
static void __netdev_adjacent_dev_unlink_lists(struct net_device *dev,
|
|
|
|
struct net_device *upper_dev,
|
2016-10-03 19:43:02 +00:00
|
|
|
u16 ref_nr,
|
2013-12-29 22:01:29 +00:00
|
|
|
struct list_head *up_list,
|
|
|
|
struct list_head *down_list)
|
net: add lower_dev_list to net_device and make a full mesh
This patch adds lower_dev_list list_head to net_device, which is the same
as upper_dev_list, only for lower devices, and begins to use it in the same
way as the upper list.
It also changes the way the whole adjacent device lists work - now they
contain *all* of upper/lower devices, not only the first level. The first
level devices are distinguished by the bool neighbour field in
netdev_adjacent, also added by this patch.
There are cases when a device can be added several times to the adjacent
list, the simplest would be:
/---- eth0.10 ---\
eth0- --- bond0
\---- eth0.20 ---/
where both bond0 and eth0 'see' each other in the adjacent lists two times.
To avoid duplication of netdev_adjacent structures ref_nr is being kept as
the number of times the device was added to the list.
The 'full view' is achieved by adding, on link creation, all of the
upper_dev's upper_dev_list devices as upper devices to all of the
lower_dev's lower_dev_list devices (and to the lower_dev itself), and vice
versa. On unlink they are removed using the same logic.
I've tested it with thousands vlans/bonds/bridges, everything works ok and
no observable lags even on a huge number of interfaces.
Memory footprint for 128 devices interconnected with each other via both
upper and lower (which is impossible, but for the comparison) lists would be:
128*128*2*sizeof(netdev_adjacent) = 1.5MB
but in the real world we usualy have at most several devices with slaves
and a lot of vlans, so the footprint will be much lower.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-28 21:25:05 +00:00
|
|
|
{
|
2016-10-03 19:43:02 +00:00
|
|
|
__netdev_adjacent_dev_remove(dev, upper_dev, ref_nr, up_list);
|
|
|
|
__netdev_adjacent_dev_remove(upper_dev, dev, ref_nr, down_list);
|
net: add lower_dev_list to net_device and make a full mesh
This patch adds lower_dev_list list_head to net_device, which is the same
as upper_dev_list, only for lower devices, and begins to use it in the same
way as the upper list.
It also changes the way the whole adjacent device lists work - now they
contain *all* of upper/lower devices, not only the first level. The first
level devices are distinguished by the bool neighbour field in
netdev_adjacent, also added by this patch.
There are cases when a device can be added several times to the adjacent
list, the simplest would be:
/---- eth0.10 ---\
eth0- --- bond0
\---- eth0.20 ---/
where both bond0 and eth0 'see' each other in the adjacent lists two times.
To avoid duplication of netdev_adjacent structures ref_nr is being kept as
the number of times the device was added to the list.
The 'full view' is achieved by adding, on link creation, all of the
upper_dev's upper_dev_list devices as upper devices to all of the
lower_dev's lower_dev_list devices (and to the lower_dev itself), and vice
versa. On unlink they are removed using the same logic.
I've tested it with thousands vlans/bonds/bridges, everything works ok and
no observable lags even on a huge number of interfaces.
Memory footprint for 128 devices interconnected with each other via both
upper and lower (which is impossible, but for the comparison) lists would be:
128*128*2*sizeof(netdev_adjacent) = 1.5MB
but in the real world we usualy have at most several devices with slaves
and a lot of vlans, so the footprint will be much lower.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-28 21:25:05 +00:00
|
|
|
}
|
|
|
|
|
2013-12-29 22:01:29 +00:00
|
|
|
static int __netdev_adjacent_dev_link_neighbour(struct net_device *dev,
|
|
|
|
struct net_device *upper_dev,
|
|
|
|
void *private, bool master)
|
net: add adj_list to save only neighbours
Currently, we distinguish neighbours (first-level linked devices) from
non-neighbours by the neighbour bool in the netdev_adjacent. This could be
quite time-consuming in case we would like to traverse *only* through
neighbours - cause we'd have to traverse through all devices and check for
this flag, and in a (quite common) scenario where we have lots of vlans on
top of bridge, which is on top of a bond - the bonding would have to go
through all those vlans to get its upper neighbour linked devices.
This situation is really unpleasant, cause there are already a lot of cases
when a device with slaves needs to go through them in hot path.
To fix this, introduce a new upper/lower device lists structure -
adj_list, which contains only the neighbours. It works always in
pair with the all_adj_list structure (renamed from upper/lower_dev_list),
i.e. both of them contain the same links, only that all_adj_list contains
also non-neighbour device links. It's really a small change visible,
currently, only for __netdev_adjacent_dev_insert/remove(), and doesn't
change the main linked logic at all.
Also, add some comments a fix a name collision in
netdev_for_each_upper_dev_rcu() and rework the naming by the following
rules:
netdev_(all_)(upper|lower)_*
If "all_" is present, then we work with the whole list of upper/lower
devices, otherwise - only with direct neighbours. Uninline functions - to
get better stack traces.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-25 07:20:07 +00:00
|
|
|
{
|
2016-10-18 02:15:51 +00:00
|
|
|
return __netdev_adjacent_dev_link_lists(dev, upper_dev,
|
|
|
|
&dev->adj_list.upper,
|
|
|
|
&upper_dev->adj_list.lower,
|
|
|
|
private, master);
|
net: add lower_dev_list to net_device and make a full mesh
This patch adds lower_dev_list list_head to net_device, which is the same
as upper_dev_list, only for lower devices, and begins to use it in the same
way as the upper list.
It also changes the way the whole adjacent device lists work - now they
contain *all* of upper/lower devices, not only the first level. The first
level devices are distinguished by the bool neighbour field in
netdev_adjacent, also added by this patch.
There are cases when a device can be added several times to the adjacent
list, the simplest would be:
/---- eth0.10 ---\
eth0- --- bond0
\---- eth0.20 ---/
where both bond0 and eth0 'see' each other in the adjacent lists two times.
To avoid duplication of netdev_adjacent structures ref_nr is being kept as
the number of times the device was added to the list.
The 'full view' is achieved by adding, on link creation, all of the
upper_dev's upper_dev_list devices as upper devices to all of the
lower_dev's lower_dev_list devices (and to the lower_dev itself), and vice
versa. On unlink they are removed using the same logic.
I've tested it with thousands vlans/bonds/bridges, everything works ok and
no observable lags even on a huge number of interfaces.
Memory footprint for 128 devices interconnected with each other via both
upper and lower (which is impossible, but for the comparison) lists would be:
128*128*2*sizeof(netdev_adjacent) = 1.5MB
but in the real world we usualy have at most several devices with slaves
and a lot of vlans, so the footprint will be much lower.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-28 21:25:05 +00:00
|
|
|
}
|
|
|
|
|
2013-12-29 22:01:29 +00:00
|
|
|
static void __netdev_adjacent_dev_unlink_neighbour(struct net_device *dev,
|
|
|
|
struct net_device *upper_dev)
|
net: add adj_list to save only neighbours
Currently, we distinguish neighbours (first-level linked devices) from
non-neighbours by the neighbour bool in the netdev_adjacent. This could be
quite time-consuming in case we would like to traverse *only* through
neighbours - cause we'd have to traverse through all devices and check for
this flag, and in a (quite common) scenario where we have lots of vlans on
top of bridge, which is on top of a bond - the bonding would have to go
through all those vlans to get its upper neighbour linked devices.
This situation is really unpleasant, cause there are already a lot of cases
when a device with slaves needs to go through them in hot path.
To fix this, introduce a new upper/lower device lists structure -
adj_list, which contains only the neighbours. It works always in
pair with the all_adj_list structure (renamed from upper/lower_dev_list),
i.e. both of them contain the same links, only that all_adj_list contains
also non-neighbour device links. It's really a small change visible,
currently, only for __netdev_adjacent_dev_insert/remove(), and doesn't
change the main linked logic at all.
Also, add some comments a fix a name collision in
netdev_for_each_upper_dev_rcu() and rework the naming by the following
rules:
netdev_(all_)(upper|lower)_*
If "all_" is present, then we work with the whole list of upper/lower
devices, otherwise - only with direct neighbours. Uninline functions - to
get better stack traces.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-25 07:20:07 +00:00
|
|
|
{
|
2016-10-03 19:43:02 +00:00
|
|
|
__netdev_adjacent_dev_unlink_lists(dev, upper_dev, 1,
|
net: add adj_list to save only neighbours
Currently, we distinguish neighbours (first-level linked devices) from
non-neighbours by the neighbour bool in the netdev_adjacent. This could be
quite time-consuming in case we would like to traverse *only* through
neighbours - cause we'd have to traverse through all devices and check for
this flag, and in a (quite common) scenario where we have lots of vlans on
top of bridge, which is on top of a bond - the bonding would have to go
through all those vlans to get its upper neighbour linked devices.
This situation is really unpleasant, cause there are already a lot of cases
when a device with slaves needs to go through them in hot path.
To fix this, introduce a new upper/lower device lists structure -
adj_list, which contains only the neighbours. It works always in
pair with the all_adj_list structure (renamed from upper/lower_dev_list),
i.e. both of them contain the same links, only that all_adj_list contains
also non-neighbour device links. It's really a small change visible,
currently, only for __netdev_adjacent_dev_insert/remove(), and doesn't
change the main linked logic at all.
Also, add some comments a fix a name collision in
netdev_for_each_upper_dev_rcu() and rework the naming by the following
rules:
netdev_(all_)(upper|lower)_*
If "all_" is present, then we work with the whole list of upper/lower
devices, otherwise - only with direct neighbours. Uninline functions - to
get better stack traces.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-25 07:20:07 +00:00
|
|
|
&dev->adj_list.upper,
|
|
|
|
&upper_dev->adj_list.lower);
|
|
|
|
}
|
net: add lower_dev_list to net_device and make a full mesh
This patch adds lower_dev_list list_head to net_device, which is the same
as upper_dev_list, only for lower devices, and begins to use it in the same
way as the upper list.
It also changes the way the whole adjacent device lists work - now they
contain *all* of upper/lower devices, not only the first level. The first
level devices are distinguished by the bool neighbour field in
netdev_adjacent, also added by this patch.
There are cases when a device can be added several times to the adjacent
list, the simplest would be:
/---- eth0.10 ---\
eth0- --- bond0
\---- eth0.20 ---/
where both bond0 and eth0 'see' each other in the adjacent lists two times.
To avoid duplication of netdev_adjacent structures ref_nr is being kept as
the number of times the device was added to the list.
The 'full view' is achieved by adding, on link creation, all of the
upper_dev's upper_dev_list devices as upper devices to all of the
lower_dev's lower_dev_list devices (and to the lower_dev itself), and vice
versa. On unlink they are removed using the same logic.
I've tested it with thousands vlans/bonds/bridges, everything works ok and
no observable lags even on a huge number of interfaces.
Memory footprint for 128 devices interconnected with each other via both
upper and lower (which is impossible, but for the comparison) lists would be:
128*128*2*sizeof(netdev_adjacent) = 1.5MB
but in the real world we usualy have at most several devices with slaves
and a lot of vlans, so the footprint will be much lower.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-28 21:25:05 +00:00
|
|
|
|
2013-01-03 22:48:49 +00:00
|
|
|
static int __netdev_upper_dev_link(struct net_device *dev,
|
net: add netdev_adjacent->private and allow to use it
Currently, even though we can access any linked device, we can't attach
anything to it, which is vital to properly manage them.
To fix this, add a new void *private to netdev_adjacent and functions
setting/getting it (per link), so that we can save, per example, bonding's
slave structures there, per slave device.
netdev_master_upper_dev_link_private(dev, upper_dev, private) links dev to
upper dev and populates the neighbour link only with private.
netdev_lower_dev_get_private{,_rcu}() returns the private, if found.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-25 07:20:09 +00:00
|
|
|
struct net_device *upper_dev, bool master,
|
2015-12-03 11:12:11 +00:00
|
|
|
void *upper_priv, void *upper_info)
|
2013-01-03 22:48:49 +00:00
|
|
|
{
|
2015-08-27 07:31:18 +00:00
|
|
|
struct netdev_notifier_changeupper_info changeupper_info;
|
net: add lower_dev_list to net_device and make a full mesh
This patch adds lower_dev_list list_head to net_device, which is the same
as upper_dev_list, only for lower devices, and begins to use it in the same
way as the upper list.
It also changes the way the whole adjacent device lists work - now they
contain *all* of upper/lower devices, not only the first level. The first
level devices are distinguished by the bool neighbour field in
netdev_adjacent, also added by this patch.
There are cases when a device can be added several times to the adjacent
list, the simplest would be:
/---- eth0.10 ---\
eth0- --- bond0
\---- eth0.20 ---/
where both bond0 and eth0 'see' each other in the adjacent lists two times.
To avoid duplication of netdev_adjacent structures ref_nr is being kept as
the number of times the device was added to the list.
The 'full view' is achieved by adding, on link creation, all of the
upper_dev's upper_dev_list devices as upper devices to all of the
lower_dev's lower_dev_list devices (and to the lower_dev itself), and vice
versa. On unlink they are removed using the same logic.
I've tested it with thousands vlans/bonds/bridges, everything works ok and
no observable lags even on a huge number of interfaces.
Memory footprint for 128 devices interconnected with each other via both
upper and lower (which is impossible, but for the comparison) lists would be:
128*128*2*sizeof(netdev_adjacent) = 1.5MB
but in the real world we usualy have at most several devices with slaves
and a lot of vlans, so the footprint will be much lower.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-28 21:25:05 +00:00
|
|
|
int ret = 0;
|
2013-01-03 22:48:49 +00:00
|
|
|
|
|
|
|
ASSERT_RTNL();
|
|
|
|
|
|
|
|
if (dev == upper_dev)
|
|
|
|
return -EBUSY;
|
|
|
|
|
|
|
|
/* To prevent loops, check if dev is not upper device to upper_dev. */
|
2016-10-18 02:15:51 +00:00
|
|
|
if (netdev_has_upper_dev(upper_dev, dev))
|
2013-01-03 22:48:49 +00:00
|
|
|
return -EBUSY;
|
|
|
|
|
2016-10-18 02:15:51 +00:00
|
|
|
if (netdev_has_upper_dev(dev, upper_dev))
|
2013-01-03 22:48:49 +00:00
|
|
|
return -EEXIST;
|
|
|
|
|
|
|
|
if (master && netdev_master_upper_dev_get(dev))
|
|
|
|
return -EBUSY;
|
|
|
|
|
2015-08-27 07:31:18 +00:00
|
|
|
changeupper_info.upper_dev = upper_dev;
|
|
|
|
changeupper_info.master = master;
|
|
|
|
changeupper_info.linking = true;
|
2015-12-03 11:12:11 +00:00
|
|
|
changeupper_info.upper_info = upper_info;
|
2015-08-27 07:31:18 +00:00
|
|
|
|
2015-10-16 12:01:22 +00:00
|
|
|
ret = call_netdevice_notifiers_info(NETDEV_PRECHANGEUPPER, dev,
|
|
|
|
&changeupper_info.info);
|
|
|
|
ret = notifier_to_errno(ret);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
2015-12-03 11:12:10 +00:00
|
|
|
ret = __netdev_adjacent_dev_link_neighbour(dev, upper_dev, upper_priv,
|
net: add netdev_adjacent->private and allow to use it
Currently, even though we can access any linked device, we can't attach
anything to it, which is vital to properly manage them.
To fix this, add a new void *private to netdev_adjacent and functions
setting/getting it (per link), so that we can save, per example, bonding's
slave structures there, per slave device.
netdev_master_upper_dev_link_private(dev, upper_dev, private) links dev to
upper dev and populates the neighbour link only with private.
netdev_lower_dev_get_private{,_rcu}() returns the private, if found.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-25 07:20:09 +00:00
|
|
|
master);
|
net: add lower_dev_list to net_device and make a full mesh
This patch adds lower_dev_list list_head to net_device, which is the same
as upper_dev_list, only for lower devices, and begins to use it in the same
way as the upper list.
It also changes the way the whole adjacent device lists work - now they
contain *all* of upper/lower devices, not only the first level. The first
level devices are distinguished by the bool neighbour field in
netdev_adjacent, also added by this patch.
There are cases when a device can be added several times to the adjacent
list, the simplest would be:
/---- eth0.10 ---\
eth0- --- bond0
\---- eth0.20 ---/
where both bond0 and eth0 'see' each other in the adjacent lists two times.
To avoid duplication of netdev_adjacent structures ref_nr is being kept as
the number of times the device was added to the list.
The 'full view' is achieved by adding, on link creation, all of the
upper_dev's upper_dev_list devices as upper devices to all of the
lower_dev's lower_dev_list devices (and to the lower_dev itself), and vice
versa. On unlink they are removed using the same logic.
I've tested it with thousands vlans/bonds/bridges, everything works ok and
no observable lags even on a huge number of interfaces.
Memory footprint for 128 devices interconnected with each other via both
upper and lower (which is impossible, but for the comparison) lists would be:
128*128*2*sizeof(netdev_adjacent) = 1.5MB
but in the real world we usualy have at most several devices with slaves
and a lot of vlans, so the footprint will be much lower.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-28 21:25:05 +00:00
|
|
|
if (ret)
|
|
|
|
return ret;
|
2013-01-03 22:48:49 +00:00
|
|
|
|
2015-12-03 11:12:03 +00:00
|
|
|
ret = call_netdevice_notifiers_info(NETDEV_CHANGEUPPER, dev,
|
|
|
|
&changeupper_info.info);
|
|
|
|
ret = notifier_to_errno(ret);
|
|
|
|
if (ret)
|
2016-10-18 02:15:51 +00:00
|
|
|
goto rollback;
|
2015-12-03 11:12:03 +00:00
|
|
|
|
2013-01-03 22:48:49 +00:00
|
|
|
return 0;
|
net: add lower_dev_list to net_device and make a full mesh
This patch adds lower_dev_list list_head to net_device, which is the same
as upper_dev_list, only for lower devices, and begins to use it in the same
way as the upper list.
It also changes the way the whole adjacent device lists work - now they
contain *all* of upper/lower devices, not only the first level. The first
level devices are distinguished by the bool neighbour field in
netdev_adjacent, also added by this patch.
There are cases when a device can be added several times to the adjacent
list, the simplest would be:
/---- eth0.10 ---\
eth0- --- bond0
\---- eth0.20 ---/
where both bond0 and eth0 'see' each other in the adjacent lists two times.
To avoid duplication of netdev_adjacent structures ref_nr is being kept as
the number of times the device was added to the list.
The 'full view' is achieved by adding, on link creation, all of the
upper_dev's upper_dev_list devices as upper devices to all of the
lower_dev's lower_dev_list devices (and to the lower_dev itself), and vice
versa. On unlink they are removed using the same logic.
I've tested it with thousands vlans/bonds/bridges, everything works ok and
no observable lags even on a huge number of interfaces.
Memory footprint for 128 devices interconnected with each other via both
upper and lower (which is impossible, but for the comparison) lists would be:
128*128*2*sizeof(netdev_adjacent) = 1.5MB
but in the real world we usualy have at most several devices with slaves
and a lot of vlans, so the footprint will be much lower.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-28 21:25:05 +00:00
|
|
|
|
2016-10-18 02:15:51 +00:00
|
|
|
rollback:
|
net: add adj_list to save only neighbours
Currently, we distinguish neighbours (first-level linked devices) from
non-neighbours by the neighbour bool in the netdev_adjacent. This could be
quite time-consuming in case we would like to traverse *only* through
neighbours - cause we'd have to traverse through all devices and check for
this flag, and in a (quite common) scenario where we have lots of vlans on
top of bridge, which is on top of a bond - the bonding would have to go
through all those vlans to get its upper neighbour linked devices.
This situation is really unpleasant, cause there are already a lot of cases
when a device with slaves needs to go through them in hot path.
To fix this, introduce a new upper/lower device lists structure -
adj_list, which contains only the neighbours. It works always in
pair with the all_adj_list structure (renamed from upper/lower_dev_list),
i.e. both of them contain the same links, only that all_adj_list contains
also non-neighbour device links. It's really a small change visible,
currently, only for __netdev_adjacent_dev_insert/remove(), and doesn't
change the main linked logic at all.
Also, add some comments a fix a name collision in
netdev_for_each_upper_dev_rcu() and rework the naming by the following
rules:
netdev_(all_)(upper|lower)_*
If "all_" is present, then we work with the whole list of upper/lower
devices, otherwise - only with direct neighbours. Uninline functions - to
get better stack traces.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-25 07:20:07 +00:00
|
|
|
__netdev_adjacent_dev_unlink_neighbour(dev, upper_dev);
|
net: add lower_dev_list to net_device and make a full mesh
This patch adds lower_dev_list list_head to net_device, which is the same
as upper_dev_list, only for lower devices, and begins to use it in the same
way as the upper list.
It also changes the way the whole adjacent device lists work - now they
contain *all* of upper/lower devices, not only the first level. The first
level devices are distinguished by the bool neighbour field in
netdev_adjacent, also added by this patch.
There are cases when a device can be added several times to the adjacent
list, the simplest would be:
/---- eth0.10 ---\
eth0- --- bond0
\---- eth0.20 ---/
where both bond0 and eth0 'see' each other in the adjacent lists two times.
To avoid duplication of netdev_adjacent structures ref_nr is being kept as
the number of times the device was added to the list.
The 'full view' is achieved by adding, on link creation, all of the
upper_dev's upper_dev_list devices as upper devices to all of the
lower_dev's lower_dev_list devices (and to the lower_dev itself), and vice
versa. On unlink they are removed using the same logic.
I've tested it with thousands vlans/bonds/bridges, everything works ok and
no observable lags even on a huge number of interfaces.
Memory footprint for 128 devices interconnected with each other via both
upper and lower (which is impossible, but for the comparison) lists would be:
128*128*2*sizeof(netdev_adjacent) = 1.5MB
but in the real world we usualy have at most several devices with slaves
and a lot of vlans, so the footprint will be much lower.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-28 21:25:05 +00:00
|
|
|
|
|
|
|
return ret;
|
2013-01-03 22:48:49 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* netdev_upper_dev_link - Add a link to the upper device
|
|
|
|
* @dev: device
|
|
|
|
* @upper_dev: new upper device
|
|
|
|
*
|
|
|
|
* Adds a link to device which is upper to this one. The caller must hold
|
|
|
|
* the RTNL lock. On a failure a negative errno code is returned.
|
|
|
|
* On success the reference counts are adjusted and the function
|
|
|
|
* returns zero.
|
|
|
|
*/
|
|
|
|
int netdev_upper_dev_link(struct net_device *dev,
|
|
|
|
struct net_device *upper_dev)
|
|
|
|
{
|
2015-12-03 11:12:11 +00:00
|
|
|
return __netdev_upper_dev_link(dev, upper_dev, false, NULL, NULL);
|
2013-01-03 22:48:49 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(netdev_upper_dev_link);
|
|
|
|
|
|
|
|
/**
|
|
|
|
* netdev_master_upper_dev_link - Add a master link to the upper device
|
|
|
|
* @dev: device
|
|
|
|
* @upper_dev: new upper device
|
2015-12-03 11:12:10 +00:00
|
|
|
* @upper_priv: upper device private
|
2015-12-03 11:12:11 +00:00
|
|
|
* @upper_info: upper info to be passed down via notifier
|
2013-01-03 22:48:49 +00:00
|
|
|
*
|
|
|
|
* Adds a link to device which is upper to this one. In this case, only
|
|
|
|
* one master upper device can be linked, although other non-master devices
|
|
|
|
* might be linked as well. The caller must hold the RTNL lock.
|
|
|
|
* On a failure a negative errno code is returned. On success the reference
|
|
|
|
* counts are adjusted and the function returns zero.
|
|
|
|
*/
|
|
|
|
int netdev_master_upper_dev_link(struct net_device *dev,
|
2015-12-03 11:12:10 +00:00
|
|
|
struct net_device *upper_dev,
|
2015-12-03 11:12:11 +00:00
|
|
|
void *upper_priv, void *upper_info)
|
2013-01-03 22:48:49 +00:00
|
|
|
{
|
2015-12-03 11:12:11 +00:00
|
|
|
return __netdev_upper_dev_link(dev, upper_dev, true,
|
|
|
|
upper_priv, upper_info);
|
2013-01-03 22:48:49 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(netdev_master_upper_dev_link);
|
|
|
|
|
|
|
|
/**
|
|
|
|
* netdev_upper_dev_unlink - Removes a link to upper device
|
|
|
|
* @dev: device
|
|
|
|
* @upper_dev: new upper device
|
|
|
|
*
|
|
|
|
* Removes a link to device which is upper to this one. The caller must hold
|
|
|
|
* the RTNL lock.
|
|
|
|
*/
|
|
|
|
void netdev_upper_dev_unlink(struct net_device *dev,
|
|
|
|
struct net_device *upper_dev)
|
|
|
|
{
|
2015-08-27 07:31:18 +00:00
|
|
|
struct netdev_notifier_changeupper_info changeupper_info;
|
2017-02-09 06:56:07 +00:00
|
|
|
|
2013-01-03 22:48:49 +00:00
|
|
|
ASSERT_RTNL();
|
|
|
|
|
2015-08-27 07:31:18 +00:00
|
|
|
changeupper_info.upper_dev = upper_dev;
|
|
|
|
changeupper_info.master = netdev_master_upper_dev_get(dev) == upper_dev;
|
|
|
|
changeupper_info.linking = false;
|
|
|
|
|
2015-10-16 12:01:22 +00:00
|
|
|
call_netdevice_notifiers_info(NETDEV_PRECHANGEUPPER, dev,
|
|
|
|
&changeupper_info.info);
|
|
|
|
|
net: add adj_list to save only neighbours
Currently, we distinguish neighbours (first-level linked devices) from
non-neighbours by the neighbour bool in the netdev_adjacent. This could be
quite time-consuming in case we would like to traverse *only* through
neighbours - cause we'd have to traverse through all devices and check for
this flag, and in a (quite common) scenario where we have lots of vlans on
top of bridge, which is on top of a bond - the bonding would have to go
through all those vlans to get its upper neighbour linked devices.
This situation is really unpleasant, cause there are already a lot of cases
when a device with slaves needs to go through them in hot path.
To fix this, introduce a new upper/lower device lists structure -
adj_list, which contains only the neighbours. It works always in
pair with the all_adj_list structure (renamed from upper/lower_dev_list),
i.e. both of them contain the same links, only that all_adj_list contains
also non-neighbour device links. It's really a small change visible,
currently, only for __netdev_adjacent_dev_insert/remove(), and doesn't
change the main linked logic at all.
Also, add some comments a fix a name collision in
netdev_for_each_upper_dev_rcu() and rework the naming by the following
rules:
netdev_(all_)(upper|lower)_*
If "all_" is present, then we work with the whole list of upper/lower
devices, otherwise - only with direct neighbours. Uninline functions - to
get better stack traces.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-25 07:20:07 +00:00
|
|
|
__netdev_adjacent_dev_unlink_neighbour(dev, upper_dev);
|
net: add lower_dev_list to net_device and make a full mesh
This patch adds lower_dev_list list_head to net_device, which is the same
as upper_dev_list, only for lower devices, and begins to use it in the same
way as the upper list.
It also changes the way the whole adjacent device lists work - now they
contain *all* of upper/lower devices, not only the first level. The first
level devices are distinguished by the bool neighbour field in
netdev_adjacent, also added by this patch.
There are cases when a device can be added several times to the adjacent
list, the simplest would be:
/---- eth0.10 ---\
eth0- --- bond0
\---- eth0.20 ---/
where both bond0 and eth0 'see' each other in the adjacent lists two times.
To avoid duplication of netdev_adjacent structures ref_nr is being kept as
the number of times the device was added to the list.
The 'full view' is achieved by adding, on link creation, all of the
upper_dev's upper_dev_list devices as upper devices to all of the
lower_dev's lower_dev_list devices (and to the lower_dev itself), and vice
versa. On unlink they are removed using the same logic.
I've tested it with thousands vlans/bonds/bridges, everything works ok and
no observable lags even on a huge number of interfaces.
Memory footprint for 128 devices interconnected with each other via both
upper and lower (which is impossible, but for the comparison) lists would be:
128*128*2*sizeof(netdev_adjacent) = 1.5MB
but in the real world we usualy have at most several devices with slaves
and a lot of vlans, so the footprint will be much lower.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-28 21:25:05 +00:00
|
|
|
|
2015-08-27 07:31:18 +00:00
|
|
|
call_netdevice_notifiers_info(NETDEV_CHANGEUPPER, dev,
|
|
|
|
&changeupper_info.info);
|
2013-01-03 22:48:49 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(netdev_upper_dev_unlink);
|
|
|
|
|
2015-02-03 14:48:29 +00:00
|
|
|
/**
|
|
|
|
* netdev_bonding_info_change - Dispatch event about slave change
|
|
|
|
* @dev: device
|
2015-02-14 13:26:34 +00:00
|
|
|
* @bonding_info: info to dispatch
|
2015-02-03 14:48:29 +00:00
|
|
|
*
|
|
|
|
* Send NETDEV_BONDING_INFO to netdev notifiers with info.
|
|
|
|
* The caller must hold the RTNL lock.
|
|
|
|
*/
|
|
|
|
void netdev_bonding_info_change(struct net_device *dev,
|
|
|
|
struct netdev_bonding_info *bonding_info)
|
|
|
|
{
|
|
|
|
struct netdev_notifier_bonding_info info;
|
|
|
|
|
|
|
|
memcpy(&info.bonding_info, bonding_info,
|
|
|
|
sizeof(struct netdev_bonding_info));
|
|
|
|
call_netdevice_notifiers_info(NETDEV_BONDING_INFO, dev,
|
|
|
|
&info.info);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(netdev_bonding_info_change);
|
|
|
|
|
2015-02-04 21:37:44 +00:00
|
|
|
static void netdev_adjacent_add_links(struct net_device *dev)
|
2014-08-25 12:26:45 +00:00
|
|
|
{
|
|
|
|
struct netdev_adjacent *iter;
|
|
|
|
|
|
|
|
struct net *net = dev_net(dev);
|
|
|
|
|
|
|
|
list_for_each_entry(iter, &dev->adj_list.upper, list) {
|
2016-06-16 13:30:12 +00:00
|
|
|
if (!net_eq(net, dev_net(iter->dev)))
|
2014-08-25 12:26:45 +00:00
|
|
|
continue;
|
|
|
|
netdev_adjacent_sysfs_add(iter->dev, dev,
|
|
|
|
&iter->dev->adj_list.lower);
|
|
|
|
netdev_adjacent_sysfs_add(dev, iter->dev,
|
|
|
|
&dev->adj_list.upper);
|
|
|
|
}
|
|
|
|
|
|
|
|
list_for_each_entry(iter, &dev->adj_list.lower, list) {
|
2016-06-16 13:30:12 +00:00
|
|
|
if (!net_eq(net, dev_net(iter->dev)))
|
2014-08-25 12:26:45 +00:00
|
|
|
continue;
|
|
|
|
netdev_adjacent_sysfs_add(iter->dev, dev,
|
|
|
|
&iter->dev->adj_list.upper);
|
|
|
|
netdev_adjacent_sysfs_add(dev, iter->dev,
|
|
|
|
&dev->adj_list.lower);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2015-02-04 21:37:44 +00:00
|
|
|
static void netdev_adjacent_del_links(struct net_device *dev)
|
2014-08-25 12:26:45 +00:00
|
|
|
{
|
|
|
|
struct netdev_adjacent *iter;
|
|
|
|
|
|
|
|
struct net *net = dev_net(dev);
|
|
|
|
|
|
|
|
list_for_each_entry(iter, &dev->adj_list.upper, list) {
|
2016-06-16 13:30:12 +00:00
|
|
|
if (!net_eq(net, dev_net(iter->dev)))
|
2014-08-25 12:26:45 +00:00
|
|
|
continue;
|
|
|
|
netdev_adjacent_sysfs_del(iter->dev, dev->name,
|
|
|
|
&iter->dev->adj_list.lower);
|
|
|
|
netdev_adjacent_sysfs_del(dev, iter->dev->name,
|
|
|
|
&dev->adj_list.upper);
|
|
|
|
}
|
|
|
|
|
|
|
|
list_for_each_entry(iter, &dev->adj_list.lower, list) {
|
2016-06-16 13:30:12 +00:00
|
|
|
if (!net_eq(net, dev_net(iter->dev)))
|
2014-08-25 12:26:45 +00:00
|
|
|
continue;
|
|
|
|
netdev_adjacent_sysfs_del(iter->dev, dev->name,
|
|
|
|
&iter->dev->adj_list.upper);
|
|
|
|
netdev_adjacent_sysfs_del(dev, iter->dev->name,
|
|
|
|
&dev->adj_list.lower);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2014-01-14 20:58:51 +00:00
|
|
|
void netdev_adjacent_rename_links(struct net_device *dev, char *oldname)
|
net: add netdev_adjacent->private and allow to use it
Currently, even though we can access any linked device, we can't attach
anything to it, which is vital to properly manage them.
To fix this, add a new void *private to netdev_adjacent and functions
setting/getting it (per link), so that we can save, per example, bonding's
slave structures there, per slave device.
netdev_master_upper_dev_link_private(dev, upper_dev, private) links dev to
upper dev and populates the neighbour link only with private.
netdev_lower_dev_get_private{,_rcu}() returns the private, if found.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-25 07:20:09 +00:00
|
|
|
{
|
2014-01-14 20:58:51 +00:00
|
|
|
struct netdev_adjacent *iter;
|
net: add netdev_adjacent->private and allow to use it
Currently, even though we can access any linked device, we can't attach
anything to it, which is vital to properly manage them.
To fix this, add a new void *private to netdev_adjacent and functions
setting/getting it (per link), so that we can save, per example, bonding's
slave structures there, per slave device.
netdev_master_upper_dev_link_private(dev, upper_dev, private) links dev to
upper dev and populates the neighbour link only with private.
netdev_lower_dev_get_private{,_rcu}() returns the private, if found.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-25 07:20:09 +00:00
|
|
|
|
2014-08-25 12:26:45 +00:00
|
|
|
struct net *net = dev_net(dev);
|
|
|
|
|
2014-01-14 20:58:51 +00:00
|
|
|
list_for_each_entry(iter, &dev->adj_list.upper, list) {
|
2016-06-16 13:30:12 +00:00
|
|
|
if (!net_eq(net, dev_net(iter->dev)))
|
2014-08-25 12:26:45 +00:00
|
|
|
continue;
|
2014-01-14 20:58:51 +00:00
|
|
|
netdev_adjacent_sysfs_del(iter->dev, oldname,
|
|
|
|
&iter->dev->adj_list.lower);
|
|
|
|
netdev_adjacent_sysfs_add(iter->dev, dev,
|
|
|
|
&iter->dev->adj_list.lower);
|
|
|
|
}
|
net: add netdev_adjacent->private and allow to use it
Currently, even though we can access any linked device, we can't attach
anything to it, which is vital to properly manage them.
To fix this, add a new void *private to netdev_adjacent and functions
setting/getting it (per link), so that we can save, per example, bonding's
slave structures there, per slave device.
netdev_master_upper_dev_link_private(dev, upper_dev, private) links dev to
upper dev and populates the neighbour link only with private.
netdev_lower_dev_get_private{,_rcu}() returns the private, if found.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-25 07:20:09 +00:00
|
|
|
|
2014-01-14 20:58:51 +00:00
|
|
|
list_for_each_entry(iter, &dev->adj_list.lower, list) {
|
2016-06-16 13:30:12 +00:00
|
|
|
if (!net_eq(net, dev_net(iter->dev)))
|
2014-08-25 12:26:45 +00:00
|
|
|
continue;
|
2014-01-14 20:58:51 +00:00
|
|
|
netdev_adjacent_sysfs_del(iter->dev, oldname,
|
|
|
|
&iter->dev->adj_list.upper);
|
|
|
|
netdev_adjacent_sysfs_add(iter->dev, dev,
|
|
|
|
&iter->dev->adj_list.upper);
|
|
|
|
}
|
net: add netdev_adjacent->private and allow to use it
Currently, even though we can access any linked device, we can't attach
anything to it, which is vital to properly manage them.
To fix this, add a new void *private to netdev_adjacent and functions
setting/getting it (per link), so that we can save, per example, bonding's
slave structures there, per slave device.
netdev_master_upper_dev_link_private(dev, upper_dev, private) links dev to
upper dev and populates the neighbour link only with private.
netdev_lower_dev_get_private{,_rcu}() returns the private, if found.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-25 07:20:09 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
void *netdev_lower_dev_get_private(struct net_device *dev,
|
|
|
|
struct net_device *lower_dev)
|
|
|
|
{
|
|
|
|
struct netdev_adjacent *lower;
|
|
|
|
|
|
|
|
if (!lower_dev)
|
|
|
|
return NULL;
|
2015-09-24 08:59:05 +00:00
|
|
|
lower = __netdev_find_adj(lower_dev, &dev->adj_list.lower);
|
net: add netdev_adjacent->private and allow to use it
Currently, even though we can access any linked device, we can't attach
anything to it, which is vital to properly manage them.
To fix this, add a new void *private to netdev_adjacent and functions
setting/getting it (per link), so that we can save, per example, bonding's
slave structures there, per slave device.
netdev_master_upper_dev_link_private(dev, upper_dev, private) links dev to
upper dev and populates the neighbour link only with private.
netdev_lower_dev_get_private{,_rcu}() returns the private, if found.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-25 07:20:09 +00:00
|
|
|
if (!lower)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
return lower->private;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(netdev_lower_dev_get_private);
|
|
|
|
|
2014-05-16 21:04:53 +00:00
|
|
|
|
net: remove type_check from dev_get_nest_level()
The idea for type_check in dev_get_nest_level() was to count the number
of nested devices of the same type (currently, only macvlan or vlan
devices).
This prevented the false positive lockdep warning on configurations such
as:
eth0 <--- macvlan0 <--- vlan0 <--- macvlan1
However, this doesn't prevent a warning on a configuration such as:
eth0 <--- macvlan0 <--- vlan0
eth1 <--- vlan1 <--- macvlan1
In this case, all the locks end up with a nesting subclass of 1, so
lockdep thinks that there is still a deadlock:
- in the first case we have (macvlan_netdev_addr_lock_key, 1) and then
take (vlan_netdev_xmit_lock_key, 1)
- in the second case, we have (vlan_netdev_xmit_lock_key, 1) and then
take (macvlan_netdev_addr_lock_key, 1)
By removing the linktype check in dev_get_nest_level() and always
incrementing the nesting depth, lockdep considers this configuration
valid.
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-12 14:10:33 +00:00
|
|
|
int dev_get_nest_level(struct net_device *dev)
|
2014-05-16 21:04:53 +00:00
|
|
|
{
|
|
|
|
struct net_device *lower = NULL;
|
|
|
|
struct list_head *iter;
|
|
|
|
int max_nest = -1;
|
|
|
|
int nest;
|
|
|
|
|
|
|
|
ASSERT_RTNL();
|
|
|
|
|
|
|
|
netdev_for_each_lower_dev(dev, lower, iter) {
|
net: remove type_check from dev_get_nest_level()
The idea for type_check in dev_get_nest_level() was to count the number
of nested devices of the same type (currently, only macvlan or vlan
devices).
This prevented the false positive lockdep warning on configurations such
as:
eth0 <--- macvlan0 <--- vlan0 <--- macvlan1
However, this doesn't prevent a warning on a configuration such as:
eth0 <--- macvlan0 <--- vlan0
eth1 <--- vlan1 <--- macvlan1
In this case, all the locks end up with a nesting subclass of 1, so
lockdep thinks that there is still a deadlock:
- in the first case we have (macvlan_netdev_addr_lock_key, 1) and then
take (vlan_netdev_xmit_lock_key, 1)
- in the second case, we have (vlan_netdev_xmit_lock_key, 1) and then
take (macvlan_netdev_addr_lock_key, 1)
By removing the linktype check in dev_get_nest_level() and always
incrementing the nesting depth, lockdep considers this configuration
valid.
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-12 14:10:33 +00:00
|
|
|
nest = dev_get_nest_level(lower);
|
2014-05-16 21:04:53 +00:00
|
|
|
if (max_nest < nest)
|
|
|
|
max_nest = nest;
|
|
|
|
}
|
|
|
|
|
net: remove type_check from dev_get_nest_level()
The idea for type_check in dev_get_nest_level() was to count the number
of nested devices of the same type (currently, only macvlan or vlan
devices).
This prevented the false positive lockdep warning on configurations such
as:
eth0 <--- macvlan0 <--- vlan0 <--- macvlan1
However, this doesn't prevent a warning on a configuration such as:
eth0 <--- macvlan0 <--- vlan0
eth1 <--- vlan1 <--- macvlan1
In this case, all the locks end up with a nesting subclass of 1, so
lockdep thinks that there is still a deadlock:
- in the first case we have (macvlan_netdev_addr_lock_key, 1) and then
take (vlan_netdev_xmit_lock_key, 1)
- in the second case, we have (vlan_netdev_xmit_lock_key, 1) and then
take (macvlan_netdev_addr_lock_key, 1)
By removing the linktype check in dev_get_nest_level() and always
incrementing the nesting depth, lockdep considers this configuration
valid.
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-12 14:10:33 +00:00
|
|
|
return max_nest + 1;
|
2014-05-16 21:04:53 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(dev_get_nest_level);
|
|
|
|
|
2015-12-03 11:12:15 +00:00
|
|
|
/**
|
|
|
|
* netdev_lower_change - Dispatch event about lower device state change
|
|
|
|
* @lower_dev: device
|
|
|
|
* @lower_state_info: state to dispatch
|
|
|
|
*
|
|
|
|
* Send NETDEV_CHANGELOWERSTATE to netdev notifiers with info.
|
|
|
|
* The caller must hold the RTNL lock.
|
|
|
|
*/
|
|
|
|
void netdev_lower_state_changed(struct net_device *lower_dev,
|
|
|
|
void *lower_state_info)
|
|
|
|
{
|
|
|
|
struct netdev_notifier_changelowerstate_info changelowerstate_info;
|
|
|
|
|
|
|
|
ASSERT_RTNL();
|
|
|
|
changelowerstate_info.lower_state_info = lower_state_info;
|
|
|
|
call_netdevice_notifiers_info(NETDEV_CHANGELOWERSTATE, lower_dev,
|
|
|
|
&changelowerstate_info.info);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(netdev_lower_state_changed);
|
|
|
|
|
2008-10-07 22:26:48 +00:00
|
|
|
static void dev_change_rx_flags(struct net_device *dev, int flags)
|
|
|
|
{
|
2008-11-20 05:32:24 +00:00
|
|
|
const struct net_device_ops *ops = dev->netdev_ops;
|
|
|
|
|
2013-11-20 01:47:15 +00:00
|
|
|
if (ops->ndo_change_rx_flags)
|
2008-11-20 05:32:24 +00:00
|
|
|
ops->ndo_change_rx_flags(dev, flags);
|
2008-10-07 22:26:48 +00:00
|
|
|
}
|
|
|
|
|
2013-09-25 10:02:45 +00:00
|
|
|
static int __dev_set_promiscuity(struct net_device *dev, int inc, bool notify)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2011-11-30 21:42:26 +00:00
|
|
|
unsigned int old_flags = dev->flags;
|
2012-05-23 23:01:57 +00:00
|
|
|
kuid_t uid;
|
|
|
|
kgid_t gid;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2007-07-15 01:51:31 +00:00
|
|
|
ASSERT_RTNL();
|
|
|
|
|
2008-06-18 08:48:28 +00:00
|
|
|
dev->flags |= IFF_PROMISC;
|
|
|
|
dev->promiscuity += inc;
|
|
|
|
if (dev->promiscuity == 0) {
|
|
|
|
/*
|
|
|
|
* Avoid overflow.
|
|
|
|
* If inc causes overflow, untouch promisc and return error.
|
|
|
|
*/
|
|
|
|
if (inc < 0)
|
|
|
|
dev->flags &= ~IFF_PROMISC;
|
|
|
|
else {
|
|
|
|
dev->promiscuity -= inc;
|
2012-02-01 10:54:43 +00:00
|
|
|
pr_warn("%s: promiscuity touches roof, set promiscuity failed. promiscuity feature of device might be broken.\n",
|
|
|
|
dev->name);
|
2008-06-18 08:48:28 +00:00
|
|
|
return -EOVERFLOW;
|
|
|
|
}
|
|
|
|
}
|
2005-07-05 22:11:06 +00:00
|
|
|
if (dev->flags != old_flags) {
|
2012-02-01 10:54:43 +00:00
|
|
|
pr_info("device %s %s promiscuous mode\n",
|
|
|
|
dev->name,
|
|
|
|
dev->flags & IFF_PROMISC ? "entered" : "left");
|
2008-11-13 23:39:10 +00:00
|
|
|
if (audit_enabled) {
|
|
|
|
current_uid_gid(&uid, &gid);
|
2008-01-24 03:57:45 +00:00
|
|
|
audit_log(current->audit_context, GFP_ATOMIC,
|
|
|
|
AUDIT_ANOM_PROMISCUOUS,
|
|
|
|
"dev=%s prom=%d old_prom=%d auid=%u uid=%u gid=%u ses=%u",
|
|
|
|
dev->name, (dev->flags & IFF_PROMISC),
|
|
|
|
(old_flags & IFF_PROMISC),
|
2012-09-11 05:39:43 +00:00
|
|
|
from_kuid(&init_user_ns, audit_get_loginuid(current)),
|
2012-05-23 23:01:57 +00:00
|
|
|
from_kuid(&init_user_ns, uid),
|
|
|
|
from_kgid(&init_user_ns, gid),
|
2008-01-24 03:57:45 +00:00
|
|
|
audit_get_sessionid(current));
|
2008-11-13 23:39:10 +00:00
|
|
|
}
|
2007-07-15 01:51:31 +00:00
|
|
|
|
2008-10-07 22:26:48 +00:00
|
|
|
dev_change_rx_flags(dev, IFF_PROMISC);
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
2013-09-25 10:02:45 +00:00
|
|
|
if (notify)
|
|
|
|
__dev_notify_flags(dev, old_flags, IFF_PROMISC);
|
2008-06-18 08:48:28 +00:00
|
|
|
return 0;
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
|
2007-06-27 08:28:10 +00:00
|
|
|
/**
|
|
|
|
* dev_set_promiscuity - update promiscuity count on a device
|
|
|
|
* @dev: device
|
|
|
|
* @inc: modifier
|
|
|
|
*
|
|
|
|
* Add or remove promiscuity from a device. While the count in the device
|
|
|
|
* remains above zero the interface remains promiscuous. Once it hits zero
|
|
|
|
* the device reverts back to normal filtering operation. A negative inc
|
|
|
|
* value is used to drop promiscuity on the device.
|
2008-06-18 08:48:28 +00:00
|
|
|
* Return 0 if successful or a negative errno code on error.
|
2007-06-27 08:28:10 +00:00
|
|
|
*/
|
2008-06-18 08:48:28 +00:00
|
|
|
int dev_set_promiscuity(struct net_device *dev, int inc)
|
2007-06-27 08:28:10 +00:00
|
|
|
{
|
2011-11-30 21:42:26 +00:00
|
|
|
unsigned int old_flags = dev->flags;
|
2008-06-18 08:48:28 +00:00
|
|
|
int err;
|
2007-06-27 08:28:10 +00:00
|
|
|
|
2013-09-25 10:02:45 +00:00
|
|
|
err = __dev_set_promiscuity(dev, inc, true);
|
2008-07-06 22:49:08 +00:00
|
|
|
if (err < 0)
|
2008-06-18 08:48:28 +00:00
|
|
|
return err;
|
2007-06-27 08:28:10 +00:00
|
|
|
if (dev->flags != old_flags)
|
|
|
|
dev_set_rx_mode(dev);
|
2008-06-18 08:48:28 +00:00
|
|
|
return err;
|
2007-06-27 08:28:10 +00:00
|
|
|
}
|
2009-09-03 08:29:39 +00:00
|
|
|
EXPORT_SYMBOL(dev_set_promiscuity);
|
2007-06-27 08:28:10 +00:00
|
|
|
|
2013-09-25 10:02:45 +00:00
|
|
|
static int __dev_set_allmulti(struct net_device *dev, int inc, bool notify)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2013-09-25 10:02:45 +00:00
|
|
|
unsigned int old_flags = dev->flags, old_gflags = dev->gflags;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2007-07-15 01:51:31 +00:00
|
|
|
ASSERT_RTNL();
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
dev->flags |= IFF_ALLMULTI;
|
2008-06-18 08:48:28 +00:00
|
|
|
dev->allmulti += inc;
|
|
|
|
if (dev->allmulti == 0) {
|
|
|
|
/*
|
|
|
|
* Avoid overflow.
|
|
|
|
* If inc causes overflow, untouch allmulti and return error.
|
|
|
|
*/
|
|
|
|
if (inc < 0)
|
|
|
|
dev->flags &= ~IFF_ALLMULTI;
|
|
|
|
else {
|
|
|
|
dev->allmulti -= inc;
|
2012-02-01 10:54:43 +00:00
|
|
|
pr_warn("%s: allmulti touches roof, set allmulti failed. allmulti feature of device might be broken.\n",
|
|
|
|
dev->name);
|
2008-06-18 08:48:28 +00:00
|
|
|
return -EOVERFLOW;
|
|
|
|
}
|
|
|
|
}
|
2007-07-15 01:51:31 +00:00
|
|
|
if (dev->flags ^ old_flags) {
|
2008-10-07 22:26:48 +00:00
|
|
|
dev_change_rx_flags(dev, IFF_ALLMULTI);
|
2007-06-27 08:28:10 +00:00
|
|
|
dev_set_rx_mode(dev);
|
2013-09-25 10:02:45 +00:00
|
|
|
if (notify)
|
|
|
|
__dev_notify_flags(dev, old_flags,
|
|
|
|
dev->gflags ^ old_gflags);
|
2007-07-15 01:51:31 +00:00
|
|
|
}
|
2008-06-18 08:48:28 +00:00
|
|
|
return 0;
|
2007-06-27 08:28:10 +00:00
|
|
|
}
|
2013-09-25 10:02:45 +00:00
|
|
|
|
|
|
|
/**
|
|
|
|
* dev_set_allmulti - update allmulti count on a device
|
|
|
|
* @dev: device
|
|
|
|
* @inc: modifier
|
|
|
|
*
|
|
|
|
* Add or remove reception of all multicast frames to a device. While the
|
|
|
|
* count in the device remains above zero the interface remains listening
|
|
|
|
* to all interfaces. Once it hits zero the device reverts back to normal
|
|
|
|
* filtering operation. A negative @inc value is used to drop the counter
|
|
|
|
* when releasing a resource needing all multicasts.
|
|
|
|
* Return 0 if successful or a negative errno code on error.
|
|
|
|
*/
|
|
|
|
|
|
|
|
int dev_set_allmulti(struct net_device *dev, int inc)
|
|
|
|
{
|
|
|
|
return __dev_set_allmulti(dev, inc, true);
|
|
|
|
}
|
2009-09-03 08:29:39 +00:00
|
|
|
EXPORT_SYMBOL(dev_set_allmulti);
|
2007-06-27 08:28:10 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Upload unicast and multicast address lists to device and
|
|
|
|
* configure RX filtering. When the device doesn't support unicast
|
2007-12-20 22:02:06 +00:00
|
|
|
* filtering it is put in promiscuous mode while unicast addresses
|
2007-06-27 08:28:10 +00:00
|
|
|
* are present.
|
|
|
|
*/
|
|
|
|
void __dev_set_rx_mode(struct net_device *dev)
|
|
|
|
{
|
2008-11-20 05:32:24 +00:00
|
|
|
const struct net_device_ops *ops = dev->netdev_ops;
|
|
|
|
|
2007-06-27 08:28:10 +00:00
|
|
|
/* dev_open will call this function so the list will stay sane. */
|
|
|
|
if (!(dev->flags&IFF_UP))
|
|
|
|
return;
|
|
|
|
|
|
|
|
if (!netif_device_present(dev))
|
2007-07-19 01:43:23 +00:00
|
|
|
return;
|
2007-06-27 08:28:10 +00:00
|
|
|
|
2011-08-16 06:29:00 +00:00
|
|
|
if (!(dev->priv_flags & IFF_UNICAST_FLT)) {
|
2007-06-27 08:28:10 +00:00
|
|
|
/* Unicast addresses changes may only happen under the rtnl,
|
|
|
|
* therefore calling __dev_set_promiscuity here is safe.
|
|
|
|
*/
|
2010-01-25 21:36:10 +00:00
|
|
|
if (!netdev_uc_empty(dev) && !dev->uc_promisc) {
|
2013-09-25 10:02:45 +00:00
|
|
|
__dev_set_promiscuity(dev, 1, false);
|
2011-07-25 23:17:35 +00:00
|
|
|
dev->uc_promisc = true;
|
2010-01-25 21:36:10 +00:00
|
|
|
} else if (netdev_uc_empty(dev) && dev->uc_promisc) {
|
2013-09-25 10:02:45 +00:00
|
|
|
__dev_set_promiscuity(dev, -1, false);
|
2011-07-25 23:17:35 +00:00
|
|
|
dev->uc_promisc = false;
|
2007-06-27 08:28:10 +00:00
|
|
|
}
|
|
|
|
}
|
2011-08-16 06:29:00 +00:00
|
|
|
|
|
|
|
if (ops->ndo_set_rx_mode)
|
|
|
|
ops->ndo_set_rx_mode(dev);
|
2007-06-27 08:28:10 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
void dev_set_rx_mode(struct net_device *dev)
|
|
|
|
{
|
2008-07-15 07:15:08 +00:00
|
|
|
netif_addr_lock_bh(dev);
|
2007-06-27 08:28:10 +00:00
|
|
|
__dev_set_rx_mode(dev);
|
2008-07-15 07:15:08 +00:00
|
|
|
netif_addr_unlock_bh(dev);
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
|
2008-09-30 09:23:58 +00:00
|
|
|
/**
|
|
|
|
* dev_get_flags - get flags reported to userspace
|
|
|
|
* @dev: device
|
|
|
|
*
|
|
|
|
* Get the combination of flag bits exported through APIs to userspace.
|
|
|
|
*/
|
2012-04-15 05:58:06 +00:00
|
|
|
unsigned int dev_get_flags(const struct net_device *dev)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2012-04-15 05:58:06 +00:00
|
|
|
unsigned int flags;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
flags = (dev->flags & ~(IFF_PROMISC |
|
|
|
|
IFF_ALLMULTI |
|
2006-03-21 01:09:11 +00:00
|
|
|
IFF_RUNNING |
|
|
|
|
IFF_LOWER_UP |
|
|
|
|
IFF_DORMANT)) |
|
2005-04-16 22:20:36 +00:00
|
|
|
(dev->gflags & (IFF_PROMISC |
|
|
|
|
IFF_ALLMULTI));
|
|
|
|
|
2006-03-21 01:09:11 +00:00
|
|
|
if (netif_running(dev)) {
|
|
|
|
if (netif_oper_up(dev))
|
|
|
|
flags |= IFF_RUNNING;
|
|
|
|
if (netif_carrier_ok(dev))
|
|
|
|
flags |= IFF_LOWER_UP;
|
|
|
|
if (netif_dormant(dev))
|
|
|
|
flags |= IFF_DORMANT;
|
|
|
|
}
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
return flags;
|
|
|
|
}
|
2009-09-03 08:29:39 +00:00
|
|
|
EXPORT_SYMBOL(dev_get_flags);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2010-02-26 06:34:53 +00:00
|
|
|
int __dev_change_flags(struct net_device *dev, unsigned int flags)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2011-11-30 21:42:26 +00:00
|
|
|
unsigned int old_flags = dev->flags;
|
2010-02-26 06:34:53 +00:00
|
|
|
int ret;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2007-07-15 01:51:31 +00:00
|
|
|
ASSERT_RTNL();
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
/*
|
|
|
|
* Set the flags on our device.
|
|
|
|
*/
|
|
|
|
|
|
|
|
dev->flags = (flags & (IFF_DEBUG | IFF_NOTRAILERS | IFF_NOARP |
|
|
|
|
IFF_DYNAMIC | IFF_MULTICAST | IFF_PORTSEL |
|
|
|
|
IFF_AUTOMEDIA)) |
|
|
|
|
(dev->flags & (IFF_UP | IFF_VOLATILE | IFF_PROMISC |
|
|
|
|
IFF_ALLMULTI));
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Load in the correct multicast list now the flags have changed.
|
|
|
|
*/
|
|
|
|
|
2008-10-07 22:26:48 +00:00
|
|
|
if ((old_flags ^ flags) & IFF_MULTICAST)
|
|
|
|
dev_change_rx_flags(dev, IFF_MULTICAST);
|
2007-07-15 01:51:31 +00:00
|
|
|
|
2007-06-27 08:28:10 +00:00
|
|
|
dev_set_rx_mode(dev);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Have we downed the interface. We handle IFF_UP ourselves
|
|
|
|
* according to user attempts to set it, rather than blindly
|
|
|
|
* setting it.
|
|
|
|
*/
|
|
|
|
|
|
|
|
ret = 0;
|
2017-07-18 22:59:27 +00:00
|
|
|
if ((old_flags ^ flags) & IFF_UP) {
|
|
|
|
if (old_flags & IFF_UP)
|
|
|
|
__dev_close(dev);
|
|
|
|
else
|
|
|
|
ret = __dev_open(dev);
|
|
|
|
}
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
if ((flags ^ dev->gflags) & IFF_PROMISC) {
|
2009-09-03 08:29:39 +00:00
|
|
|
int inc = (flags & IFF_PROMISC) ? 1 : -1;
|
2013-09-25 10:02:45 +00:00
|
|
|
unsigned int old_flags = dev->flags;
|
2009-09-03 08:29:39 +00:00
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
dev->gflags ^= IFF_PROMISC;
|
2013-09-25 10:02:45 +00:00
|
|
|
|
|
|
|
if (__dev_set_promiscuity(dev, inc, false) >= 0)
|
|
|
|
if (dev->flags != old_flags)
|
|
|
|
dev_set_rx_mode(dev);
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/* NOTE: order of synchronization of IFF_PROMISC and IFF_ALLMULTI
|
2017-02-09 06:56:06 +00:00
|
|
|
* is important. Some (broken) drivers set IFF_PROMISC, when
|
|
|
|
* IFF_ALLMULTI is requested not asking us and not reporting.
|
2005-04-16 22:20:36 +00:00
|
|
|
*/
|
|
|
|
if ((flags ^ dev->gflags) & IFF_ALLMULTI) {
|
2009-09-03 08:29:39 +00:00
|
|
|
int inc = (flags & IFF_ALLMULTI) ? 1 : -1;
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
dev->gflags ^= IFF_ALLMULTI;
|
2013-09-25 10:02:45 +00:00
|
|
|
__dev_set_allmulti(dev, inc, false);
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
|
2010-02-26 06:34:53 +00:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2013-09-25 10:02:44 +00:00
|
|
|
void __dev_notify_flags(struct net_device *dev, unsigned int old_flags,
|
|
|
|
unsigned int gchanges)
|
2010-02-26 06:34:53 +00:00
|
|
|
{
|
|
|
|
unsigned int changes = dev->flags ^ old_flags;
|
|
|
|
|
2013-09-25 10:02:44 +00:00
|
|
|
if (gchanges)
|
2013-10-23 23:02:42 +00:00
|
|
|
rtmsg_ifinfo(RTM_NEWLINK, dev, gchanges, GFP_ATOMIC);
|
2013-09-25 10:02:44 +00:00
|
|
|
|
2010-02-26 06:34:53 +00:00
|
|
|
if (changes & IFF_UP) {
|
|
|
|
if (dev->flags & IFF_UP)
|
|
|
|
call_netdevice_notifiers(NETDEV_UP, dev);
|
|
|
|
else
|
|
|
|
call_netdevice_notifiers(NETDEV_DOWN, dev);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (dev->flags & IFF_UP &&
|
2013-05-28 01:30:22 +00:00
|
|
|
(changes & ~(IFF_UP | IFF_PROMISC | IFF_ALLMULTI | IFF_VOLATILE))) {
|
|
|
|
struct netdev_notifier_change_info change_info;
|
|
|
|
|
|
|
|
change_info.flags_changed = changes;
|
|
|
|
call_netdevice_notifiers_info(NETDEV_CHANGE, dev,
|
|
|
|
&change_info.info);
|
|
|
|
}
|
2010-02-26 06:34:53 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* dev_change_flags - change device settings
|
|
|
|
* @dev: device
|
|
|
|
* @flags: device state flags
|
|
|
|
*
|
|
|
|
* Change settings on device based state flags. The flags are
|
|
|
|
* in the userspace exported format.
|
|
|
|
*/
|
2011-11-30 21:42:26 +00:00
|
|
|
int dev_change_flags(struct net_device *dev, unsigned int flags)
|
2010-02-26 06:34:53 +00:00
|
|
|
{
|
2011-11-30 21:42:26 +00:00
|
|
|
int ret;
|
2013-09-25 10:02:45 +00:00
|
|
|
unsigned int changes, old_flags = dev->flags, old_gflags = dev->gflags;
|
2010-02-26 06:34:53 +00:00
|
|
|
|
|
|
|
ret = __dev_change_flags(dev, flags);
|
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
|
2013-09-25 10:02:45 +00:00
|
|
|
changes = (old_flags ^ dev->flags) | (old_gflags ^ dev->gflags);
|
2013-09-25 10:02:44 +00:00
|
|
|
__dev_notify_flags(dev, old_flags, changes);
|
2005-04-16 22:20:36 +00:00
|
|
|
return ret;
|
|
|
|
}
|
2009-09-03 08:29:39 +00:00
|
|
|
EXPORT_SYMBOL(dev_change_flags);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2017-07-06 22:01:57 +00:00
|
|
|
int __dev_set_mtu(struct net_device *dev, int new_mtu)
|
2014-01-10 15:56:25 +00:00
|
|
|
{
|
|
|
|
const struct net_device_ops *ops = dev->netdev_ops;
|
|
|
|
|
|
|
|
if (ops->ndo_change_mtu)
|
|
|
|
return ops->ndo_change_mtu(dev, new_mtu);
|
|
|
|
|
|
|
|
dev->mtu = new_mtu;
|
|
|
|
return 0;
|
|
|
|
}
|
2017-07-06 22:01:57 +00:00
|
|
|
EXPORT_SYMBOL(__dev_set_mtu);
|
2014-01-10 15:56:25 +00:00
|
|
|
|
2008-09-30 09:23:58 +00:00
|
|
|
/**
|
|
|
|
* dev_set_mtu - Change maximum transfer unit
|
|
|
|
* @dev: device
|
|
|
|
* @new_mtu: new transfer unit
|
|
|
|
*
|
|
|
|
* Change the maximum transfer size of the network device.
|
|
|
|
*/
|
2005-04-16 22:20:36 +00:00
|
|
|
int dev_set_mtu(struct net_device *dev, int new_mtu)
|
|
|
|
{
|
2014-01-10 15:56:25 +00:00
|
|
|
int err, orig_mtu;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
if (new_mtu == dev->mtu)
|
|
|
|
return 0;
|
|
|
|
|
net: centralize net_device min/max MTU checking
While looking into an MTU issue with sfc, I started noticing that almost
every NIC driver with an ndo_change_mtu function implemented almost
exactly the same range checks, and in many cases, that was the only
practical thing their ndo_change_mtu function was doing. Quite a few
drivers have either 68, 64, 60 or 46 as their minimum MTU value checked,
and then various sizes from 1500 to 65535 for their maximum MTU value. We
can remove a whole lot of redundant code here if we simple store min_mtu
and max_mtu in net_device, and check against those in net/core/dev.c's
dev_set_mtu().
In theory, there should be zero functional change with this patch, it just
puts the infrastructure in place. Subsequent patches will attempt to start
using said infrastructure, with theoretically zero change in
functionality.
CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-08 02:04:33 +00:00
|
|
|
/* MTU must be positive, and in range */
|
|
|
|
if (new_mtu < 0 || new_mtu < dev->min_mtu) {
|
|
|
|
net_err_ratelimited("%s: Invalid MTU %d requested, hw min %d\n",
|
|
|
|
dev->name, new_mtu, dev->min_mtu);
|
2005-04-16 22:20:36 +00:00
|
|
|
return -EINVAL;
|
net: centralize net_device min/max MTU checking
While looking into an MTU issue with sfc, I started noticing that almost
every NIC driver with an ndo_change_mtu function implemented almost
exactly the same range checks, and in many cases, that was the only
practical thing their ndo_change_mtu function was doing. Quite a few
drivers have either 68, 64, 60 or 46 as their minimum MTU value checked,
and then various sizes from 1500 to 65535 for their maximum MTU value. We
can remove a whole lot of redundant code here if we simple store min_mtu
and max_mtu in net_device, and check against those in net/core/dev.c's
dev_set_mtu().
In theory, there should be zero functional change with this patch, it just
puts the infrastructure in place. Subsequent patches will attempt to start
using said infrastructure, with theoretically zero change in
functionality.
CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-08 02:04:33 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
if (dev->max_mtu > 0 && new_mtu > dev->max_mtu) {
|
|
|
|
net_err_ratelimited("%s: Invalid MTU %d requested, hw max %d\n",
|
2016-10-17 17:02:22 +00:00
|
|
|
dev->name, new_mtu, dev->max_mtu);
|
net: centralize net_device min/max MTU checking
While looking into an MTU issue with sfc, I started noticing that almost
every NIC driver with an ndo_change_mtu function implemented almost
exactly the same range checks, and in many cases, that was the only
practical thing their ndo_change_mtu function was doing. Quite a few
drivers have either 68, 64, 60 or 46 as their minimum MTU value checked,
and then various sizes from 1500 to 65535 for their maximum MTU value. We
can remove a whole lot of redundant code here if we simple store min_mtu
and max_mtu in net_device, and check against those in net/core/dev.c's
dev_set_mtu().
In theory, there should be zero functional change with this patch, it just
puts the infrastructure in place. Subsequent patches will attempt to start
using said infrastructure, with theoretically zero change in
functionality.
CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-08 02:04:33 +00:00
|
|
|
return -EINVAL;
|
|
|
|
}
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
if (!netif_device_present(dev))
|
|
|
|
return -ENODEV;
|
|
|
|
|
2014-01-15 23:02:18 +00:00
|
|
|
err = call_netdevice_notifiers(NETDEV_PRECHANGEMTU, dev);
|
|
|
|
err = notifier_to_errno(err);
|
|
|
|
if (err)
|
|
|
|
return err;
|
2008-11-20 05:32:24 +00:00
|
|
|
|
2014-01-10 15:56:25 +00:00
|
|
|
orig_mtu = dev->mtu;
|
|
|
|
err = __dev_set_mtu(dev, new_mtu);
|
2008-11-20 05:32:24 +00:00
|
|
|
|
2014-01-10 15:56:25 +00:00
|
|
|
if (!err) {
|
|
|
|
err = call_netdevice_notifiers(NETDEV_CHANGEMTU, dev);
|
|
|
|
err = notifier_to_errno(err);
|
|
|
|
if (err) {
|
|
|
|
/* setting mtu back and notifying everyone again,
|
|
|
|
* so that they have a chance to revert changes.
|
|
|
|
*/
|
|
|
|
__dev_set_mtu(dev, orig_mtu);
|
|
|
|
call_netdevice_notifiers(NETDEV_CHANGEMTU, dev);
|
|
|
|
}
|
|
|
|
}
|
2005-04-16 22:20:36 +00:00
|
|
|
return err;
|
|
|
|
}
|
2009-09-03 08:29:39 +00:00
|
|
|
EXPORT_SYMBOL(dev_set_mtu);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2011-01-13 23:38:30 +00:00
|
|
|
/**
|
|
|
|
* dev_set_group - Change group this device belongs to
|
|
|
|
* @dev: device
|
|
|
|
* @new_group: group this device should belong to
|
|
|
|
*/
|
|
|
|
void dev_set_group(struct net_device *dev, int new_group)
|
|
|
|
{
|
|
|
|
dev->group = new_group;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(dev_set_group);
|
|
|
|
|
2008-09-30 09:23:58 +00:00
|
|
|
/**
|
|
|
|
* dev_set_mac_address - Change Media Access Control Address
|
|
|
|
* @dev: device
|
|
|
|
* @sa: new address
|
|
|
|
*
|
|
|
|
* Change the hardware (MAC) address of the device
|
|
|
|
*/
|
2005-04-16 22:20:36 +00:00
|
|
|
int dev_set_mac_address(struct net_device *dev, struct sockaddr *sa)
|
|
|
|
{
|
2008-11-20 05:32:24 +00:00
|
|
|
const struct net_device_ops *ops = dev->netdev_ops;
|
2005-04-16 22:20:36 +00:00
|
|
|
int err;
|
|
|
|
|
2008-11-20 05:32:24 +00:00
|
|
|
if (!ops->ndo_set_mac_address)
|
2005-04-16 22:20:36 +00:00
|
|
|
return -EOPNOTSUPP;
|
|
|
|
if (sa->sa_family != dev->type)
|
|
|
|
return -EINVAL;
|
|
|
|
if (!netif_device_present(dev))
|
|
|
|
return -ENODEV;
|
2008-11-20 05:32:24 +00:00
|
|
|
err = ops->ndo_set_mac_address(dev, sa);
|
2013-01-01 03:30:14 +00:00
|
|
|
if (err)
|
|
|
|
return err;
|
2013-01-01 03:30:16 +00:00
|
|
|
dev->addr_assign_type = NET_ADDR_SET;
|
2013-01-01 03:30:14 +00:00
|
|
|
call_netdevice_notifiers(NETDEV_CHANGEADDR, dev);
|
2012-07-05 01:23:25 +00:00
|
|
|
add_device_randomness(dev->dev_addr, dev->addr_len);
|
2013-01-01 03:30:14 +00:00
|
|
|
return 0;
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
2009-09-03 08:29:39 +00:00
|
|
|
EXPORT_SYMBOL(dev_set_mac_address);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2012-12-27 23:49:37 +00:00
|
|
|
/**
|
|
|
|
* dev_change_carrier - Change device carrier
|
|
|
|
* @dev: device
|
2013-03-04 12:32:43 +00:00
|
|
|
* @new_carrier: new value
|
2012-12-27 23:49:37 +00:00
|
|
|
*
|
|
|
|
* Change device carrier
|
|
|
|
*/
|
|
|
|
int dev_change_carrier(struct net_device *dev, bool new_carrier)
|
|
|
|
{
|
|
|
|
const struct net_device_ops *ops = dev->netdev_ops;
|
|
|
|
|
|
|
|
if (!ops->ndo_change_carrier)
|
|
|
|
return -EOPNOTSUPP;
|
|
|
|
if (!netif_device_present(dev))
|
|
|
|
return -ENODEV;
|
|
|
|
return ops->ndo_change_carrier(dev, new_carrier);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(dev_change_carrier);
|
|
|
|
|
2013-07-29 16:16:49 +00:00
|
|
|
/**
|
|
|
|
* dev_get_phys_port_id - Get device physical port ID
|
|
|
|
* @dev: device
|
|
|
|
* @ppid: port ID
|
|
|
|
*
|
|
|
|
* Get device physical port ID
|
|
|
|
*/
|
|
|
|
int dev_get_phys_port_id(struct net_device *dev,
|
2014-11-28 13:34:16 +00:00
|
|
|
struct netdev_phys_item_id *ppid)
|
2013-07-29 16:16:49 +00:00
|
|
|
{
|
|
|
|
const struct net_device_ops *ops = dev->netdev_ops;
|
|
|
|
|
|
|
|
if (!ops->ndo_get_phys_port_id)
|
|
|
|
return -EOPNOTSUPP;
|
|
|
|
return ops->ndo_get_phys_port_id(dev, ppid);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(dev_get_phys_port_id);
|
|
|
|
|
2015-03-18 02:23:15 +00:00
|
|
|
/**
|
|
|
|
* dev_get_phys_port_name - Get device physical port name
|
|
|
|
* @dev: device
|
|
|
|
* @name: port name
|
2016-03-21 16:31:14 +00:00
|
|
|
* @len: limit of bytes to copy to name
|
2015-03-18 02:23:15 +00:00
|
|
|
*
|
|
|
|
* Get device physical port name
|
|
|
|
*/
|
|
|
|
int dev_get_phys_port_name(struct net_device *dev,
|
|
|
|
char *name, size_t len)
|
|
|
|
{
|
|
|
|
const struct net_device_ops *ops = dev->netdev_ops;
|
|
|
|
|
|
|
|
if (!ops->ndo_get_phys_port_name)
|
|
|
|
return -EOPNOTSUPP;
|
|
|
|
return ops->ndo_get_phys_port_name(dev, name, len);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(dev_get_phys_port_name);
|
|
|
|
|
2015-07-14 20:43:19 +00:00
|
|
|
/**
|
|
|
|
* dev_change_proto_down - update protocol port state information
|
|
|
|
* @dev: device
|
|
|
|
* @proto_down: new value
|
|
|
|
*
|
|
|
|
* This info can be used by switch drivers to set the phys state of the
|
|
|
|
* port.
|
|
|
|
*/
|
|
|
|
int dev_change_proto_down(struct net_device *dev, bool proto_down)
|
|
|
|
{
|
|
|
|
const struct net_device_ops *ops = dev->netdev_ops;
|
|
|
|
|
|
|
|
if (!ops->ndo_change_proto_down)
|
|
|
|
return -EOPNOTSUPP;
|
|
|
|
if (!netif_device_present(dev))
|
|
|
|
return -ENODEV;
|
|
|
|
return ops->ndo_change_proto_down(dev, proto_down);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(dev_change_proto_down);
|
|
|
|
|
2017-06-22 01:25:09 +00:00
|
|
|
u8 __dev_xdp_attached(struct net_device *dev, xdp_op_t xdp_op, u32 *prog_id)
|
xdp: refine xdp api with regards to generic xdp
While working on the iproute2 generic XDP frontend, I noticed that
as of right now it's possible to have native *and* generic XDP
programs loaded both at the same time for the case when a driver
supports native XDP.
The intended model for generic XDP from b5cdae3291f7 ("net: Generic
XDP") is, however, that only one out of the two can be present at
once which is also indicated as such in the XDP netlink dump part.
The main rationale for generic XDP is to ease accessibility (in
case a driver does not yet have XDP support) and to generically
provide a semantical model as an example for driver developers
wanting to add XDP support. The generic XDP option for an XDP
aware driver can still be useful for comparing and testing both
implementations.
However, it is not intended to have a second XDP processing stage
or layer with exactly the same functionality of the first native
stage. Only reason could be to have a partial fallback for future
XDP features that are not supported yet in the native implementation
and we probably also shouldn't strive for such fallback and instead
encourage native feature support in the first place. Given there's
currently no such fallback issue or use case, lets not go there yet
if we don't need to.
Therefore, change semantics for loading XDP and bail out if the
user tries to load a generic XDP program when a native one is
present and vice versa. Another alternative to bailing out would
be to handle the transition from one flavor to another gracefully,
but that would require to bring the device down, exchange both
types of programs, and bring it up again in order to avoid a tiny
window where a packet could hit both hooks. Given this complicates
the logic for just a debugging feature in the native case, I went
with the simpler variant.
For the dump, remove IFLA_XDP_FLAGS that was added with b5cdae3291f7
and reuse IFLA_XDP_ATTACHED for indicating the mode. Dumping all
or just a subset of flags that were used for loading the XDP prog
is suboptimal in the long run since not all flags are useful for
dumping and if we start to reuse the same flag definitions for
load and dump, then we'll waste bit space. What we really just
want is to dump the mode for now.
Current IFLA_XDP_ATTACHED semantics are: nothing was installed (0),
a program is running at the native driver layer (1). Thus, add a
mode that says that a program is running at generic XDP layer (2).
Applications will handle this fine in that older binaries will
just indicate that something is attached at XDP layer, effectively
this is similar to IFLA_XDP_FLAGS attr that we would have had
modulo the redundancy.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-11 23:04:46 +00:00
|
|
|
{
|
|
|
|
struct netdev_xdp xdp;
|
|
|
|
|
|
|
|
memset(&xdp, 0, sizeof(xdp));
|
|
|
|
xdp.command = XDP_QUERY_PROG;
|
|
|
|
|
|
|
|
/* Query must always succeed. */
|
|
|
|
WARN_ON(xdp_op(dev, &xdp) < 0);
|
2017-06-16 00:29:09 +00:00
|
|
|
if (prog_id)
|
|
|
|
*prog_id = xdp.prog_id;
|
|
|
|
|
xdp: refine xdp api with regards to generic xdp
While working on the iproute2 generic XDP frontend, I noticed that
as of right now it's possible to have native *and* generic XDP
programs loaded both at the same time for the case when a driver
supports native XDP.
The intended model for generic XDP from b5cdae3291f7 ("net: Generic
XDP") is, however, that only one out of the two can be present at
once which is also indicated as such in the XDP netlink dump part.
The main rationale for generic XDP is to ease accessibility (in
case a driver does not yet have XDP support) and to generically
provide a semantical model as an example for driver developers
wanting to add XDP support. The generic XDP option for an XDP
aware driver can still be useful for comparing and testing both
implementations.
However, it is not intended to have a second XDP processing stage
or layer with exactly the same functionality of the first native
stage. Only reason could be to have a partial fallback for future
XDP features that are not supported yet in the native implementation
and we probably also shouldn't strive for such fallback and instead
encourage native feature support in the first place. Given there's
currently no such fallback issue or use case, lets not go there yet
if we don't need to.
Therefore, change semantics for loading XDP and bail out if the
user tries to load a generic XDP program when a native one is
present and vice versa. Another alternative to bailing out would
be to handle the transition from one flavor to another gracefully,
but that would require to bring the device down, exchange both
types of programs, and bring it up again in order to avoid a tiny
window where a packet could hit both hooks. Given this complicates
the logic for just a debugging feature in the native case, I went
with the simpler variant.
For the dump, remove IFLA_XDP_FLAGS that was added with b5cdae3291f7
and reuse IFLA_XDP_ATTACHED for indicating the mode. Dumping all
or just a subset of flags that were used for loading the XDP prog
is suboptimal in the long run since not all flags are useful for
dumping and if we start to reuse the same flag definitions for
load and dump, then we'll waste bit space. What we really just
want is to dump the mode for now.
Current IFLA_XDP_ATTACHED semantics are: nothing was installed (0),
a program is running at the native driver layer (1). Thus, add a
mode that says that a program is running at generic XDP layer (2).
Applications will handle this fine in that older binaries will
just indicate that something is attached at XDP layer, effectively
this is similar to IFLA_XDP_FLAGS attr that we would have had
modulo the redundancy.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-11 23:04:46 +00:00
|
|
|
return xdp.prog_attached;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int dev_xdp_install(struct net_device *dev, xdp_op_t xdp_op,
|
2017-06-22 01:25:03 +00:00
|
|
|
struct netlink_ext_ack *extack, u32 flags,
|
xdp: refine xdp api with regards to generic xdp
While working on the iproute2 generic XDP frontend, I noticed that
as of right now it's possible to have native *and* generic XDP
programs loaded both at the same time for the case when a driver
supports native XDP.
The intended model for generic XDP from b5cdae3291f7 ("net: Generic
XDP") is, however, that only one out of the two can be present at
once which is also indicated as such in the XDP netlink dump part.
The main rationale for generic XDP is to ease accessibility (in
case a driver does not yet have XDP support) and to generically
provide a semantical model as an example for driver developers
wanting to add XDP support. The generic XDP option for an XDP
aware driver can still be useful for comparing and testing both
implementations.
However, it is not intended to have a second XDP processing stage
or layer with exactly the same functionality of the first native
stage. Only reason could be to have a partial fallback for future
XDP features that are not supported yet in the native implementation
and we probably also shouldn't strive for such fallback and instead
encourage native feature support in the first place. Given there's
currently no such fallback issue or use case, lets not go there yet
if we don't need to.
Therefore, change semantics for loading XDP and bail out if the
user tries to load a generic XDP program when a native one is
present and vice versa. Another alternative to bailing out would
be to handle the transition from one flavor to another gracefully,
but that would require to bring the device down, exchange both
types of programs, and bring it up again in order to avoid a tiny
window where a packet could hit both hooks. Given this complicates
the logic for just a debugging feature in the native case, I went
with the simpler variant.
For the dump, remove IFLA_XDP_FLAGS that was added with b5cdae3291f7
and reuse IFLA_XDP_ATTACHED for indicating the mode. Dumping all
or just a subset of flags that were used for loading the XDP prog
is suboptimal in the long run since not all flags are useful for
dumping and if we start to reuse the same flag definitions for
load and dump, then we'll waste bit space. What we really just
want is to dump the mode for now.
Current IFLA_XDP_ATTACHED semantics are: nothing was installed (0),
a program is running at the native driver layer (1). Thus, add a
mode that says that a program is running at generic XDP layer (2).
Applications will handle this fine in that older binaries will
just indicate that something is attached at XDP layer, effectively
this is similar to IFLA_XDP_FLAGS attr that we would have had
modulo the redundancy.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-11 23:04:46 +00:00
|
|
|
struct bpf_prog *prog)
|
|
|
|
{
|
|
|
|
struct netdev_xdp xdp;
|
|
|
|
|
|
|
|
memset(&xdp, 0, sizeof(xdp));
|
2017-06-22 01:25:04 +00:00
|
|
|
if (flags & XDP_FLAGS_HW_MODE)
|
|
|
|
xdp.command = XDP_SETUP_PROG_HW;
|
|
|
|
else
|
|
|
|
xdp.command = XDP_SETUP_PROG;
|
xdp: refine xdp api with regards to generic xdp
While working on the iproute2 generic XDP frontend, I noticed that
as of right now it's possible to have native *and* generic XDP
programs loaded both at the same time for the case when a driver
supports native XDP.
The intended model for generic XDP from b5cdae3291f7 ("net: Generic
XDP") is, however, that only one out of the two can be present at
once which is also indicated as such in the XDP netlink dump part.
The main rationale for generic XDP is to ease accessibility (in
case a driver does not yet have XDP support) and to generically
provide a semantical model as an example for driver developers
wanting to add XDP support. The generic XDP option for an XDP
aware driver can still be useful for comparing and testing both
implementations.
However, it is not intended to have a second XDP processing stage
or layer with exactly the same functionality of the first native
stage. Only reason could be to have a partial fallback for future
XDP features that are not supported yet in the native implementation
and we probably also shouldn't strive for such fallback and instead
encourage native feature support in the first place. Given there's
currently no such fallback issue or use case, lets not go there yet
if we don't need to.
Therefore, change semantics for loading XDP and bail out if the
user tries to load a generic XDP program when a native one is
present and vice versa. Another alternative to bailing out would
be to handle the transition from one flavor to another gracefully,
but that would require to bring the device down, exchange both
types of programs, and bring it up again in order to avoid a tiny
window where a packet could hit both hooks. Given this complicates
the logic for just a debugging feature in the native case, I went
with the simpler variant.
For the dump, remove IFLA_XDP_FLAGS that was added with b5cdae3291f7
and reuse IFLA_XDP_ATTACHED for indicating the mode. Dumping all
or just a subset of flags that were used for loading the XDP prog
is suboptimal in the long run since not all flags are useful for
dumping and if we start to reuse the same flag definitions for
load and dump, then we'll waste bit space. What we really just
want is to dump the mode for now.
Current IFLA_XDP_ATTACHED semantics are: nothing was installed (0),
a program is running at the native driver layer (1). Thus, add a
mode that says that a program is running at generic XDP layer (2).
Applications will handle this fine in that older binaries will
just indicate that something is attached at XDP layer, effectively
this is similar to IFLA_XDP_FLAGS attr that we would have had
modulo the redundancy.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-11 23:04:46 +00:00
|
|
|
xdp.extack = extack;
|
2017-06-22 01:25:03 +00:00
|
|
|
xdp.flags = flags;
|
xdp: refine xdp api with regards to generic xdp
While working on the iproute2 generic XDP frontend, I noticed that
as of right now it's possible to have native *and* generic XDP
programs loaded both at the same time for the case when a driver
supports native XDP.
The intended model for generic XDP from b5cdae3291f7 ("net: Generic
XDP") is, however, that only one out of the two can be present at
once which is also indicated as such in the XDP netlink dump part.
The main rationale for generic XDP is to ease accessibility (in
case a driver does not yet have XDP support) and to generically
provide a semantical model as an example for driver developers
wanting to add XDP support. The generic XDP option for an XDP
aware driver can still be useful for comparing and testing both
implementations.
However, it is not intended to have a second XDP processing stage
or layer with exactly the same functionality of the first native
stage. Only reason could be to have a partial fallback for future
XDP features that are not supported yet in the native implementation
and we probably also shouldn't strive for such fallback and instead
encourage native feature support in the first place. Given there's
currently no such fallback issue or use case, lets not go there yet
if we don't need to.
Therefore, change semantics for loading XDP and bail out if the
user tries to load a generic XDP program when a native one is
present and vice versa. Another alternative to bailing out would
be to handle the transition from one flavor to another gracefully,
but that would require to bring the device down, exchange both
types of programs, and bring it up again in order to avoid a tiny
window where a packet could hit both hooks. Given this complicates
the logic for just a debugging feature in the native case, I went
with the simpler variant.
For the dump, remove IFLA_XDP_FLAGS that was added with b5cdae3291f7
and reuse IFLA_XDP_ATTACHED for indicating the mode. Dumping all
or just a subset of flags that were used for loading the XDP prog
is suboptimal in the long run since not all flags are useful for
dumping and if we start to reuse the same flag definitions for
load and dump, then we'll waste bit space. What we really just
want is to dump the mode for now.
Current IFLA_XDP_ATTACHED semantics are: nothing was installed (0),
a program is running at the native driver layer (1). Thus, add a
mode that says that a program is running at generic XDP layer (2).
Applications will handle this fine in that older binaries will
just indicate that something is attached at XDP layer, effectively
this is similar to IFLA_XDP_FLAGS attr that we would have had
modulo the redundancy.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-11 23:04:46 +00:00
|
|
|
xdp.prog = prog;
|
|
|
|
|
|
|
|
return xdp_op(dev, &xdp);
|
|
|
|
}
|
|
|
|
|
2016-07-19 19:16:48 +00:00
|
|
|
/**
|
|
|
|
* dev_change_xdp_fd - set or clear a bpf program for a device rx path
|
|
|
|
* @dev: device
|
2017-05-01 22:53:43 +00:00
|
|
|
* @extack: netlink extended ack
|
2016-07-19 19:16:48 +00:00
|
|
|
* @fd: new program fd or negative value to clear
|
2016-11-28 22:16:54 +00:00
|
|
|
* @flags: xdp-related flags
|
2016-07-19 19:16:48 +00:00
|
|
|
*
|
|
|
|
* Set or clear a bpf program for a device
|
|
|
|
*/
|
2017-05-01 04:46:46 +00:00
|
|
|
int dev_change_xdp_fd(struct net_device *dev, struct netlink_ext_ack *extack,
|
|
|
|
int fd, u32 flags)
|
2016-07-19 19:16:48 +00:00
|
|
|
{
|
|
|
|
const struct net_device_ops *ops = dev->netdev_ops;
|
|
|
|
struct bpf_prog *prog = NULL;
|
xdp: refine xdp api with regards to generic xdp
While working on the iproute2 generic XDP frontend, I noticed that
as of right now it's possible to have native *and* generic XDP
programs loaded both at the same time for the case when a driver
supports native XDP.
The intended model for generic XDP from b5cdae3291f7 ("net: Generic
XDP") is, however, that only one out of the two can be present at
once which is also indicated as such in the XDP netlink dump part.
The main rationale for generic XDP is to ease accessibility (in
case a driver does not yet have XDP support) and to generically
provide a semantical model as an example for driver developers
wanting to add XDP support. The generic XDP option for an XDP
aware driver can still be useful for comparing and testing both
implementations.
However, it is not intended to have a second XDP processing stage
or layer with exactly the same functionality of the first native
stage. Only reason could be to have a partial fallback for future
XDP features that are not supported yet in the native implementation
and we probably also shouldn't strive for such fallback and instead
encourage native feature support in the first place. Given there's
currently no such fallback issue or use case, lets not go there yet
if we don't need to.
Therefore, change semantics for loading XDP and bail out if the
user tries to load a generic XDP program when a native one is
present and vice versa. Another alternative to bailing out would
be to handle the transition from one flavor to another gracefully,
but that would require to bring the device down, exchange both
types of programs, and bring it up again in order to avoid a tiny
window where a packet could hit both hooks. Given this complicates
the logic for just a debugging feature in the native case, I went
with the simpler variant.
For the dump, remove IFLA_XDP_FLAGS that was added with b5cdae3291f7
and reuse IFLA_XDP_ATTACHED for indicating the mode. Dumping all
or just a subset of flags that were used for loading the XDP prog
is suboptimal in the long run since not all flags are useful for
dumping and if we start to reuse the same flag definitions for
load and dump, then we'll waste bit space. What we really just
want is to dump the mode for now.
Current IFLA_XDP_ATTACHED semantics are: nothing was installed (0),
a program is running at the native driver layer (1). Thus, add a
mode that says that a program is running at generic XDP layer (2).
Applications will handle this fine in that older binaries will
just indicate that something is attached at XDP layer, effectively
this is similar to IFLA_XDP_FLAGS attr that we would have had
modulo the redundancy.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-11 23:04:46 +00:00
|
|
|
xdp_op_t xdp_op, xdp_chk;
|
2016-07-19 19:16:48 +00:00
|
|
|
int err;
|
|
|
|
|
2016-11-28 22:16:54 +00:00
|
|
|
ASSERT_RTNL();
|
|
|
|
|
xdp: refine xdp api with regards to generic xdp
While working on the iproute2 generic XDP frontend, I noticed that
as of right now it's possible to have native *and* generic XDP
programs loaded both at the same time for the case when a driver
supports native XDP.
The intended model for generic XDP from b5cdae3291f7 ("net: Generic
XDP") is, however, that only one out of the two can be present at
once which is also indicated as such in the XDP netlink dump part.
The main rationale for generic XDP is to ease accessibility (in
case a driver does not yet have XDP support) and to generically
provide a semantical model as an example for driver developers
wanting to add XDP support. The generic XDP option for an XDP
aware driver can still be useful for comparing and testing both
implementations.
However, it is not intended to have a second XDP processing stage
or layer with exactly the same functionality of the first native
stage. Only reason could be to have a partial fallback for future
XDP features that are not supported yet in the native implementation
and we probably also shouldn't strive for such fallback and instead
encourage native feature support in the first place. Given there's
currently no such fallback issue or use case, lets not go there yet
if we don't need to.
Therefore, change semantics for loading XDP and bail out if the
user tries to load a generic XDP program when a native one is
present and vice versa. Another alternative to bailing out would
be to handle the transition from one flavor to another gracefully,
but that would require to bring the device down, exchange both
types of programs, and bring it up again in order to avoid a tiny
window where a packet could hit both hooks. Given this complicates
the logic for just a debugging feature in the native case, I went
with the simpler variant.
For the dump, remove IFLA_XDP_FLAGS that was added with b5cdae3291f7
and reuse IFLA_XDP_ATTACHED for indicating the mode. Dumping all
or just a subset of flags that were used for loading the XDP prog
is suboptimal in the long run since not all flags are useful for
dumping and if we start to reuse the same flag definitions for
load and dump, then we'll waste bit space. What we really just
want is to dump the mode for now.
Current IFLA_XDP_ATTACHED semantics are: nothing was installed (0),
a program is running at the native driver layer (1). Thus, add a
mode that says that a program is running at generic XDP layer (2).
Applications will handle this fine in that older binaries will
just indicate that something is attached at XDP layer, effectively
this is similar to IFLA_XDP_FLAGS attr that we would have had
modulo the redundancy.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-11 23:04:46 +00:00
|
|
|
xdp_op = xdp_chk = ops->ndo_xdp;
|
2017-06-22 01:25:04 +00:00
|
|
|
if (!xdp_op && (flags & (XDP_FLAGS_DRV_MODE | XDP_FLAGS_HW_MODE)))
|
2017-05-11 23:04:45 +00:00
|
|
|
return -EOPNOTSUPP;
|
2017-04-18 19:36:58 +00:00
|
|
|
if (!xdp_op || (flags & XDP_FLAGS_SKB_MODE))
|
|
|
|
xdp_op = generic_xdp_install;
|
xdp: refine xdp api with regards to generic xdp
While working on the iproute2 generic XDP frontend, I noticed that
as of right now it's possible to have native *and* generic XDP
programs loaded both at the same time for the case when a driver
supports native XDP.
The intended model for generic XDP from b5cdae3291f7 ("net: Generic
XDP") is, however, that only one out of the two can be present at
once which is also indicated as such in the XDP netlink dump part.
The main rationale for generic XDP is to ease accessibility (in
case a driver does not yet have XDP support) and to generically
provide a semantical model as an example for driver developers
wanting to add XDP support. The generic XDP option for an XDP
aware driver can still be useful for comparing and testing both
implementations.
However, it is not intended to have a second XDP processing stage
or layer with exactly the same functionality of the first native
stage. Only reason could be to have a partial fallback for future
XDP features that are not supported yet in the native implementation
and we probably also shouldn't strive for such fallback and instead
encourage native feature support in the first place. Given there's
currently no such fallback issue or use case, lets not go there yet
if we don't need to.
Therefore, change semantics for loading XDP and bail out if the
user tries to load a generic XDP program when a native one is
present and vice versa. Another alternative to bailing out would
be to handle the transition from one flavor to another gracefully,
but that would require to bring the device down, exchange both
types of programs, and bring it up again in order to avoid a tiny
window where a packet could hit both hooks. Given this complicates
the logic for just a debugging feature in the native case, I went
with the simpler variant.
For the dump, remove IFLA_XDP_FLAGS that was added with b5cdae3291f7
and reuse IFLA_XDP_ATTACHED for indicating the mode. Dumping all
or just a subset of flags that were used for loading the XDP prog
is suboptimal in the long run since not all flags are useful for
dumping and if we start to reuse the same flag definitions for
load and dump, then we'll waste bit space. What we really just
want is to dump the mode for now.
Current IFLA_XDP_ATTACHED semantics are: nothing was installed (0),
a program is running at the native driver layer (1). Thus, add a
mode that says that a program is running at generic XDP layer (2).
Applications will handle this fine in that older binaries will
just indicate that something is attached at XDP layer, effectively
this is similar to IFLA_XDP_FLAGS attr that we would have had
modulo the redundancy.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-11 23:04:46 +00:00
|
|
|
if (xdp_op == xdp_chk)
|
|
|
|
xdp_chk = generic_xdp_install;
|
2017-04-18 19:36:58 +00:00
|
|
|
|
2016-07-19 19:16:48 +00:00
|
|
|
if (fd >= 0) {
|
2017-06-16 00:29:09 +00:00
|
|
|
if (xdp_chk && __dev_xdp_attached(dev, xdp_chk, NULL))
|
xdp: refine xdp api with regards to generic xdp
While working on the iproute2 generic XDP frontend, I noticed that
as of right now it's possible to have native *and* generic XDP
programs loaded both at the same time for the case when a driver
supports native XDP.
The intended model for generic XDP from b5cdae3291f7 ("net: Generic
XDP") is, however, that only one out of the two can be present at
once which is also indicated as such in the XDP netlink dump part.
The main rationale for generic XDP is to ease accessibility (in
case a driver does not yet have XDP support) and to generically
provide a semantical model as an example for driver developers
wanting to add XDP support. The generic XDP option for an XDP
aware driver can still be useful for comparing and testing both
implementations.
However, it is not intended to have a second XDP processing stage
or layer with exactly the same functionality of the first native
stage. Only reason could be to have a partial fallback for future
XDP features that are not supported yet in the native implementation
and we probably also shouldn't strive for such fallback and instead
encourage native feature support in the first place. Given there's
currently no such fallback issue or use case, lets not go there yet
if we don't need to.
Therefore, change semantics for loading XDP and bail out if the
user tries to load a generic XDP program when a native one is
present and vice versa. Another alternative to bailing out would
be to handle the transition from one flavor to another gracefully,
but that would require to bring the device down, exchange both
types of programs, and bring it up again in order to avoid a tiny
window where a packet could hit both hooks. Given this complicates
the logic for just a debugging feature in the native case, I went
with the simpler variant.
For the dump, remove IFLA_XDP_FLAGS that was added with b5cdae3291f7
and reuse IFLA_XDP_ATTACHED for indicating the mode. Dumping all
or just a subset of flags that were used for loading the XDP prog
is suboptimal in the long run since not all flags are useful for
dumping and if we start to reuse the same flag definitions for
load and dump, then we'll waste bit space. What we really just
want is to dump the mode for now.
Current IFLA_XDP_ATTACHED semantics are: nothing was installed (0),
a program is running at the native driver layer (1). Thus, add a
mode that says that a program is running at generic XDP layer (2).
Applications will handle this fine in that older binaries will
just indicate that something is attached at XDP layer, effectively
this is similar to IFLA_XDP_FLAGS attr that we would have had
modulo the redundancy.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-11 23:04:46 +00:00
|
|
|
return -EEXIST;
|
|
|
|
if ((flags & XDP_FLAGS_UPDATE_IF_NOEXIST) &&
|
2017-06-16 00:29:09 +00:00
|
|
|
__dev_xdp_attached(dev, xdp_op, NULL))
|
xdp: refine xdp api with regards to generic xdp
While working on the iproute2 generic XDP frontend, I noticed that
as of right now it's possible to have native *and* generic XDP
programs loaded both at the same time for the case when a driver
supports native XDP.
The intended model for generic XDP from b5cdae3291f7 ("net: Generic
XDP") is, however, that only one out of the two can be present at
once which is also indicated as such in the XDP netlink dump part.
The main rationale for generic XDP is to ease accessibility (in
case a driver does not yet have XDP support) and to generically
provide a semantical model as an example for driver developers
wanting to add XDP support. The generic XDP option for an XDP
aware driver can still be useful for comparing and testing both
implementations.
However, it is not intended to have a second XDP processing stage
or layer with exactly the same functionality of the first native
stage. Only reason could be to have a partial fallback for future
XDP features that are not supported yet in the native implementation
and we probably also shouldn't strive for such fallback and instead
encourage native feature support in the first place. Given there's
currently no such fallback issue or use case, lets not go there yet
if we don't need to.
Therefore, change semantics for loading XDP and bail out if the
user tries to load a generic XDP program when a native one is
present and vice versa. Another alternative to bailing out would
be to handle the transition from one flavor to another gracefully,
but that would require to bring the device down, exchange both
types of programs, and bring it up again in order to avoid a tiny
window where a packet could hit both hooks. Given this complicates
the logic for just a debugging feature in the native case, I went
with the simpler variant.
For the dump, remove IFLA_XDP_FLAGS that was added with b5cdae3291f7
and reuse IFLA_XDP_ATTACHED for indicating the mode. Dumping all
or just a subset of flags that were used for loading the XDP prog
is suboptimal in the long run since not all flags are useful for
dumping and if we start to reuse the same flag definitions for
load and dump, then we'll waste bit space. What we really just
want is to dump the mode for now.
Current IFLA_XDP_ATTACHED semantics are: nothing was installed (0),
a program is running at the native driver layer (1). Thus, add a
mode that says that a program is running at generic XDP layer (2).
Applications will handle this fine in that older binaries will
just indicate that something is attached at XDP layer, effectively
this is similar to IFLA_XDP_FLAGS attr that we would have had
modulo the redundancy.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-11 23:04:46 +00:00
|
|
|
return -EBUSY;
|
2016-11-28 22:16:54 +00:00
|
|
|
|
2016-07-19 19:16:48 +00:00
|
|
|
prog = bpf_prog_get_type(fd, BPF_PROG_TYPE_XDP);
|
|
|
|
if (IS_ERR(prog))
|
|
|
|
return PTR_ERR(prog);
|
|
|
|
}
|
|
|
|
|
2017-06-22 01:25:03 +00:00
|
|
|
err = dev_xdp_install(dev, xdp_op, extack, flags, prog);
|
2016-07-19 19:16:48 +00:00
|
|
|
if (err < 0 && prog)
|
|
|
|
bpf_prog_put(prog);
|
|
|
|
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
/**
|
|
|
|
* dev_new_index - allocate an ifindex
|
2007-10-13 04:17:49 +00:00
|
|
|
* @net: the applicable net namespace
|
2005-04-16 22:20:36 +00:00
|
|
|
*
|
|
|
|
* Returns a suitable unique value for a new device interface
|
|
|
|
* number. The caller must hold the rtnl semaphore or the
|
|
|
|
* dev_base_lock to be sure it remains unique.
|
|
|
|
*/
|
2007-09-17 18:56:21 +00:00
|
|
|
static int dev_new_index(struct net *net)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2012-08-08 21:53:19 +00:00
|
|
|
int ifindex = net->ifindex;
|
2017-02-09 06:56:07 +00:00
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
for (;;) {
|
|
|
|
if (++ifindex <= 0)
|
|
|
|
ifindex = 1;
|
2007-09-17 18:56:21 +00:00
|
|
|
if (!__dev_get_by_index(net, ifindex))
|
2012-08-08 21:53:19 +00:00
|
|
|
return net->ifindex = ifindex;
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Delayed registration/unregisteration */
|
2007-12-07 08:49:17 +00:00
|
|
|
static LIST_HEAD(net_todo_list);
|
2014-05-12 22:11:20 +00:00
|
|
|
DECLARE_WAIT_QUEUE_HEAD(netdev_unregistering_wq);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2007-03-09 04:46:03 +00:00
|
|
|
static void net_set_todo(struct net_device *dev)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
|
|
|
list_add_tail(&dev->todo_list, &net_todo_list);
|
2013-09-24 04:19:49 +00:00
|
|
|
dev_net(dev)->dev_unreg_count++;
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
|
2009-10-27 07:04:19 +00:00
|
|
|
static void rollback_registered_many(struct list_head *head)
|
2007-10-30 22:38:18 +00:00
|
|
|
{
|
net: Handle NETREG_UNINITIALIZED devices correctly
Fix two problems:
1. If unregister_netdevice_many() is called with both registered
and unregistered devices, rollback_registered_many() bails out
when it reaches the first unregistered device. The processing
of the prior registered devices is unfinished, and the
remaining devices are skipped, and possible registered netdev's
are leaked/unregistered.
2. System hangs or panics depending on how the devices are passed,
since when netdev_run_todo() runs, some devices were not fully
processed.
Tested by passing intermingled unregistered and registered vlan
devices to unregister_netdevice_many() as follows:
1. dev, fake_dev1, fake_dev2: hangs in run_todo
("unregister_netdevice: waiting for eth1.100 to become
free. Usage count = 1")
2. fake_dev1, dev, fake_dev2: failure during de-registration
and next registration, followed by a vlan driver Oops
during subsequent registration.
Confirmed that the patch fixes both cases.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-08 22:26:02 +00:00
|
|
|
struct net_device *dev, *tmp;
|
2013-10-06 02:26:05 +00:00
|
|
|
LIST_HEAD(close_head);
|
2009-10-27 07:04:19 +00:00
|
|
|
|
2007-10-30 22:38:18 +00:00
|
|
|
BUG_ON(dev_boot_phase);
|
|
|
|
ASSERT_RTNL();
|
|
|
|
|
net: Handle NETREG_UNINITIALIZED devices correctly
Fix two problems:
1. If unregister_netdevice_many() is called with both registered
and unregistered devices, rollback_registered_many() bails out
when it reaches the first unregistered device. The processing
of the prior registered devices is unfinished, and the
remaining devices are skipped, and possible registered netdev's
are leaked/unregistered.
2. System hangs or panics depending on how the devices are passed,
since when netdev_run_todo() runs, some devices were not fully
processed.
Tested by passing intermingled unregistered and registered vlan
devices to unregister_netdevice_many() as follows:
1. dev, fake_dev1, fake_dev2: hangs in run_todo
("unregister_netdevice: waiting for eth1.100 to become
free. Usage count = 1")
2. fake_dev1, dev, fake_dev2: failure during de-registration
and next registration, followed by a vlan driver Oops
during subsequent registration.
Confirmed that the patch fixes both cases.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-08 22:26:02 +00:00
|
|
|
list_for_each_entry_safe(dev, tmp, head, unreg_list) {
|
2009-10-27 07:04:19 +00:00
|
|
|
/* Some devices call without registering
|
net: Handle NETREG_UNINITIALIZED devices correctly
Fix two problems:
1. If unregister_netdevice_many() is called with both registered
and unregistered devices, rollback_registered_many() bails out
when it reaches the first unregistered device. The processing
of the prior registered devices is unfinished, and the
remaining devices are skipped, and possible registered netdev's
are leaked/unregistered.
2. System hangs or panics depending on how the devices are passed,
since when netdev_run_todo() runs, some devices were not fully
processed.
Tested by passing intermingled unregistered and registered vlan
devices to unregister_netdevice_many() as follows:
1. dev, fake_dev1, fake_dev2: hangs in run_todo
("unregister_netdevice: waiting for eth1.100 to become
free. Usage count = 1")
2. fake_dev1, dev, fake_dev2: failure during de-registration
and next registration, followed by a vlan driver Oops
during subsequent registration.
Confirmed that the patch fixes both cases.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-08 22:26:02 +00:00
|
|
|
* for initialization unwind. Remove those
|
|
|
|
* devices and proceed with the remaining.
|
2009-10-27 07:04:19 +00:00
|
|
|
*/
|
|
|
|
if (dev->reg_state == NETREG_UNINITIALIZED) {
|
2012-02-01 10:54:43 +00:00
|
|
|
pr_debug("unregister_netdevice: device %s/%p never was registered\n",
|
|
|
|
dev->name, dev);
|
2007-10-30 22:38:18 +00:00
|
|
|
|
2009-10-27 07:04:19 +00:00
|
|
|
WARN_ON(1);
|
net: Handle NETREG_UNINITIALIZED devices correctly
Fix two problems:
1. If unregister_netdevice_many() is called with both registered
and unregistered devices, rollback_registered_many() bails out
when it reaches the first unregistered device. The processing
of the prior registered devices is unfinished, and the
remaining devices are skipped, and possible registered netdev's
are leaked/unregistered.
2. System hangs or panics depending on how the devices are passed,
since when netdev_run_todo() runs, some devices were not fully
processed.
Tested by passing intermingled unregistered and registered vlan
devices to unregister_netdevice_many() as follows:
1. dev, fake_dev1, fake_dev2: hangs in run_todo
("unregister_netdevice: waiting for eth1.100 to become
free. Usage count = 1")
2. fake_dev1, dev, fake_dev2: failure during de-registration
and next registration, followed by a vlan driver Oops
during subsequent registration.
Confirmed that the patch fixes both cases.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-12-08 22:26:02 +00:00
|
|
|
list_del(&dev->unreg_list);
|
|
|
|
continue;
|
2009-10-27 07:04:19 +00:00
|
|
|
}
|
2011-05-19 12:24:16 +00:00
|
|
|
dev->dismantle = true;
|
2009-10-27 07:04:19 +00:00
|
|
|
BUG_ON(dev->reg_state != NETREG_REGISTERED);
|
2010-12-13 12:44:07 +00:00
|
|
|
}
|
2007-10-30 22:38:18 +00:00
|
|
|
|
2010-12-13 12:44:07 +00:00
|
|
|
/* If device is running, close it first. */
|
2013-10-06 02:26:05 +00:00
|
|
|
list_for_each_entry(dev, head, unreg_list)
|
|
|
|
list_add_tail(&dev->close_list, &close_head);
|
2015-03-19 02:52:33 +00:00
|
|
|
dev_close_many(&close_head, true);
|
2007-10-30 22:38:18 +00:00
|
|
|
|
2010-12-13 12:44:07 +00:00
|
|
|
list_for_each_entry(dev, head, unreg_list) {
|
2009-10-27 07:04:19 +00:00
|
|
|
/* And unlink it from device chain. */
|
|
|
|
unlist_netdevice(dev);
|
2007-10-30 22:38:18 +00:00
|
|
|
|
2009-10-27 07:04:19 +00:00
|
|
|
dev->reg_state = NETREG_UNREGISTERING;
|
|
|
|
}
|
2016-08-26 19:50:39 +00:00
|
|
|
flush_all_backlogs();
|
2007-10-30 22:38:18 +00:00
|
|
|
|
|
|
|
synchronize_net();
|
|
|
|
|
2009-10-27 07:04:19 +00:00
|
|
|
list_for_each_entry(dev, head, unreg_list) {
|
2014-12-03 21:46:24 +00:00
|
|
|
struct sk_buff *skb = NULL;
|
|
|
|
|
2009-10-27 07:04:19 +00:00
|
|
|
/* Shutdown queueing discipline. */
|
|
|
|
dev_shutdown(dev);
|
2007-10-30 22:38:18 +00:00
|
|
|
|
|
|
|
|
2009-10-27 07:04:19 +00:00
|
|
|
/* Notify protocols, that we are about to destroy
|
2017-02-09 06:56:06 +00:00
|
|
|
* this device. They should clean all the things.
|
|
|
|
*/
|
2009-10-27 07:04:19 +00:00
|
|
|
call_netdevice_notifiers(NETDEV_UNREGISTER, dev);
|
2007-10-30 22:38:18 +00:00
|
|
|
|
2014-12-03 21:46:24 +00:00
|
|
|
if (!dev->rtnl_link_ops ||
|
|
|
|
dev->rtnl_link_state == RTNL_LINK_INITIALIZED)
|
2017-05-27 14:14:34 +00:00
|
|
|
skb = rtmsg_ifinfo_build_skb(RTM_DELLINK, dev, ~0U, 0,
|
2014-12-03 21:46:24 +00:00
|
|
|
GFP_KERNEL);
|
|
|
|
|
2009-10-27 07:04:19 +00:00
|
|
|
/*
|
|
|
|
* Flush the unicast and multicast chains
|
|
|
|
*/
|
2010-04-01 21:22:09 +00:00
|
|
|
dev_uc_flush(dev);
|
2010-04-01 21:22:57 +00:00
|
|
|
dev_mc_flush(dev);
|
2007-10-30 22:38:18 +00:00
|
|
|
|
2009-10-27 07:04:19 +00:00
|
|
|
if (dev->netdev_ops->ndo_uninit)
|
|
|
|
dev->netdev_ops->ndo_uninit(dev);
|
2007-10-30 22:38:18 +00:00
|
|
|
|
2014-12-03 21:46:24 +00:00
|
|
|
if (skb)
|
|
|
|
rtmsg_ifinfo_send(skb, dev, GFP_KERNEL);
|
2014-05-01 18:40:30 +00:00
|
|
|
|
2013-01-03 22:48:49 +00:00
|
|
|
/* Notifier chain MUST detach us all upper devices. */
|
|
|
|
WARN_ON(netdev_has_any_upper_dev(dev));
|
2016-10-18 02:15:52 +00:00
|
|
|
WARN_ON(netdev_has_any_lower_dev(dev));
|
2007-10-30 22:38:18 +00:00
|
|
|
|
2009-10-27 07:04:19 +00:00
|
|
|
/* Remove entries from kobject tree */
|
|
|
|
netdev_unregister_kobject(dev);
|
2013-01-10 08:57:46 +00:00
|
|
|
#ifdef CONFIG_XPS
|
|
|
|
/* Remove XPS queueing entries */
|
|
|
|
netif_reset_xps_queues_gt(dev, 0);
|
|
|
|
#endif
|
2009-10-27 07:04:19 +00:00
|
|
|
}
|
2007-10-30 22:38:18 +00:00
|
|
|
|
2011-10-13 22:25:23 +00:00
|
|
|
synchronize_net();
|
2009-11-16 13:49:35 +00:00
|
|
|
|
2009-11-29 15:45:58 +00:00
|
|
|
list_for_each_entry(dev, head, unreg_list)
|
2009-10-27 07:04:19 +00:00
|
|
|
dev_put(dev);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void rollback_registered(struct net_device *dev)
|
|
|
|
{
|
|
|
|
LIST_HEAD(single);
|
|
|
|
|
|
|
|
list_add(&dev->unreg_list, &single);
|
|
|
|
rollback_registered_many(&single);
|
2011-02-17 22:59:19 +00:00
|
|
|
list_del(&single);
|
2007-10-30 22:38:18 +00:00
|
|
|
}
|
|
|
|
|
net/core: generic support for disabling netdev features down stack
There are some netdev features, which when disabled on an upper device,
such as a bonding master or a bridge, must be disabled and cannot be
re-enabled on underlying devices.
This is a rework of an earlier more heavy-handed appraoch, which simply
disables and prevents re-enabling of netdev features listed in a new
define in include/net/netdev_features.h, NETIF_F_UPPER_DISABLES. Any upper
device that disables a flag in that feature mask, the disabling will
propagate down the stack, and any lower device that has any upper device
with one of those flags disabled should not be able to enable said flag.
Initially, only LRO is included for proof of concept, and because this
code effectively does the same thing as dev_disable_lro(), though it will
also activate from the ethtool path, which was one of the goals here.
[root@dell-per730-01 ~]# ethtool -k bond0 |grep large
large-receive-offload: on
[root@dell-per730-01 ~]# ethtool -k p5p1 |grep large
large-receive-offload: on
[root@dell-per730-01 ~]# ethtool -K bond0 lro off
[root@dell-per730-01 ~]# ethtool -k bond0 |grep large
large-receive-offload: off
[root@dell-per730-01 ~]# ethtool -k p5p1 |grep large
large-receive-offload: off
dmesg dump:
[ 1033.277986] bond0: Disabling feature 0x0000000000008000 on lower dev p5p2.
[ 1034.067949] bnx2x 0000:06:00.1 p5p2: using MSI-X IRQs: sp 74 fp[0] 76 ... fp[7] 83
[ 1034.753612] bond0: Disabling feature 0x0000000000008000 on lower dev p5p1.
[ 1035.591019] bnx2x 0000:06:00.0 p5p1: using MSI-X IRQs: sp 62 fp[0] 64 ... fp[7] 71
This has been successfully tested with bnx2x, qlcnic and netxen network
cards as slaves in a bond interface. Turning LRO on or off on the master
also turns it on or off on each of the slaves, new slaves are added with
LRO in the same state as the master, and LRO can't be toggled on the
slaves.
Also, this should largely remove the need for dev_disable_lro(), and most,
if not all, of its call sites can be replaced by simply making sure
NETIF_F_LRO isn't included in the relevant device's feature flags.
Note that this patch is driven by bug reports from users saying it was
confusing that bonds and slaves had different settings for the same
features, and while it won't be 100% in sync if a lower device doesn't
support a feature like LRO, I think this is a good step in the right
direction.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <gospo@cumulusnetworks.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Nikolay Aleksandrov <razor@blackwall.org>
CC: Michal Kubecek <mkubecek@suse.cz>
CC: Alexander Duyck <alexander.duyck@gmail.com>
CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-03 02:55:59 +00:00
|
|
|
static netdev_features_t netdev_sync_upper_features(struct net_device *lower,
|
|
|
|
struct net_device *upper, netdev_features_t features)
|
|
|
|
{
|
|
|
|
netdev_features_t upper_disables = NETIF_F_UPPER_DISABLES;
|
|
|
|
netdev_features_t feature;
|
2015-11-03 15:15:59 +00:00
|
|
|
int feature_bit;
|
net/core: generic support for disabling netdev features down stack
There are some netdev features, which when disabled on an upper device,
such as a bonding master or a bridge, must be disabled and cannot be
re-enabled on underlying devices.
This is a rework of an earlier more heavy-handed appraoch, which simply
disables and prevents re-enabling of netdev features listed in a new
define in include/net/netdev_features.h, NETIF_F_UPPER_DISABLES. Any upper
device that disables a flag in that feature mask, the disabling will
propagate down the stack, and any lower device that has any upper device
with one of those flags disabled should not be able to enable said flag.
Initially, only LRO is included for proof of concept, and because this
code effectively does the same thing as dev_disable_lro(), though it will
also activate from the ethtool path, which was one of the goals here.
[root@dell-per730-01 ~]# ethtool -k bond0 |grep large
large-receive-offload: on
[root@dell-per730-01 ~]# ethtool -k p5p1 |grep large
large-receive-offload: on
[root@dell-per730-01 ~]# ethtool -K bond0 lro off
[root@dell-per730-01 ~]# ethtool -k bond0 |grep large
large-receive-offload: off
[root@dell-per730-01 ~]# ethtool -k p5p1 |grep large
large-receive-offload: off
dmesg dump:
[ 1033.277986] bond0: Disabling feature 0x0000000000008000 on lower dev p5p2.
[ 1034.067949] bnx2x 0000:06:00.1 p5p2: using MSI-X IRQs: sp 74 fp[0] 76 ... fp[7] 83
[ 1034.753612] bond0: Disabling feature 0x0000000000008000 on lower dev p5p1.
[ 1035.591019] bnx2x 0000:06:00.0 p5p1: using MSI-X IRQs: sp 62 fp[0] 64 ... fp[7] 71
This has been successfully tested with bnx2x, qlcnic and netxen network
cards as slaves in a bond interface. Turning LRO on or off on the master
also turns it on or off on each of the slaves, new slaves are added with
LRO in the same state as the master, and LRO can't be toggled on the
slaves.
Also, this should largely remove the need for dev_disable_lro(), and most,
if not all, of its call sites can be replaced by simply making sure
NETIF_F_LRO isn't included in the relevant device's feature flags.
Note that this patch is driven by bug reports from users saying it was
confusing that bonds and slaves had different settings for the same
features, and while it won't be 100% in sync if a lower device doesn't
support a feature like LRO, I think this is a good step in the right
direction.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <gospo@cumulusnetworks.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Nikolay Aleksandrov <razor@blackwall.org>
CC: Michal Kubecek <mkubecek@suse.cz>
CC: Alexander Duyck <alexander.duyck@gmail.com>
CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-03 02:55:59 +00:00
|
|
|
|
2015-11-03 15:15:59 +00:00
|
|
|
for_each_netdev_feature(&upper_disables, feature_bit) {
|
|
|
|
feature = __NETIF_F_BIT(feature_bit);
|
net/core: generic support for disabling netdev features down stack
There are some netdev features, which when disabled on an upper device,
such as a bonding master or a bridge, must be disabled and cannot be
re-enabled on underlying devices.
This is a rework of an earlier more heavy-handed appraoch, which simply
disables and prevents re-enabling of netdev features listed in a new
define in include/net/netdev_features.h, NETIF_F_UPPER_DISABLES. Any upper
device that disables a flag in that feature mask, the disabling will
propagate down the stack, and any lower device that has any upper device
with one of those flags disabled should not be able to enable said flag.
Initially, only LRO is included for proof of concept, and because this
code effectively does the same thing as dev_disable_lro(), though it will
also activate from the ethtool path, which was one of the goals here.
[root@dell-per730-01 ~]# ethtool -k bond0 |grep large
large-receive-offload: on
[root@dell-per730-01 ~]# ethtool -k p5p1 |grep large
large-receive-offload: on
[root@dell-per730-01 ~]# ethtool -K bond0 lro off
[root@dell-per730-01 ~]# ethtool -k bond0 |grep large
large-receive-offload: off
[root@dell-per730-01 ~]# ethtool -k p5p1 |grep large
large-receive-offload: off
dmesg dump:
[ 1033.277986] bond0: Disabling feature 0x0000000000008000 on lower dev p5p2.
[ 1034.067949] bnx2x 0000:06:00.1 p5p2: using MSI-X IRQs: sp 74 fp[0] 76 ... fp[7] 83
[ 1034.753612] bond0: Disabling feature 0x0000000000008000 on lower dev p5p1.
[ 1035.591019] bnx2x 0000:06:00.0 p5p1: using MSI-X IRQs: sp 62 fp[0] 64 ... fp[7] 71
This has been successfully tested with bnx2x, qlcnic and netxen network
cards as slaves in a bond interface. Turning LRO on or off on the master
also turns it on or off on each of the slaves, new slaves are added with
LRO in the same state as the master, and LRO can't be toggled on the
slaves.
Also, this should largely remove the need for dev_disable_lro(), and most,
if not all, of its call sites can be replaced by simply making sure
NETIF_F_LRO isn't included in the relevant device's feature flags.
Note that this patch is driven by bug reports from users saying it was
confusing that bonds and slaves had different settings for the same
features, and while it won't be 100% in sync if a lower device doesn't
support a feature like LRO, I think this is a good step in the right
direction.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <gospo@cumulusnetworks.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Nikolay Aleksandrov <razor@blackwall.org>
CC: Michal Kubecek <mkubecek@suse.cz>
CC: Alexander Duyck <alexander.duyck@gmail.com>
CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-03 02:55:59 +00:00
|
|
|
if (!(upper->wanted_features & feature)
|
|
|
|
&& (features & feature)) {
|
|
|
|
netdev_dbg(lower, "Dropping feature %pNF, upper dev %s has it off.\n",
|
|
|
|
&feature, upper->name);
|
|
|
|
features &= ~feature;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return features;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void netdev_sync_lower_features(struct net_device *upper,
|
|
|
|
struct net_device *lower, netdev_features_t features)
|
|
|
|
{
|
|
|
|
netdev_features_t upper_disables = NETIF_F_UPPER_DISABLES;
|
|
|
|
netdev_features_t feature;
|
2015-11-03 15:15:59 +00:00
|
|
|
int feature_bit;
|
net/core: generic support for disabling netdev features down stack
There are some netdev features, which when disabled on an upper device,
such as a bonding master or a bridge, must be disabled and cannot be
re-enabled on underlying devices.
This is a rework of an earlier more heavy-handed appraoch, which simply
disables and prevents re-enabling of netdev features listed in a new
define in include/net/netdev_features.h, NETIF_F_UPPER_DISABLES. Any upper
device that disables a flag in that feature mask, the disabling will
propagate down the stack, and any lower device that has any upper device
with one of those flags disabled should not be able to enable said flag.
Initially, only LRO is included for proof of concept, and because this
code effectively does the same thing as dev_disable_lro(), though it will
also activate from the ethtool path, which was one of the goals here.
[root@dell-per730-01 ~]# ethtool -k bond0 |grep large
large-receive-offload: on
[root@dell-per730-01 ~]# ethtool -k p5p1 |grep large
large-receive-offload: on
[root@dell-per730-01 ~]# ethtool -K bond0 lro off
[root@dell-per730-01 ~]# ethtool -k bond0 |grep large
large-receive-offload: off
[root@dell-per730-01 ~]# ethtool -k p5p1 |grep large
large-receive-offload: off
dmesg dump:
[ 1033.277986] bond0: Disabling feature 0x0000000000008000 on lower dev p5p2.
[ 1034.067949] bnx2x 0000:06:00.1 p5p2: using MSI-X IRQs: sp 74 fp[0] 76 ... fp[7] 83
[ 1034.753612] bond0: Disabling feature 0x0000000000008000 on lower dev p5p1.
[ 1035.591019] bnx2x 0000:06:00.0 p5p1: using MSI-X IRQs: sp 62 fp[0] 64 ... fp[7] 71
This has been successfully tested with bnx2x, qlcnic and netxen network
cards as slaves in a bond interface. Turning LRO on or off on the master
also turns it on or off on each of the slaves, new slaves are added with
LRO in the same state as the master, and LRO can't be toggled on the
slaves.
Also, this should largely remove the need for dev_disable_lro(), and most,
if not all, of its call sites can be replaced by simply making sure
NETIF_F_LRO isn't included in the relevant device's feature flags.
Note that this patch is driven by bug reports from users saying it was
confusing that bonds and slaves had different settings for the same
features, and while it won't be 100% in sync if a lower device doesn't
support a feature like LRO, I think this is a good step in the right
direction.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <gospo@cumulusnetworks.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Nikolay Aleksandrov <razor@blackwall.org>
CC: Michal Kubecek <mkubecek@suse.cz>
CC: Alexander Duyck <alexander.duyck@gmail.com>
CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-03 02:55:59 +00:00
|
|
|
|
2015-11-03 15:15:59 +00:00
|
|
|
for_each_netdev_feature(&upper_disables, feature_bit) {
|
|
|
|
feature = __NETIF_F_BIT(feature_bit);
|
net/core: generic support for disabling netdev features down stack
There are some netdev features, which when disabled on an upper device,
such as a bonding master or a bridge, must be disabled and cannot be
re-enabled on underlying devices.
This is a rework of an earlier more heavy-handed appraoch, which simply
disables and prevents re-enabling of netdev features listed in a new
define in include/net/netdev_features.h, NETIF_F_UPPER_DISABLES. Any upper
device that disables a flag in that feature mask, the disabling will
propagate down the stack, and any lower device that has any upper device
with one of those flags disabled should not be able to enable said flag.
Initially, only LRO is included for proof of concept, and because this
code effectively does the same thing as dev_disable_lro(), though it will
also activate from the ethtool path, which was one of the goals here.
[root@dell-per730-01 ~]# ethtool -k bond0 |grep large
large-receive-offload: on
[root@dell-per730-01 ~]# ethtool -k p5p1 |grep large
large-receive-offload: on
[root@dell-per730-01 ~]# ethtool -K bond0 lro off
[root@dell-per730-01 ~]# ethtool -k bond0 |grep large
large-receive-offload: off
[root@dell-per730-01 ~]# ethtool -k p5p1 |grep large
large-receive-offload: off
dmesg dump:
[ 1033.277986] bond0: Disabling feature 0x0000000000008000 on lower dev p5p2.
[ 1034.067949] bnx2x 0000:06:00.1 p5p2: using MSI-X IRQs: sp 74 fp[0] 76 ... fp[7] 83
[ 1034.753612] bond0: Disabling feature 0x0000000000008000 on lower dev p5p1.
[ 1035.591019] bnx2x 0000:06:00.0 p5p1: using MSI-X IRQs: sp 62 fp[0] 64 ... fp[7] 71
This has been successfully tested with bnx2x, qlcnic and netxen network
cards as slaves in a bond interface. Turning LRO on or off on the master
also turns it on or off on each of the slaves, new slaves are added with
LRO in the same state as the master, and LRO can't be toggled on the
slaves.
Also, this should largely remove the need for dev_disable_lro(), and most,
if not all, of its call sites can be replaced by simply making sure
NETIF_F_LRO isn't included in the relevant device's feature flags.
Note that this patch is driven by bug reports from users saying it was
confusing that bonds and slaves had different settings for the same
features, and while it won't be 100% in sync if a lower device doesn't
support a feature like LRO, I think this is a good step in the right
direction.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <gospo@cumulusnetworks.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Nikolay Aleksandrov <razor@blackwall.org>
CC: Michal Kubecek <mkubecek@suse.cz>
CC: Alexander Duyck <alexander.duyck@gmail.com>
CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-03 02:55:59 +00:00
|
|
|
if (!(features & feature) && (lower->features & feature)) {
|
|
|
|
netdev_dbg(upper, "Disabling feature %pNF on lower dev %s.\n",
|
|
|
|
&feature, lower->name);
|
|
|
|
lower->wanted_features &= ~feature;
|
|
|
|
netdev_update_features(lower);
|
|
|
|
|
|
|
|
if (unlikely(lower->features & feature))
|
|
|
|
netdev_WARN(upper, "failed to disable %pNF on %s!\n",
|
|
|
|
&feature, lower->name);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2011-11-15 15:29:55 +00:00
|
|
|
static netdev_features_t netdev_fix_features(struct net_device *dev,
|
|
|
|
netdev_features_t features)
|
2008-10-23 08:11:29 +00:00
|
|
|
{
|
2011-01-22 12:14:12 +00:00
|
|
|
/* Fix illegal checksum combinations */
|
|
|
|
if ((features & NETIF_F_HW_CSUM) &&
|
|
|
|
(features & (NETIF_F_IP_CSUM|NETIF_F_IPV6_CSUM))) {
|
2011-05-16 19:14:21 +00:00
|
|
|
netdev_warn(dev, "mixed HW and IP checksum settings.\n");
|
2011-01-22 12:14:12 +00:00
|
|
|
features &= ~(NETIF_F_IP_CSUM|NETIF_F_IPV6_CSUM);
|
|
|
|
}
|
|
|
|
|
2008-10-23 08:11:29 +00:00
|
|
|
/* TSO requires that SG is present as well. */
|
2011-04-12 14:38:37 +00:00
|
|
|
if ((features & NETIF_F_ALL_TSO) && !(features & NETIF_F_SG)) {
|
2011-05-16 19:14:21 +00:00
|
|
|
netdev_dbg(dev, "Dropping TSO features since no SG feature.\n");
|
2011-04-12 14:38:37 +00:00
|
|
|
features &= ~NETIF_F_ALL_TSO;
|
2008-10-23 08:11:29 +00:00
|
|
|
}
|
|
|
|
|
2013-03-07 09:28:01 +00:00
|
|
|
if ((features & NETIF_F_TSO) && !(features & NETIF_F_HW_CSUM) &&
|
|
|
|
!(features & NETIF_F_IP_CSUM)) {
|
|
|
|
netdev_dbg(dev, "Dropping TSO features since no CSUM feature.\n");
|
|
|
|
features &= ~NETIF_F_TSO;
|
|
|
|
features &= ~NETIF_F_TSO_ECN;
|
|
|
|
}
|
|
|
|
|
|
|
|
if ((features & NETIF_F_TSO6) && !(features & NETIF_F_HW_CSUM) &&
|
|
|
|
!(features & NETIF_F_IPV6_CSUM)) {
|
|
|
|
netdev_dbg(dev, "Dropping TSO6 features since no CSUM feature.\n");
|
|
|
|
features &= ~NETIF_F_TSO6;
|
|
|
|
}
|
|
|
|
|
2016-05-02 16:38:24 +00:00
|
|
|
/* TSO with IPv4 ID mangling requires IPv4 TSO be enabled */
|
|
|
|
if ((features & NETIF_F_TSO_MANGLEID) && !(features & NETIF_F_TSO))
|
|
|
|
features &= ~NETIF_F_TSO_MANGLEID;
|
|
|
|
|
2011-04-12 14:47:15 +00:00
|
|
|
/* TSO ECN requires that TSO is present as well. */
|
|
|
|
if ((features & NETIF_F_ALL_TSO) == NETIF_F_TSO_ECN)
|
|
|
|
features &= ~NETIF_F_TSO_ECN;
|
|
|
|
|
2011-02-15 16:59:16 +00:00
|
|
|
/* Software GSO depends on SG. */
|
|
|
|
if ((features & NETIF_F_GSO) && !(features & NETIF_F_SG)) {
|
2011-05-16 19:14:21 +00:00
|
|
|
netdev_dbg(dev, "Dropping NETIF_F_GSO since no SG feature.\n");
|
2011-02-15 16:59:16 +00:00
|
|
|
features &= ~NETIF_F_GSO;
|
|
|
|
}
|
|
|
|
|
2016-04-11 01:45:03 +00:00
|
|
|
/* GSO partial features require GSO partial be set */
|
|
|
|
if ((features & dev->gso_partial_features) &&
|
|
|
|
!(features & NETIF_F_GSO_PARTIAL)) {
|
|
|
|
netdev_dbg(dev,
|
|
|
|
"Dropping partially supported GSO features since no GSO partial.\n");
|
|
|
|
features &= ~dev->gso_partial_features;
|
|
|
|
}
|
|
|
|
|
2008-10-23 08:11:29 +00:00
|
|
|
return features;
|
|
|
|
}
|
|
|
|
|
2011-04-03 05:48:47 +00:00
|
|
|
int __netdev_update_features(struct net_device *dev)
|
2011-02-15 16:59:17 +00:00
|
|
|
{
|
net/core: generic support for disabling netdev features down stack
There are some netdev features, which when disabled on an upper device,
such as a bonding master or a bridge, must be disabled and cannot be
re-enabled on underlying devices.
This is a rework of an earlier more heavy-handed appraoch, which simply
disables and prevents re-enabling of netdev features listed in a new
define in include/net/netdev_features.h, NETIF_F_UPPER_DISABLES. Any upper
device that disables a flag in that feature mask, the disabling will
propagate down the stack, and any lower device that has any upper device
with one of those flags disabled should not be able to enable said flag.
Initially, only LRO is included for proof of concept, and because this
code effectively does the same thing as dev_disable_lro(), though it will
also activate from the ethtool path, which was one of the goals here.
[root@dell-per730-01 ~]# ethtool -k bond0 |grep large
large-receive-offload: on
[root@dell-per730-01 ~]# ethtool -k p5p1 |grep large
large-receive-offload: on
[root@dell-per730-01 ~]# ethtool -K bond0 lro off
[root@dell-per730-01 ~]# ethtool -k bond0 |grep large
large-receive-offload: off
[root@dell-per730-01 ~]# ethtool -k p5p1 |grep large
large-receive-offload: off
dmesg dump:
[ 1033.277986] bond0: Disabling feature 0x0000000000008000 on lower dev p5p2.
[ 1034.067949] bnx2x 0000:06:00.1 p5p2: using MSI-X IRQs: sp 74 fp[0] 76 ... fp[7] 83
[ 1034.753612] bond0: Disabling feature 0x0000000000008000 on lower dev p5p1.
[ 1035.591019] bnx2x 0000:06:00.0 p5p1: using MSI-X IRQs: sp 62 fp[0] 64 ... fp[7] 71
This has been successfully tested with bnx2x, qlcnic and netxen network
cards as slaves in a bond interface. Turning LRO on or off on the master
also turns it on or off on each of the slaves, new slaves are added with
LRO in the same state as the master, and LRO can't be toggled on the
slaves.
Also, this should largely remove the need for dev_disable_lro(), and most,
if not all, of its call sites can be replaced by simply making sure
NETIF_F_LRO isn't included in the relevant device's feature flags.
Note that this patch is driven by bug reports from users saying it was
confusing that bonds and slaves had different settings for the same
features, and while it won't be 100% in sync if a lower device doesn't
support a feature like LRO, I think this is a good step in the right
direction.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <gospo@cumulusnetworks.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Nikolay Aleksandrov <razor@blackwall.org>
CC: Michal Kubecek <mkubecek@suse.cz>
CC: Alexander Duyck <alexander.duyck@gmail.com>
CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-03 02:55:59 +00:00
|
|
|
struct net_device *upper, *lower;
|
2011-11-15 15:29:55 +00:00
|
|
|
netdev_features_t features;
|
net/core: generic support for disabling netdev features down stack
There are some netdev features, which when disabled on an upper device,
such as a bonding master or a bridge, must be disabled and cannot be
re-enabled on underlying devices.
This is a rework of an earlier more heavy-handed appraoch, which simply
disables and prevents re-enabling of netdev features listed in a new
define in include/net/netdev_features.h, NETIF_F_UPPER_DISABLES. Any upper
device that disables a flag in that feature mask, the disabling will
propagate down the stack, and any lower device that has any upper device
with one of those flags disabled should not be able to enable said flag.
Initially, only LRO is included for proof of concept, and because this
code effectively does the same thing as dev_disable_lro(), though it will
also activate from the ethtool path, which was one of the goals here.
[root@dell-per730-01 ~]# ethtool -k bond0 |grep large
large-receive-offload: on
[root@dell-per730-01 ~]# ethtool -k p5p1 |grep large
large-receive-offload: on
[root@dell-per730-01 ~]# ethtool -K bond0 lro off
[root@dell-per730-01 ~]# ethtool -k bond0 |grep large
large-receive-offload: off
[root@dell-per730-01 ~]# ethtool -k p5p1 |grep large
large-receive-offload: off
dmesg dump:
[ 1033.277986] bond0: Disabling feature 0x0000000000008000 on lower dev p5p2.
[ 1034.067949] bnx2x 0000:06:00.1 p5p2: using MSI-X IRQs: sp 74 fp[0] 76 ... fp[7] 83
[ 1034.753612] bond0: Disabling feature 0x0000000000008000 on lower dev p5p1.
[ 1035.591019] bnx2x 0000:06:00.0 p5p1: using MSI-X IRQs: sp 62 fp[0] 64 ... fp[7] 71
This has been successfully tested with bnx2x, qlcnic and netxen network
cards as slaves in a bond interface. Turning LRO on or off on the master
also turns it on or off on each of the slaves, new slaves are added with
LRO in the same state as the master, and LRO can't be toggled on the
slaves.
Also, this should largely remove the need for dev_disable_lro(), and most,
if not all, of its call sites can be replaced by simply making sure
NETIF_F_LRO isn't included in the relevant device's feature flags.
Note that this patch is driven by bug reports from users saying it was
confusing that bonds and slaves had different settings for the same
features, and while it won't be 100% in sync if a lower device doesn't
support a feature like LRO, I think this is a good step in the right
direction.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <gospo@cumulusnetworks.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Nikolay Aleksandrov <razor@blackwall.org>
CC: Michal Kubecek <mkubecek@suse.cz>
CC: Alexander Duyck <alexander.duyck@gmail.com>
CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-03 02:55:59 +00:00
|
|
|
struct list_head *iter;
|
2015-11-04 04:09:32 +00:00
|
|
|
int err = -1;
|
2011-02-15 16:59:17 +00:00
|
|
|
|
2011-04-12 09:56:38 +00:00
|
|
|
ASSERT_RTNL();
|
|
|
|
|
2011-02-15 16:59:17 +00:00
|
|
|
features = netdev_get_wanted_features(dev);
|
|
|
|
|
|
|
|
if (dev->netdev_ops->ndo_fix_features)
|
|
|
|
features = dev->netdev_ops->ndo_fix_features(dev, features);
|
|
|
|
|
|
|
|
/* driver might be less strict about feature dependencies */
|
|
|
|
features = netdev_fix_features(dev, features);
|
|
|
|
|
net/core: generic support for disabling netdev features down stack
There are some netdev features, which when disabled on an upper device,
such as a bonding master or a bridge, must be disabled and cannot be
re-enabled on underlying devices.
This is a rework of an earlier more heavy-handed appraoch, which simply
disables and prevents re-enabling of netdev features listed in a new
define in include/net/netdev_features.h, NETIF_F_UPPER_DISABLES. Any upper
device that disables a flag in that feature mask, the disabling will
propagate down the stack, and any lower device that has any upper device
with one of those flags disabled should not be able to enable said flag.
Initially, only LRO is included for proof of concept, and because this
code effectively does the same thing as dev_disable_lro(), though it will
also activate from the ethtool path, which was one of the goals here.
[root@dell-per730-01 ~]# ethtool -k bond0 |grep large
large-receive-offload: on
[root@dell-per730-01 ~]# ethtool -k p5p1 |grep large
large-receive-offload: on
[root@dell-per730-01 ~]# ethtool -K bond0 lro off
[root@dell-per730-01 ~]# ethtool -k bond0 |grep large
large-receive-offload: off
[root@dell-per730-01 ~]# ethtool -k p5p1 |grep large
large-receive-offload: off
dmesg dump:
[ 1033.277986] bond0: Disabling feature 0x0000000000008000 on lower dev p5p2.
[ 1034.067949] bnx2x 0000:06:00.1 p5p2: using MSI-X IRQs: sp 74 fp[0] 76 ... fp[7] 83
[ 1034.753612] bond0: Disabling feature 0x0000000000008000 on lower dev p5p1.
[ 1035.591019] bnx2x 0000:06:00.0 p5p1: using MSI-X IRQs: sp 62 fp[0] 64 ... fp[7] 71
This has been successfully tested with bnx2x, qlcnic and netxen network
cards as slaves in a bond interface. Turning LRO on or off on the master
also turns it on or off on each of the slaves, new slaves are added with
LRO in the same state as the master, and LRO can't be toggled on the
slaves.
Also, this should largely remove the need for dev_disable_lro(), and most,
if not all, of its call sites can be replaced by simply making sure
NETIF_F_LRO isn't included in the relevant device's feature flags.
Note that this patch is driven by bug reports from users saying it was
confusing that bonds and slaves had different settings for the same
features, and while it won't be 100% in sync if a lower device doesn't
support a feature like LRO, I think this is a good step in the right
direction.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <gospo@cumulusnetworks.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Nikolay Aleksandrov <razor@blackwall.org>
CC: Michal Kubecek <mkubecek@suse.cz>
CC: Alexander Duyck <alexander.duyck@gmail.com>
CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-03 02:55:59 +00:00
|
|
|
/* some features can't be enabled if they're off an an upper device */
|
|
|
|
netdev_for_each_upper_dev_rcu(dev, upper, iter)
|
|
|
|
features = netdev_sync_upper_features(dev, upper, features);
|
|
|
|
|
2011-02-15 16:59:17 +00:00
|
|
|
if (dev->features == features)
|
2015-11-04 04:09:32 +00:00
|
|
|
goto sync_lower;
|
2011-02-15 16:59:17 +00:00
|
|
|
|
2011-11-15 15:29:55 +00:00
|
|
|
netdev_dbg(dev, "Features changed: %pNF -> %pNF\n",
|
|
|
|
&dev->features, &features);
|
2011-02-15 16:59:17 +00:00
|
|
|
|
|
|
|
if (dev->netdev_ops->ndo_set_features)
|
|
|
|
err = dev->netdev_ops->ndo_set_features(dev, features);
|
2015-11-13 13:54:01 +00:00
|
|
|
else
|
|
|
|
err = 0;
|
2011-02-15 16:59:17 +00:00
|
|
|
|
2011-04-03 05:48:47 +00:00
|
|
|
if (unlikely(err < 0)) {
|
2011-02-15 16:59:17 +00:00
|
|
|
netdev_err(dev,
|
2011-11-15 15:29:55 +00:00
|
|
|
"set_features() failed (%d); wanted %pNF, left %pNF\n",
|
|
|
|
err, &features, &dev->features);
|
2015-11-17 14:49:06 +00:00
|
|
|
/* return non-0 since some features might have changed and
|
|
|
|
* it's better to fire a spurious notification than miss it
|
|
|
|
*/
|
|
|
|
return -1;
|
2011-04-03 05:48:47 +00:00
|
|
|
}
|
|
|
|
|
2015-11-04 04:09:32 +00:00
|
|
|
sync_lower:
|
net/core: generic support for disabling netdev features down stack
There are some netdev features, which when disabled on an upper device,
such as a bonding master or a bridge, must be disabled and cannot be
re-enabled on underlying devices.
This is a rework of an earlier more heavy-handed appraoch, which simply
disables and prevents re-enabling of netdev features listed in a new
define in include/net/netdev_features.h, NETIF_F_UPPER_DISABLES. Any upper
device that disables a flag in that feature mask, the disabling will
propagate down the stack, and any lower device that has any upper device
with one of those flags disabled should not be able to enable said flag.
Initially, only LRO is included for proof of concept, and because this
code effectively does the same thing as dev_disable_lro(), though it will
also activate from the ethtool path, which was one of the goals here.
[root@dell-per730-01 ~]# ethtool -k bond0 |grep large
large-receive-offload: on
[root@dell-per730-01 ~]# ethtool -k p5p1 |grep large
large-receive-offload: on
[root@dell-per730-01 ~]# ethtool -K bond0 lro off
[root@dell-per730-01 ~]# ethtool -k bond0 |grep large
large-receive-offload: off
[root@dell-per730-01 ~]# ethtool -k p5p1 |grep large
large-receive-offload: off
dmesg dump:
[ 1033.277986] bond0: Disabling feature 0x0000000000008000 on lower dev p5p2.
[ 1034.067949] bnx2x 0000:06:00.1 p5p2: using MSI-X IRQs: sp 74 fp[0] 76 ... fp[7] 83
[ 1034.753612] bond0: Disabling feature 0x0000000000008000 on lower dev p5p1.
[ 1035.591019] bnx2x 0000:06:00.0 p5p1: using MSI-X IRQs: sp 62 fp[0] 64 ... fp[7] 71
This has been successfully tested with bnx2x, qlcnic and netxen network
cards as slaves in a bond interface. Turning LRO on or off on the master
also turns it on or off on each of the slaves, new slaves are added with
LRO in the same state as the master, and LRO can't be toggled on the
slaves.
Also, this should largely remove the need for dev_disable_lro(), and most,
if not all, of its call sites can be replaced by simply making sure
NETIF_F_LRO isn't included in the relevant device's feature flags.
Note that this patch is driven by bug reports from users saying it was
confusing that bonds and slaves had different settings for the same
features, and while it won't be 100% in sync if a lower device doesn't
support a feature like LRO, I think this is a good step in the right
direction.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <gospo@cumulusnetworks.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Nikolay Aleksandrov <razor@blackwall.org>
CC: Michal Kubecek <mkubecek@suse.cz>
CC: Alexander Duyck <alexander.duyck@gmail.com>
CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-03 02:55:59 +00:00
|
|
|
/* some features must be disabled on lower devices when disabled
|
|
|
|
* on an upper device (think: bonding master or bridge)
|
|
|
|
*/
|
|
|
|
netdev_for_each_lower_dev(dev, lower, iter)
|
|
|
|
netdev_sync_lower_features(dev, lower, features);
|
|
|
|
|
2017-07-21 10:49:31 +00:00
|
|
|
if (!err) {
|
|
|
|
netdev_features_t diff = features ^ dev->features;
|
|
|
|
|
|
|
|
if (diff & NETIF_F_RX_UDP_TUNNEL_PORT) {
|
|
|
|
/* udp_tunnel_{get,drop}_rx_info both need
|
|
|
|
* NETIF_F_RX_UDP_TUNNEL_PORT enabled on the
|
|
|
|
* device, or they won't do anything.
|
|
|
|
* Thus we need to update dev->features
|
|
|
|
* *before* calling udp_tunnel_get_rx_info,
|
|
|
|
* but *after* calling udp_tunnel_drop_rx_info.
|
|
|
|
*/
|
|
|
|
if (features & NETIF_F_RX_UDP_TUNNEL_PORT) {
|
|
|
|
dev->features = features;
|
|
|
|
udp_tunnel_get_rx_info(dev);
|
|
|
|
} else {
|
|
|
|
udp_tunnel_drop_rx_info(dev);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2011-04-03 05:48:47 +00:00
|
|
|
dev->features = features;
|
2017-07-21 10:49:31 +00:00
|
|
|
}
|
2011-04-03 05:48:47 +00:00
|
|
|
|
2015-11-04 04:09:32 +00:00
|
|
|
return err < 0 ? 0 : 1;
|
2011-04-03 05:48:47 +00:00
|
|
|
}
|
|
|
|
|
2011-05-07 03:22:17 +00:00
|
|
|
/**
|
|
|
|
* netdev_update_features - recalculate device features
|
|
|
|
* @dev: the device to check
|
|
|
|
*
|
|
|
|
* Recalculate dev->features set and send notifications if it
|
|
|
|
* has changed. Should be called after driver or hardware dependent
|
|
|
|
* conditions might have changed that influence the features.
|
|
|
|
*/
|
2011-04-03 05:48:47 +00:00
|
|
|
void netdev_update_features(struct net_device *dev)
|
|
|
|
{
|
|
|
|
if (__netdev_update_features(dev))
|
|
|
|
netdev_features_change(dev);
|
2011-02-15 16:59:17 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(netdev_update_features);
|
|
|
|
|
2011-05-07 03:22:17 +00:00
|
|
|
/**
|
|
|
|
* netdev_change_features - recalculate device features
|
|
|
|
* @dev: the device to check
|
|
|
|
*
|
|
|
|
* Recalculate dev->features set and send notifications even
|
|
|
|
* if they have not changed. Should be called instead of
|
|
|
|
* netdev_update_features() if also dev->vlan_features might
|
|
|
|
* have changed to allow the changes to be propagated to stacked
|
|
|
|
* VLAN devices.
|
|
|
|
*/
|
|
|
|
void netdev_change_features(struct net_device *dev)
|
|
|
|
{
|
|
|
|
__netdev_update_features(dev);
|
|
|
|
netdev_features_change(dev);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(netdev_change_features);
|
|
|
|
|
2009-12-03 23:59:22 +00:00
|
|
|
/**
|
|
|
|
* netif_stacked_transfer_operstate - transfer operstate
|
|
|
|
* @rootdev: the root or lower level device to transfer state from
|
|
|
|
* @dev: the device to transfer operstate to
|
|
|
|
*
|
|
|
|
* Transfer operational state from root to device. This is normally
|
|
|
|
* called when a stacking relationship exists between the root
|
|
|
|
* device and the device(a leaf device).
|
|
|
|
*/
|
|
|
|
void netif_stacked_transfer_operstate(const struct net_device *rootdev,
|
|
|
|
struct net_device *dev)
|
|
|
|
{
|
|
|
|
if (rootdev->operstate == IF_OPER_DORMANT)
|
|
|
|
netif_dormant_on(dev);
|
|
|
|
else
|
|
|
|
netif_dormant_off(dev);
|
|
|
|
|
2017-04-26 09:49:38 +00:00
|
|
|
if (netif_carrier_ok(rootdev))
|
|
|
|
netif_carrier_on(dev);
|
|
|
|
else
|
|
|
|
netif_carrier_off(dev);
|
2009-12-03 23:59:22 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(netif_stacked_transfer_operstate);
|
|
|
|
|
2014-01-17 06:23:28 +00:00
|
|
|
#ifdef CONFIG_SYSFS
|
2010-09-23 17:26:35 +00:00
|
|
|
static int netif_alloc_rx_queues(struct net_device *dev)
|
|
|
|
{
|
|
|
|
unsigned int i, count = dev->num_rx_queues;
|
2010-10-18 18:00:16 +00:00
|
|
|
struct netdev_rx_queue *rx;
|
2015-01-12 06:11:28 +00:00
|
|
|
size_t sz = count * sizeof(*rx);
|
2010-09-23 17:26:35 +00:00
|
|
|
|
2010-10-18 18:00:16 +00:00
|
|
|
BUG_ON(count < 1);
|
2010-09-23 17:26:35 +00:00
|
|
|
|
2017-07-12 21:36:45 +00:00
|
|
|
rx = kvzalloc(sz, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
|
2017-05-08 22:57:31 +00:00
|
|
|
if (!rx)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2010-10-18 18:00:16 +00:00
|
|
|
dev->_rx = rx;
|
|
|
|
|
|
|
|
for (i = 0; i < count; i++)
|
2010-11-09 10:47:38 +00:00
|
|
|
rx[i].dev = dev;
|
2010-09-23 17:26:35 +00:00
|
|
|
return 0;
|
|
|
|
}
|
2010-11-26 08:36:09 +00:00
|
|
|
#endif
|
2010-09-23 17:26:35 +00:00
|
|
|
|
2010-12-04 02:31:41 +00:00
|
|
|
static void netdev_init_one_queue(struct net_device *dev,
|
|
|
|
struct netdev_queue *queue, void *_unused)
|
|
|
|
{
|
|
|
|
/* Initialize queue lock */
|
|
|
|
spin_lock_init(&queue->_xmit_lock);
|
|
|
|
netdev_set_xmit_lockdep_class(&queue->_xmit_lock, dev->type);
|
|
|
|
queue->xmit_lock_owner = -1;
|
2010-12-14 03:09:15 +00:00
|
|
|
netdev_queue_numa_node_write(queue, NUMA_NO_NODE);
|
2010-12-04 02:31:41 +00:00
|
|
|
queue->dev = dev;
|
2011-11-28 16:33:09 +00:00
|
|
|
#ifdef CONFIG_BQL
|
|
|
|
dql_init(&queue->dql, HZ);
|
|
|
|
#endif
|
2010-12-04 02:31:41 +00:00
|
|
|
}
|
|
|
|
|
2013-06-20 08:15:51 +00:00
|
|
|
static void netif_free_tx_queues(struct net_device *dev)
|
|
|
|
{
|
2014-06-02 22:55:22 +00:00
|
|
|
kvfree(dev->_tx);
|
2013-06-20 08:15:51 +00:00
|
|
|
}
|
|
|
|
|
2010-10-18 18:04:39 +00:00
|
|
|
static int netif_alloc_netdev_queues(struct net_device *dev)
|
|
|
|
{
|
|
|
|
unsigned int count = dev->num_tx_queues;
|
|
|
|
struct netdev_queue *tx;
|
2013-06-20 08:15:51 +00:00
|
|
|
size_t sz = count * sizeof(*tx);
|
2010-10-18 18:04:39 +00:00
|
|
|
|
2015-07-06 15:13:26 +00:00
|
|
|
if (count < 1 || count > 0xffff)
|
|
|
|
return -EINVAL;
|
2013-02-04 16:48:16 +00:00
|
|
|
|
2017-07-12 21:36:45 +00:00
|
|
|
tx = kvzalloc(sz, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
|
2017-05-08 22:57:31 +00:00
|
|
|
if (!tx)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2010-10-18 18:04:39 +00:00
|
|
|
dev->_tx = tx;
|
2010-11-21 13:17:27 +00:00
|
|
|
|
2010-10-18 18:04:39 +00:00
|
|
|
netdev_for_each_tx_queue(dev, netdev_init_one_queue, NULL);
|
|
|
|
spin_lock_init(&dev->tx_global_lock);
|
2010-12-04 02:31:41 +00:00
|
|
|
|
|
|
|
return 0;
|
2010-10-18 18:04:39 +00:00
|
|
|
}
|
|
|
|
|
2015-05-11 19:17:53 +00:00
|
|
|
void netif_tx_stop_all_queues(struct net_device *dev)
|
|
|
|
{
|
|
|
|
unsigned int i;
|
|
|
|
|
|
|
|
for (i = 0; i < dev->num_tx_queues; i++) {
|
|
|
|
struct netdev_queue *txq = netdev_get_tx_queue(dev, i);
|
2017-02-09 06:56:07 +00:00
|
|
|
|
2015-05-11 19:17:53 +00:00
|
|
|
netif_tx_stop_queue(txq);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(netif_tx_stop_all_queues);
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
/**
|
|
|
|
* register_netdevice - register a network device
|
|
|
|
* @dev: device to register
|
|
|
|
*
|
|
|
|
* Take a completed network device structure and add it to the kernel
|
|
|
|
* interfaces. A %NETDEV_REGISTER message is sent to the netdev notifier
|
|
|
|
* chain. 0 is returned on success. A negative errno code is returned
|
|
|
|
* on a failure to set up the device, or if the name is a duplicate.
|
|
|
|
*
|
|
|
|
* Callers must hold the rtnl semaphore. You may want
|
|
|
|
* register_netdev() instead of this.
|
|
|
|
*
|
|
|
|
* BUGS:
|
|
|
|
* The locking appears insufficient to guarantee two parallel registers
|
|
|
|
* will not get the same name.
|
|
|
|
*/
|
|
|
|
|
|
|
|
int register_netdevice(struct net_device *dev)
|
|
|
|
{
|
|
|
|
int ret;
|
2008-11-20 05:32:24 +00:00
|
|
|
struct net *net = dev_net(dev);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
BUG_ON(dev_boot_phase);
|
|
|
|
ASSERT_RTNL();
|
|
|
|
|
2006-05-10 20:21:17 +00:00
|
|
|
might_sleep();
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
/* When net_device's are persistent, this will be fatal. */
|
|
|
|
BUG_ON(dev->reg_state != NETREG_UNINITIALIZED);
|
2008-11-20 05:32:24 +00:00
|
|
|
BUG_ON(!net);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2008-07-15 07:08:33 +00:00
|
|
|
spin_lock_init(&dev->addr_list_lock);
|
2008-07-22 21:16:42 +00:00
|
|
|
netdev_set_addr_lockdep_class(dev);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2012-09-13 20:58:27 +00:00
|
|
|
ret = dev_get_valid_name(net, dev, dev->name);
|
2011-05-12 15:46:56 +00:00
|
|
|
if (ret < 0)
|
|
|
|
goto out;
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
/* Init, if this function is available */
|
2008-11-20 05:32:24 +00:00
|
|
|
if (dev->netdev_ops->ndo_init) {
|
|
|
|
ret = dev->netdev_ops->ndo_init(dev);
|
2005-04-16 22:20:36 +00:00
|
|
|
if (ret) {
|
|
|
|
if (ret > 0)
|
|
|
|
ret = -EIO;
|
2006-11-14 00:02:22 +00:00
|
|
|
goto out;
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
}
|
2007-02-09 14:24:36 +00:00
|
|
|
|
2013-04-19 02:04:27 +00:00
|
|
|
if (((dev->hw_features | dev->features) &
|
|
|
|
NETIF_F_HW_VLAN_CTAG_FILTER) &&
|
2013-01-29 15:14:16 +00:00
|
|
|
(!dev->netdev_ops->ndo_vlan_rx_add_vid ||
|
|
|
|
!dev->netdev_ops->ndo_vlan_rx_kill_vid)) {
|
|
|
|
netdev_WARN(dev, "Buggy VLAN acceleration in driver!\n");
|
|
|
|
ret = -EINVAL;
|
|
|
|
goto err_uninit;
|
|
|
|
}
|
|
|
|
|
2012-08-08 21:52:46 +00:00
|
|
|
ret = -EBUSY;
|
|
|
|
if (!dev->ifindex)
|
|
|
|
dev->ifindex = dev_new_index(net);
|
|
|
|
else if (__dev_get_by_index(net, dev->ifindex))
|
|
|
|
goto err_uninit;
|
|
|
|
|
2011-02-15 16:59:17 +00:00
|
|
|
/* Transfer changeable features to wanted_features and enable
|
|
|
|
* software offloads (GSO and GRO).
|
|
|
|
*/
|
|
|
|
dev->hw_features |= NETIF_F_SOFT_FEATURES;
|
2011-02-22 16:52:28 +00:00
|
|
|
dev->features |= NETIF_F_SOFT_FEATURES;
|
2017-07-21 10:49:28 +00:00
|
|
|
|
|
|
|
if (dev->netdev_ops->ndo_udp_tunnel_add) {
|
|
|
|
dev->features |= NETIF_F_RX_UDP_TUNNEL_PORT;
|
|
|
|
dev->hw_features |= NETIF_F_RX_UDP_TUNNEL_PORT;
|
|
|
|
}
|
|
|
|
|
2011-02-22 16:52:28 +00:00
|
|
|
dev->wanted_features = dev->features & dev->hw_features;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2016-04-11 01:44:51 +00:00
|
|
|
if (!(dev->flags & IFF_LOOPBACK))
|
2011-11-15 15:29:55 +00:00
|
|
|
dev->hw_features |= NETIF_F_NOCACHE_COPY;
|
2016-04-11 01:44:51 +00:00
|
|
|
|
2016-04-20 20:51:00 +00:00
|
|
|
/* If IPv4 TCP segmentation offload is supported we should also
|
|
|
|
* allow the device to enable segmenting the frame with the option
|
|
|
|
* of ignoring a static IP ID value. This doesn't enable the
|
|
|
|
* feature itself but allows the user to enable it later.
|
|
|
|
*/
|
2016-04-11 01:44:51 +00:00
|
|
|
if (dev->hw_features & NETIF_F_TSO)
|
|
|
|
dev->hw_features |= NETIF_F_TSO_MANGLEID;
|
2016-04-20 20:51:00 +00:00
|
|
|
if (dev->vlan_features & NETIF_F_TSO)
|
|
|
|
dev->vlan_features |= NETIF_F_TSO_MANGLEID;
|
|
|
|
if (dev->mpls_features & NETIF_F_TSO)
|
|
|
|
dev->mpls_features |= NETIF_F_TSO_MANGLEID;
|
|
|
|
if (dev->hw_enc_features & NETIF_F_TSO)
|
|
|
|
dev->hw_enc_features |= NETIF_F_TSO_MANGLEID;
|
2011-04-05 05:30:30 +00:00
|
|
|
|
2011-07-14 21:41:11 +00:00
|
|
|
/* Make NETIF_F_HIGHDMA inheritable to VLAN devices.
|
2010-09-15 09:24:24 +00:00
|
|
|
*/
|
2011-07-14 21:41:11 +00:00
|
|
|
dev->vlan_features |= NETIF_F_HIGHDMA;
|
2010-09-15 09:24:24 +00:00
|
|
|
|
2013-03-07 09:28:08 +00:00
|
|
|
/* Make NETIF_F_SG inheritable to tunnel devices.
|
|
|
|
*/
|
2016-04-11 01:45:03 +00:00
|
|
|
dev->hw_enc_features |= NETIF_F_SG | NETIF_F_GSO_PARTIAL;
|
2013-03-07 09:28:08 +00:00
|
|
|
|
2013-05-23 21:02:52 +00:00
|
|
|
/* Make NETIF_F_SG inheritable to MPLS.
|
|
|
|
*/
|
|
|
|
dev->mpls_features |= NETIF_F_SG;
|
|
|
|
|
2009-10-02 05:15:27 +00:00
|
|
|
ret = call_netdevice_notifiers(NETDEV_POST_INIT, dev);
|
|
|
|
ret = notifier_to_errno(ret);
|
|
|
|
if (ret)
|
|
|
|
goto err_uninit;
|
|
|
|
|
2007-09-27 05:02:53 +00:00
|
|
|
ret = netdev_register_kobject(dev);
|
2006-05-10 20:21:17 +00:00
|
|
|
if (ret)
|
2007-07-30 23:29:40 +00:00
|
|
|
goto err_uninit;
|
2006-05-10 20:21:17 +00:00
|
|
|
dev->reg_state = NETREG_REGISTERED;
|
|
|
|
|
2011-04-03 05:48:47 +00:00
|
|
|
__netdev_update_features(dev);
|
2011-02-22 16:52:28 +00:00
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
/*
|
|
|
|
* Default initial state at registry is that the
|
|
|
|
* device is present.
|
|
|
|
*/
|
|
|
|
|
|
|
|
set_bit(__LINK_STATE_PRESENT, &dev->state);
|
|
|
|
|
net: Set device operstate at registration time
The operstate of a device is initially IF_OPER_UNKNOWN and is updated
asynchronously by linkwatch after each change of carrier state
reported by the driver. The default carrier state of a net device is
on, and this will never be changed on drivers that do not support
carrier detection, thus the operstate remains IF_OPER_UNKNOWN.
For devices that do support carrier detection, the driver must set the
carrier state to off initially, then poll the hardware state when the
device is opened. However, we must not activate linkwatch for a
unregistered device, and commit b473001 ('net: Do not fire linkwatch
events until the device is registered.') ensured that we don't. But
this means that the operstate for many devices that support carrier
detection remains IF_OPER_UNKNOWN when it should be IF_OPER_DOWN.
The same issue exists with the dormant state.
The proper initialisation sequence, avoiding a race with opening of
the device, is:
rtnl_lock();
rc = register_netdevice(dev);
if (rc)
goto out_unlock;
netif_carrier_off(dev); /* or netif_dormant_on(dev) */
rtnl_unlock();
but it seems silly that this should have to be repeated in so many
drivers. Further, the operstate seen immediately after opening the
device may still be IF_OPER_UNKNOWN due to the asynchronous nature of
linkwatch.
Commit 22604c8 ('net: Fix for initial link state in 2.6.28') attempted
to fix this by setting the operstate synchronously, but it was
reverted as it could lead to deadlock.
This initialises the operstate synchronously at registration time
only.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-20 21:16:51 +00:00
|
|
|
linkwatch_init_dev(dev);
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
dev_init_scheduler(dev);
|
|
|
|
dev_hold(dev);
|
2007-09-12 11:53:49 +00:00
|
|
|
list_netdevice(dev);
|
2012-07-05 01:23:25 +00:00
|
|
|
add_device_randomness(dev->dev_addr, dev->addr_len);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2013-01-08 01:38:25 +00:00
|
|
|
/* If the device has permanent device address, driver should
|
|
|
|
* set dev_addr and also addr_assign_type should be set to
|
|
|
|
* NET_ADDR_PERM (default value).
|
|
|
|
*/
|
|
|
|
if (dev->addr_assign_type == NET_ADDR_PERM)
|
|
|
|
memcpy(dev->perm_addr, dev->dev_addr, dev->addr_len);
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
/* Notify protocols, that a new device appeared. */
|
2007-09-16 22:42:43 +00:00
|
|
|
ret = call_netdevice_notifiers(NETDEV_REGISTER, dev);
|
2007-07-31 00:03:38 +00:00
|
|
|
ret = notifier_to_errno(ret);
|
2007-10-30 22:38:18 +00:00
|
|
|
if (ret) {
|
|
|
|
rollback_registered(dev);
|
|
|
|
dev->reg_state = NETREG_UNREGISTERED;
|
|
|
|
}
|
2009-12-12 22:11:15 +00:00
|
|
|
/*
|
|
|
|
* Prevent userspace races by waiting until the network
|
|
|
|
* device is fully setup before sending notifications.
|
|
|
|
*/
|
2010-02-26 06:34:51 +00:00
|
|
|
if (!dev->rtnl_link_ops ||
|
|
|
|
dev->rtnl_link_state == RTNL_LINK_INITIALIZED)
|
2013-10-23 23:02:42 +00:00
|
|
|
rtmsg_ifinfo(RTM_NEWLINK, dev, ~0U, GFP_KERNEL);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
out:
|
|
|
|
return ret;
|
2007-07-30 23:29:40 +00:00
|
|
|
|
|
|
|
err_uninit:
|
2008-11-20 05:32:24 +00:00
|
|
|
if (dev->netdev_ops->ndo_uninit)
|
|
|
|
dev->netdev_ops->ndo_uninit(dev);
|
net: Fix inconsistent teardown and release of private netdev state.
Network devices can allocate reasources and private memory using
netdev_ops->ndo_init(). However, the release of these resources
can occur in one of two different places.
Either netdev_ops->ndo_uninit() or netdev->destructor().
The decision of which operation frees the resources depends upon
whether it is necessary for all netdev refs to be released before it
is safe to perform the freeing.
netdev_ops->ndo_uninit() presumably can occur right after the
NETDEV_UNREGISTER notifier completes and the unicast and multicast
address lists are flushed.
netdev->destructor(), on the other hand, does not run until the
netdev references all go away.
Further complicating the situation is that netdev->destructor()
almost universally does also a free_netdev().
This creates a problem for the logic in register_netdevice().
Because all callers of register_netdevice() manage the freeing
of the netdev, and invoke free_netdev(dev) if register_netdevice()
fails.
If netdev_ops->ndo_init() succeeds, but something else fails inside
of register_netdevice(), it does call ndo_ops->ndo_uninit(). But
it is not able to invoke netdev->destructor().
This is because netdev->destructor() will do a free_netdev() and
then the caller of register_netdevice() will do the same.
However, this means that the resources that would normally be released
by netdev->destructor() will not be.
Over the years drivers have added local hacks to deal with this, by
invoking their destructor parts by hand when register_netdevice()
fails.
Many drivers do not try to deal with this, and instead we have leaks.
Let's close this hole by formalizing the distinction between what
private things need to be freed up by netdev->destructor() and whether
the driver needs unregister_netdevice() to perform the free_netdev().
netdev->priv_destructor() performs all actions to free up the private
resources that used to be freed by netdev->destructor(), except for
free_netdev().
netdev->needs_free_netdev is a boolean that indicates whether
free_netdev() should be done at the end of unregister_netdevice().
Now, register_netdevice() can sanely release all resources after
ndo_ops->ndo_init() succeeds, by invoking both ndo_ops->ndo_uninit()
and netdev->priv_destructor().
And at the end of unregister_netdevice(), we invoke
netdev->priv_destructor() and optionally call free_netdev().
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-08 16:52:56 +00:00
|
|
|
if (dev->priv_destructor)
|
|
|
|
dev->priv_destructor(dev);
|
2007-07-30 23:29:40 +00:00
|
|
|
goto out;
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
2009-09-03 08:29:39 +00:00
|
|
|
EXPORT_SYMBOL(register_netdevice);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2009-01-15 05:05:05 +00:00
|
|
|
/**
|
|
|
|
* init_dummy_netdev - init a dummy network device for NAPI
|
|
|
|
* @dev: device to init
|
|
|
|
*
|
|
|
|
* This takes a network device structure and initialize the minimum
|
|
|
|
* amount of fields so it can be used to schedule NAPI polls without
|
|
|
|
* registering a full blown interface. This is to be used by drivers
|
|
|
|
* that need to tie several hardware interfaces to a single NAPI
|
|
|
|
* poll scheduler due to HW limitations.
|
|
|
|
*/
|
|
|
|
int init_dummy_netdev(struct net_device *dev)
|
|
|
|
{
|
|
|
|
/* Clear everything. Note we don't initialize spinlocks
|
|
|
|
* are they aren't supposed to be taken by any of the
|
|
|
|
* NAPI code and this dummy netdev is supposed to be
|
|
|
|
* only ever used for NAPI polls
|
|
|
|
*/
|
|
|
|
memset(dev, 0, sizeof(struct net_device));
|
|
|
|
|
|
|
|
/* make sure we BUG if trying to hit standard
|
|
|
|
* register/unregister code path
|
|
|
|
*/
|
|
|
|
dev->reg_state = NETREG_DUMMY;
|
|
|
|
|
|
|
|
/* NAPI wants this */
|
|
|
|
INIT_LIST_HEAD(&dev->napi_list);
|
|
|
|
|
|
|
|
/* a dummy interface is started by default */
|
|
|
|
set_bit(__LINK_STATE_PRESENT, &dev->state);
|
|
|
|
set_bit(__LINK_STATE_START, &dev->state);
|
|
|
|
|
net: percpu net_device refcount
We tried very hard to remove all possible dev_hold()/dev_put() pairs in
network stack, using RCU conversions.
There is still an unavoidable device refcount change for every dst we
create/destroy, and this can slow down some workloads (routers or some
app servers, mmap af_packet)
We can switch to a percpu refcount implementation, now dynamic per_cpu
infrastructure is mature. On a 64 cpus machine, this consumes 256 bytes
per device.
On x86, dev_hold(dev) code :
before
lock incl 0x280(%ebx)
after:
movl 0x260(%ebx),%eax
incl fs:(%eax)
Stress bench :
(Sending 160.000.000 UDP frames,
IP route cache disabled, dual E5540 @2.53GHz,
32bit kernel, FIB_TRIE)
Before:
real 1m1.662s
user 0m14.373s
sys 12m55.960s
After:
real 0m51.179s
user 0m15.329s
sys 10m15.942s
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-11 10:22:12 +00:00
|
|
|
/* Note : We dont allocate pcpu_refcnt for dummy devices,
|
|
|
|
* because users of this 'device' dont need to change
|
|
|
|
* its refcount.
|
|
|
|
*/
|
|
|
|
|
2009-01-15 05:05:05 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(init_dummy_netdev);
|
|
|
|
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
/**
|
|
|
|
* register_netdev - register a network device
|
|
|
|
* @dev: device to register
|
|
|
|
*
|
|
|
|
* Take a completed network device structure and add it to the kernel
|
|
|
|
* interfaces. A %NETDEV_REGISTER message is sent to the netdev notifier
|
|
|
|
* chain. 0 is returned on success. A negative errno code is returned
|
|
|
|
* on a failure to set up the device, or if the name is a duplicate.
|
|
|
|
*
|
2007-04-21 05:14:10 +00:00
|
|
|
* This is a wrapper around register_netdevice that takes the rtnl semaphore
|
2005-04-16 22:20:36 +00:00
|
|
|
* and expands the device name if you passed a format string to
|
|
|
|
* alloc_netdev.
|
|
|
|
*/
|
|
|
|
int register_netdev(struct net_device *dev)
|
|
|
|
{
|
|
|
|
int err;
|
|
|
|
|
|
|
|
rtnl_lock();
|
|
|
|
err = register_netdevice(dev);
|
|
|
|
rtnl_unlock();
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(register_netdev);
|
|
|
|
|
net: percpu net_device refcount
We tried very hard to remove all possible dev_hold()/dev_put() pairs in
network stack, using RCU conversions.
There is still an unavoidable device refcount change for every dst we
create/destroy, and this can slow down some workloads (routers or some
app servers, mmap af_packet)
We can switch to a percpu refcount implementation, now dynamic per_cpu
infrastructure is mature. On a 64 cpus machine, this consumes 256 bytes
per device.
On x86, dev_hold(dev) code :
before
lock incl 0x280(%ebx)
after:
movl 0x260(%ebx),%eax
incl fs:(%eax)
Stress bench :
(Sending 160.000.000 UDP frames,
IP route cache disabled, dual E5540 @2.53GHz,
32bit kernel, FIB_TRIE)
Before:
real 1m1.662s
user 0m14.373s
sys 12m55.960s
After:
real 0m51.179s
user 0m15.329s
sys 10m15.942s
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-11 10:22:12 +00:00
|
|
|
int netdev_refcnt_read(const struct net_device *dev)
|
|
|
|
{
|
|
|
|
int i, refcnt = 0;
|
|
|
|
|
|
|
|
for_each_possible_cpu(i)
|
|
|
|
refcnt += *per_cpu_ptr(dev->pcpu_refcnt, i);
|
|
|
|
return refcnt;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(netdev_refcnt_read);
|
|
|
|
|
2012-07-10 10:55:09 +00:00
|
|
|
/**
|
2005-04-16 22:20:36 +00:00
|
|
|
* netdev_wait_allrefs - wait until all references are gone.
|
2012-08-18 14:36:44 +00:00
|
|
|
* @dev: target net_device
|
2005-04-16 22:20:36 +00:00
|
|
|
*
|
|
|
|
* This is called when unregistering network devices.
|
|
|
|
*
|
|
|
|
* Any protocol or device that holds a reference should register
|
|
|
|
* for netdevice notification, and cleanup and put back the
|
|
|
|
* reference if they receive an UNREGISTER event.
|
|
|
|
* We can get stuck here if buggy protocols don't correctly
|
2007-02-09 14:24:36 +00:00
|
|
|
* call dev_put.
|
2005-04-16 22:20:36 +00:00
|
|
|
*/
|
|
|
|
static void netdev_wait_allrefs(struct net_device *dev)
|
|
|
|
{
|
|
|
|
unsigned long rebroadcast_time, warning_time;
|
net: percpu net_device refcount
We tried very hard to remove all possible dev_hold()/dev_put() pairs in
network stack, using RCU conversions.
There is still an unavoidable device refcount change for every dst we
create/destroy, and this can slow down some workloads (routers or some
app servers, mmap af_packet)
We can switch to a percpu refcount implementation, now dynamic per_cpu
infrastructure is mature. On a 64 cpus machine, this consumes 256 bytes
per device.
On x86, dev_hold(dev) code :
before
lock incl 0x280(%ebx)
after:
movl 0x260(%ebx),%eax
incl fs:(%eax)
Stress bench :
(Sending 160.000.000 UDP frames,
IP route cache disabled, dual E5540 @2.53GHz,
32bit kernel, FIB_TRIE)
Before:
real 1m1.662s
user 0m14.373s
sys 12m55.960s
After:
real 0m51.179s
user 0m15.329s
sys 10m15.942s
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-11 10:22:12 +00:00
|
|
|
int refcnt;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
linkwatch: linkwatch_forget_dev() to speedup device dismantle
Herbert Xu a écrit :
> On Tue, Nov 17, 2009 at 04:26:04AM -0800, David Miller wrote:
>> Really, the link watch stuff is just due for a redesign. I don't
>> think a simple hack is going to cut it this time, sorry Eric :-)
>
> I have no objections against any redesigns, but since the only
> caller of linkwatch_forget_dev runs in process context with the
> RTNL, it could also legally emit those events.
Thanks guys, here an updated version then, before linkwatch surgery ?
In this version, I force the event to be sent synchronously.
[PATCH net-next-2.6] linkwatch: linkwatch_forget_dev() to speedup device dismantle
time ip link del eth3.103 ; time ip link del eth3.104 ; time ip link del eth3.105
real 0m0.266s
user 0m0.000s
sys 0m0.001s
real 0m0.770s
user 0m0.000s
sys 0m0.000s
real 0m1.022s
user 0m0.000s
sys 0m0.000s
One problem of current schem in vlan dismantle phase is the
holding of device done by following chain :
vlan_dev_stop() ->
netif_carrier_off(dev) ->
linkwatch_fire_event(dev) ->
dev_hold() ...
And __linkwatch_run_queue() runs up to one second later...
A generic fix to this problem is to add a linkwatch_forget_dev() method
to unlink the device from the list of watched devices.
dev->link_watch_next becomes dev->link_watch_list (and use a bit more memory),
to be able to unlink device in O(1).
After patch :
time ip link del eth3.103 ; time ip link del eth3.104 ; time ip link del eth3.105
real 0m0.024s
user 0m0.000s
sys 0m0.000s
real 0m0.032s
user 0m0.000s
sys 0m0.001s
real 0m0.033s
user 0m0.000s
sys 0m0.000s
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-11-17 05:59:21 +00:00
|
|
|
linkwatch_forget_dev(dev);
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
rebroadcast_time = warning_time = jiffies;
|
net: percpu net_device refcount
We tried very hard to remove all possible dev_hold()/dev_put() pairs in
network stack, using RCU conversions.
There is still an unavoidable device refcount change for every dst we
create/destroy, and this can slow down some workloads (routers or some
app servers, mmap af_packet)
We can switch to a percpu refcount implementation, now dynamic per_cpu
infrastructure is mature. On a 64 cpus machine, this consumes 256 bytes
per device.
On x86, dev_hold(dev) code :
before
lock incl 0x280(%ebx)
after:
movl 0x260(%ebx),%eax
incl fs:(%eax)
Stress bench :
(Sending 160.000.000 UDP frames,
IP route cache disabled, dual E5540 @2.53GHz,
32bit kernel, FIB_TRIE)
Before:
real 1m1.662s
user 0m14.373s
sys 12m55.960s
After:
real 0m51.179s
user 0m15.329s
sys 10m15.942s
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-11 10:22:12 +00:00
|
|
|
refcnt = netdev_refcnt_read(dev);
|
|
|
|
|
|
|
|
while (refcnt != 0) {
|
2005-04-16 22:20:36 +00:00
|
|
|
if (time_after(jiffies, rebroadcast_time + 1 * HZ)) {
|
2006-03-21 06:23:58 +00:00
|
|
|
rtnl_lock();
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
/* Rebroadcast unregister notification */
|
2007-09-16 22:42:43 +00:00
|
|
|
call_netdevice_notifiers(NETDEV_UNREGISTER, dev);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2012-08-22 21:50:59 +00:00
|
|
|
__rtnl_unlock();
|
2012-08-22 17:19:46 +00:00
|
|
|
rcu_barrier();
|
2012-08-22 21:50:59 +00:00
|
|
|
rtnl_lock();
|
|
|
|
|
2012-08-22 17:19:46 +00:00
|
|
|
call_netdevice_notifiers(NETDEV_UNREGISTER_FINAL, dev);
|
2005-04-16 22:20:36 +00:00
|
|
|
if (test_bit(__LINK_STATE_LINKWATCH_PENDING,
|
|
|
|
&dev->state)) {
|
|
|
|
/* We must not have linkwatch events
|
|
|
|
* pending on unregister. If this
|
|
|
|
* happens, we simply run the queue
|
|
|
|
* unscheduled, resulting in a noop
|
|
|
|
* for this device.
|
|
|
|
*/
|
|
|
|
linkwatch_run_queue();
|
|
|
|
}
|
|
|
|
|
2006-03-21 06:23:58 +00:00
|
|
|
__rtnl_unlock();
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
rebroadcast_time = jiffies;
|
|
|
|
}
|
|
|
|
|
|
|
|
msleep(250);
|
|
|
|
|
net: percpu net_device refcount
We tried very hard to remove all possible dev_hold()/dev_put() pairs in
network stack, using RCU conversions.
There is still an unavoidable device refcount change for every dst we
create/destroy, and this can slow down some workloads (routers or some
app servers, mmap af_packet)
We can switch to a percpu refcount implementation, now dynamic per_cpu
infrastructure is mature. On a 64 cpus machine, this consumes 256 bytes
per device.
On x86, dev_hold(dev) code :
before
lock incl 0x280(%ebx)
after:
movl 0x260(%ebx),%eax
incl fs:(%eax)
Stress bench :
(Sending 160.000.000 UDP frames,
IP route cache disabled, dual E5540 @2.53GHz,
32bit kernel, FIB_TRIE)
Before:
real 1m1.662s
user 0m14.373s
sys 12m55.960s
After:
real 0m51.179s
user 0m15.329s
sys 10m15.942s
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-11 10:22:12 +00:00
|
|
|
refcnt = netdev_refcnt_read(dev);
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
if (time_after(jiffies, warning_time + 10 * HZ)) {
|
2012-02-01 10:54:43 +00:00
|
|
|
pr_emerg("unregister_netdevice: waiting for %s to become free. Usage count = %d\n",
|
|
|
|
dev->name, refcnt);
|
2005-04-16 22:20:36 +00:00
|
|
|
warning_time = jiffies;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* The sequence is:
|
|
|
|
*
|
|
|
|
* rtnl_lock();
|
|
|
|
* ...
|
|
|
|
* register_netdevice(x1);
|
|
|
|
* register_netdevice(x2);
|
|
|
|
* ...
|
|
|
|
* unregister_netdevice(y1);
|
|
|
|
* unregister_netdevice(y2);
|
|
|
|
* ...
|
|
|
|
* rtnl_unlock();
|
|
|
|
* free_netdev(y1);
|
|
|
|
* free_netdev(y2);
|
|
|
|
*
|
2008-10-07 22:50:03 +00:00
|
|
|
* We are invoked by rtnl_unlock().
|
2005-04-16 22:20:36 +00:00
|
|
|
* This allows us to deal with problems:
|
2006-05-10 20:21:17 +00:00
|
|
|
* 1) We can delete sysfs objects which invoke hotplug
|
2005-04-16 22:20:36 +00:00
|
|
|
* without deadlocking with linkwatch via keventd.
|
|
|
|
* 2) Since we run with the RTNL semaphore not held, we can sleep
|
|
|
|
* safely in order to wait for the netdev refcnt to drop to zero.
|
2008-10-07 22:50:03 +00:00
|
|
|
*
|
|
|
|
* We must not return until all unregister events added during
|
|
|
|
* the interval the lock was held have been completed.
|
2005-04-16 22:20:36 +00:00
|
|
|
*/
|
|
|
|
void netdev_run_todo(void)
|
|
|
|
{
|
2006-06-23 09:05:55 +00:00
|
|
|
struct list_head list;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
/* Snapshot list, allow later requests */
|
2006-06-23 09:05:55 +00:00
|
|
|
list_replace_init(&net_todo_list, &list);
|
2008-10-07 22:50:03 +00:00
|
|
|
|
|
|
|
__rtnl_unlock();
|
2006-06-23 09:05:55 +00:00
|
|
|
|
2012-08-22 17:19:46 +00:00
|
|
|
|
|
|
|
/* Wait for rcu callbacks to finish before next phase */
|
2011-10-13 22:25:23 +00:00
|
|
|
if (!list_empty(&list))
|
|
|
|
rcu_barrier();
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
while (!list_empty(&list)) {
|
|
|
|
struct net_device *dev
|
2010-02-24 14:01:38 +00:00
|
|
|
= list_first_entry(&list, struct net_device, todo_list);
|
2005-04-16 22:20:36 +00:00
|
|
|
list_del(&dev->todo_list);
|
|
|
|
|
2012-08-22 21:50:59 +00:00
|
|
|
rtnl_lock();
|
2012-08-22 17:19:46 +00:00
|
|
|
call_netdevice_notifiers(NETDEV_UNREGISTER_FINAL, dev);
|
2012-08-22 21:50:59 +00:00
|
|
|
__rtnl_unlock();
|
2012-08-22 17:19:46 +00:00
|
|
|
|
2006-05-10 20:21:17 +00:00
|
|
|
if (unlikely(dev->reg_state != NETREG_UNREGISTERING)) {
|
2012-02-01 10:54:43 +00:00
|
|
|
pr_err("network todo '%s' but state %d\n",
|
2006-05-10 20:21:17 +00:00
|
|
|
dev->name, dev->reg_state);
|
|
|
|
dump_stack();
|
|
|
|
continue;
|
|
|
|
}
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2006-05-10 20:21:17 +00:00
|
|
|
dev->reg_state = NETREG_UNREGISTERED;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2006-05-10 20:21:17 +00:00
|
|
|
netdev_wait_allrefs(dev);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2006-05-10 20:21:17 +00:00
|
|
|
/* paranoia */
|
net: percpu net_device refcount
We tried very hard to remove all possible dev_hold()/dev_put() pairs in
network stack, using RCU conversions.
There is still an unavoidable device refcount change for every dst we
create/destroy, and this can slow down some workloads (routers or some
app servers, mmap af_packet)
We can switch to a percpu refcount implementation, now dynamic per_cpu
infrastructure is mature. On a 64 cpus machine, this consumes 256 bytes
per device.
On x86, dev_hold(dev) code :
before
lock incl 0x280(%ebx)
after:
movl 0x260(%ebx),%eax
incl fs:(%eax)
Stress bench :
(Sending 160.000.000 UDP frames,
IP route cache disabled, dual E5540 @2.53GHz,
32bit kernel, FIB_TRIE)
Before:
real 1m1.662s
user 0m14.373s
sys 12m55.960s
After:
real 0m51.179s
user 0m15.329s
sys 10m15.942s
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-11 10:22:12 +00:00
|
|
|
BUG_ON(netdev_refcnt_read(dev));
|
2015-01-27 19:35:48 +00:00
|
|
|
BUG_ON(!list_empty(&dev->ptype_all));
|
|
|
|
BUG_ON(!list_empty(&dev->ptype_specific));
|
2011-08-11 19:30:52 +00:00
|
|
|
WARN_ON(rcu_access_pointer(dev->ip_ptr));
|
|
|
|
WARN_ON(rcu_access_pointer(dev->ip6_ptr));
|
2008-07-26 04:43:18 +00:00
|
|
|
WARN_ON(dev->dn_ptr);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
net: Fix inconsistent teardown and release of private netdev state.
Network devices can allocate reasources and private memory using
netdev_ops->ndo_init(). However, the release of these resources
can occur in one of two different places.
Either netdev_ops->ndo_uninit() or netdev->destructor().
The decision of which operation frees the resources depends upon
whether it is necessary for all netdev refs to be released before it
is safe to perform the freeing.
netdev_ops->ndo_uninit() presumably can occur right after the
NETDEV_UNREGISTER notifier completes and the unicast and multicast
address lists are flushed.
netdev->destructor(), on the other hand, does not run until the
netdev references all go away.
Further complicating the situation is that netdev->destructor()
almost universally does also a free_netdev().
This creates a problem for the logic in register_netdevice().
Because all callers of register_netdevice() manage the freeing
of the netdev, and invoke free_netdev(dev) if register_netdevice()
fails.
If netdev_ops->ndo_init() succeeds, but something else fails inside
of register_netdevice(), it does call ndo_ops->ndo_uninit(). But
it is not able to invoke netdev->destructor().
This is because netdev->destructor() will do a free_netdev() and
then the caller of register_netdevice() will do the same.
However, this means that the resources that would normally be released
by netdev->destructor() will not be.
Over the years drivers have added local hacks to deal with this, by
invoking their destructor parts by hand when register_netdevice()
fails.
Many drivers do not try to deal with this, and instead we have leaks.
Let's close this hole by formalizing the distinction between what
private things need to be freed up by netdev->destructor() and whether
the driver needs unregister_netdevice() to perform the free_netdev().
netdev->priv_destructor() performs all actions to free up the private
resources that used to be freed by netdev->destructor(), except for
free_netdev().
netdev->needs_free_netdev is a boolean that indicates whether
free_netdev() should be done at the end of unregister_netdevice().
Now, register_netdevice() can sanely release all resources after
ndo_ops->ndo_init() succeeds, by invoking both ndo_ops->ndo_uninit()
and netdev->priv_destructor().
And at the end of unregister_netdevice(), we invoke
netdev->priv_destructor() and optionally call free_netdev().
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-08 16:52:56 +00:00
|
|
|
if (dev->priv_destructor)
|
|
|
|
dev->priv_destructor(dev);
|
|
|
|
if (dev->needs_free_netdev)
|
|
|
|
free_netdev(dev);
|
2007-05-19 22:39:25 +00:00
|
|
|
|
2013-09-24 04:19:49 +00:00
|
|
|
/* Report a network device has been unregistered */
|
|
|
|
rtnl_lock();
|
|
|
|
dev_net(dev)->dev_unreg_count--;
|
|
|
|
__rtnl_unlock();
|
|
|
|
wake_up(&netdev_unregistering_wq);
|
|
|
|
|
2007-05-19 22:39:25 +00:00
|
|
|
/* Free network device */
|
|
|
|
kobject_put(&dev->dev.kobj);
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2016-02-01 23:51:04 +00:00
|
|
|
/* Convert net_device_stats to rtnl_link_stats64. rtnl_link_stats64 has
|
|
|
|
* all the same fields in the same order as net_device_stats, with only
|
|
|
|
* the type differing, but rtnl_link_stats64 may have additional fields
|
|
|
|
* at the end for newer counters.
|
2010-07-09 09:11:52 +00:00
|
|
|
*/
|
2012-03-05 04:50:09 +00:00
|
|
|
void netdev_stats_to_stats64(struct rtnl_link_stats64 *stats64,
|
|
|
|
const struct net_device_stats *netdev_stats)
|
2010-07-09 09:11:52 +00:00
|
|
|
{
|
|
|
|
#if BITS_PER_LONG == 64
|
2016-02-01 23:51:04 +00:00
|
|
|
BUILD_BUG_ON(sizeof(*stats64) < sizeof(*netdev_stats));
|
2017-07-03 01:20:13 +00:00
|
|
|
memcpy(stats64, netdev_stats, sizeof(*netdev_stats));
|
2016-02-01 23:51:04 +00:00
|
|
|
/* zero out counters that only exist in rtnl_link_stats64 */
|
|
|
|
memset((char *)stats64 + sizeof(*netdev_stats), 0,
|
|
|
|
sizeof(*stats64) - sizeof(*netdev_stats));
|
2010-07-09 09:11:52 +00:00
|
|
|
#else
|
2016-02-01 23:51:04 +00:00
|
|
|
size_t i, n = sizeof(*netdev_stats) / sizeof(unsigned long);
|
2010-07-09 09:11:52 +00:00
|
|
|
const unsigned long *src = (const unsigned long *)netdev_stats;
|
|
|
|
u64 *dst = (u64 *)stats64;
|
|
|
|
|
2016-02-01 23:51:04 +00:00
|
|
|
BUILD_BUG_ON(n > sizeof(*stats64) / sizeof(u64));
|
2010-07-09 09:11:52 +00:00
|
|
|
for (i = 0; i < n; i++)
|
|
|
|
dst[i] = src[i];
|
2016-02-01 23:51:04 +00:00
|
|
|
/* zero out counters that only exist in rtnl_link_stats64 */
|
|
|
|
memset((char *)stats64 + n * sizeof(u64), 0,
|
|
|
|
sizeof(*stats64) - n * sizeof(u64));
|
2010-07-09 09:11:52 +00:00
|
|
|
#endif
|
|
|
|
}
|
2012-03-05 04:50:09 +00:00
|
|
|
EXPORT_SYMBOL(netdev_stats_to_stats64);
|
2010-07-09 09:11:52 +00:00
|
|
|
|
2008-11-20 05:40:23 +00:00
|
|
|
/**
|
|
|
|
* dev_get_stats - get network device statistics
|
|
|
|
* @dev: device to get statistics from
|
2010-07-07 21:58:56 +00:00
|
|
|
* @storage: place to store stats
|
2008-11-20 05:40:23 +00:00
|
|
|
*
|
2010-07-09 09:12:41 +00:00
|
|
|
* Get network statistics from device. Return @storage.
|
|
|
|
* The device driver may provide its own method by setting
|
|
|
|
* dev->netdev_ops->get_stats64 or dev->netdev_ops->get_stats;
|
|
|
|
* otherwise the internal statistics structure is used.
|
2008-11-20 05:40:23 +00:00
|
|
|
*/
|
2010-07-09 09:12:41 +00:00
|
|
|
struct rtnl_link_stats64 *dev_get_stats(struct net_device *dev,
|
|
|
|
struct rtnl_link_stats64 *storage)
|
2009-05-18 00:34:33 +00:00
|
|
|
{
|
2008-11-20 05:40:23 +00:00
|
|
|
const struct net_device_ops *ops = dev->netdev_ops;
|
|
|
|
|
2010-07-07 21:58:56 +00:00
|
|
|
if (ops->ndo_get_stats64) {
|
|
|
|
memset(storage, 0, sizeof(*storage));
|
2010-09-30 21:06:55 +00:00
|
|
|
ops->ndo_get_stats64(dev, storage);
|
|
|
|
} else if (ops->ndo_get_stats) {
|
2010-07-09 09:11:52 +00:00
|
|
|
netdev_stats_to_stats64(storage, ops->ndo_get_stats(dev));
|
2010-09-30 21:06:55 +00:00
|
|
|
} else {
|
|
|
|
netdev_stats_to_stats64(storage, &dev->stats);
|
2010-07-07 21:58:56 +00:00
|
|
|
}
|
2017-06-27 14:02:20 +00:00
|
|
|
storage->rx_dropped += (unsigned long)atomic_long_read(&dev->rx_dropped);
|
|
|
|
storage->tx_dropped += (unsigned long)atomic_long_read(&dev->tx_dropped);
|
|
|
|
storage->rx_nohandler += (unsigned long)atomic_long_read(&dev->rx_nohandler);
|
2010-07-07 21:58:56 +00:00
|
|
|
return storage;
|
2007-03-28 21:29:08 +00:00
|
|
|
}
|
2008-11-20 05:40:23 +00:00
|
|
|
EXPORT_SYMBOL(dev_get_stats);
|
2007-03-28 21:29:08 +00:00
|
|
|
|
2010-10-02 06:11:55 +00:00
|
|
|
struct netdev_queue *dev_ingress_queue_create(struct net_device *dev)
|
2008-07-09 00:18:23 +00:00
|
|
|
{
|
2010-10-02 06:11:55 +00:00
|
|
|
struct netdev_queue *queue = dev_ingress_queue(dev);
|
2008-07-09 00:18:23 +00:00
|
|
|
|
2010-10-02 06:11:55 +00:00
|
|
|
#ifdef CONFIG_NET_CLS_ACT
|
|
|
|
if (queue)
|
|
|
|
return queue;
|
|
|
|
queue = kzalloc(sizeof(*queue), GFP_KERNEL);
|
|
|
|
if (!queue)
|
|
|
|
return NULL;
|
|
|
|
netdev_init_one_queue(dev, queue, NULL);
|
2015-02-04 21:37:44 +00:00
|
|
|
RCU_INIT_POINTER(queue->qdisc, &noop_qdisc);
|
2010-10-02 06:11:55 +00:00
|
|
|
queue->qdisc_sleeping = &noop_qdisc;
|
|
|
|
rcu_assign_pointer(dev->ingress_queue, queue);
|
|
|
|
#endif
|
|
|
|
return queue;
|
2008-07-08 23:55:56 +00:00
|
|
|
}
|
|
|
|
|
2012-09-16 09:17:26 +00:00
|
|
|
static const struct ethtool_ops default_ethtool_ops;
|
|
|
|
|
2013-01-10 23:19:10 +00:00
|
|
|
void netdev_set_default_ethtool_ops(struct net_device *dev,
|
|
|
|
const struct ethtool_ops *ops)
|
|
|
|
{
|
|
|
|
if (dev->ethtool_ops == &default_ethtool_ops)
|
|
|
|
dev->ethtool_ops = ops;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(netdev_set_default_ethtool_ops);
|
|
|
|
|
2013-10-30 20:10:44 +00:00
|
|
|
void netdev_freemem(struct net_device *dev)
|
|
|
|
{
|
|
|
|
char *addr = (char *)dev - dev->padded;
|
|
|
|
|
2014-06-02 22:55:22 +00:00
|
|
|
kvfree(addr);
|
2013-10-30 20:10:44 +00:00
|
|
|
}
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
/**
|
2017-02-09 06:56:04 +00:00
|
|
|
* alloc_netdev_mqs - allocate network device
|
|
|
|
* @sizeof_priv: size of private data to allocate space for
|
|
|
|
* @name: device name format string
|
|
|
|
* @name_assign_type: origin of device name
|
|
|
|
* @setup: callback to initialize device
|
|
|
|
* @txqs: the number of TX subqueues to allocate
|
|
|
|
* @rxqs: the number of RX subqueues to allocate
|
|
|
|
*
|
|
|
|
* Allocates a struct net_device with private data area for driver use
|
|
|
|
* and performs basic initialization. Also allocates subqueue structs
|
|
|
|
* for each queue on the device.
|
2005-04-16 22:20:36 +00:00
|
|
|
*/
|
2011-01-09 19:36:31 +00:00
|
|
|
struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
|
net: set name_assign_type in alloc_netdev()
Extend alloc_netdev{,_mq{,s}}() to take name_assign_type as argument, and convert
all users to pass NET_NAME_UNKNOWN.
Coccinelle patch:
@@
expression sizeof_priv, name, setup, txqs, rxqs, count;
@@
(
-alloc_netdev_mqs(sizeof_priv, name, setup, txqs, rxqs)
+alloc_netdev_mqs(sizeof_priv, name, NET_NAME_UNKNOWN, setup, txqs, rxqs)
|
-alloc_netdev_mq(sizeof_priv, name, setup, count)
+alloc_netdev_mq(sizeof_priv, name, NET_NAME_UNKNOWN, setup, count)
|
-alloc_netdev(sizeof_priv, name, setup)
+alloc_netdev(sizeof_priv, name, NET_NAME_UNKNOWN, setup)
)
v9: move comments here from the wrong commit
Signed-off-by: Tom Gundersen <teg@jklm.no>
Reviewed-by: David Herrmann <dh.herrmann@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-14 14:37:24 +00:00
|
|
|
unsigned char name_assign_type,
|
2011-01-09 19:36:31 +00:00
|
|
|
void (*setup)(struct net_device *),
|
|
|
|
unsigned int txqs, unsigned int rxqs)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
|
|
|
struct net_device *dev;
|
2017-09-21 20:33:29 +00:00
|
|
|
unsigned int alloc_size;
|
2009-05-27 04:42:37 +00:00
|
|
|
struct net_device *p;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2006-08-30 00:06:13 +00:00
|
|
|
BUG_ON(strlen(name) >= sizeof(dev->name));
|
|
|
|
|
2011-01-09 19:36:31 +00:00
|
|
|
if (txqs < 1) {
|
2012-02-01 10:54:43 +00:00
|
|
|
pr_err("alloc_netdev: Unable to allocate device with zero queues\n");
|
2010-10-18 17:55:58 +00:00
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2014-01-17 06:23:28 +00:00
|
|
|
#ifdef CONFIG_SYSFS
|
2011-01-09 19:36:31 +00:00
|
|
|
if (rxqs < 1) {
|
2012-02-01 10:54:43 +00:00
|
|
|
pr_err("alloc_netdev: Unable to allocate device with zero RX queues\n");
|
2011-01-09 19:36:31 +00:00
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2008-07-17 08:56:23 +00:00
|
|
|
alloc_size = sizeof(struct net_device);
|
2008-04-18 22:43:32 +00:00
|
|
|
if (sizeof_priv) {
|
|
|
|
/* ensure 32-byte alignment of private area */
|
2009-05-27 04:42:37 +00:00
|
|
|
alloc_size = ALIGN(alloc_size, NETDEV_ALIGN);
|
2008-04-18 22:43:32 +00:00
|
|
|
alloc_size += sizeof_priv;
|
|
|
|
}
|
|
|
|
/* ensure 32-byte alignment of whole construct */
|
2009-05-27 04:42:37 +00:00
|
|
|
alloc_size += NETDEV_ALIGN - 1;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2017-07-12 21:36:45 +00:00
|
|
|
p = kvzalloc(alloc_size, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
|
2013-02-04 16:48:16 +00:00
|
|
|
if (!p)
|
2005-04-16 22:20:36 +00:00
|
|
|
return NULL;
|
|
|
|
|
2009-05-27 04:42:37 +00:00
|
|
|
dev = PTR_ALIGN(p, NETDEV_ALIGN);
|
2005-04-16 22:20:36 +00:00
|
|
|
dev->padded = (char *)dev - (char *)p;
|
2009-05-08 13:30:17 +00:00
|
|
|
|
net: percpu net_device refcount
We tried very hard to remove all possible dev_hold()/dev_put() pairs in
network stack, using RCU conversions.
There is still an unavoidable device refcount change for every dst we
create/destroy, and this can slow down some workloads (routers or some
app servers, mmap af_packet)
We can switch to a percpu refcount implementation, now dynamic per_cpu
infrastructure is mature. On a 64 cpus machine, this consumes 256 bytes
per device.
On x86, dev_hold(dev) code :
before
lock incl 0x280(%ebx)
after:
movl 0x260(%ebx),%eax
incl fs:(%eax)
Stress bench :
(Sending 160.000.000 UDP frames,
IP route cache disabled, dual E5540 @2.53GHz,
32bit kernel, FIB_TRIE)
Before:
real 1m1.662s
user 0m14.373s
sys 12m55.960s
After:
real 0m51.179s
user 0m15.329s
sys 10m15.942s
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-11 10:22:12 +00:00
|
|
|
dev->pcpu_refcnt = alloc_percpu(int);
|
|
|
|
if (!dev->pcpu_refcnt)
|
2013-10-30 20:10:44 +00:00
|
|
|
goto free_dev;
|
2009-05-08 13:30:17 +00:00
|
|
|
|
|
|
|
if (dev_addr_init(dev))
|
net: percpu net_device refcount
We tried very hard to remove all possible dev_hold()/dev_put() pairs in
network stack, using RCU conversions.
There is still an unavoidable device refcount change for every dst we
create/destroy, and this can slow down some workloads (routers or some
app servers, mmap af_packet)
We can switch to a percpu refcount implementation, now dynamic per_cpu
infrastructure is mature. On a 64 cpus machine, this consumes 256 bytes
per device.
On x86, dev_hold(dev) code :
before
lock incl 0x280(%ebx)
after:
movl 0x260(%ebx),%eax
incl fs:(%eax)
Stress bench :
(Sending 160.000.000 UDP frames,
IP route cache disabled, dual E5540 @2.53GHz,
32bit kernel, FIB_TRIE)
Before:
real 1m1.662s
user 0m14.373s
sys 12m55.960s
After:
real 0m51.179s
user 0m15.329s
sys 10m15.942s
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-11 10:22:12 +00:00
|
|
|
goto free_pcpu;
|
2009-05-08 13:30:17 +00:00
|
|
|
|
2010-04-01 21:22:57 +00:00
|
|
|
dev_mc_init(dev);
|
2010-04-01 21:22:09 +00:00
|
|
|
dev_uc_init(dev);
|
2009-05-22 23:22:17 +00:00
|
|
|
|
2008-03-25 12:47:49 +00:00
|
|
|
dev_net_set(dev, &init_net);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2011-02-08 23:02:50 +00:00
|
|
|
dev->gso_max_size = GSO_MAX_SIZE;
|
2012-07-30 15:57:00 +00:00
|
|
|
dev->gso_max_segs = GSO_MAX_SEGS;
|
2011-02-08 23:02:50 +00:00
|
|
|
|
|
|
|
INIT_LIST_HEAD(&dev->napi_list);
|
|
|
|
INIT_LIST_HEAD(&dev->unreg_list);
|
2013-10-06 02:26:05 +00:00
|
|
|
INIT_LIST_HEAD(&dev->close_list);
|
2011-02-08 23:02:50 +00:00
|
|
|
INIT_LIST_HEAD(&dev->link_watch_list);
|
net: add adj_list to save only neighbours
Currently, we distinguish neighbours (first-level linked devices) from
non-neighbours by the neighbour bool in the netdev_adjacent. This could be
quite time-consuming in case we would like to traverse *only* through
neighbours - cause we'd have to traverse through all devices and check for
this flag, and in a (quite common) scenario where we have lots of vlans on
top of bridge, which is on top of a bond - the bonding would have to go
through all those vlans to get its upper neighbour linked devices.
This situation is really unpleasant, cause there are already a lot of cases
when a device with slaves needs to go through them in hot path.
To fix this, introduce a new upper/lower device lists structure -
adj_list, which contains only the neighbours. It works always in
pair with the all_adj_list structure (renamed from upper/lower_dev_list),
i.e. both of them contain the same links, only that all_adj_list contains
also non-neighbour device links. It's really a small change visible,
currently, only for __netdev_adjacent_dev_insert/remove(), and doesn't
change the main linked logic at all.
Also, add some comments a fix a name collision in
netdev_for_each_upper_dev_rcu() and rework the naming by the following
rules:
netdev_(all_)(upper|lower)_*
If "all_" is present, then we work with the whole list of upper/lower
devices, otherwise - only with direct neighbours. Uninline functions - to
get better stack traces.
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Alexander Duyck <alexander.h.duyck@intel.com>
CC: Cong Wang <amwang@redhat.com>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-25 07:20:07 +00:00
|
|
|
INIT_LIST_HEAD(&dev->adj_list.upper);
|
|
|
|
INIT_LIST_HEAD(&dev->adj_list.lower);
|
2015-01-27 19:35:48 +00:00
|
|
|
INIT_LIST_HEAD(&dev->ptype_all);
|
|
|
|
INIT_LIST_HEAD(&dev->ptype_specific);
|
2016-08-10 09:05:15 +00:00
|
|
|
#ifdef CONFIG_NET_SCHED
|
|
|
|
hash_init(dev->qdisc_hash);
|
|
|
|
#endif
|
2014-10-06 01:38:35 +00:00
|
|
|
dev->priv_flags = IFF_XMIT_DST_RELEASE | IFF_XMIT_DST_RELEASE_PERM;
|
2011-02-08 23:02:50 +00:00
|
|
|
setup(dev);
|
|
|
|
|
2016-02-17 14:37:43 +00:00
|
|
|
if (!dev->tx_queue_len) {
|
2015-08-27 19:21:36 +00:00
|
|
|
dev->priv_flags |= IFF_NO_QUEUE;
|
net/qdisc: IFF_NO_QUEUE drivers should use consistent TX queue len
The flag IFF_NO_QUEUE marks virtual device drivers that doesn't need a
default qdisc attached, given they will be backed by physical device,
that already have a qdisc attached for pushback.
It is still supported to attach a qdisc to a IFF_NO_QUEUE device, as
this can be useful for difference policy reasons (e.g. bandwidth
limiting containers). For this to work, the tx_queue_len need to have
a sane value, because some qdiscs inherit/copy the tx_queue_len
(namely, pfifo, bfifo, gred, htb, plug and sfb).
Commit a813104d9233 ("IFF_NO_QUEUE: Fix for drivers not calling
ether_setup()") caught situations where some drivers didn't initialize
tx_queue_len. The problem with the commit was choosing 1 as the
fallback value.
A qdisc queue length of 1 causes more harm than good, because it
creates hard to debug situations for userspace. It gives userspace a
false sense of a working config after attaching a qdisc. As low
volume traffic (that doesn't activate the qdisc policy) works,
like ping, while traffic that e.g. needs shaping cannot reach the
configured policy levels, given the queue length is too small.
This patch change the value to DEFAULT_TX_QUEUE_LEN, given other
IFF_NO_QUEUE devices (that call ether_setup()) also use this value.
Fixes: a813104d9233 ("IFF_NO_QUEUE: Fix for drivers not calling ether_setup()")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03 13:56:06 +00:00
|
|
|
dev->tx_queue_len = DEFAULT_TX_QUEUE_LEN;
|
2016-02-17 14:37:43 +00:00
|
|
|
}
|
2015-08-18 08:30:48 +00:00
|
|
|
|
2011-01-09 19:36:31 +00:00
|
|
|
dev->num_tx_queues = txqs;
|
|
|
|
dev->real_num_tx_queues = txqs;
|
2010-11-09 10:47:30 +00:00
|
|
|
if (netif_alloc_netdev_queues(dev))
|
2011-02-08 23:02:50 +00:00
|
|
|
goto free_all;
|
2008-07-17 07:34:19 +00:00
|
|
|
|
2014-01-17 06:23:28 +00:00
|
|
|
#ifdef CONFIG_SYSFS
|
2011-01-09 19:36:31 +00:00
|
|
|
dev->num_rx_queues = rxqs;
|
|
|
|
dev->real_num_rx_queues = rxqs;
|
2010-11-09 10:47:38 +00:00
|
|
|
if (netif_alloc_rx_queues(dev))
|
2011-02-08 23:02:50 +00:00
|
|
|
goto free_all;
|
2010-03-24 19:13:54 +00:00
|
|
|
#endif
|
2010-03-16 08:03:29 +00:00
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
strcpy(dev->name, name);
|
net: set name_assign_type in alloc_netdev()
Extend alloc_netdev{,_mq{,s}}() to take name_assign_type as argument, and convert
all users to pass NET_NAME_UNKNOWN.
Coccinelle patch:
@@
expression sizeof_priv, name, setup, txqs, rxqs, count;
@@
(
-alloc_netdev_mqs(sizeof_priv, name, setup, txqs, rxqs)
+alloc_netdev_mqs(sizeof_priv, name, NET_NAME_UNKNOWN, setup, txqs, rxqs)
|
-alloc_netdev_mq(sizeof_priv, name, setup, count)
+alloc_netdev_mq(sizeof_priv, name, NET_NAME_UNKNOWN, setup, count)
|
-alloc_netdev(sizeof_priv, name, setup)
+alloc_netdev(sizeof_priv, name, NET_NAME_UNKNOWN, setup)
)
v9: move comments here from the wrong commit
Signed-off-by: Tom Gundersen <teg@jklm.no>
Reviewed-by: David Herrmann <dh.herrmann@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-14 14:37:24 +00:00
|
|
|
dev->name_assign_type = name_assign_type;
|
2011-01-13 23:38:30 +00:00
|
|
|
dev->group = INIT_NETDEV_GROUP;
|
2012-09-16 09:17:26 +00:00
|
|
|
if (!dev->ethtool_ops)
|
|
|
|
dev->ethtool_ops = &default_ethtool_ops;
|
netfilter: add netfilter ingress hook after handle_ing() under unique static key
This patch adds the Netfilter ingress hook just after the existing tc ingress
hook, that seems to be the consensus solution for this.
Note that the Netfilter hook resides under the global static key that enables
ingress filtering. Nonetheless, Netfilter still also has its own static key for
minimal impact on the existing handle_ing().
* Without this patch:
Result: OK: 6216490(c6216338+d152) usec, 100000000 (60byte,0frags)
16086246pps 7721Mb/sec (7721398080bps) errors: 100000000
42.46% kpktgend_0 [kernel.kallsyms] [k] __netif_receive_skb_core
25.92% kpktgend_0 [kernel.kallsyms] [k] kfree_skb
7.81% kpktgend_0 [pktgen] [k] pktgen_thread_worker
5.62% kpktgend_0 [kernel.kallsyms] [k] ip_rcv
2.70% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_internal
2.34% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_sk
1.44% kpktgend_0 [kernel.kallsyms] [k] __build_skb
* With this patch:
Result: OK: 6214833(c6214731+d101) usec, 100000000 (60byte,0frags)
16090536pps 7723Mb/sec (7723457280bps) errors: 100000000
41.23% kpktgend_0 [kernel.kallsyms] [k] __netif_receive_skb_core
26.57% kpktgend_0 [kernel.kallsyms] [k] kfree_skb
7.72% kpktgend_0 [pktgen] [k] pktgen_thread_worker
5.55% kpktgend_0 [kernel.kallsyms] [k] ip_rcv
2.78% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_internal
2.06% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_sk
1.43% kpktgend_0 [kernel.kallsyms] [k] __build_skb
* Without this patch + tc ingress:
tc filter add dev eth4 parent ffff: protocol ip prio 1 \
u32 match ip dst 4.3.2.1/32
Result: OK: 9269001(c9268821+d179) usec, 100000000 (60byte,0frags)
10788648pps 5178Mb/sec (5178551040bps) errors: 100000000
40.99% kpktgend_0 [kernel.kallsyms] [k] __netif_receive_skb_core
17.50% kpktgend_0 [kernel.kallsyms] [k] kfree_skb
11.77% kpktgend_0 [cls_u32] [k] u32_classify
5.62% kpktgend_0 [kernel.kallsyms] [k] tc_classify_compat
5.18% kpktgend_0 [pktgen] [k] pktgen_thread_worker
3.23% kpktgend_0 [kernel.kallsyms] [k] tc_classify
2.97% kpktgend_0 [kernel.kallsyms] [k] ip_rcv
1.83% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_internal
1.50% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_sk
0.99% kpktgend_0 [kernel.kallsyms] [k] __build_skb
* With this patch + tc ingress:
tc filter add dev eth4 parent ffff: protocol ip prio 1 \
u32 match ip dst 4.3.2.1/32
Result: OK: 9308218(c9308091+d126) usec, 100000000 (60byte,0frags)
10743194pps 5156Mb/sec (5156733120bps) errors: 100000000
42.01% kpktgend_0 [kernel.kallsyms] [k] __netif_receive_skb_core
17.78% kpktgend_0 [kernel.kallsyms] [k] kfree_skb
11.70% kpktgend_0 [cls_u32] [k] u32_classify
5.46% kpktgend_0 [kernel.kallsyms] [k] tc_classify_compat
5.16% kpktgend_0 [pktgen] [k] pktgen_thread_worker
2.98% kpktgend_0 [kernel.kallsyms] [k] ip_rcv
2.84% kpktgend_0 [kernel.kallsyms] [k] tc_classify
1.96% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_internal
1.57% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_sk
Note that the results are very similar before and after.
I can see gcc gets the code under the ingress static key out of the hot path.
Then, on that cold branch, it generates the code to accomodate the netfilter
ingress static key. My explanation for this is that this reduces the pressure
on the instruction cache for non-users as the new code is out of the hot path,
and it comes with minimal impact for tc ingress users.
Using gcc version 4.8.4 on:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
[...]
L1d cache: 16K
L1i cache: 64K
L2 cache: 2048K
L3 cache: 8192K
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-13 16:19:38 +00:00
|
|
|
|
|
|
|
nf_hook_ingress_init(dev);
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
return dev;
|
2009-05-08 13:30:17 +00:00
|
|
|
|
2011-02-08 23:02:50 +00:00
|
|
|
free_all:
|
|
|
|
free_netdev(dev);
|
|
|
|
return NULL;
|
|
|
|
|
net: percpu net_device refcount
We tried very hard to remove all possible dev_hold()/dev_put() pairs in
network stack, using RCU conversions.
There is still an unavoidable device refcount change for every dst we
create/destroy, and this can slow down some workloads (routers or some
app servers, mmap af_packet)
We can switch to a percpu refcount implementation, now dynamic per_cpu
infrastructure is mature. On a 64 cpus machine, this consumes 256 bytes
per device.
On x86, dev_hold(dev) code :
before
lock incl 0x280(%ebx)
after:
movl 0x260(%ebx),%eax
incl fs:(%eax)
Stress bench :
(Sending 160.000.000 UDP frames,
IP route cache disabled, dual E5540 @2.53GHz,
32bit kernel, FIB_TRIE)
Before:
real 1m1.662s
user 0m14.373s
sys 12m55.960s
After:
real 0m51.179s
user 0m15.329s
sys 10m15.942s
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-11 10:22:12 +00:00
|
|
|
free_pcpu:
|
|
|
|
free_percpu(dev->pcpu_refcnt);
|
2013-10-30 20:10:44 +00:00
|
|
|
free_dev:
|
|
|
|
netdev_freemem(dev);
|
2009-05-08 13:30:17 +00:00
|
|
|
return NULL;
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
2011-01-09 19:36:31 +00:00
|
|
|
EXPORT_SYMBOL(alloc_netdev_mqs);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
/**
|
2017-02-09 06:56:04 +00:00
|
|
|
* free_netdev - free network device
|
|
|
|
* @dev: device
|
2005-04-16 22:20:36 +00:00
|
|
|
*
|
2017-02-09 06:56:04 +00:00
|
|
|
* This function does the last stage of destroying an allocated device
|
|
|
|
* interface. The reference to the device object is released. If this
|
|
|
|
* is the last reference then it will be freed.Must be called in process
|
|
|
|
* context.
|
2005-04-16 22:20:36 +00:00
|
|
|
*/
|
|
|
|
void free_netdev(struct net_device *dev)
|
|
|
|
{
|
net: Add Generic Receive Offload infrastructure
This patch adds the top-level GRO (Generic Receive Offload) infrastructure.
This is pretty similar to LRO except that this is protocol-independent.
Instead of holding packets in an lro_mgr structure, they're now held in
napi_struct.
For drivers that intend to use this, they can set the NETIF_F_GRO bit and
call napi_gro_receive instead of netif_receive_skb or just call netif_rx.
The latter will call napi_receive_skb automatically. When napi_gro_receive
is used, the driver must either call napi_complete/napi_rx_complete, or
call napi_gro_flush in softirq context if the driver uses the primitives
__napi_complete/__napi_rx_complete.
Protocols will set the gro_receive and gro_complete function pointers in
order to participate in this scheme.
In addition to the packet, gro_receive will get a list of currently held
packets. Each packet in the list has a same_flow field which is non-zero
if it is a potential match for the new packet. For each packet that may
match, they also have a flush field which is non-zero if the held packet
must not be merged with the new packet.
Once gro_receive has determined that the new skb matches a held packet,
the held packet may be processed immediately if the new skb cannot be
merged with it. In this case gro_receive should return the pointer to
the existing skb in gro_list. Otherwise the new skb should be merged into
the existing packet and NULL should be returned, unless the new skb makes
it impossible for any further merges to be made (e.g., FIN packet) where
the merged skb should be returned.
Whenever the skb is merged into an existing entry, the gro_receive
function should set NAPI_GRO_CB(skb)->same_flow. Note that if an skb
merely matches an existing entry but can't be merged with it, then
this shouldn't be set.
If gro_receive finds it pointless to hold the new skb for future merging,
it should set NAPI_GRO_CB(skb)->flush.
Held packets will be flushed by napi_gro_flush which is called by
napi_complete and napi_rx_complete.
Currently held packets are stored in a singly liked list just like LRO.
The list is limited to a maximum of 8 entries. In future, this may be
expanded to use a hash table to allow more flows to be held for merging.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-16 07:38:52 +00:00
|
|
|
struct napi_struct *p, *n;
|
2017-04-18 19:36:58 +00:00
|
|
|
struct bpf_prog *prog;
|
net: Add Generic Receive Offload infrastructure
This patch adds the top-level GRO (Generic Receive Offload) infrastructure.
This is pretty similar to LRO except that this is protocol-independent.
Instead of holding packets in an lro_mgr structure, they're now held in
napi_struct.
For drivers that intend to use this, they can set the NETIF_F_GRO bit and
call napi_gro_receive instead of netif_receive_skb or just call netif_rx.
The latter will call napi_receive_skb automatically. When napi_gro_receive
is used, the driver must either call napi_complete/napi_rx_complete, or
call napi_gro_flush in softirq context if the driver uses the primitives
__napi_complete/__napi_rx_complete.
Protocols will set the gro_receive and gro_complete function pointers in
order to participate in this scheme.
In addition to the packet, gro_receive will get a list of currently held
packets. Each packet in the list has a same_flow field which is non-zero
if it is a potential match for the new packet. For each packet that may
match, they also have a flush field which is non-zero if the held packet
must not be merged with the new packet.
Once gro_receive has determined that the new skb matches a held packet,
the held packet may be processed immediately if the new skb cannot be
merged with it. In this case gro_receive should return the pointer to
the existing skb in gro_list. Otherwise the new skb should be merged into
the existing packet and NULL should be returned, unless the new skb makes
it impossible for any further merges to be made (e.g., FIN packet) where
the merged skb should be returned.
Whenever the skb is merged into an existing entry, the gro_receive
function should set NAPI_GRO_CB(skb)->same_flow. Note that if an skb
merely matches an existing entry but can't be merged with it, then
this shouldn't be set.
If gro_receive finds it pointless to hold the new skb for future merging,
it should set NAPI_GRO_CB(skb)->flush.
Held packets will be flushed by napi_gro_flush which is called by
napi_complete and napi_rx_complete.
Currently held packets are stored in a singly liked list just like LRO.
The list is limited to a maximum of 8 entries. In future, this may be
expanded to use a hash table to allow more flows to be held for merging.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-16 07:38:52 +00:00
|
|
|
|
2015-11-18 14:31:03 +00:00
|
|
|
might_sleep();
|
2013-06-20 08:15:51 +00:00
|
|
|
netif_free_tx_queues(dev);
|
2014-01-17 06:23:28 +00:00
|
|
|
#ifdef CONFIG_SYSFS
|
2015-01-12 06:11:28 +00:00
|
|
|
kvfree(dev->_rx);
|
2010-11-09 10:47:38 +00:00
|
|
|
#endif
|
2008-07-17 07:34:19 +00:00
|
|
|
|
2011-08-11 19:30:52 +00:00
|
|
|
kfree(rcu_dereference_protected(dev->ingress_queue, 1));
|
2010-10-02 06:11:55 +00:00
|
|
|
|
2009-05-05 02:48:28 +00:00
|
|
|
/* Flush device addresses */
|
|
|
|
dev_addr_flush(dev);
|
|
|
|
|
net: Add Generic Receive Offload infrastructure
This patch adds the top-level GRO (Generic Receive Offload) infrastructure.
This is pretty similar to LRO except that this is protocol-independent.
Instead of holding packets in an lro_mgr structure, they're now held in
napi_struct.
For drivers that intend to use this, they can set the NETIF_F_GRO bit and
call napi_gro_receive instead of netif_receive_skb or just call netif_rx.
The latter will call napi_receive_skb automatically. When napi_gro_receive
is used, the driver must either call napi_complete/napi_rx_complete, or
call napi_gro_flush in softirq context if the driver uses the primitives
__napi_complete/__napi_rx_complete.
Protocols will set the gro_receive and gro_complete function pointers in
order to participate in this scheme.
In addition to the packet, gro_receive will get a list of currently held
packets. Each packet in the list has a same_flow field which is non-zero
if it is a potential match for the new packet. For each packet that may
match, they also have a flush field which is non-zero if the held packet
must not be merged with the new packet.
Once gro_receive has determined that the new skb matches a held packet,
the held packet may be processed immediately if the new skb cannot be
merged with it. In this case gro_receive should return the pointer to
the existing skb in gro_list. Otherwise the new skb should be merged into
the existing packet and NULL should be returned, unless the new skb makes
it impossible for any further merges to be made (e.g., FIN packet) where
the merged skb should be returned.
Whenever the skb is merged into an existing entry, the gro_receive
function should set NAPI_GRO_CB(skb)->same_flow. Note that if an skb
merely matches an existing entry but can't be merged with it, then
this shouldn't be set.
If gro_receive finds it pointless to hold the new skb for future merging,
it should set NAPI_GRO_CB(skb)->flush.
Held packets will be flushed by napi_gro_flush which is called by
napi_complete and napi_rx_complete.
Currently held packets are stored in a singly liked list just like LRO.
The list is limited to a maximum of 8 entries. In future, this may be
expanded to use a hash table to allow more flows to be held for merging.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-16 07:38:52 +00:00
|
|
|
list_for_each_entry_safe(p, n, &dev->napi_list, dev_list)
|
|
|
|
netif_napi_del(p);
|
|
|
|
|
net: percpu net_device refcount
We tried very hard to remove all possible dev_hold()/dev_put() pairs in
network stack, using RCU conversions.
There is still an unavoidable device refcount change for every dst we
create/destroy, and this can slow down some workloads (routers or some
app servers, mmap af_packet)
We can switch to a percpu refcount implementation, now dynamic per_cpu
infrastructure is mature. On a 64 cpus machine, this consumes 256 bytes
per device.
On x86, dev_hold(dev) code :
before
lock incl 0x280(%ebx)
after:
movl 0x260(%ebx),%eax
incl fs:(%eax)
Stress bench :
(Sending 160.000.000 UDP frames,
IP route cache disabled, dual E5540 @2.53GHz,
32bit kernel, FIB_TRIE)
Before:
real 1m1.662s
user 0m14.373s
sys 12m55.960s
After:
real 0m51.179s
user 0m15.329s
sys 10m15.942s
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-10-11 10:22:12 +00:00
|
|
|
free_percpu(dev->pcpu_refcnt);
|
|
|
|
dev->pcpu_refcnt = NULL;
|
|
|
|
|
2017-04-18 19:36:58 +00:00
|
|
|
prog = rcu_dereference_protected(dev->xdp_prog, 1);
|
|
|
|
if (prog) {
|
|
|
|
bpf_prog_put(prog);
|
|
|
|
static_key_slow_dec(&generic_xdp_needed);
|
|
|
|
}
|
|
|
|
|
2006-05-26 20:25:24 +00:00
|
|
|
/* Compatibility with error handling in drivers */
|
2005-04-16 22:20:36 +00:00
|
|
|
if (dev->reg_state == NETREG_UNINITIALIZED) {
|
2013-10-30 20:10:44 +00:00
|
|
|
netdev_freemem(dev);
|
2005-04-16 22:20:36 +00:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
BUG_ON(dev->reg_state != NETREG_UNREGISTERED);
|
|
|
|
dev->reg_state = NETREG_RELEASED;
|
|
|
|
|
2002-04-09 19:14:34 +00:00
|
|
|
/* will free via device release */
|
|
|
|
put_device(&dev->dev);
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
2009-09-03 08:29:39 +00:00
|
|
|
EXPORT_SYMBOL(free_netdev);
|
2007-02-09 14:24:36 +00:00
|
|
|
|
2008-09-30 09:23:58 +00:00
|
|
|
/**
|
|
|
|
* synchronize_net - Synchronize with packet receive processing
|
|
|
|
*
|
|
|
|
* Wait for packets currently being received to be done.
|
|
|
|
* Does not block later packets from starting.
|
|
|
|
*/
|
2007-02-09 14:24:36 +00:00
|
|
|
void synchronize_net(void)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
|
|
|
might_sleep();
|
2011-05-23 23:07:32 +00:00
|
|
|
if (rtnl_is_locked())
|
|
|
|
synchronize_rcu_expedited();
|
|
|
|
else
|
|
|
|
synchronize_rcu();
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
2009-09-03 08:29:39 +00:00
|
|
|
EXPORT_SYMBOL(synchronize_net);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
/**
|
2009-10-27 07:03:04 +00:00
|
|
|
* unregister_netdevice_queue - remove device from the kernel
|
2005-04-16 22:20:36 +00:00
|
|
|
* @dev: device
|
2009-10-27 07:03:04 +00:00
|
|
|
* @head: list
|
2009-11-23 04:43:13 +00:00
|
|
|
*
|
2005-04-16 22:20:36 +00:00
|
|
|
* This function shuts down a device interface and removes it
|
2007-12-11 10:28:03 +00:00
|
|
|
* from the kernel tables.
|
2009-10-27 07:03:04 +00:00
|
|
|
* If head not NULL, device is queued to be unregistered later.
|
2005-04-16 22:20:36 +00:00
|
|
|
*
|
|
|
|
* Callers must hold the rtnl semaphore. You may want
|
|
|
|
* unregister_netdev() instead of this.
|
|
|
|
*/
|
|
|
|
|
2009-10-27 07:03:04 +00:00
|
|
|
void unregister_netdevice_queue(struct net_device *dev, struct list_head *head)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
2007-12-13 03:21:56 +00:00
|
|
|
ASSERT_RTNL();
|
|
|
|
|
2009-10-27 07:03:04 +00:00
|
|
|
if (head) {
|
2009-10-30 14:51:13 +00:00
|
|
|
list_move_tail(&dev->unreg_list, head);
|
2009-10-27 07:03:04 +00:00
|
|
|
} else {
|
|
|
|
rollback_registered(dev);
|
|
|
|
/* Finish processing unregister after unlock */
|
|
|
|
net_set_todo(dev);
|
|
|
|
}
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
2009-10-27 07:03:04 +00:00
|
|
|
EXPORT_SYMBOL(unregister_netdevice_queue);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2009-10-27 07:04:19 +00:00
|
|
|
/**
|
|
|
|
* unregister_netdevice_many - unregister many devices
|
|
|
|
* @head: list of devices
|
2014-06-06 13:44:03 +00:00
|
|
|
*
|
|
|
|
* Note: As most callers use a stack allocated list_head,
|
|
|
|
* we force a list_del() to make sure stack wont be corrupted later.
|
2009-10-27 07:04:19 +00:00
|
|
|
*/
|
|
|
|
void unregister_netdevice_many(struct list_head *head)
|
|
|
|
{
|
|
|
|
struct net_device *dev;
|
|
|
|
|
|
|
|
if (!list_empty(head)) {
|
|
|
|
rollback_registered_many(head);
|
|
|
|
list_for_each_entry(dev, head, unreg_list)
|
|
|
|
net_set_todo(dev);
|
2014-06-06 13:44:03 +00:00
|
|
|
list_del(head);
|
2009-10-27 07:04:19 +00:00
|
|
|
}
|
|
|
|
}
|
2009-10-27 07:06:49 +00:00
|
|
|
EXPORT_SYMBOL(unregister_netdevice_many);
|
2009-10-27 07:04:19 +00:00
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
/**
|
|
|
|
* unregister_netdev - remove device from the kernel
|
|
|
|
* @dev: device
|
|
|
|
*
|
|
|
|
* This function shuts down a device interface and removes it
|
2007-12-11 10:28:03 +00:00
|
|
|
* from the kernel tables.
|
2005-04-16 22:20:36 +00:00
|
|
|
*
|
|
|
|
* This is just a wrapper for unregister_netdevice that takes
|
|
|
|
* the rtnl semaphore. In general you want to use this and not
|
|
|
|
* unregister_netdevice.
|
|
|
|
*/
|
|
|
|
void unregister_netdev(struct net_device *dev)
|
|
|
|
{
|
|
|
|
rtnl_lock();
|
|
|
|
unregister_netdevice(dev);
|
|
|
|
rtnl_unlock();
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(unregister_netdev);
|
|
|
|
|
2007-09-12 11:53:49 +00:00
|
|
|
/**
|
|
|
|
* dev_change_net_namespace - move device to different nethost namespace
|
|
|
|
* @dev: device
|
|
|
|
* @net: network namespace
|
|
|
|
* @pat: If not NULL name pattern to try if the current device name
|
|
|
|
* is already taken in the destination network namespace.
|
|
|
|
*
|
|
|
|
* This function shuts down a device interface and moves it
|
|
|
|
* to a new network namespace. On success 0 is returned, on
|
|
|
|
* a failure a netagive errno code is returned.
|
|
|
|
*
|
|
|
|
* Callers must hold the rtnl semaphore.
|
|
|
|
*/
|
|
|
|
|
|
|
|
int dev_change_net_namespace(struct net_device *dev, struct net *net, const char *pat)
|
|
|
|
{
|
|
|
|
int err;
|
|
|
|
|
|
|
|
ASSERT_RTNL();
|
|
|
|
|
|
|
|
/* Don't allow namespace local devices to be moved. */
|
|
|
|
err = -EINVAL;
|
|
|
|
if (dev->features & NETIF_F_NETNS_LOCAL)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
/* Ensure the device has been registrered */
|
|
|
|
if (dev->reg_state != NETREG_REGISTERED)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
/* Get out if there is nothing todo */
|
|
|
|
err = 0;
|
2008-03-25 18:57:35 +00:00
|
|
|
if (net_eq(dev_net(dev), net))
|
2007-09-12 11:53:49 +00:00
|
|
|
goto out;
|
|
|
|
|
|
|
|
/* Pick the destination device name, and ensure
|
|
|
|
* we can use it in the destination network namespace.
|
|
|
|
*/
|
|
|
|
err = -EEXIST;
|
2009-11-18 02:36:59 +00:00
|
|
|
if (__dev_get_by_name(net, dev->name)) {
|
2007-09-12 11:53:49 +00:00
|
|
|
/* We get here if we can't use the current device name */
|
|
|
|
if (!pat)
|
|
|
|
goto out;
|
2012-09-13 20:58:27 +00:00
|
|
|
if (dev_get_valid_name(net, dev, pat) < 0)
|
2007-09-12 11:53:49 +00:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* And now a mini version of register_netdevice unregister_netdevice.
|
|
|
|
*/
|
|
|
|
|
|
|
|
/* If device is running close it first. */
|
2007-10-10 09:49:09 +00:00
|
|
|
dev_close(dev);
|
2007-09-12 11:53:49 +00:00
|
|
|
|
|
|
|
/* And unlink it from device chain */
|
|
|
|
err = -ENODEV;
|
|
|
|
unlist_netdevice(dev);
|
|
|
|
|
|
|
|
synchronize_net();
|
|
|
|
|
|
|
|
/* Shutdown queueing discipline. */
|
|
|
|
dev_shutdown(dev);
|
|
|
|
|
|
|
|
/* Notify protocols, that we are about to destroy
|
2017-02-09 06:56:06 +00:00
|
|
|
* this device. They should clean all the things.
|
|
|
|
*
|
|
|
|
* Note that dev->reg_state stays at NETREG_REGISTERED.
|
|
|
|
* This is wanted because this way 8021q and macvlan know
|
|
|
|
* the device is just moving and can keep their slaves up.
|
|
|
|
*/
|
2007-09-12 11:53:49 +00:00
|
|
|
call_netdevice_notifiers(NETDEV_UNREGISTER, dev);
|
net: dev: fix the incorrect hold of net namespace's lo device
When moving a net device from one net namespace to another
net namespace,dev_change_net_namespace calls NETDEV_DOWN
event,so the original net namespace's dst entries which
beloned to this net device will be put into dst_garbage
list.
then dev_change_net_namespace will set this net device's
net to the new net namespace.
If we unregister this net device's driver, this will trigger
the NETDEV_UNREGISTER_FINAL event, dst_ifdown will be called,
and get this net device's dst entries from dst_garbage list,
put these entries' dev to the new net namespace's lo device.
It's not what we want,actually we need these dst entries hold
the original net namespace's lo device,this incorrect device
holding will trigger emg message like below.
unregister_netdevice: waiting for lo to become free. Usage count = 1
so we should call NETDEV_UNREGISTER_FINAL event in
dev_change_net_namespace too,in order to make sure dst entries
already in the dst_garbage list, we need rcu_barrier before we
call NETDEV_UNREGISTER_FINAL event.
With help form Eric Dumazet.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-23 15:36:55 +00:00
|
|
|
rcu_barrier();
|
|
|
|
call_netdevice_notifiers(NETDEV_UNREGISTER_FINAL, dev);
|
2013-10-23 23:02:42 +00:00
|
|
|
rtmsg_ifinfo(RTM_DELLINK, dev, ~0U, GFP_KERNEL);
|
2007-09-12 11:53:49 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Flush the unicast and multicast chains
|
|
|
|
*/
|
2010-04-01 21:22:09 +00:00
|
|
|
dev_uc_flush(dev);
|
2010-04-01 21:22:57 +00:00
|
|
|
dev_mc_flush(dev);
|
2007-09-12 11:53:49 +00:00
|
|
|
|
net: dev_change_net_namespace: send a KOBJ_REMOVED/KOBJ_ADD
When a new nic is created in namespace ns1, the kernel sends a KOBJ_ADD uevent
to ns1. When the nic is moved to ns2, we only send a KOBJ_MOVE to ns2, and
nothing to ns1.
This patch changes that behavior so that when moving a nic from ns1 to ns2, we
send a KOBJ_REMOVED to ns1 and KOBJ_ADD to ns2. (The KOBJ_MOVE is still
sent to ns2).
The effects of this can be seen when starting and stopping containers in
an upstart based host. Lxc will create a pair of veth nics, the kernel
sends KOBJ_ADD, and upstart starts network-instance jobs for each. When
one nic is moved to the container, because no KOBJ_REMOVED event is
received, the network-instance job for that veth never goes away. This
was reported at https://bugs.launchpad.net/ubuntu/+source/lxc/+bug/1065589
With this patch the networ-instance jobs properly go away.
The other oddness solved here is that if a nic is passed into a running
upstart-based container, without this patch no network-instance job is
started in the container. But when the container creates a new nic
itself (ip link add new type veth) then network-interface jobs are
created. With this patch, behavior comes in line with a regular host.
v2: also send KOBJ_ADD to new netns. There will then be a
_MOVE event from the device_rename() call, but that should
be innocuous.
Signed-off-by: Serge Hallyn <serge.hallyn@canonical.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Daniel Lezcano <daniel.lezcano@free.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-03 16:17:12 +00:00
|
|
|
/* Send a netdev-removed uevent to the old namespace */
|
|
|
|
kobject_uevent(&dev->dev.kobj, KOBJ_REMOVE);
|
2014-08-25 12:26:45 +00:00
|
|
|
netdev_adjacent_del_links(dev);
|
net: dev_change_net_namespace: send a KOBJ_REMOVED/KOBJ_ADD
When a new nic is created in namespace ns1, the kernel sends a KOBJ_ADD uevent
to ns1. When the nic is moved to ns2, we only send a KOBJ_MOVE to ns2, and
nothing to ns1.
This patch changes that behavior so that when moving a nic from ns1 to ns2, we
send a KOBJ_REMOVED to ns1 and KOBJ_ADD to ns2. (The KOBJ_MOVE is still
sent to ns2).
The effects of this can be seen when starting and stopping containers in
an upstart based host. Lxc will create a pair of veth nics, the kernel
sends KOBJ_ADD, and upstart starts network-instance jobs for each. When
one nic is moved to the container, because no KOBJ_REMOVED event is
received, the network-instance job for that veth never goes away. This
was reported at https://bugs.launchpad.net/ubuntu/+source/lxc/+bug/1065589
With this patch the networ-instance jobs properly go away.
The other oddness solved here is that if a nic is passed into a running
upstart-based container, without this patch no network-instance job is
started in the container. But when the container creates a new nic
itself (ip link add new type veth) then network-interface jobs are
created. With this patch, behavior comes in line with a regular host.
v2: also send KOBJ_ADD to new netns. There will then be a
_MOVE event from the device_rename() call, but that should
be innocuous.
Signed-off-by: Serge Hallyn <serge.hallyn@canonical.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Daniel Lezcano <daniel.lezcano@free.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-03 16:17:12 +00:00
|
|
|
|
2007-09-12 11:53:49 +00:00
|
|
|
/* Actually switch the network namespace */
|
2008-03-25 12:47:49 +00:00
|
|
|
dev_net_set(dev, net);
|
2007-09-12 11:53:49 +00:00
|
|
|
|
|
|
|
/* If there is an ifindex conflict assign a new one */
|
2015-04-02 15:07:09 +00:00
|
|
|
if (__dev_get_by_index(net, dev->ifindex))
|
2007-09-12 11:53:49 +00:00
|
|
|
dev->ifindex = dev_new_index(net);
|
|
|
|
|
net: dev_change_net_namespace: send a KOBJ_REMOVED/KOBJ_ADD
When a new nic is created in namespace ns1, the kernel sends a KOBJ_ADD uevent
to ns1. When the nic is moved to ns2, we only send a KOBJ_MOVE to ns2, and
nothing to ns1.
This patch changes that behavior so that when moving a nic from ns1 to ns2, we
send a KOBJ_REMOVED to ns1 and KOBJ_ADD to ns2. (The KOBJ_MOVE is still
sent to ns2).
The effects of this can be seen when starting and stopping containers in
an upstart based host. Lxc will create a pair of veth nics, the kernel
sends KOBJ_ADD, and upstart starts network-instance jobs for each. When
one nic is moved to the container, because no KOBJ_REMOVED event is
received, the network-instance job for that veth never goes away. This
was reported at https://bugs.launchpad.net/ubuntu/+source/lxc/+bug/1065589
With this patch the networ-instance jobs properly go away.
The other oddness solved here is that if a nic is passed into a running
upstart-based container, without this patch no network-instance job is
started in the container. But when the container creates a new nic
itself (ip link add new type veth) then network-interface jobs are
created. With this patch, behavior comes in line with a regular host.
v2: also send KOBJ_ADD to new netns. There will then be a
_MOVE event from the device_rename() call, but that should
be innocuous.
Signed-off-by: Serge Hallyn <serge.hallyn@canonical.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Daniel Lezcano <daniel.lezcano@free.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-03 16:17:12 +00:00
|
|
|
/* Send a netdev-add uevent to the new namespace */
|
|
|
|
kobject_uevent(&dev->dev.kobj, KOBJ_ADD);
|
2014-08-25 12:26:45 +00:00
|
|
|
netdev_adjacent_add_links(dev);
|
net: dev_change_net_namespace: send a KOBJ_REMOVED/KOBJ_ADD
When a new nic is created in namespace ns1, the kernel sends a KOBJ_ADD uevent
to ns1. When the nic is moved to ns2, we only send a KOBJ_MOVE to ns2, and
nothing to ns1.
This patch changes that behavior so that when moving a nic from ns1 to ns2, we
send a KOBJ_REMOVED to ns1 and KOBJ_ADD to ns2. (The KOBJ_MOVE is still
sent to ns2).
The effects of this can be seen when starting and stopping containers in
an upstart based host. Lxc will create a pair of veth nics, the kernel
sends KOBJ_ADD, and upstart starts network-instance jobs for each. When
one nic is moved to the container, because no KOBJ_REMOVED event is
received, the network-instance job for that veth never goes away. This
was reported at https://bugs.launchpad.net/ubuntu/+source/lxc/+bug/1065589
With this patch the networ-instance jobs properly go away.
The other oddness solved here is that if a nic is passed into a running
upstart-based container, without this patch no network-instance job is
started in the container. But when the container creates a new nic
itself (ip link add new type veth) then network-interface jobs are
created. With this patch, behavior comes in line with a regular host.
v2: also send KOBJ_ADD to new netns. There will then be a
_MOVE event from the device_rename() call, but that should
be innocuous.
Signed-off-by: Serge Hallyn <serge.hallyn@canonical.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Daniel Lezcano <daniel.lezcano@free.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-03 16:17:12 +00:00
|
|
|
|
2007-09-27 05:02:53 +00:00
|
|
|
/* Fixup kobjects */
|
2010-05-05 00:36:49 +00:00
|
|
|
err = device_rename(&dev->dev, dev->name);
|
2007-09-27 05:02:53 +00:00
|
|
|
WARN_ON(err);
|
2007-09-12 11:53:49 +00:00
|
|
|
|
|
|
|
/* Add the device back in the hashes */
|
|
|
|
list_netdevice(dev);
|
|
|
|
|
|
|
|
/* Notify protocols, that a new device appeared. */
|
|
|
|
call_netdevice_notifiers(NETDEV_REGISTER, dev);
|
|
|
|
|
2009-12-12 22:11:15 +00:00
|
|
|
/*
|
|
|
|
* Prevent userspace races by waiting until the network
|
|
|
|
* device is fully setup before sending notifications.
|
|
|
|
*/
|
2013-10-23 23:02:42 +00:00
|
|
|
rtmsg_ifinfo(RTM_NEWLINK, dev, ~0U, GFP_KERNEL);
|
2009-12-12 22:11:15 +00:00
|
|
|
|
2007-09-12 11:53:49 +00:00
|
|
|
synchronize_net();
|
|
|
|
err = 0;
|
|
|
|
out:
|
|
|
|
return err;
|
|
|
|
}
|
2009-07-13 22:33:35 +00:00
|
|
|
EXPORT_SYMBOL_GPL(dev_change_net_namespace);
|
2007-09-12 11:53:49 +00:00
|
|
|
|
2016-11-03 14:50:04 +00:00
|
|
|
static int dev_cpu_dead(unsigned int oldcpu)
|
2005-04-16 22:20:36 +00:00
|
|
|
{
|
|
|
|
struct sk_buff **list_skb;
|
|
|
|
struct sk_buff *skb;
|
2016-11-03 14:50:04 +00:00
|
|
|
unsigned int cpu;
|
2017-06-13 11:24:55 +00:00
|
|
|
struct softnet_data *sd, *oldsd, *remsd = NULL;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
local_irq_disable();
|
|
|
|
cpu = smp_processor_id();
|
|
|
|
sd = &per_cpu(softnet_data, cpu);
|
|
|
|
oldsd = &per_cpu(softnet_data, oldcpu);
|
|
|
|
|
|
|
|
/* Find end of our completion_queue. */
|
|
|
|
list_skb = &sd->completion_queue;
|
|
|
|
while (*list_skb)
|
|
|
|
list_skb = &(*list_skb)->next;
|
|
|
|
/* Append completion queue from offline CPU. */
|
|
|
|
*list_skb = oldsd->completion_queue;
|
|
|
|
oldsd->completion_queue = NULL;
|
|
|
|
|
|
|
|
/* Append output queue from offline CPU. */
|
2010-04-26 23:06:24 +00:00
|
|
|
if (oldsd->output_queue) {
|
|
|
|
*sd->output_queue_tailp = oldsd->output_queue;
|
|
|
|
sd->output_queue_tailp = oldsd->output_queue_tailp;
|
|
|
|
oldsd->output_queue = NULL;
|
|
|
|
oldsd->output_queue_tailp = &oldsd->output_queue;
|
|
|
|
}
|
2015-01-16 01:04:22 +00:00
|
|
|
/* Append NAPI poll list from offline CPU, with one exception :
|
|
|
|
* process_backlog() must be called by cpu owning percpu backlog.
|
|
|
|
* We properly handle process_queue & input_pkt_queue later.
|
|
|
|
*/
|
|
|
|
while (!list_empty(&oldsd->poll_list)) {
|
|
|
|
struct napi_struct *napi = list_first_entry(&oldsd->poll_list,
|
|
|
|
struct napi_struct,
|
|
|
|
poll_list);
|
|
|
|
|
|
|
|
list_del_init(&napi->poll_list);
|
|
|
|
if (napi->poll == process_backlog)
|
|
|
|
napi->state = 0;
|
|
|
|
else
|
|
|
|
____napi_schedule(sd, napi);
|
2011-06-06 20:50:03 +00:00
|
|
|
}
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
raise_softirq_irqoff(NET_TX_SOFTIRQ);
|
|
|
|
local_irq_enable();
|
|
|
|
|
2017-06-09 08:54:58 +00:00
|
|
|
#ifdef CONFIG_RPS
|
|
|
|
remsd = oldsd->rps_ipi_list;
|
|
|
|
oldsd->rps_ipi_list = NULL;
|
|
|
|
#endif
|
|
|
|
/* send out pending IPI's on offline CPU */
|
|
|
|
net_rps_send_ipi(remsd);
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
/* Process offline CPU's input_pkt_queue */
|
2010-05-20 18:37:59 +00:00
|
|
|
while ((skb = __skb_dequeue(&oldsd->process_queue))) {
|
2015-02-05 22:58:14 +00:00
|
|
|
netif_rx_ni(skb);
|
2010-05-20 18:37:59 +00:00
|
|
|
input_queue_head_incr(oldsd);
|
rfs: Receive Flow Steering
This patch implements receive flow steering (RFS). RFS steers
received packets for layer 3 and 4 processing to the CPU where
the application for the corresponding flow is running. RFS is an
extension of Receive Packet Steering (RPS).
The basic idea of RFS is that when an application calls recvmsg
(or sendmsg) the application's running CPU is stored in a hash
table that is indexed by the connection's rxhash which is stored in
the socket structure. The rxhash is passed in skb's received on
the connection from netif_receive_skb. For each received packet,
the associated rxhash is used to look up the CPU in the hash table,
if a valid CPU is set then the packet is steered to that CPU using
the RPS mechanisms.
The convolution of the simple approach is that it would potentially
allow OOO packets. If threads are thrashing around CPUs or multiple
threads are trying to read from the same sockets, a quickly changing
CPU value in the hash table could cause rampant OOO packets--
we consider this a non-starter.
To avoid OOO packets, this solution implements two types of hash
tables: rps_sock_flow_table and rps_dev_flow_table.
rps_sock_table is a global hash table. Each entry is just a CPU
number and it is populated in recvmsg and sendmsg as described above.
This table contains the "desired" CPUs for flows.
rps_dev_flow_table is specific to each device queue. Each entry
contains a CPU and a tail queue counter. The CPU is the "current"
CPU for a matching flow. The tail queue counter holds the value
of a tail queue counter for the associated CPU's backlog queue at
the time of last enqueue for a flow matching the entry.
Each backlog queue has a queue head counter which is incremented
on dequeue, and so a queue tail counter is computed as queue head
count + queue length. When a packet is enqueued on a backlog queue,
the current value of the queue tail counter is saved in the hash
entry of the rps_dev_flow_table.
And now the trick: when selecting the CPU for RPS (get_rps_cpu)
the rps_sock_flow table and the rps_dev_flow table for the RX queue
are consulted. When the desired CPU for the flow (found in the
rps_sock_flow table) does not match the current CPU (found in the
rps_dev_flow table), the current CPU is changed to the desired CPU
if one of the following is true:
- The current CPU is unset (equal to RPS_NO_CPU)
- Current CPU is offline
- The current CPU's queue head counter >= queue tail counter in the
rps_dev_flow table. This checks if the queue tail has advanced
beyond the last packet that was enqueued using this table entry.
This guarantees that all packets queued using this entry have been
dequeued, thus preserving in order delivery.
Making each queue have its own rps_dev_flow table has two advantages:
1) the tail queue counters will be written on each receive, so
keeping the table local to interrupting CPU s good for locality. 2)
this allows lockless access to the table-- the CPU number and queue
tail counter need to be accessed together under mutual exclusion
from netif_receive_skb, we assume that this is only called from
device napi_poll which is non-reentrant.
This patch implements RFS for TCP and connected UDP sockets.
It should be usable for other flow oriented protocols.
There are two configuration parameters for RFS. The
"rps_flow_entries" kernel init parameter sets the number of
entries in the rps_sock_flow_table, the per rxqueue sysfs entry
"rps_flow_cnt" contains the number of entries in the rps_dev_flow
table for the rxqueue. Both are rounded to power of two.
The obvious benefit of RFS (over just RPS) is that it achieves
CPU locality between the receive processing for a flow and the
applications processing; this can result in increased performance
(higher pps, lower latency).
The benefits of RFS are dependent on cache hierarchy, application
load, and other factors. On simple benchmarks, we don't necessarily
see improvement and sometimes see degradation. However, for more
complex benchmarks and for applications where cache pressure is
much higher this technique seems to perform very well.
Below are some benchmark results which show the potential benfit of
this patch. The netperf test has 500 instances of netperf TCP_RR
test with 1 byte req. and resp. The RPC test is an request/response
test similar in structure to netperf RR test ith 100 threads on
each host, but does more work in userspace that netperf.
e1000e on 8 core Intel
No RFS or RPS 104K tps at 30% CPU
No RFS (best RPS config): 290K tps at 63% CPU
RFS 303K tps at 61% CPU
RPC test tps CPU% 50/90/99% usec latency Latency StdDev
No RFS/RPS 103K 48% 757/900/3185 4472.35
RPS only: 174K 73% 415/993/2468 491.66
RFS 223K 73% 379/651/1382 315.61
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2010-04-16 23:01:27 +00:00
|
|
|
}
|
2015-01-16 01:04:22 +00:00
|
|
|
while ((skb = skb_dequeue(&oldsd->input_pkt_queue))) {
|
2015-02-05 22:58:14 +00:00
|
|
|
netif_rx_ni(skb);
|
2010-05-20 18:37:59 +00:00
|
|
|
input_queue_head_incr(oldsd);
|
|
|
|
}
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2016-11-03 14:50:04 +00:00
|
|
|
return 0;
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
|
2007-08-10 22:47:58 +00:00
|
|
|
/**
|
2008-10-23 08:11:29 +00:00
|
|
|
* netdev_increment_features - increment feature set by one
|
|
|
|
* @all: current feature set
|
|
|
|
* @one: new feature set
|
|
|
|
* @mask: mask feature set
|
2007-08-10 22:47:58 +00:00
|
|
|
*
|
|
|
|
* Computes a new feature set after adding a device with feature set
|
2008-10-23 08:11:29 +00:00
|
|
|
* @one to the master device with current feature set @all. Will not
|
|
|
|
* enable anything that is off in @mask. Returns the new feature set.
|
2007-08-10 22:47:58 +00:00
|
|
|
*/
|
2011-11-15 15:29:55 +00:00
|
|
|
netdev_features_t netdev_increment_features(netdev_features_t all,
|
|
|
|
netdev_features_t one, netdev_features_t mask)
|
2008-10-23 08:11:29 +00:00
|
|
|
{
|
2015-12-14 19:19:44 +00:00
|
|
|
if (mask & NETIF_F_HW_CSUM)
|
2015-12-14 19:19:43 +00:00
|
|
|
mask |= NETIF_F_CSUM_MASK;
|
2011-04-22 06:31:16 +00:00
|
|
|
mask |= NETIF_F_VLAN_CHALLENGED;
|
2007-08-10 22:47:58 +00:00
|
|
|
|
2015-12-14 19:19:43 +00:00
|
|
|
all |= one & (NETIF_F_ONE_FOR_ALL | NETIF_F_CSUM_MASK) & mask;
|
2011-04-22 06:31:16 +00:00
|
|
|
all &= one | ~NETIF_F_ALL_FOR_ALL;
|
2011-04-05 05:30:30 +00:00
|
|
|
|
2011-04-22 06:31:16 +00:00
|
|
|
/* If one device supports hw checksumming, set for all. */
|
2015-12-14 19:19:44 +00:00
|
|
|
if (all & NETIF_F_HW_CSUM)
|
|
|
|
all &= ~(NETIF_F_CSUM_MASK & ~NETIF_F_HW_CSUM);
|
2007-08-10 22:47:58 +00:00
|
|
|
|
|
|
|
return all;
|
|
|
|
}
|
2008-10-23 08:11:29 +00:00
|
|
|
EXPORT_SYMBOL(netdev_increment_features);
|
2007-08-10 22:47:58 +00:00
|
|
|
|
2013-06-02 20:43:55 +00:00
|
|
|
static struct hlist_head * __net_init netdev_create_hash(void)
|
2007-09-16 22:40:33 +00:00
|
|
|
{
|
|
|
|
int i;
|
|
|
|
struct hlist_head *hash;
|
|
|
|
|
|
|
|
hash = kmalloc(sizeof(*hash) * NETDEV_HASHENTRIES, GFP_KERNEL);
|
|
|
|
if (hash != NULL)
|
|
|
|
for (i = 0; i < NETDEV_HASHENTRIES; i++)
|
|
|
|
INIT_HLIST_HEAD(&hash[i]);
|
|
|
|
|
|
|
|
return hash;
|
|
|
|
}
|
|
|
|
|
2007-09-17 18:56:21 +00:00
|
|
|
/* Initialize per network namespace state */
|
2007-10-09 03:38:39 +00:00
|
|
|
static int __net_init netdev_init(struct net *net)
|
2007-09-17 18:56:21 +00:00
|
|
|
{
|
2012-07-18 09:06:07 +00:00
|
|
|
if (net != &init_net)
|
|
|
|
INIT_LIST_HEAD(&net->dev_base_head);
|
2007-09-17 18:56:21 +00:00
|
|
|
|
2007-09-16 22:40:33 +00:00
|
|
|
net->dev_name_head = netdev_create_hash();
|
|
|
|
if (net->dev_name_head == NULL)
|
|
|
|
goto err_name;
|
2007-09-17 18:56:21 +00:00
|
|
|
|
2007-09-16 22:40:33 +00:00
|
|
|
net->dev_index_head = netdev_create_hash();
|
|
|
|
if (net->dev_index_head == NULL)
|
|
|
|
goto err_idx;
|
2007-09-17 18:56:21 +00:00
|
|
|
|
|
|
|
return 0;
|
2007-09-16 22:40:33 +00:00
|
|
|
|
|
|
|
err_idx:
|
|
|
|
kfree(net->dev_name_head);
|
|
|
|
err_name:
|
|
|
|
return -ENOMEM;
|
2007-09-17 18:56:21 +00:00
|
|
|
}
|
|
|
|
|
2008-09-30 09:23:58 +00:00
|
|
|
/**
|
|
|
|
* netdev_drivername - network driver for the device
|
|
|
|
* @dev: network device
|
|
|
|
*
|
|
|
|
* Determine network driver for device.
|
|
|
|
*/
|
2011-06-06 23:41:33 +00:00
|
|
|
const char *netdev_drivername(const struct net_device *dev)
|
2008-07-21 20:31:48 +00:00
|
|
|
{
|
2008-09-30 09:22:14 +00:00
|
|
|
const struct device_driver *driver;
|
|
|
|
const struct device *parent;
|
2011-06-06 23:41:33 +00:00
|
|
|
const char *empty = "";
|
2008-07-21 20:31:48 +00:00
|
|
|
|
|
|
|
parent = dev->dev.parent;
|
|
|
|
if (!parent)
|
2011-06-06 23:41:33 +00:00
|
|
|
return empty;
|
2008-07-21 20:31:48 +00:00
|
|
|
|
|
|
|
driver = parent->driver;
|
|
|
|
if (driver && driver->name)
|
2011-06-06 23:41:33 +00:00
|
|
|
return driver->name;
|
|
|
|
return empty;
|
2008-07-21 20:31:48 +00:00
|
|
|
}
|
|
|
|
|
2014-09-22 18:10:50 +00:00
|
|
|
static void __netdev_printk(const char *level, const struct net_device *dev,
|
|
|
|
struct va_format *vaf)
|
2010-06-27 01:02:35 +00:00
|
|
|
{
|
2012-09-13 03:12:19 +00:00
|
|
|
if (dev && dev->dev.parent) {
|
2014-09-22 18:10:50 +00:00
|
|
|
dev_printk_emit(level[1] - '0',
|
|
|
|
dev->dev.parent,
|
|
|
|
"%s %s %s%s: %pV",
|
|
|
|
dev_driver_string(dev->dev.parent),
|
|
|
|
dev_name(dev->dev.parent),
|
|
|
|
netdev_name(dev), netdev_reg_state(dev),
|
|
|
|
vaf);
|
2012-09-13 03:12:19 +00:00
|
|
|
} else if (dev) {
|
2014-09-22 18:10:50 +00:00
|
|
|
printk("%s%s%s: %pV",
|
|
|
|
level, netdev_name(dev), netdev_reg_state(dev), vaf);
|
2012-09-13 03:12:19 +00:00
|
|
|
} else {
|
2014-09-22 18:10:50 +00:00
|
|
|
printk("%s(NULL net_device): %pV", level, vaf);
|
2012-09-13 03:12:19 +00:00
|
|
|
}
|
2010-06-27 01:02:35 +00:00
|
|
|
}
|
|
|
|
|
2014-09-22 18:10:50 +00:00
|
|
|
void netdev_printk(const char *level, const struct net_device *dev,
|
|
|
|
const char *format, ...)
|
2010-06-27 01:02:35 +00:00
|
|
|
{
|
|
|
|
struct va_format vaf;
|
|
|
|
va_list args;
|
|
|
|
|
|
|
|
va_start(args, format);
|
|
|
|
|
|
|
|
vaf.fmt = format;
|
|
|
|
vaf.va = &args;
|
|
|
|
|
2014-09-22 18:10:50 +00:00
|
|
|
__netdev_printk(level, dev, &vaf);
|
2012-09-13 03:12:19 +00:00
|
|
|
|
2010-06-27 01:02:35 +00:00
|
|
|
va_end(args);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(netdev_printk);
|
|
|
|
|
|
|
|
#define define_netdev_printk_level(func, level) \
|
2014-09-22 18:10:50 +00:00
|
|
|
void func(const struct net_device *dev, const char *fmt, ...) \
|
2010-06-27 01:02:35 +00:00
|
|
|
{ \
|
|
|
|
struct va_format vaf; \
|
|
|
|
va_list args; \
|
|
|
|
\
|
|
|
|
va_start(args, fmt); \
|
|
|
|
\
|
|
|
|
vaf.fmt = fmt; \
|
|
|
|
vaf.va = &args; \
|
|
|
|
\
|
2014-09-22 18:10:50 +00:00
|
|
|
__netdev_printk(level, dev, &vaf); \
|
2012-09-13 03:12:19 +00:00
|
|
|
\
|
2010-06-27 01:02:35 +00:00
|
|
|
va_end(args); \
|
|
|
|
} \
|
|
|
|
EXPORT_SYMBOL(func);
|
|
|
|
|
|
|
|
define_netdev_printk_level(netdev_emerg, KERN_EMERG);
|
|
|
|
define_netdev_printk_level(netdev_alert, KERN_ALERT);
|
|
|
|
define_netdev_printk_level(netdev_crit, KERN_CRIT);
|
|
|
|
define_netdev_printk_level(netdev_err, KERN_ERR);
|
|
|
|
define_netdev_printk_level(netdev_warn, KERN_WARNING);
|
|
|
|
define_netdev_printk_level(netdev_notice, KERN_NOTICE);
|
|
|
|
define_netdev_printk_level(netdev_info, KERN_INFO);
|
|
|
|
|
2007-10-09 03:38:39 +00:00
|
|
|
static void __net_exit netdev_exit(struct net *net)
|
2007-09-17 18:56:21 +00:00
|
|
|
{
|
|
|
|
kfree(net->dev_name_head);
|
|
|
|
kfree(net->dev_index_head);
|
|
|
|
}
|
|
|
|
|
2007-11-13 11:23:50 +00:00
|
|
|
static struct pernet_operations __net_initdata netdev_net_ops = {
|
2007-09-17 18:56:21 +00:00
|
|
|
.init = netdev_init,
|
|
|
|
.exit = netdev_exit,
|
|
|
|
};
|
|
|
|
|
2007-10-09 03:38:39 +00:00
|
|
|
static void __net_exit default_device_exit(struct net *net)
|
2007-09-12 11:53:49 +00:00
|
|
|
{
|
2009-11-29 22:25:30 +00:00
|
|
|
struct net_device *dev, *aux;
|
2007-09-12 11:53:49 +00:00
|
|
|
/*
|
2009-11-29 22:25:30 +00:00
|
|
|
* Push all migratable network devices back to the
|
2007-09-12 11:53:49 +00:00
|
|
|
* initial network namespace
|
|
|
|
*/
|
|
|
|
rtnl_lock();
|
2009-11-29 22:25:30 +00:00
|
|
|
for_each_netdev_safe(net, dev, aux) {
|
2007-09-12 11:53:49 +00:00
|
|
|
int err;
|
netns: Fix arbitrary net_device-s corruptions on net_ns stop.
When a net namespace is destroyed, some devices (those, not killed
on ns stop explicitly) are moved back to init_net.
The problem, is that this net_ns change has one point of failure -
the __dev_alloc_name() may be called if a name collision occurs (and
this is easy to trigger). This allocator performs a likely-to-fail
GFP_ATOMIC allocation to find a suitable number. Other possible
conditions that may cause error (for device being ns local or not
registered) are always false in this case.
So, when this call fails, the device is unregistered. But this is
*not* the right thing to do, since after this the device may be
released (and kfree-ed) improperly. E. g. bridges require more
actions (sysfs update, timer disarming, etc.), some other devices
want to remove their private areas from lists, etc.
I. e. arbitrary use-after-free cases may occur.
The proposed fix is the following: since the only reason for the
dev_change_net_namespace to fail is the name generation, we may
give it a unique fall-back name w/o %d-s in it - the dev<ifindex>
one, since ifindexes are still unique.
So make this change, raise the failure-case printk loglevel to
EMERG and replace the unregister_netdevice call with BUG().
[ Use snprintf() -DaveM ]
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-05-08 08:24:25 +00:00
|
|
|
char fb_name[IFNAMSIZ];
|
2007-09-12 11:53:49 +00:00
|
|
|
|
|
|
|
/* Ignore unmoveable devices (i.e. loopback) */
|
|
|
|
if (dev->features & NETIF_F_NETNS_LOCAL)
|
|
|
|
continue;
|
|
|
|
|
2009-11-29 22:25:30 +00:00
|
|
|
/* Leave virtual devices for the generic cleanup */
|
|
|
|
if (dev->rtnl_link_ops)
|
|
|
|
continue;
|
2008-11-05 23:59:38 +00:00
|
|
|
|
2011-03-31 01:57:33 +00:00
|
|
|
/* Push remaining network devices to init_net */
|
netns: Fix arbitrary net_device-s corruptions on net_ns stop.
When a net namespace is destroyed, some devices (those, not killed
on ns stop explicitly) are moved back to init_net.
The problem, is that this net_ns change has one point of failure -
the __dev_alloc_name() may be called if a name collision occurs (and
this is easy to trigger). This allocator performs a likely-to-fail
GFP_ATOMIC allocation to find a suitable number. Other possible
conditions that may cause error (for device being ns local or not
registered) are always false in this case.
So, when this call fails, the device is unregistered. But this is
*not* the right thing to do, since after this the device may be
released (and kfree-ed) improperly. E. g. bridges require more
actions (sysfs update, timer disarming, etc.), some other devices
want to remove their private areas from lists, etc.
I. e. arbitrary use-after-free cases may occur.
The proposed fix is the following: since the only reason for the
dev_change_net_namespace to fail is the name generation, we may
give it a unique fall-back name w/o %d-s in it - the dev<ifindex>
one, since ifindexes are still unique.
So make this change, raise the failure-case printk loglevel to
EMERG and replace the unregister_netdevice call with BUG().
[ Use snprintf() -DaveM ]
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-05-08 08:24:25 +00:00
|
|
|
snprintf(fb_name, IFNAMSIZ, "dev%d", dev->ifindex);
|
|
|
|
err = dev_change_net_namespace(dev, &init_net, fb_name);
|
2007-09-12 11:53:49 +00:00
|
|
|
if (err) {
|
2012-02-01 10:54:43 +00:00
|
|
|
pr_emerg("%s: failed to move %s to init_net: %d\n",
|
|
|
|
__func__, dev->name, err);
|
netns: Fix arbitrary net_device-s corruptions on net_ns stop.
When a net namespace is destroyed, some devices (those, not killed
on ns stop explicitly) are moved back to init_net.
The problem, is that this net_ns change has one point of failure -
the __dev_alloc_name() may be called if a name collision occurs (and
this is easy to trigger). This allocator performs a likely-to-fail
GFP_ATOMIC allocation to find a suitable number. Other possible
conditions that may cause error (for device being ns local or not
registered) are always false in this case.
So, when this call fails, the device is unregistered. But this is
*not* the right thing to do, since after this the device may be
released (and kfree-ed) improperly. E. g. bridges require more
actions (sysfs update, timer disarming, etc.), some other devices
want to remove their private areas from lists, etc.
I. e. arbitrary use-after-free cases may occur.
The proposed fix is the following: since the only reason for the
dev_change_net_namespace to fail is the name generation, we may
give it a unique fall-back name w/o %d-s in it - the dev<ifindex>
one, since ifindexes are still unique.
So make this change, raise the failure-case printk loglevel to
EMERG and replace the unregister_netdevice call with BUG().
[ Use snprintf() -DaveM ]
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-05-08 08:24:25 +00:00
|
|
|
BUG();
|
2007-09-12 11:53:49 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
rtnl_unlock();
|
|
|
|
}
|
|
|
|
|
2013-09-24 04:19:49 +00:00
|
|
|
static void __net_exit rtnl_lock_unregistering(struct list_head *net_list)
|
|
|
|
{
|
|
|
|
/* Return with the rtnl_lock held when there are no network
|
|
|
|
* devices unregistering in any network namespace in net_list.
|
|
|
|
*/
|
|
|
|
struct net *net;
|
|
|
|
bool unregistering;
|
2014-10-29 16:04:56 +00:00
|
|
|
DEFINE_WAIT_FUNC(wait, woken_wake_function);
|
2013-09-24 04:19:49 +00:00
|
|
|
|
2014-10-29 16:04:56 +00:00
|
|
|
add_wait_queue(&netdev_unregistering_wq, &wait);
|
2013-09-24 04:19:49 +00:00
|
|
|
for (;;) {
|
|
|
|
unregistering = false;
|
|
|
|
rtnl_lock();
|
|
|
|
list_for_each_entry(net, net_list, exit_list) {
|
|
|
|
if (net->dev_unreg_count > 0) {
|
|
|
|
unregistering = true;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (!unregistering)
|
|
|
|
break;
|
|
|
|
__rtnl_unlock();
|
2014-10-29 16:04:56 +00:00
|
|
|
|
|
|
|
wait_woken(&wait, TASK_UNINTERRUPTIBLE, MAX_SCHEDULE_TIMEOUT);
|
2013-09-24 04:19:49 +00:00
|
|
|
}
|
2014-10-29 16:04:56 +00:00
|
|
|
remove_wait_queue(&netdev_unregistering_wq, &wait);
|
2013-09-24 04:19:49 +00:00
|
|
|
}
|
|
|
|
|
2009-12-03 02:29:04 +00:00
|
|
|
static void __net_exit default_device_exit_batch(struct list_head *net_list)
|
|
|
|
{
|
|
|
|
/* At exit all network devices most be removed from a network
|
tree-wide: fix comment/printk typos
"gadget", "through", "command", "maintain", "maintain", "controller", "address",
"between", "initiali[zs]e", "instead", "function", "select", "already",
"equal", "access", "management", "hierarchy", "registration", "interest",
"relative", "memory", "offset", "already",
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2010-11-01 19:38:34 +00:00
|
|
|
* namespace. Do this in the reverse order of registration.
|
2009-12-03 02:29:04 +00:00
|
|
|
* Do this across as many network namespaces as possible to
|
|
|
|
* improve batching efficiency.
|
|
|
|
*/
|
|
|
|
struct net_device *dev;
|
|
|
|
struct net *net;
|
|
|
|
LIST_HEAD(dev_kill_list);
|
|
|
|
|
2013-09-24 04:19:49 +00:00
|
|
|
/* To prevent network device cleanup code from dereferencing
|
|
|
|
* loopback devices or network devices that have been freed
|
|
|
|
* wait here for all pending unregistrations to complete,
|
|
|
|
* before unregistring the loopback device and allowing the
|
|
|
|
* network namespace be freed.
|
|
|
|
*
|
|
|
|
* The netdev todo list containing all network devices
|
|
|
|
* unregistrations that happen in default_device_exit_batch
|
|
|
|
* will run in the rtnl_unlock() at the end of
|
|
|
|
* default_device_exit_batch.
|
|
|
|
*/
|
|
|
|
rtnl_lock_unregistering(net_list);
|
2009-12-03 02:29:04 +00:00
|
|
|
list_for_each_entry(net, net_list, exit_list) {
|
|
|
|
for_each_netdev_reverse(net, dev) {
|
2014-06-26 07:58:25 +00:00
|
|
|
if (dev->rtnl_link_ops && dev->rtnl_link_ops->dellink)
|
2009-12-03 02:29:04 +00:00
|
|
|
dev->rtnl_link_ops->dellink(dev, &dev_kill_list);
|
|
|
|
else
|
|
|
|
unregister_netdevice_queue(dev, &dev_kill_list);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
unregister_netdevice_many(&dev_kill_list);
|
|
|
|
rtnl_unlock();
|
|
|
|
}
|
|
|
|
|
2007-11-13 11:23:50 +00:00
|
|
|
static struct pernet_operations __net_initdata default_device_ops = {
|
2007-09-12 11:53:49 +00:00
|
|
|
.exit = default_device_exit,
|
2009-12-03 02:29:04 +00:00
|
|
|
.exit_batch = default_device_exit_batch,
|
2007-09-12 11:53:49 +00:00
|
|
|
};
|
|
|
|
|
2005-04-16 22:20:36 +00:00
|
|
|
/*
|
|
|
|
* Initialize the DEV module. At boot time this walks the device list and
|
|
|
|
* unhooks any devices that fail to initialise (normally hardware not
|
|
|
|
* present) and leaves us with a valid list of present and active devices.
|
|
|
|
*
|
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This is called single threaded during boot, so no need
|
|
|
|
* to take the rtnl semaphore.
|
|
|
|
*/
|
|
|
|
static int __init net_dev_init(void)
|
|
|
|
{
|
|
|
|
int i, rc = -ENOMEM;
|
|
|
|
|
|
|
|
BUG_ON(!dev_boot_phase);
|
|
|
|
|
|
|
|
if (dev_proc_init())
|
|
|
|
goto out;
|
|
|
|
|
2007-09-27 05:02:53 +00:00
|
|
|
if (netdev_kobject_init())
|
2005-04-16 22:20:36 +00:00
|
|
|
goto out;
|
|
|
|
|
|
|
|
INIT_LIST_HEAD(&ptype_all);
|
2007-11-26 12:12:58 +00:00
|
|
|
for (i = 0; i < PTYPE_HASH_SIZE; i++)
|
2005-04-16 22:20:36 +00:00
|
|
|
INIT_LIST_HEAD(&ptype_base[i]);
|
|
|
|
|
2012-11-15 08:49:10 +00:00
|
|
|
INIT_LIST_HEAD(&offload_base);
|
|
|
|
|
2007-09-17 18:56:21 +00:00
|
|
|
if (register_pernet_subsys(&netdev_net_ops))
|
|
|
|
goto out;
|
2005-04-16 22:20:36 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Initialise the packet receive queues.
|
|
|
|
*/
|
|
|
|
|
2006-04-11 05:52:50 +00:00
|
|
|
for_each_possible_cpu(i) {
|
2016-08-26 19:50:39 +00:00
|
|
|
struct work_struct *flush = per_cpu_ptr(&flush_works, i);
|
2010-04-19 21:17:14 +00:00
|
|
|
struct softnet_data *sd = &per_cpu(softnet_data, i);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2016-08-26 19:50:39 +00:00
|
|
|
INIT_WORK(flush, flush_backlog);
|
|
|
|
|
2010-04-19 21:17:14 +00:00
|
|
|
skb_queue_head_init(&sd->input_pkt_queue);
|
2010-04-27 22:07:33 +00:00
|
|
|
skb_queue_head_init(&sd->process_queue);
|
2010-04-19 21:17:14 +00:00
|
|
|
INIT_LIST_HEAD(&sd->poll_list);
|
2010-04-26 23:06:24 +00:00
|
|
|
sd->output_queue_tailp = &sd->output_queue;
|
2010-03-24 19:13:54 +00:00
|
|
|
#ifdef CONFIG_RPS
|
2010-04-19 21:17:14 +00:00
|
|
|
sd->csd.func = rps_trigger_softirq;
|
|
|
|
sd->csd.info = sd;
|
|
|
|
sd->cpu = i;
|
2010-03-19 00:45:44 +00:00
|
|
|
#endif
|
2010-03-16 08:03:29 +00:00
|
|
|
|
2010-04-19 21:17:14 +00:00
|
|
|
sd->backlog.poll = process_backlog;
|
|
|
|
sd->backlog.weight = weight_p;
|
2005-04-16 22:20:36 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
dev_boot_phase = 0;
|
|
|
|
|
2008-11-08 06:54:20 +00:00
|
|
|
/* The loopback device is special if any other network devices
|
|
|
|
* is present in a network namespace the loopback device must
|
|
|
|
* be present. Since we now dynamically allocate and free the
|
|
|
|
* loopback device ensure this invariant is maintained by
|
|
|
|
* keeping the loopback device as the first device on the
|
|
|
|
* list of network devices. Ensuring the loopback devices
|
|
|
|
* is the first device that appears and the last network device
|
|
|
|
* that disappears.
|
|
|
|
*/
|
|
|
|
if (register_pernet_device(&loopback_net_ops))
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
if (register_pernet_device(&default_device_ops))
|
|
|
|
goto out;
|
|
|
|
|
Remove argument from open_softirq which is always NULL
As git-grep shows, open_softirq() is always called with the last argument
being NULL
block/blk-core.c: open_softirq(BLOCK_SOFTIRQ, blk_done_softirq, NULL);
kernel/hrtimer.c: open_softirq(HRTIMER_SOFTIRQ, run_hrtimer_softirq, NULL);
kernel/rcuclassic.c: open_softirq(RCU_SOFTIRQ, rcu_process_callbacks, NULL);
kernel/rcupreempt.c: open_softirq(RCU_SOFTIRQ, rcu_process_callbacks, NULL);
kernel/sched.c: open_softirq(SCHED_SOFTIRQ, run_rebalance_domains, NULL);
kernel/softirq.c: open_softirq(TASKLET_SOFTIRQ, tasklet_action, NULL);
kernel/softirq.c: open_softirq(HI_SOFTIRQ, tasklet_hi_action, NULL);
kernel/timer.c: open_softirq(TIMER_SOFTIRQ, run_timer_softirq, NULL);
net/core/dev.c: open_softirq(NET_TX_SOFTIRQ, net_tx_action, NULL);
net/core/dev.c: open_softirq(NET_RX_SOFTIRQ, net_rx_action, NULL);
This observation has already been made by Matthew Wilcox in June 2002
(http://www.cs.helsinki.fi/linux/linux-kernel/2002-25/0687.html)
"I notice that none of the current softirq routines use the data element
passed to them."
and the situation hasn't changed since them. So it appears we can safely
remove that extra argument to save 128 (54) bytes of kernel data (text).
Signed-off-by: Carlos R. Mafra <crmafra@ift.unesp.br>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-05-15 14:15:37 +00:00
|
|
|
open_softirq(NET_TX_SOFTIRQ, net_tx_action);
|
|
|
|
open_softirq(NET_RX_SOFTIRQ, net_rx_action);
|
2005-04-16 22:20:36 +00:00
|
|
|
|
2016-11-03 14:50:04 +00:00
|
|
|
rc = cpuhp_setup_state_nocalls(CPUHP_NET_DEV_DEAD, "net/dev:dead",
|
|
|
|
NULL, dev_cpu_dead);
|
|
|
|
WARN_ON(rc < 0);
|
2005-04-16 22:20:36 +00:00
|
|
|
rc = 0;
|
|
|
|
out:
|
|
|
|
return rc;
|
|
|
|
}
|
|
|
|
|
|
|
|
subsys_initcall(net_dev_init);
|