Commit Graph

1398 Commits

Author SHA1 Message Date
Sabrina Dubroca
952fcfd08c net: remove type_check from dev_get_nest_level()
The idea for type_check in dev_get_nest_level() was to count the number
of nested devices of the same type (currently, only macvlan or vlan
devices).
This prevented the false positive lockdep warning on configurations such
as:

eth0 <--- macvlan0 <--- vlan0 <--- macvlan1

However, this doesn't prevent a warning on a configuration such as:

eth0 <--- macvlan0 <--- vlan0
eth1 <--- vlan1 <--- macvlan1

In this case, all the locks end up with a nesting subclass of 1, so
lockdep thinks that there is still a deadlock:

- in the first case we have (macvlan_netdev_addr_lock_key, 1) and then
  take (vlan_netdev_xmit_lock_key, 1)
- in the second case, we have (vlan_netdev_xmit_lock_key, 1) and then
  take (macvlan_netdev_addr_lock_key, 1)

By removing the linktype check in dev_get_nest_level() and always
incrementing the nesting depth, lockdep considers this configuration
valid.

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-13 15:15:54 -07:00
Jiri Kosina
59cc1f61f0 net: sched: convert qdisc linked list to hashtable
Convert the per-device linked list into a hashtable. The primary
motivation for this change is that currently, we're not tracking all the
qdiscs in hierarchy (e.g. excluding default qdiscs), as the lookup
performed over the linked list by qdisc_match_from_root() is rather
expensive.

The ultimate goal is to get rid of hidden qdiscs completely, which will
bring much more determinism in user experience.

Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-10 17:19:02 -07:00
Linus Torvalds
554828ee0d Merge branch 'salted-string-hash'
This changes the vfs dentry hashing to mix in the parent pointer at the
_beginning_ of the hash, rather than at the end.

That actually improves both the hash and the code generation, because we
can move more of the computation to the "static" part of the dcache
setup, and do less at lookup runtime.

It turns out that a lot of other hash users also really wanted to mix in
a base pointer as a 'salt' for the hash, and so the slightly extended
interface ends up working well for other cases too.

Users that want a string hash that is purely about the string pass in a
'salt' pointer of NULL.

* merge branch 'salted-string-hash':
  fs/dcache.c: Save one 32-bit multiply in dcache lookup
  vfs: make the string hashes salt the hash
2016-07-28 12:26:31 -07:00
Brenden Blanco
a7862b4584 net: add ndo to setup/query xdp prog in adapter rx
Add one new netdev op for drivers implementing the BPF_PROG_TYPE_XDP
filter. The single op is used for both setup/query of the xdp program,
modelled after ndo_setup_tc.

Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-19 21:46:31 -07:00
Jesper Dangaard Brouer
1db19db7f5 net: tracepoint napi:napi_poll add work and budget
An important information for the napi_poll tracepoint is knowing
the work done (packets processed) by the napi_poll() call. Add
both the work done and budget, as they are related.

Handle trace_napi_poll() param change in dropwatch/drop_monitor
and in python perf script netdev-times.py in backward compat way,
as python fortunately supports optional parameter handling.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-09 18:05:02 -04:00
Jiri Pirko
18bfb924f0 net: introduce default neigh_construct/destroy ndo calls for L2 upper devices
L2 upper device needs to propagate neigh_construct/destroy calls down to
lower devices. Do this by defining default ndo functions and use them in
team, bond, bridge and vlan.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-05 09:06:28 -07:00
Jiri Pirko
7ce856aaaf mlxsw: spectrum: Add couple of lower device helper functions
Add functions that iterate over lower devices and find port device.
As a dependency add netdev_for_each_all_lower_dev and
netdev_for_each_all_lower_dev_rcu macro with
netdev_all_lower_get_next and netdev_all_lower_get_next_rcu shelpers.

Also, add functions to return mlxsw struct according to lower device
found and mlxsw_port struct with a reference to lower device.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-04 18:25:15 -07:00
Eric Dumazet
520ac30f45 net_sched: drop packets after root qdisc lock is released
Qdisc performance suffers when packets are dropped at enqueue()
time because drops (kfree_skb()) are done while qdisc lock is held,
delaying a dequeue() draining the queue.

Nominal throughput can be reduced by 50 % when this happens,
at a time we would like the dequeue() to proceed as fast as possible.

Even FQ is vulnerable to this problem, while one of FQ goals was
to provide some flow isolation.

This patch adds a 'struct sk_buff **to_free' parameter to all
qdisc->enqueue(), and in qdisc_drop() helper.

I measured a performance increase of up to 12 %, but this patch
is a prereq so that future batches in enqueue() can fly.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-25 12:19:35 -04:00
Wei Tang
be4da0e340 net: the space is required after ','
The space is missing after ',', and this will introduce much more
noise when checking patch around.

Signed-off-by: Wei Tang <tangwei@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-16 17:41:23 -07:00
Wei Tang
84d15ae57d net: do not initialise statics to 0
This patch fixes the checkpatch.pl error to dev.c:

ERROR: do not initialise statics to 0

Signed-off-by: Wei Tang <tangwei@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-16 17:41:22 -07:00
Linus Torvalds
8387ff2577 vfs: make the string hashes salt the hash
We always mixed in the parent pointer into the dentry name hash, but we
did it late at lookup time.  It turns out that we can simplify that
lookup-time action by salting the hash with the parent pointer early
instead of late.

A few other users of our string hashes also wanted to mix in their own
pointers into the hash, and those are updated to use the same mechanism.

Hash users that don't have any particular initial salt can just use the
NULL pointer as a no-salt.

Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: George Spelvin <linux@sciencehorizons.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-10 20:21:46 -07:00
Daniel Borkmann
a70b506efe bpf: enforce recursion limit on redirects
Respect the stack's xmit_recursion limit for calls into dev_queue_xmit().
Currently, they are not handeled by the limiter when attached to clsact's
egress parent, for example, and a buggy program redirecting it to the
same device again could run into stack overflow eventually. It would be
good if we could notify an admin to give him a chance to react. We reuse
xmit_recursion instead of having one private to eBPF, so that the stack's
current recursion depth will be taken into account as well. Follow-up to
commit 3896d655f4 ("bpf: introduce bpf_clone_redirect() helper") and
27b29f6305 ("bpf: add bpf_redirect() helper").

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-10 18:00:57 -07:00
Hariprasad Shenai
40e4e713eb net: Reduce queue allocation to one in kdump kernel
When in kdump kernel, reduce memory usage by only using a single Queue
Set for multiqueue devices. So make netif_get_num_default_rss_queues()
return one, when in kdump kernel.

Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-08 11:13:58 -07:00
Eric Dumazet
f9eb8aea2a net_sched: transform qdisc running bit into a seqcount
Instead of using a single bit (__QDISC___STATE_RUNNING)
in sch->__state, use a seqcount.

This adds lockdep support, but more importantly it will allow us
to sample qdisc/class statistics without having to grab qdisc root lock.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-07 16:37:13 -07:00
Eric Dumazet
3bcb846ca4 net: get rid of spin_trylock() in net_tx_action()
Note: Tom Herbert posted almost same patch 3 months back, but for
different reasons.

The reasons we want to get rid of this spin_trylock() are :

1) Under high qdisc pressure, the spin_trylock() has almost no
chance to succeed.

2) We loop multiple times in softirq handler, eventually reaching
the max retry count (10), and we schedule ksoftirqd.

Since we want to adhere more strictly to ksoftirqd being waked up in
the future (https://lwn.net/Articles/687617/), better avoid spurious
wakeups.

3) calls to __netif_reschedule() dirty the cache line containing
q->next_sched, slowing down the owner of qdisc.

4) RT kernels can not use the spin_trylock() here.

With help of busylock, we get the qdisc spinlock fast enough, and
the trylock trick brings only performance penalty.

Depending on qdisc setup, I observed a gain of up to 19 % in qdisc
performance (1016600 pps instead of 853400 pps, using prio+tbf+fq_codel)

("mpstat -I SCPU 1" is much happier now)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <tom@herbertland.com>
Acked-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-07 15:32:03 -07:00
Daniel Borkmann
7e2c3aea43 net: also make sch_handle_egress() drop monitor ready
Follow-up for 8a3a4c6e7b ("net: make sch_handle_ingress() drop
monitor ready") to also make the egress side drop monitor ready.

Also here only TC_ACT_SHOT is a clear indication that something
went wrong. Hence don't provide false positives to drop monitors
such as 'perf record -e skb:kfree_skb ...'.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-16 14:02:44 -04:00
David Ahern
74b20582ac net: l3mdev: Add hook in ip and ipv6
Currently the VRF driver uses the rx_handler to switch the skb device
to the VRF device. Switching the dev prior to the ip / ipv6 layer
means the VRF driver has to duplicate IP/IPv6 processing which adds
overhead and makes features such as retaining the ingress device index
more complicated than necessary.

This patch moves the hook to the L3 layer just after the first NF_HOOK
for PRE_ROUTING. This location makes exposing the original ingress device
trivial (next patch) and allows adding other NF_HOOKs to the VRF driver
in the future.

dev_queue_xmit_nit is exported so that the VRF driver can cycle the skb
with the switched device through the packet taps to maintain current
behavior (tcpdump can be used on either the vrf device or the enslaved
devices).

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-11 19:31:40 -04:00
Eric Dumazet
8a3a4c6e7b net: make sch_handle_ingress() drop monitor ready
TC_ACT_STOLEN is used when ingress traffic is mirred/redirected
to say ifb.

Packet is not dropped, but consumed.

Only TC_ACT_SHOT is a clear indication something went wrong.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-08 23:53:22 -04:00
Alexander Duyck
b1dc497b28 net: Fix netdev_fix_features so that TSO_MANGLEID is only available with TSO
This change makes it so that we will strip the TSO_MANGLEID bit if TSO is
not present.  This way we will also handle ECN correctly of TSO is not
present.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04 13:32:27 -04:00
David S. Miller
cba6532100 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	net/ipv4/ip_gre.c

Minor conflicts between tunnel bug fixes in net and
ipv6 tunnel cleanups in net-next.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-04 00:52:29 -04:00
Alexander Duyck
996e802187 net: Disable segmentation if checksumming is not supported
In the case of the mlx4 and mlx5 driver they do not support IPv6 checksum
offload for tunnels.  With this being the case we should disable GSO in
addition to the checksum offload features when we find that a device cannot
perform a checksum on a given packet type.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03 16:00:54 -04:00
Nikolay Aleksandrov
f4b05d27ec net: constify is_skb_forwardable's arguments
is_skb_forwardable is not supposed to change anything so constify its
arguments

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-29 16:13:36 -04:00
Jason Wang
3df97ba830 tuntap: calculate rps hash only when needed
There's no need to calculate rps hash if it was not enabled. So this
patch export rps_needed and check it before trying to get rps
hash. Tests (using pktgen to inject packets to guest) shows this can
improve pps about 13% (when rps is disabled).

Before:
~1150000 pps
After:
~1300000 pps

Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
----
Changes from V1:
- Fix build when CONFIG_RPS is not set
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-28 16:38:54 -04:00
Eric Dumazet
02a1d6e7a6 net: rename NET_{ADD|INC}_STATS_BH()
Rename NET_INC_STATS_BH() to __NET_INC_STATS()
and NET_ADD_STATS_BH() to __NET_ADD_STATS()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27 22:48:24 -04:00
Alexander Duyck
7f348a6076 net: Add support for IP ID mangling TSO in cases that require encapsulation
This patch adds support for NETIF_F_TSO_MANGLEID if a given tunnel supports
NETIF_F_TSO.  This way if needed a device can then later enable the TSO
with IP ID mangling and the tunnels on top of that device can then also
make use of the IP ID mangling as well.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-21 15:11:07 -04:00
Eric Dumazet
d21fd63ea3 net: validate_xmit_skb() changes
skbs given to validate_xmit_skb() should not have a next
pointer anymore.

Also if a packet is dropped, increment dev->tx_dropped
__dev_queue_xmit() no longer has to change tx_dropped in this case.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-14 21:40:24 -04:00
Alexander Duyck
802ab55adc GSO: Support partial segmentation offload
This patch adds support for something I am referring to as GSO partial.
The basic idea is that we can support a broader range of devices for
segmentation if we use fixed outer headers and have the hardware only
really deal with segmenting the inner header.  The idea behind the naming
is due to the fact that everything before csum_start will be fixed headers,
and everything after will be the region that is handled by hardware.

With the current implementation it allows us to add support for the
following GSO types with an inner TSO_MANGLEID or TSO6 offload:
NETIF_F_GSO_GRE
NETIF_F_GSO_GRE_CSUM
NETIF_F_GSO_IPIP
NETIF_F_GSO_SIT
NETIF_F_UDP_TUNNEL
NETIF_F_UDP_TUNNEL_CSUM

In the case of hardware that already supports tunneling we may be able to
extend this further to support TSO_TCPV4 without TSO_MANGLEID if the
hardware can support updating inner IPv4 headers.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-14 16:23:41 -04:00
Alexander Duyck
1530545ed6 GRO: Add support for TCP with fixed IPv4 ID field, limit tunnel IP ID values
This patch does two things.

First it allows TCP to aggregate TCP frames with a fixed IPv4 ID field.  As
a result we should now be able to aggregate flows that were converted from
IPv6 to IPv4.  In addition this allows us more flexibility for future
implementations of segmentation as we may be able to use a fixed IP ID when
segmenting the flow.

The second thing this does is that it places limitations on the outer IPv4
ID header in the case of tunneled frames.  Specifically it forces the IP ID
to be incrementing by 1 unless the DF bit is set in the outer IPv4 header.
This way we can avoid creating overlapping series of IP IDs that could
possibly be fragmented if the frame goes through GRO and is then
resegmented via GSO.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-14 16:23:41 -04:00
Alexander Duyck
cbc53e08a7 GSO: Add GSO type for fixed IPv4 ID
This patch adds support for TSO using IPv4 headers with a fixed IP ID
field.  This is meant to allow us to do a lossless GRO in the case of TCP
flows that use a fixed IP ID such as those that convert IPv6 header to IPv4
headers.

In addition I am adding a feature that for now I am referring to TSO with
IP ID mangling.  Basically when this flag is enabled the device has the
option to either output the flow with incrementing IP IDs or with a fixed
IP ID regardless of what the original IP ID ordering was.  This is useful
in cases where the DF bit is set and we do not care if the original IP ID
value is maintained.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-14 16:23:40 -04:00
Eric Dumazet
743b03a832 net: remove netdevice gso_min_segs
After introduction of ndo_features_check(), we believe that very
specific checks for rare features should not be done in core
networking stack.

No driver uses gso_min_segs yet, so we revert this feature and save
few instructions per tx packet in fast path.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-14 00:37:08 -04:00
David S. Miller
ae95d71261 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2016-04-09 17:41:41 -04:00
Alexander Duyck
a0ca153f98 GRE: Disable segmentation offloads w/ CSUM and we are encapsulated via FOU
This patch fixes an issue I found in which we were dropping frames if we
had enabled checksums on GRE headers that were encapsulated by either FOU
or GUE.  Without this patch I was barely able to get 1 Gb/s of throughput.
With this patch applied I am now at least getting around 6 Gb/s.

The issue is due to the fact that with FOU or GUE applied we do not provide
a transport offset pointing to the GRE header, nor do we offload it in
software as the GRE header is completely skipped by GSO and treated like a
VXLAN or GENEVE type header.  As such we need to prevent the stack from
generating it and also prevent GRE from generating it via any interface we
create.

Fixes: c3483384ee ("gro: Allow tunnel stacking in the case of FOU/GUE")
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-07 16:56:33 -04:00
Aaron Conole
4da46cebbd net/core/dev: Warn on a too-short GRO frame
When signaling that a GRO frame is ready to be processed, the network stack
correctly checks length and aborts processing when a frame is less than 14
bytes. However, such a condition is really indicative of a broken driver,
and should be loudly signaled, rather than silently dropped as the case is
today.

Convert the condition to use net_warn_ratelimited() to ensure the stack
loudly complains about such broken drivers.

Signed-off-by: Aaron Conole <aconole@bytheb.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-05 19:58:39 -04:00
Luis de Bethencourt
ed49e65037 net: add description for len argument of dev_get_phys_port_name
When the function dev_get_phys_port_name was added it missed a description
for it's len argument. Adding it.

Fixes: db24a9044e ("net: add support for phys_port_name")
Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-21 13:28:31 -04:00
Jesse Gross
fac8e0f579 tunnels: Don't apply GRO to multiple layers of encapsulation.
When drivers express support for TSO of encapsulated packets, they
only mean that they can do it for one layer of encapsulation.
Supporting additional levels would mean updating, at a minimum,
more IP length fields and they are unaware of this.

No encapsulation device expresses support for handling offloaded
encapsulated packets, so we won't generate these types of frames
in the transmit path. However, GRO doesn't have a check for
multiple levels of encapsulation and will attempt to build them.

UDP tunnel GRO actually does prevent this situation but it only
handles multiple UDP tunnels stacked on top of each other. This
generalizes that solution to prevent any kind of tunnel stacking
that would cause problems.

Fixes: bf5a755f ("net-gre-gro: Add GRE support to the GRO stack")
Signed-off-by: Jesse Gross <jesse@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-20 16:33:40 -04:00
David S. Miller
b633353115 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/phy/bcm7xxx.c
	drivers/net/phy/marvell.c
	drivers/net/vxlan.c

All three conflicts were cases of simple overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-23 00:09:14 -05:00
Nikolay Aleksandrov
cfdd28beb3 net: make netdev_for_each_lower_dev safe for device removal
When I used netdev_for_each_lower_dev in commit bad5316232 ("vrf:
remove slave queue and private slave struct") I thought that it acts
like netdev_for_each_lower_private and can be used to remove the current
device from the list while walking, but unfortunately it acts more like
netdev_for_each_lower_private_rcu and doesn't allow it. The difference
is where the "iter" points to, right now it points to the current element
and that makes it impossible to remove it. Change the logic to be
similar to netdev_for_each_lower_private and make it point to the "next"
element so we can safely delete the current one. VRF is the only such
user right now, there's no change for the read-only users.

Here's what can happen now:
[98423.249858] general protection fault: 0000 [#1] SMP
[98423.250175] Modules linked in: vrf bridge(O) stp llc nfsd auth_rpcgss
oid_registry nfs_acl nfs lockd grace sunrpc crct10dif_pclmul
crc32_pclmul crc32c_intel ghash_clmulni_intel jitterentropy_rng
sha256_generic hmac drbg ppdev aesni_intel aes_x86_64 glue_helper lrw
gf128mul ablk_helper cryptd evdev serio_raw pcspkr virtio_balloon
parport_pc parport i2c_piix4 i2c_core virtio_console acpi_cpufreq button
9pnet_virtio 9p 9pnet fscache ipv6 autofs4 ext4 crc16 mbcache jbd2 sg
virtio_blk virtio_net sr_mod cdrom e1000 ata_generic ehci_pci uhci_hcd
ehci_hcd usbcore usb_common virtio_pci ata_piix libata floppy
virtio_ring virtio scsi_mod [last unloaded: bridge]
[98423.255040] CPU: 1 PID: 14173 Comm: ip Tainted: G           O
4.5.0-rc2+ #81
[98423.255386] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.8.1-20150318_183358- 04/01/2014
[98423.255777] task: ffff8800547f5540 ti: ffff88003428c000 task.ti:
ffff88003428c000
[98423.256123] RIP: 0010:[<ffffffff81514f3e>]  [<ffffffff81514f3e>]
netdev_lower_get_next+0x1e/0x30
[98423.256534] RSP: 0018:ffff88003428f940  EFLAGS: 00010207
[98423.256766] RAX: 0002000100000004 RBX: ffff880054ff9000 RCX:
0000000000000000
[98423.257039] RDX: ffff88003428f8b8 RSI: ffff88003428f950 RDI:
ffff880054ff90c0
[98423.257287] RBP: ffff88003428f940 R08: 0000000000000000 R09:
0000000000000000
[98423.257537] R10: 0000000000000001 R11: 0000000000000000 R12:
ffff88003428f9e0
[98423.257802] R13: ffff880054a5fd00 R14: ffff88003428f970 R15:
0000000000000001
[98423.258055] FS:  00007f3d76881700(0000) GS:ffff88005d000000(0000)
knlGS:0000000000000000
[98423.258418] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[98423.258650] CR2: 00007ffe5951ffa8 CR3: 0000000052077000 CR4:
00000000000406e0
[98423.258902] Stack:
[98423.259075]  ffff88003428f960 ffffffffa0442636 0002000100000004
ffff880054ff9000
[98423.259647]  ffff88003428f9b0 ffffffff81518205 ffff880054ff9000
ffff88003428f978
[98423.260208]  ffff88003428f978 ffff88003428f9e0 ffff88003428f9e0
ffff880035b35f00
[98423.260739] Call Trace:
[98423.260920]  [<ffffffffa0442636>] vrf_dev_uninit+0x76/0xa0 [vrf]
[98423.261156]  [<ffffffff81518205>]
rollback_registered_many+0x205/0x390
[98423.261401]  [<ffffffff815183ec>] unregister_netdevice_many+0x1c/0x70
[98423.261641]  [<ffffffff8153223c>] rtnl_delete_link+0x3c/0x50
[98423.271557]  [<ffffffff815335bb>] rtnl_dellink+0xcb/0x1d0
[98423.271800]  [<ffffffff811cd7da>] ? __inc_zone_state+0x4a/0x90
[98423.272049]  [<ffffffff815337b4>] rtnetlink_rcv_msg+0x84/0x200
[98423.272279]  [<ffffffff810cfe7d>] ? trace_hardirqs_on+0xd/0x10
[98423.272513]  [<ffffffff8153370b>] ? rtnetlink_rcv+0x1b/0x40
[98423.272755]  [<ffffffff81533730>] ? rtnetlink_rcv+0x40/0x40
[98423.272983]  [<ffffffff8155d6e7>] netlink_rcv_skb+0x97/0xb0
[98423.273209]  [<ffffffff8153371a>] rtnetlink_rcv+0x2a/0x40
[98423.273476]  [<ffffffff8155ce8b>] netlink_unicast+0x11b/0x1a0
[98423.273710]  [<ffffffff8155d2f1>] netlink_sendmsg+0x3e1/0x610
[98423.273947]  [<ffffffff814fbc98>] sock_sendmsg+0x38/0x70
[98423.274175]  [<ffffffff814fc253>] ___sys_sendmsg+0x2e3/0x2f0
[98423.274416]  [<ffffffff810d841e>] ? do_raw_spin_unlock+0xbe/0x140
[98423.274658]  [<ffffffff811e1bec>] ? handle_mm_fault+0x26c/0x2210
[98423.274894]  [<ffffffff811e19cd>] ? handle_mm_fault+0x4d/0x2210
[98423.275130]  [<ffffffff81269611>] ? __fget_light+0x91/0xb0
[98423.275365]  [<ffffffff814fcd42>] __sys_sendmsg+0x42/0x80
[98423.275595]  [<ffffffff814fcd92>] SyS_sendmsg+0x12/0x20
[98423.275827]  [<ffffffff81611bb6>] entry_SYSCALL_64_fastpath+0x16/0x7a
[98423.276073] Code: c3 31 c0 5d c3 0f 1f 84 00 00 00 00 00 66 66 66 66
90 48 8b 06 55 48 81 c7 c0 00 00 00 48 89 e5 48 8b 00 48 39 f8 74 09 48
89 06 <48> 8b 40 e8 5d c3 31 c0 5d c3 0f 1f 84 00 00 00 00 00 66 66 66
[98423.279639] RIP  [<ffffffff81514f3e>] netdev_lower_get_next+0x1e/0x30
[98423.279920]  RSP <ffff88003428f940>

CC: David Ahern <dsa@cumulusnetworks.com>
CC: David S. Miller <davem@davemloft.net>
CC: Roopa Prabhu <roopa@cumulusnetworks.com>
CC: Vlad Yasevich <vyasevic@redhat.com>
Fixes: bad5316232 ("vrf: remove slave queue and private slave struct")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: David Ahern <dsa@cumulusnetworks.com>
Tested-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-19 15:29:26 -05:00
Phil Sutter
a813104d92 IFF_NO_QUEUE: Fix for drivers not calling ether_setup()
My implementation around IFF_NO_QUEUE driver flag assumed that leaving
tx_queue_len untouched (specifically: not setting it to zero) by drivers
would make it possible to assign a regular qdisc to them without having
to worry about setting tx_queue_len to a useful value. This was only
partially true: I overlooked that some drivers don't call ether_setup()
and therefore not initialize tx_queue_len to the default value of 1000.
Consequently, removing the workarounds in place for that case in qdisc
implementations which cared about it (namely, pfifo, bfifo, gred, htb,
plug and sfb) leads to problems with these specific interface types and
qdiscs.

Luckily, there's already a sanitization point for drivers setting
tx_queue_len to zero, which can be reused to assign the fallback value
most qdisc implementations used, which is 1.

Fixes: 348e3435cb ("net: sched: drop all special handling of tx_queue_len == 0")
Tested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-18 14:56:53 -05:00
Jesper Dangaard Brouer
15fad714be net: bulk free SKBs that were delay free'ed due to IRQ context
The network stack defers SKBs free, in-case free happens in IRQ or
when IRQs are disabled. This happens in __dev_kfree_skb_irq() that
writes SKBs that were free'ed during IRQ to the softirq completion
queue (softnet_data.completion_queue).

These SKBs are naturally delayed, and cleaned up during NET_TX_SOFTIRQ
in function net_tx_action().  Take advantage of this a use the skb
defer and flush API, as we are already in softirq context.

For modern drivers this rarely happens. Although most drivers do call
dev_kfree_skb_any(), which detects the situation and calls
__dev_kfree_skb_irq() when needed.  This due to netpoll can call from
IRQ context.

Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11 11:59:09 -05:00
Jesper Dangaard Brouer
795bb1c00d net: bulk free infrastructure for NAPI context, use napi_consume_skb
Discovered that network stack were hitting the kmem_cache/SLUB
slowpath when freeing SKBs.  Doing bulk free with kmem_cache_free_bulk
can speedup this slowpath.

NAPI context is a bit special, lets take advantage of that for bulk
free'ing SKBs.

In NAPI context we are running in softirq, which gives us certain
protection.  A softirq can run on several CPUs at once.  BUT the
important part is a softirq will never preempt another softirq running
on the same CPU.  This gives us the opportunity to access per-cpu
variables in softirq context.

Extend napi_alloc_cache (before only contained page_frag_cache) to be
a struct with a small array based stack for holding SKBs.  Introduce a
SKB defer and flush API for accessing this.

Introduce napi_consume_skb() as replacement for e.g. dev_consume_skb_any()
when running in NAPI context.  A small trick to handle/detect if we
are called from netpoll is to see if budget is 0.  In that case, we
need to invoke dev_consume_skb_irq().

Joint work with Alexander Duyck.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11 11:59:09 -05:00
Jarod Wilson
6e7333d315 net: add rx_nohandler stat counter
This adds an rx_nohandler stat counter, along with a sysfs statistics
node, and copies the counter out via netlink as well.

CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jiri Pirko <jiri@mellanox.com>
CC: Daniel Borkmann <daniel@iogearbox.net>
CC: Tom Herbert <tom@herbertland.com>
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <gospo@cumulusnetworks.com>
CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-06 02:59:51 -05:00
Jarod Wilson
9256645af0 net/core: relax BUILD_BUG_ON in netdev_stats_to_stats64
The netdev_stats_to_stats64 function copies the deprecated
net_device_stats format stats into rtnl_link_stats64 for legacy support
purposes, but with the BUILD_BUG_ON as it was, it wasn't possible to
extend rtnl_link_stats64 without also extending net_device_stats. Relax
the BUILD_BUG_ON to only require that rtnl_link_stats64 is larger, and
zero out all the stat counters that aren't present in net_device_stats.

CC: Eric Dumazet <edumazet@google.com>
CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-06 02:59:50 -05:00
Jesse Gross
ce87fc6ce3 gro: Make GRO aware of lightweight tunnels.
GRO is currently not aware of tunnel metadata generated by lightweight
tunnels and stored in the dst. This leads to two possible problems:
 * Incorrectly merging two frames that have different metadata.
 * Leaking of allocated metadata from merged frames.

This avoids those problems by comparing the tunnel information before
merging, similar to how we handle other metadata (such as vlan tags),
and releasing any state when we are done.

Reported-by: John <john.phillips5@hpe.com>
Fixes: 2e15ea39 ("ip_gre: Add support to collect tunnel metadata.")
Signed-off-by: Jesse Gross <jesse@kernel.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-20 18:48:38 -08:00
Konstantin Khlebnikov
9207f9d45b net: preserve IP control block during GSO segmentation
Skb_gso_segment() uses skb control block during segmentation.
This patch adds 32-bytes room for previous control block which
will be copied into all resulting segments.

This patch fixes kernel crash during fragmenting forwarded packets.
Fragmentation requires valid IP CB in skb for clearing ip options.
Also patch removes custom save/restore in ovs code, now it's redundant.

Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
Link: http://lkml.kernel.org/r/CALYGNiP-0MZ-FExV2HutTvE9U-QQtkKSoE--KN=JQE5STYsjAA@mail.gmail.com
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-15 14:35:24 -05:00
Daniel Borkmann
1f211a1b92 net, sched: add clsact qdisc
This work adds a generalization of the ingress qdisc as a qdisc holding
only classifiers. The clsact qdisc works on ingress, but also on egress.
In both cases, it's execution happens without taking the qdisc lock, and
the main difference for the egress part compared to prior version of [1]
is that this can be applied with _any_ underlying real egress qdisc (also
classless ones).

Besides solving the use-case of [1], that is, allowing for more programmability
on assigning skb->priority for the mqprio case that is supported by most
popular 10G+ NICs, it also opens up a lot more flexibility for other tc
applications. The main work on classification can already be done at clsact
egress time if the use-case allows and state stored for later retrieval
f.e. again in skb->priority with major/minors (which is checked by most
classful qdiscs before consulting tc_classify()) and/or in other skb fields
like skb->tc_index for some light-weight post-processing to get to the
eventual classid in case of a classful qdisc. Another use case is that
the clsact egress part allows to have a central egress counterpart to
the ingress classifiers, so that classifiers can easily share state (e.g.
in cls_bpf via eBPF maps) for ingress and egress.

Currently, default setups like mq + pfifo_fast would require for this to
use, for example, prio qdisc instead (to get a tc_classify() run) and to
duplicate the egress classifier for each queue. With clsact, it allows
for leaving the setup as is, it can additionally assign skb->priority to
put the skb in one of pfifo_fast's bands and it can share state with maps.
Moreover, we can access the skb's dst entry (f.e. to retrieve tclassid)
w/o the need to perform a skb_dst_force() to hold on to it any longer. In
lwt case, we can also use this facility to setup dst metadata via cls_bpf
(bpf_skb_set_tunnel_key()) without needing a real egress qdisc just for
that (case of IFF_NO_QUEUE devices, for example).

The realization can be done without any changes to the scheduler core
framework. All it takes is that we have two a-priori defined minors/child
classes, where we can mux between ingress and egress classifier list
(dev->ingress_cl_list and dev->egress_cl_list, latter stored close to
dev->_tx to avoid extra cacheline miss for moderate loads). The egress
part is a bit similar modelled to handle_ing() and patched to a noop in
case the functionality is not used. Both handlers are now called
sch_handle_ingress() and sch_handle_egress(), code sharing among the two
doesn't seem practical as there are various minor differences in both
paths, so that making them conditional in a single handler would rather
slow things down.

Full compatibility to ingress qdisc is provided as well. Since both
piggyback on TC_H_CLSACT, only one of them (ingress/clsact) can exist
per netdevice, and thus ingress qdisc specific behaviour can be retained
for user space. This means, either a user does 'tc qdisc add dev foo ingress'
and configures ingress qdisc as usual, or the 'tc qdisc add dev foo clsact'
alternative, where both, ingress and egress classifier can be configured
as in the below example. ingress qdisc supports attaching classifier to any
minor number whereas clsact has two fixed minors for muxing between the
lists, therefore to not break user space setups, they are better done as
two separate qdiscs.

I decided to extend the sch_ingress module with clsact functionality so
that commonly used code can be reused, the module is being aliased with
sch_clsact so that it can be auto-loaded properly. Alternative would have been
to add a flag when initializing ingress to alter its behaviour plus aliasing
to a different name (as it's more than just ingress). However, the first would
end up, based on the flag, choosing the new/old behaviour by calling different
function implementations to handle each anyway, the latter would require to
register ingress qdisc once again under different alias. So, this really begs
to provide a minimal, cleaner approach to have Qdisc_ops and Qdisc_class_ops
by its own that share callbacks used by both.

Example, adding qdisc:

   # tc qdisc add dev foo clsact
   # tc qdisc show dev foo
   qdisc mq 0: root
   qdisc pfifo_fast 0: parent :1 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
   qdisc pfifo_fast 0: parent :2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
   qdisc pfifo_fast 0: parent :3 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
   qdisc pfifo_fast 0: parent :4 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
   qdisc clsact ffff: parent ffff:fff1

Adding filters (deleting, etc works analogous by specifying ingress/egress):

   # tc filter add dev foo ingress bpf da obj bar.o sec ingress
   # tc filter add dev foo egress  bpf da obj bar.o sec egress
   # tc filter show dev foo ingress
   filter protocol all pref 49152 bpf
   filter protocol all pref 49152 bpf handle 0x1 bar.o:[ingress] direct-action
   # tc filter show dev foo egress
   filter protocol all pref 49152 bpf
   filter protocol all pref 49152 bpf handle 0x1 bar.o:[egress] direct-action

A 'tc filter show dev foo' or 'tc filter show dev foo parent ffff:' will
show an empty list for clsact. Either using the parent names (ingress/egress)
or specifying the full major/minor will then show the related filter lists.

Prior work on a mqprio prequeue() facility [1] was done mainly by John Fastabend.

  [1] http://patchwork.ozlabs.org/patch/512949/

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-10 22:13:15 -05:00
Tom Herbert
6ae23ad362 net: Add driver helper functions to determine checksum offloadability
Add skb_csum_offload_chk driver helper function to determine if a
device with limited checksum offload capabilities is able to offload the
checksum for a given packet.

This patch includes:
  - The skb_csum_offload_chk function. Returns true if checksum is
    offloadable, else false. Optionally, in the case that the checksum
    is not offloable, the function can call skb_checksum_help to resolve
    the checksum. skb_csum_offload_chk also returns whether the checksum
    refers to an encapsulated checksum.
  - Definition of skb_csum_offl_spec structure that caller uses to
    indicate rules about what it can offload (e.g. IPv4/v6, TCP/UDP only,
    whether encapsulated checksums can be offloaded, whether checksum with
    IPv6 extension headers can be offloaded).
  - Ancilary functions called skb_csum_offload_chk_help,
    skb_csum_off_chk_help_cmn, skb_csum_off_chk_help_cmn_v4_only.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-15 16:50:21 -05:00
Tom Herbert
c8cd0989bd net: Eliminate NETIF_F_GEN_CSUM and NETIF_F_V[46]_CSUM
These netif flags are unnecessary convolutions. It is more
straightforward to just use NETIF_F_HW_CSUM, NETIF_F_IP_CSUM,
and NETIF_F_IPV6_CSUM directly.

This patch also:
    - Cleans up can_checksum_protocol
    - Simplifies netdev_intersect_features

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-15 16:50:20 -05:00
Tom Herbert
a188222b6e net: Rename NETIF_F_ALL_CSUM to NETIF_F_CSUM_MASK
The name NETIF_F_ALL_CSUM is a misnomer. This does not correspond to the
set of features for offloading all checksums. This is a mask of the
checksum offload related features bits. It is incorrect to set both
NETIF_F_HW_CSUM and NETIF_F_IP_CSUM or NETIF_F_IPV6 at the same time for
features of a device.

This patch:
  - Changes instances of NETIF_F_ALL_CSUM to NETIF_F_CSUM_MASK (where
    NETIF_F_ALL_CSUM is being used as a mask).
  - Changes bonding, sfc/efx, ipvlan, macvlan, vlan, and team drivers to
    use NEITF_F_HW_CSUM in features list instead of NETIF_F_ALL_CSUM.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-15 16:50:08 -05:00
Tejun Heo
2a56a1fec2 net: wrap sock->sk_cgrp_prioidx and ->sk_classid inside a struct
Introduce sock->sk_cgrp_data which is a struct sock_cgroup_data.
->sk_cgroup_prioidx and ->sk_classid are moved into it.  The struct
and its accessors are defined in cgroup-defs.h.  This is to prepare
for overloading the fields with a cgroup pointer.

This patch mostly performs equivalent conversions but the followings
are noteworthy.

* Equality test before updating classid is removed from
  sock_update_classid().  This shouldn't make any noticeable
  difference and a similar test will be implemented on the helper side
  later.

* sock_update_netprioidx() now takes struct sock_cgroup_data and can
  be moved to netprio_cgroup.h without causing include dependency
  loop.  Moved.

* The dummy version of sock_update_netprioidx() converted to a static
  inline function while at it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-08 22:02:33 -05:00
Jiri Pirko
b618aaa91b net: constify netif_is_* helpers net_device param
As suggested by Eric, these helpers should have const dev param.

Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-05 18:16:27 -05:00
Jiri Pirko
04d482660a net: introduce change lower state notifier
When lower device like bonding slave, team/bridge port, etc changes its
state, it is useful for others to notice this change. Currently this is
implemented specificly for bonding as NETDEV_BONDING_INFO notifier. This
patch aims to replace this specific usage and make this more generic to
be used for all upper-lower devices.

Introduce NETDEV_CHANGELOWERSTATE netdev notifier type and
netdev_lower_state_changed() helper.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-03 11:49:26 -05:00
Jiri Pirko
29bf24afb2 net: add possibility to pass information about upper device via notifier
Sometimes the drivers and other code would find it handy to know some
internal information about upper device being changed. So allow upper-code
to pass information down to notifier listeners during linking.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-03 11:49:25 -05:00
Jiri Pirko
6dffb0447c net: propagate upper priv via netdev_master_upper_dev_link
Eliminate netdev_master_upper_dev_link_private and pass priv directly as
a parameter of netdev_master_upper_dev_link.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-03 11:49:25 -05:00
Ido Schimmel
b03804e7c3 net: Check CHANGEUPPER notifier return value
switchdev drivers reflect the newly requested topology to hardware when
CHANGEUPPER is received, after software links were already formed.
However, the operation can fail and user will not be notified, as the
return value of the notifier is not checked.

Add this check and rollback software links if necessary.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-03 11:49:23 -05:00
Eric Dumazet
e2f9dc3bd2 net: avoid NULL deref in napi_get_frags()
napi_alloc_skb() can return NULL.
We should not crash should this happen.

Fixes: 93f93a4404 ("net: move skb_mark_napi_id() into core networking stack")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-20 16:43:14 -05:00
Eric Dumazet
93d05d4a32 net: provide generic busy polling to all NAPI drivers
NAPI drivers no longer need to observe a particular protocol
to benefit from busy polling (CONFIG_NET_RX_BUSY_POLL=y)

napi_hash_add() and napi_hash_del() are automatically called
from core networking stack, respectively from
netif_napi_add() and netif_napi_del()

This patch depends on free_netdev() and netif_napi_del() being
called from process context, which seems to be the norm.

Drivers might still prefer to call napi_hash_del() on their
own, since they might combine all the rcu grace periods into
a single one, knowing their NAPI structures lifetime, while
core networking stack has no idea of a possible combining.

Once this patch proves to not bring serious regressions,
we will cleanup drivers to either remove napi_hash_del()
or provide appropriate rcu grace periods combining.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-18 16:17:42 -05:00
Eric Dumazet
34cbe27e81 net: napi_hash_del() returns a boolean status
napi_hash_del() will soon be used from both drivers (if they want)
or core networking stack.

Callers are responsibles to ensure an RCU grace period is respected
before freeing napi structure : napi_hash_del() can signal if
this RCU grace period is needed or not.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-18 16:17:42 -05:00
Eric Dumazet
6180d9de61 net: move napi_hash[] into read mostly section
We do not often add/delete a napi context.
Moving napi_hash[] into read mostly section avoids potential false sharing.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-18 16:17:42 -05:00
Eric Dumazet
d64b5e85bf net: add netif_tx_napi_add()
netif_tx_napi_add() is a variant of netif_napi_add()

It should be used by drivers that use a napi structure
to exclusively poll TX.

We do not want to add this kind of napi in napi_hash[] in following
patches, adding generic busy polling to all NAPI drivers.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-18 16:17:41 -05:00
Eric Dumazet
93f93a4404 net: move skb_mark_napi_id() into core networking stack
We would like to automatically provide busy polling support
to all NAPI drivers, without them having to implement anything.

skb_mark_napi_id() can be called from napi_gro_receive() and
napi_get_frags().

Few drivers are still calling skb_mark_napi_id() because
they use netif_receive_skb(). They should eventually call
napi_gro_receive() instead. I will leave this to drivers
maintainers.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-18 16:17:41 -05:00
Eric Dumazet
ce6aea93f7 net: network drivers no longer need to implement ndo_busy_poll()
Instead of having to implement complex ndo_busy_poll() method,
drivers can simply rely on NAPI poll logic.

Busy polling gains are mainly coming from polling itself,
not on exact details on how we poll the device.

ndo_busy_poll() if implemented can avoid touching
napi state, but it adds extra synchronization between
normal napi->poll() and busy poll handler, slowing down
the common path (non busy polling) with extra atomic operations.
In practice few drivers ever got busy poll because of the complexity.

We could go one step further, and make busy polling
available for all NAPI drivers, but this would require
that all netif_napi_del() calls are done in process context
so that we can call synchronize_rcu().
Full audit would be required.

Before this is done, a driver still needs to call :

- skb_mark_napi_id() for each skb provided to the stack.
- napi_hash_add() and napi_hash_del() to allocate a napi_id per napi struct.
- Make sure RCU grace period is respected after napi_hash_del() before
  memory containing napi structure is freed.

Followup patch implements busy poll for mlx5 driver as an example.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-18 16:17:39 -05:00
Eric Dumazet
2a028ecb76 net: allow BH servicing in sk_busy_loop()
Instead of blocking BH in whole sk_busy_loop(), block them
only around ->ndo_busy_poll() calls.

This has many benefits.

1) allow tunneled traffic to use busy poll as well as native traffic.
   Tunnels handlers usually call netif_rx() and depend on net_rx_action()
   being run (from sofirq handler)

2) allow RFS/RPS being used (sending IPI to other cpus if needed)

3) use the 'lets burn cpu cycles' budget to do useful work
   (like TX completions, timers, RCU callbacks...)

4) reduce BH latencies, making busy poll a better citizen.

Tested:

Tested with SIT tunnel

lpaa5:~# echo 0 >/proc/sys/net/core/busy_read
lpaa5:~# ./netperf -H 2002:af6:786::1 -t TCP_RR
MIGRATED TCP REQUEST/RESPONSE TEST from ::0 (::) port 0 AF_INET6 to 2002:af6:786::1 () port 0 AF_INET6 : first burst 0
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate
bytes  Bytes  bytes    bytes   secs.    per sec

16384  87380  1        1       10.00    37373.93
16384  87380

Now enable busy poll on both hosts

lpaa5:~# echo 70 >/proc/sys/net/core/busy_read
lpaa6:~# echo 70 >/proc/sys/net/core/busy_read

lpaa5:~# ./netperf -H 2002:af6:786::1 -t TCP_RR
MIGRATED TCP REQUEST/RESPONSE TEST from ::0 (::) port 0 AF_INET6 to 2002:af6:786::1 () port 0 AF_INET6 : first burst 0
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate
bytes  Bytes  bytes    bytes   secs.    per sec

16384  87380  1        1       10.00    58314.77
16384  87380

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-18 16:17:38 -05:00
Eric Dumazet
02d62e86fe net: un-inline sk_busy_loop()
There is really little gain from inlining this big function.
We'll soon make it even bigger in following patches.

This means we no longer need to export napi_by_id()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-18 16:17:38 -05:00
Eric Dumazet
52bd2d62ce net: better skb->sender_cpu and skb->napi_id cohabitation
skb->sender_cpu and skb->napi_id share a common storage,
and we had various bugs about this.

We had to call skb_sender_cpu_clear() in some places to
not leave a prior skb->napi_id and fool netdev_pick_tx()

As suggested by Alexei, we could split the space so that
these errors can not happen.

0 value being reserved as the common (not initialized) value,
let's reserve [1 .. NR_CPUS] range for valid sender_cpu,
and [NR_CPUS+1 .. ~0U] for valid napi_id.

This will allow proper busy polling support over tunnels.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-18 16:17:37 -05:00
Nikolay Aleksandrov
17b85d29e8 net/core: revert "net: fix __netdev_update_features return.." and add comment
This reverts commit 00ee592717 ("net: fix __netdev_update_features return
on ndo_set_features failure")
and adds a comment explaining why it's okay to return a value other than
0 upon error. Some drivers might actually change flags and return an
error so it's better to fire a spurious notification rather than miss
these.

CC: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-17 15:25:45 -05:00
Bjørn Mork
88ad4175b2 net/core: use netdev name in warning if no parent
A recent flaw in the netdev feature setting resulted in warnings
like this one from VLAN interfaces:

 WARNING: CPU: 1 PID: 4975 at net/core/dev.c:2419 skb_warn_bad_offload+0xbc/0xcb()
 : caps=(0x00000000001b5820, 0x00000000001b5829) len=2782 data_len=0 gso_size=1348 gso_type=16 ip_summed=3

The ":" is supposed to be preceded by a driver name, but in this
case it is an empty string since the device has no parent.

There are many types of network devices without a parent. The
anonymous warnings for these devices can be hard to debug.  Log
the network device name instead in these cases to assist further
debugging.

This is mostly similar to how __netdev_printk() handles orphan
devices.

Signed-off-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-16 16:21:48 -05:00
Nikolay Aleksandrov
00ee592717 net: fix __netdev_update_features return on ndo_set_features failure
If ndo_set_features fails __netdev_update_features() will return -1 but
this is wrong because it is expected to return 0 if no features were
changed (see netdev_update_features()), which will cause a netdev
notifier to be called without any actual changes. Fix this by returning
0 if ndo_set_features fails.

Fixes: 6cb6a27c45 ("net: Call netdev_features_change() from netdev_update_features()")
CC: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-16 14:56:03 -05:00
Nikolay Aleksandrov
5f8dc33e8e net: fix feature changes on devices without ndo_set_features
When __netdev_update_features() was updated to ensure some features are
disabled on new lower devices, an error was introduced for devices which
don't have the ndo_set_features() method set. Before we'll just set the
new features, but now we return an error and don't set them. Fix this by
returning the old behaviour and setting err to 0 when ndo_set_features
is not present.

Fixes: e7868a85e1 ("net/core: ensure features get disabled on new lower devs")
CC: Jarod Wilson <jarod@redhat.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Ido Schimmel <idosch@mellanox.com>
CC: Sander Eikelenboom <linux@eikelenboom.it>
CC: Andy Gospodarek <gospo@cumulusnetworks.com>
CC: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Reviewed-by: Jarod Wilson <jarod@redhat.com>
Tested-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Dave Young <dyoung@redhat.com>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-16 14:56:03 -05:00
Jarod Wilson
e7868a85e1 net/core: ensure features get disabled on new lower devs
With moving netdev_sync_lower_features() after the .ndo_set_features
calls, I neglected to verify that devices added *after* a flag had been
disabled on an upper device were properly added with that flag disabled as
well. This currently happens, because we exit __netdev_update_features()
when we see dev->features == features for the upper dev. We can retain the
optimization of leaving without calling .ndo_set_features with a bit of
tweaking and a goto here.

Fixes: fd867d51f8 ("net/core: generic support for disabling netdev features down stack")
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <gospo@cumulusnetworks.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Nikolay Aleksandrov <razor@blackwall.org>
CC: Michal Kubecek <mkubecek@suse.cz>
CC: Alexander Duyck <alexander.duyck@gmail.com>
CC: netdev@vger.kernel.org
Reported-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-04 21:56:00 -05:00
Jarod Wilson
5ba3f7d61a net/core: fix for_each_netdev_feature
As pointed out by Nikolay and further explained by Geert, the initial
for_each_netdev_feature macro was broken, as feature would get set outside
of the block of code it was intended to run in, thus only ever working for
the first feature bit in the mask. While less pretty this way, this is
tested and confirmed functional with multiple feature bits set in
NETIF_F_UPPER_DISABLES.

[root@dell-per730-01 ~]# ethtool -K bond0 lro off
...
[  242.761394] bond0: Disabling feature 0x0000000000008000 on lower dev p5p2.
[  243.552178] bnx2x 0000:06:00.1 p5p2: using MSI-X  IRQs: sp 74  fp[0] 76 ... fp[7] 83
[  244.353978] bond0: Disabling feature 0x0000000000008000 on lower dev p5p1.
[  245.147420] bnx2x 0000:06:00.0 p5p1: using MSI-X  IRQs: sp 62  fp[0] 64 ... fp[7] 71

[root@dell-per730-01 ~]# ethtool -K bond0 gro off
...
[  251.925645] bond0: Disabling feature 0x0000000000004000 on lower dev p5p2.
[  252.713693] bnx2x 0000:06:00.1 p5p2: using MSI-X  IRQs: sp 74  fp[0] 76 ... fp[7] 83
[  253.499085] bond0: Disabling feature 0x0000000000004000 on lower dev p5p1.
[  254.290922] bnx2x 0000:06:00.0 p5p1: using MSI-X  IRQs: sp 62  fp[0] 64 ... fp[7] 71

Fixes: fd867d51f ("net/core: generic support for disabling netdev features down stack")
CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <gospo@cumulusnetworks.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Nikolay Aleksandrov <razor@blackwall.org>
CC: Michal Kubecek <mkubecek@suse.cz>
CC: Alexander Duyck <alexander.duyck@gmail.com>
CC: Geert Uytterhoeven <geert@linux-m68k.org>
CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-03 11:29:57 -05:00
Jarod Wilson
fd867d51f8 net/core: generic support for disabling netdev features down stack
There are some netdev features, which when disabled on an upper device,
such as a bonding master or a bridge, must be disabled and cannot be
re-enabled on underlying devices.

This is a rework of an earlier more heavy-handed appraoch, which simply
disables and prevents re-enabling of netdev features listed in a new
define in include/net/netdev_features.h, NETIF_F_UPPER_DISABLES. Any upper
device that disables a flag in that feature mask, the disabling will
propagate down the stack, and any lower device that has any upper device
with one of those flags disabled should not be able to enable said flag.

Initially, only LRO is included for proof of concept, and because this
code effectively does the same thing as dev_disable_lro(), though it will
also activate from the ethtool path, which was one of the goals here.

[root@dell-per730-01 ~]# ethtool -k bond0 |grep large
large-receive-offload: on
[root@dell-per730-01 ~]# ethtool -k p5p1 |grep large
large-receive-offload: on
[root@dell-per730-01 ~]# ethtool -K bond0 lro off
[root@dell-per730-01 ~]# ethtool -k bond0 |grep large
large-receive-offload: off
[root@dell-per730-01 ~]# ethtool -k p5p1 |grep large
large-receive-offload: off

dmesg dump:

[ 1033.277986] bond0: Disabling feature 0x0000000000008000 on lower dev p5p2.
[ 1034.067949] bnx2x 0000:06:00.1 p5p2: using MSI-X  IRQs: sp 74  fp[0] 76 ... fp[7] 83
[ 1034.753612] bond0: Disabling feature 0x0000000000008000 on lower dev p5p1.
[ 1035.591019] bnx2x 0000:06:00.0 p5p1: using MSI-X  IRQs: sp 62  fp[0] 64 ... fp[7] 71

This has been successfully tested with bnx2x, qlcnic and netxen network
cards as slaves in a bond interface. Turning LRO on or off on the master
also turns it on or off on each of the slaves, new slaves are added with
LRO in the same state as the master, and LRO can't be toggled on the
slaves.

Also, this should largely remove the need for dev_disable_lro(), and most,
if not all, of its call sites can be replaced by simply making sure
NETIF_F_LRO isn't included in the relevant device's feature flags.

Note that this patch is driven by bug reports from users saying it was
confusing that bonds and slaves had different settings for the same
features, and while it won't be 100% in sync if a lower device doesn't
support a feature like LRO, I think this is a good step in the right
direction.

CC: "David S. Miller" <davem@davemloft.net>
CC: Eric Dumazet <edumazet@google.com>
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <gospo@cumulusnetworks.com>
CC: Jiri Pirko <jiri@resnulli.us>
CC: Nikolay Aleksandrov <razor@blackwall.org>
CC: Michal Kubecek <mkubecek@suse.cz>
CC: Alexander Duyck <alexander.duyck@gmail.com>
CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-02 23:41:31 -05:00
David S. Miller
ba3e2084f2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	net/ipv6/xfrm6_output.c
	net/openvswitch/flow_netlink.c
	net/openvswitch/vport-gre.c
	net/openvswitch/vport-vxlan.c
	net/openvswitch/vport.c
	net/openvswitch/vport.h

The openvswitch conflicts were overlapping changes.  One was
the egress tunnel info fix in 'net' and the other was the
vport ->send() op simplification in 'net-next'.

The xfrm6_output.c conflicts was also a simplification
overlapping a bug fix.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-24 06:54:12 -07:00
Pravin B Shelar
fc4099f172 openvswitch: Fix egress tunnel info.
While transitioning to netdev based vport we broke OVS
feature which allows user to retrieve tunnel packet egress
information for lwtunnel devices.  Following patch fixes it
by introducing ndo operation to get the tunnel egress info.
Same ndo operation can be used for lwtunnel devices and compat
ovs-tnl-vport devices. So after adding such device operation
we can remove similar operation from ovs-vport.

Fixes: 614732eaa1 ("openvswitch: Use regular VXLAN net_device device").
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-22 19:39:25 -07:00
Jiri Pirko
573c7ba006 net: introduce pre-change upper device notifier
This newly introduced netdevice notifier is called before actual change
upper happens. That provides a possibility for notifier handlers to
know upper change will happen and react to it, including possibility to
forbid the change. That is valuable for drivers which can check if the
upper device linkage is supported and forbid that in case it is not.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-16 07:15:05 -07:00
Eric Dumazet
004a5d0140 net: use sk_fullsock() in __netdev_pick_tx()
SYN_RECV & TIMEWAIT sockets are not full blown, they do not have a
sk_dst_cache pointer.

Fixes: ca6fb06518 ("tcp: attach SYNACK messages to request sockets instead of listener")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-05 02:45:25 -07:00
David S. Miller
4963ed48f2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	net/ipv4/arp.c

The net/ipv4/arp.c conflict was one commit adding a new
local variable while another commit was deleting one.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-26 16:08:27 -07:00
Michal Kubeček
6ea29da1d0 net: remove unused argument of __netdev_find_adj()
The __netdev_find_adj() helper does not use its first argument, only the
device to find and list to walk through.

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-25 12:35:15 -07:00
Neil Horman
2d8bff1269 netpoll: Close race condition between poll_one_napi and napi_disable
Drivers might call napi_disable while not holding the napi instance poll_lock.
In those instances, its possible for a race condition to exist between
poll_one_napi and napi_disable.  That is to say, poll_one_napi only tests the
NAPI_STATE_SCHED bit to see if there is work to do during a poll, and as such
the following may happen:

CPU0				CPU1
ndo_tx_timeout			napi_poll_dev
 napi_disable			 poll_one_napi
  test_and_set_bit (ret 0)
				  test_bit (ret 1)
   reset adapter		   napi_poll_routine

If the adapter gets a tx timeout without a napi instance scheduled, its possible
for the adapter to think it has exclusive access to the hardware  (as the napi
instance is now scheduled via the napi_disable call), while the netpoll code
thinks there is simply work to do.  The result is parallel hardware access
leading to corrupt data structures in the driver, and a crash.

Additionaly, there is another, more critical race between netpoll and
napi_disable.  The disabled napi state is actually identical to the scheduled
state for a given napi instance.  The implication being that, if a napi instance
is disabled, a netconsole instance would see the napi state of the device as
having been scheduled, and poll it, likely while the driver was dong something
requiring exclusive access.  In the case above, its fairly clear that not having
the rings in a state ready to be polled will cause any number of crashes.

The fix should be pretty easy.  netpoll uses its own bit to indicate that that
the napi instance is in a state of being serviced by netpoll (NAPI_STATE_NPSVC).
We can just gate disabling on that bit as well as the sched bit.  That should
prevent netpoll from conducting a napi poll if we convert its set bit to a
test_and_set_bit operation to provide mutual exclusion

Change notes:
V2)
	Remove a trailing whtiespace
	Resubmit with proper subject prefix

V3)
	Clean up spacing nits

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: jmaxwell@redhat.com
Tested-by: jmaxwell@redhat.com
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-23 14:32:50 -07:00
Alexei Starovoitov
27b29f6305 bpf: add bpf_redirect() helper
Existing bpf_clone_redirect() helper clones skb before redirecting
it to RX or TX of destination netdev.
Introduce bpf_redirect() helper that does that without cloning.

Benchmarked with two hosts using 10G ixgbe NICs.
One host is doing line rate pktgen.
Another host is configured as:
$ tc qdisc add dev $dev ingress
$ tc filter add dev $dev root pref 10 u32 match u32 0 0 flowid 1:2 \
   action bpf run object-file tcbpf1_kern.o section clone_redirect_xmit drop
so it receives the packet on $dev and immediately xmits it on $dev + 1
The section 'clone_redirect_xmit' in tcbpf1_kern.o file has the program
that does bpf_clone_redirect() and performance is 2.0 Mpps

$ tc filter add dev $dev root pref 10 u32 match u32 0 0 flowid 1:2 \
   action bpf run object-file tcbpf1_kern.o section redirect_xmit drop
which is using bpf_redirect() - 2.4 Mpps

and using cls_bpf with integrated actions as:
$ tc filter add dev $dev root pref 10 \
  bpf run object-file tcbpf1_kern.o section redirect_xmit integ_act classid 1
performance is 2.5 Mpps

To summarize:
u32+act_bpf using clone_redirect - 2.0 Mpps
u32+act_bpf using redirect - 2.4 Mpps
cls_bpf using redirect - 2.5 Mpps

For comparison linux bridge in this setup is doing 2.1 Mpps
and ixgbe rx + drop in ip_rcv - 7.8 Mpps

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-17 21:09:07 -07:00
Eric W. Biederman
0c4b51f005 netfilter: Pass net into okfn
This is immediately motivated by the bridge code that chains functions that
call into netfilter.  Without passing net into the okfns the bridge code would
need to guess about the best expression for the network namespace to process
packets in.

As net is frequently one of the first things computed in continuation functions
after netfilter has done it's job passing in the desired network namespace is in
many cases a code simplification.

To support this change the function dst_output_okfn is introduced to
simplify passing dst_output as an okfn.  For the moment dst_output_okfn
just silently drops the struct net.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-17 17:18:37 -07:00
Eric W. Biederman
04eb44890e bridge: Add br_netif_receive_skb remove netif_receive_skb_sk
netif_receive_skb_sk is only called once in the bridge code, replace
it with a bridge specific function that calls netif_receive_skb.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-17 17:18:37 -07:00
Eric W. Biederman
2b4aa3cec4 net: Remove dev_queue_xmit_sk
A function with weird arguments that it will never use to accomdate a
netfilter callback prototype is absolutely in the core of the
networking stack.  Frankly it does not make sense and it causes a lot
of confusion as to why arguments that are never used are being passed
to the function.

As I am preparing to make a second change to arguments to the okfn even
the names stops making sense.

As I have removed the two callers of this function remove this confusion
from the networking stack.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-17 17:18:35 -07:00
Phil Sutter
f84bb1eac0 net: fix IFF_NO_QUEUE for drivers using alloc_netdev
Printing a warning in alloc_netdev_mqs() if tx_queue_len is zero and
IFF_NO_QUEUE not set is not appropriate since drivers may use one of the
alloc_netdev* macros instead of alloc_etherdev*, thereby not
intentionally leaving tx_queue_len uninitialized. Instead check here if
tx_queue_len is zero and set IFF_NO_QUEUE, so the value of tx_queue_len
can be ignored in net/sched_generic.c.

Fixes: 906470c ("net: warn if drivers set tx_queue_len = 0")
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-27 17:14:29 -07:00
Jiri Pirko
0e4ead9d7b net: introduce change upper device notifier change info
Add info that is passed along with NETDEV_CHANGEUPPER event.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-27 16:28:34 -07:00
Daniel Borkmann
3b3ae88026 net: sched: consolidate tc_classify{,_compat}
For classifiers getting invoked via tc_classify(), we always need an
extra function call into tc_classify_compat(), as both are being
exported as symbols and tc_classify() itself doesn't do much except
handling of reclassifications when tp->classify() returned with
TC_ACT_RECLASSIFY.

CBQ and ATM are the only qdiscs that directly call into tc_classify_compat(),
all others use tc_classify(). When tc actions are being configured
out in the kernel, tc_classify() effectively does nothing besides
delegating.

We could spare this layer and consolidate both functions. pktgen on
single CPU constantly pushing skbs directly into the netif_receive_skb()
path with a dummy classifier on ingress qdisc attached, improves
slightly from 22.3Mpps to 23.1Mpps.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-27 14:18:48 -07:00
Phil Sutter
906470c19d net: warn if drivers set tx_queue_len = 0
Due to the introduction of IFF_NO_QUEUE, there is a better way for
drivers to indicate that no qdisc should be attached by default. Though,
the old convention can't be dropped since ignoring that setting would
break drivers still using it. Instead, add a warning so out-of-tree
driver maintainers get a chance to adjust their code before we finally
get rid of any special handling of tx_queue_len == 0.

Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-18 11:55:08 -07:00
subashab@codeaurora.org
b469139e81 dev: Spelling fix in comments
Fix the following typo
- unchainged -> unchanged

Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-27 00:54:57 -07:00
Thomas Graf
f38a9eb1f7 dst: Metadata destinations
Introduces a new dst_metadata which enables to carry per packet metadata
between forwarding and processing elements via the skb->dst pointer.

The structure is set up to be a union. Thus, each separate type of
metadata requires its own dst instance. If demand arises to carry
multiple types of metadata concurrently, metadata dst entries can be
made stackable.

The metadata dst entry is refcnt'ed as expected for now but a non
reference counted use is possible if the reference is forced before
queueing the skb.

In order to allow allocating dsts with variable length, the existing
dst_alloc() is split into a dst_alloc() and dst_init() function. The
existing dst_init() function to initialize the subsystem is being
renamed to dst_subsys_init() to make it clear what is what.

The check before ip_route_input() is changed to ignore metadata dsts
and drop the dst inside the routing function thus allowing to interpret
metadata in a later commit.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-21 10:39:05 -07:00
Scott Feldman
0c4f691ff6 net: don't reforward packets already forwarded by offload device
Just before queuing skb for xmit on port, check if skb has been marked by
switchdev port driver as already fordwarded by device.  If so, drop skb.  A
non-zero skb->offload_fwd_mark field is set by the switchdev port
driver/device on ingress to indicate the skb has already been forwarded by
the device to egress ports with matching dev->skb_mark.  The switchdev port
driver would assign a non-zero dev->offload_skb_mark for each device port
netdev during registration, for example.

Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-20 18:32:44 -07:00
Anuradha Karuppiah
d746d707a8 net core: Add protodown support.
This patch introduces the proto_down flag that can be used by user space
applications to notify switch drivers that errors have been detected on the
device.

The switch driver can react to protodown notification by doing a phys down
on the associated switch port.

Signed-off-by: Anuradha Karuppiah <anuradhak@cumulusnetworks.com>
Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: Wilson Kok <wkok@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-15 21:39:40 -07:00
David S. Miller
638d3c6381 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	net/bridge/br_mdb.c

Minor conflict in br_mdb.c, in 'net' we added a memset of the
on-stack 'ip' variable whereas in 'net-next' we assign a new
member 'vid'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-13 17:28:09 -07:00
Julian Anastasov
2c17d27c36 net: call rcu_read_lock early in process_backlog
Incoming packet should be either in backlog queue or
in RCU read-side section. Otherwise, the final sequence of
flush_backlog() and synchronize_net() may miss packets
that can run without device reference:

CPU 1                  CPU 2
                       skb->dev: no reference
                       process_backlog:__skb_dequeue
                       process_backlog:local_irq_enable

on_each_cpu for
flush_backlog =>       IPI(hardirq): flush_backlog
                       - packet not found in backlog

                       CPU delayed ...
synchronize_net
- no ongoing RCU
read-side sections

netdev_run_todo,
rcu_barrier: no
ongoing callbacks
                       __netif_receive_skb_core:rcu_read_lock
                       - too late
free dev
                       process packet for freed dev

Fixes: 6e583ce524 ("net: eliminate refcounting in backlog queue")
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-10 18:16:36 -07:00
Julian Anastasov
e9e4dd3267 net: do not process device backlog during unregistration
commit 381c759d99 ("ipv4: Avoid crashing in ip_error")
fixes a problem where processed packet comes from device
with destroyed inetdev (dev->ip_ptr). This is not expected
because inetdev_destroy is called in NETDEV_UNREGISTER
phase and packets should not be processed after
dev_close_many() and synchronize_net(). Above fix is still
required because inetdev_destroy can be called for other
reasons. But it shows the real problem: backlog can keep
packets for long time and they do not hold reference to
device. Such packets are then delivered to upper levels
at the same time when device is unregistered.
Calling flush_backlog after NETDEV_UNREGISTER_FINAL still
accounts all packets from backlog but before that some packets
continue to be delivered to upper levels long after the
synchronize_net call which is supposed to wait the last
ones. Also, as Eric pointed out, processed packets, mostly
from other devices, can continue to add new packets to backlog.

Fix the problem by moving flush_backlog early, after the
device driver is stopped and before the synchronize_net() call.
Then use netif_running check to make sure we do not add more
packets to backlog. We have to do it in enqueue_to_backlog
context when the local IRQ is disabled. As result, after the
flush_backlog and synchronize_net sequence all packets
should be accounted.

Thanks to Eric W. Biederman for the test script and his
valuable feedback!

Reported-by: Vittorio Gambaletta <linuxbugs@vittgam.net>
Fixes: 6e583ce524 ("net: eliminate refcounting in backlog queue")
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-10 18:16:36 -07:00
Nicolas Dichtel
95ec655bc4 Revert "dev: set iflink to 0 for virtual interfaces"
This reverts commit e1622baf54.

The side effect of this commit is to add a '@NONE' after each virtual
interface name with a 'ip link'. It may break existing scripts.

Reported-by: Olivier Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Tested-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-08 15:52:33 -07:00
Eric Dumazet
d339727c2b net: graceful exit from netif_alloc_netdev_queues()
User space can crash kernel with

ip link add ifb10 numtxqueues 100000 type ifb

We must replace a BUG_ON() by proper test and return -EINVAL for
crazy values.

Fixes: 60877a32bc ("net: allow large number of tx queues")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-08 15:46:17 -07:00
Eric Dumazet
24ea591d22 net: sched: extend percpu stats helpers
qdisc_bstats_update_cpu() and other helpers were added to support
percpu stats for qdisc.

We want to add percpu stats for tc action, so this patch add common
helpers.

qdisc_bstats_update_cpu() is renamed to qdisc_bstats_cpu_update()
qdisc_qstats_drop_cpu() is renamed to qdisc_qstats_cpu_drop()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-08 13:50:41 -07:00
David S. Miller
941742f497 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2015-06-08 20:06:56 -07:00
Willem de Bruijn
bbbf2df003 net: replace last open coded skb_orphan_frags with function call
Commit 70008aa50e ("skbuff: convert to skb_orphan_frags") replaced
open coded tests of SKBTX_DEV_ZEROCOPY and skb_copy_ubufs with calls
to helper function skb_orphan_frags. Apply that to the last remaining
open coded site.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-08 12:15:13 -07:00
David S. Miller
bdef7de4b8 net: Add priority to packet_offload objects.
When we scan a packet for GRO processing, we want to see the most
common packet types in the front of the offload_base list.

So add a priority field so we can handle this properly.

IPv4/IPv6 get the highest priority with the implicit zero priority
field.

Next comes ethernet with a priority of 10, and then we have the MPLS
types with a priority of 15.

Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Suggested-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-06-01 14:56:09 -07:00
Daniel Borkmann
e7582bab5d net: dev: reduce both ingress hook ifdefs
Reduce ifdef pollution slightly, no functional change. We can simply
remove the extra alternative definition of handle_ing() and nf_ingress().

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-21 16:58:53 -04:00
Florian Westphal
3365495c18 net: core: set qdisc pkt len before tc_classify
commit d2788d3488 ("net: sched: further simplify handle_ing")
removed the call to qdisc_enqueue_root().

However, after this removal we no longer set qdisc pkt length.
This breaks traffic policing on ingress.

This is the minimum fix: set qdisc pkt length before tc_classify.

Only setting the length does remove support for 'stab' on ingress, but
as Alexei pointed out:
 "Though it was allowed to add qdisc_size_table to ingress, it's useless.
  Nothing takes advantage of recomputed qdisc_pkt_len".

Jamal suggested to use qdisc_pkt_len_init(), but as Eric mentioned that
would result in qdisc_pkt_len_init to no longer get inlined due to the
additional 2nd call site.

ingress policing is rare and GRO doesn't really work that well with police
on ingress, as we see packets > mtu and drop skbs that  -- without
aggregation -- would still have fitted the policier budget.
Thus to have reliable/smooth ingress policing GRO has to be turned off.

Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Fixes: d2788d3488 ("net: sched: further simplify handle_ing")
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-14 22:44:40 -04:00
Pablo Neira
e687ad60af netfilter: add netfilter ingress hook after handle_ing() under unique static key
This patch adds the Netfilter ingress hook just after the existing tc ingress
hook, that seems to be the consensus solution for this.

Note that the Netfilter hook resides under the global static key that enables
ingress filtering. Nonetheless, Netfilter still also has its own static key for
minimal impact on the existing handle_ing().

* Without this patch:

Result: OK: 6216490(c6216338+d152) usec, 100000000 (60byte,0frags)
  16086246pps 7721Mb/sec (7721398080bps) errors: 100000000

    42.46%  kpktgend_0   [kernel.kallsyms]   [k] __netif_receive_skb_core
    25.92%  kpktgend_0   [kernel.kallsyms]   [k] kfree_skb
     7.81%  kpktgend_0   [pktgen]            [k] pktgen_thread_worker
     5.62%  kpktgend_0   [kernel.kallsyms]   [k] ip_rcv
     2.70%  kpktgend_0   [kernel.kallsyms]   [k] netif_receive_skb_internal
     2.34%  kpktgend_0   [kernel.kallsyms]   [k] netif_receive_skb_sk
     1.44%  kpktgend_0   [kernel.kallsyms]   [k] __build_skb

* With this patch:

Result: OK: 6214833(c6214731+d101) usec, 100000000 (60byte,0frags)
  16090536pps 7723Mb/sec (7723457280bps) errors: 100000000

    41.23%  kpktgend_0      [kernel.kallsyms]  [k] __netif_receive_skb_core
    26.57%  kpktgend_0      [kernel.kallsyms]  [k] kfree_skb
     7.72%  kpktgend_0      [pktgen]           [k] pktgen_thread_worker
     5.55%  kpktgend_0      [kernel.kallsyms]  [k] ip_rcv
     2.78%  kpktgend_0      [kernel.kallsyms]  [k] netif_receive_skb_internal
     2.06%  kpktgend_0      [kernel.kallsyms]  [k] netif_receive_skb_sk
     1.43%  kpktgend_0      [kernel.kallsyms]  [k] __build_skb

* Without this patch + tc ingress:

        tc filter add dev eth4 parent ffff: protocol ip prio 1 \
                u32 match ip dst 4.3.2.1/32

Result: OK: 9269001(c9268821+d179) usec, 100000000 (60byte,0frags)
  10788648pps 5178Mb/sec (5178551040bps) errors: 100000000

    40.99%  kpktgend_0   [kernel.kallsyms]  [k] __netif_receive_skb_core
    17.50%  kpktgend_0   [kernel.kallsyms]  [k] kfree_skb
    11.77%  kpktgend_0   [cls_u32]          [k] u32_classify
     5.62%  kpktgend_0   [kernel.kallsyms]  [k] tc_classify_compat
     5.18%  kpktgend_0   [pktgen]           [k] pktgen_thread_worker
     3.23%  kpktgend_0   [kernel.kallsyms]  [k] tc_classify
     2.97%  kpktgend_0   [kernel.kallsyms]  [k] ip_rcv
     1.83%  kpktgend_0   [kernel.kallsyms]  [k] netif_receive_skb_internal
     1.50%  kpktgend_0   [kernel.kallsyms]  [k] netif_receive_skb_sk
     0.99%  kpktgend_0   [kernel.kallsyms]  [k] __build_skb

* With this patch + tc ingress:

        tc filter add dev eth4 parent ffff: protocol ip prio 1 \
                u32 match ip dst 4.3.2.1/32

Result: OK: 9308218(c9308091+d126) usec, 100000000 (60byte,0frags)
  10743194pps 5156Mb/sec (5156733120bps) errors: 100000000

    42.01%  kpktgend_0   [kernel.kallsyms]   [k] __netif_receive_skb_core
    17.78%  kpktgend_0   [kernel.kallsyms]   [k] kfree_skb
    11.70%  kpktgend_0   [cls_u32]           [k] u32_classify
     5.46%  kpktgend_0   [kernel.kallsyms]   [k] tc_classify_compat
     5.16%  kpktgend_0   [pktgen]            [k] pktgen_thread_worker
     2.98%  kpktgend_0   [kernel.kallsyms]   [k] ip_rcv
     2.84%  kpktgend_0   [kernel.kallsyms]   [k] tc_classify
     1.96%  kpktgend_0   [kernel.kallsyms]   [k] netif_receive_skb_internal
     1.57%  kpktgend_0   [kernel.kallsyms]   [k] netif_receive_skb_sk

Note that the results are very similar before and after.

I can see gcc gets the code under the ingress static key out of the hot path.
Then, on that cold branch, it generates the code to accomodate the netfilter
ingress static key. My explanation for this is that this reduces the pressure
on the instruction cache for non-users as the new code is out of the hot path,
and it comes with minimal impact for tc ingress users.

Using gcc version 4.8.4 on:

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
[...]
L1d cache:             16K
L1i cache:             64K
L2 cache:              2048K
L3 cache:              8192K

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-14 01:10:05 -04:00
Pablo Neira
1cf51900f8 net: add CONFIG_NET_INGRESS to enable ingress filtering
This new config switch enables the ingress filtering infrastructure that is
controlled through the ingress_needed static key. This prepares the
introduction of the Netfilter ingress hook that resides under this unique
static key.

Note that CONFIG_SCH_INGRESS automatically selects this, that should be no
problem since this also depends on CONFIG_NET_CLS_ACT.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-14 01:10:05 -04:00
Jiri Pirko
638b2a699f net: move netdev_pick_tx and dependencies to net/core/dev.c
next to its user. No relation to flow_dissector so it makes no sense to
have it in flow_dissector.c

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-13 15:19:46 -04:00
Jiri Pirko
5605c76240 net: move __skb_tx_hash to dev.c
__skb_tx_hash function has no relation to flow_dissect so just move it
to dev.c

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-13 15:19:46 -04:00
David S. Miller
b04096ff33 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Four minor merge conflicts:

1) qca_spi.c renamed the local variable used for the SPI device
   from spi_device to spi, meanwhile the spi_set_drvdata() call
   got moved further up in the probe function.

2) Two changes were both adding new members to codel params
   structure, and thus we had overlapping changes to the
   initializer function.

3) 'net' was making a fix to sk_release_kernel() which is
   completely removed in 'net-next'.

4) In net_namespace.c, the rtnl_net_fill() call for GET operations
   had the command value fixed, meanwhile 'net-next' adjusted the
   argument signature a bit.

This also matches example merge resolutions posted by Stephen
Rothwell over the past two days.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-13 14:31:43 -04:00
Denys Vlasenko
a2029240e5 net: deinline netif_tx_stop_all_queues(), remove WARN_ON in netif_tx_stop_queue()
These functions compile to 60 bytes of machine code each.
With this .config: http://busybox.net/~vda/kernel_config
there are 617 calls of netif_tx_stop_queue()
and 49 calls of netif_tx_stop_all_queues() in vmlinux.

To fix this, remove WARN_ON in netif_tx_stop_queue()
as suggested by davem, and deinline netif_tx_stop_all_queues().

Change in code size is about 20k:

   text      data      bss       dec     hex filename
82426986 22255416 20627456 125309858 77813a2 vmlinux.before
82406248 22255416 20627456 125289120 777c2a0 vmlinux

gcc-4.7.2 still creates deinlined version of netif_tx_stop_queue
sometimes:

$ nm --size-sort vmlinux | grep netif_tx_stop_queue | wc -l
190

ffffffff81b558a8 <netif_tx_stop_queue>:
ffffffff81b558a8:       55                      push   %rbp
ffffffff81b558a9:       48 89 e5                mov    %rsp,%rbp
ffffffff81b558ac:       f0 80 8f e0 01 00 00    lock orb $0x1,0x1e0(%rdi)
ffffffff81b558b3:       01
ffffffff81b558b4:       5d                      pop    %rbp
ffffffff81b558b5:       c3                      retq

This needs additional fixing.

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: Alexei Starovoitov <alexei.starovoitov@gmail.com>
CC: Alexander Duyck <alexander.duyck@gmail.com>
CC: Joe Perches <joe@perches.com>
CC: David S. Miller <davem@davemloft.net>
CC: Jiri Pirko <jpirko@redhat.com>
CC: linux-kernel@vger.kernel.org
CC: netdev@vger.kernel.org
CC: netfilter-devel@vger.kernel.org
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-12 23:05:35 -04:00
Daniel Borkmann
d2788d3488 net: sched: further simplify handle_ing
Ingress qdisc has no other purpose than calling into tc_classify()
that executes attached classifier(s) and action(s).

It has a 1:1 relationship to dev->ingress_queue. After having commit
087c1a601a ("net: sched: run ingress qdisc without locks") removed
the central ingress lock, one major contention point is gone.

The extra indirection layers however, are not necessary for calling
into ingress qdisc. pktgen calling locally into netif_receive_skb()
with a dummy u32, single CPU result on a Supermicro X10SLM-F, Xeon
E3-1240: before ~21,1 Mpps, after patch ~22,9 Mpps.

We can redirect the private classifier list to the netdev directly,
without changing any classifier API bits (!) and execute on that from
handle_ing() side. The __QDISC_STATE_DEACTIVATE test can be removed,
ingress qdisc doesn't have a queue and thus dev_deactivate_queue()
is also not applicable, ingress_cl_list provides similar behaviour.
In other words, ingress qdisc acts like TCQ_F_BUILTIN qdisc.

One next possible step is the removal of the dev's ingress (dummy)
netdev_queue, and to only have the list member in the netdevice
itself.

Note, the filter chain is RCU protected and individual filter elements
are being kfree'd by sched subsystem after RCU grace period. RCU read
lock is being held by __netif_receive_skb_core().

Joint work with Alexei Starovoitov.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-11 11:10:35 -04:00
Daniel Borkmann
c9e99fd078 net: sched: consolidate handle_ing and ing_filter
Given quite some code has been removed from ing_filter(), we can just
consolidate that function into handle_ing() and get rid of a few
instructions at the same time.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-11 11:10:34 -04:00
Vlad Yasevich
d66bf7dd27 net: core: Correct an over-stringent device loop detection.
The code in __netdev_upper_dev_link() has an over-stringent
loop detection logic that actually prevents valid configurations
from working correctly.

In particular, the logic returns an error if an upper device
is already in the list of all upper devices for a given dev.
This particular check seems to be a overzealous as it disallows
perfectly valid configurations.  For example:
  # ip l a link eth0 name eth0.10 type vlan id 10
  # ip l a dev br0 typ bridge
  # ip l s eth0.10 master br0
  # ip l s eth0 master br0  <--- Will fail

If you switch the last two commands (add eth0 first), then both
will succeed.  If after that, you remove eth0 and try to re-add
it, it will fail!

It appears to be enough to simply check adj_list to keeps things
safe.

I've tried stacking multiple devices multiple times in all different
combinations, and either rx_handler registration prevented the stacking
of the device linking cought the error.

Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Veaceslav Falico <vfalico@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-04 14:57:59 -04:00
Jamal Hadi Salim
c19ae86a51 tc: remove unused redirect ttl
improves ingress+u32 performance from 22.4 Mpps to 22.9 Mpps

Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Florian Westphal <fw@strlen.de>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-04 12:16:12 -04:00
Alexei Starovoitov
087c1a601a net: sched: run ingress qdisc without locks
TC classifiers/actions were converted to RCU by John in the series:
http://thread.gmane.org/gmane.linux.network/329739/focus=329739
and many follow on patches.
This is the last patch from that series that finally drops
ingress spin_lock.

Single cpu ingress+u32 performance goes from 22.9 Mpps to 24.5 Mpps.

In two cpu case when both cores are receiving traffic on the same
device and go into the same ingress+u32 the performance jumps
from 4.5 + 4.5 Mpps to 23.5 + 23.5 Mpps

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-03 23:42:03 -04:00
Eric Dumazet
a31196b07f net: rfs: fix crash in get_rps_cpus()
Commit 567e4b7973 ("net: rfs: add hash collision detection") had one
mistake :

RPS_NO_CPU is no longer the marker for invalid cpu in set_rps_cpu()
and get_rps_cpu(), as @next_cpu was the result of an AND with
rps_cpu_mask

This bug showed up on a host with 72 cpus :
next_cpu was 0x7f, and the code was trying to access percpu data of an
non existent cpu.

In a follow up patch, we might get rid of compares against nr_cpu_ids,
if we init the tables with 0. This is silly to test for a very unlikely
condition that exists only shortly after table initialization, as
we got rid of rps_reset_sock_flow() and similar functions that were
writing this RPS_NO_CPU magic value at flow dismantle : When table is
old enough, it never contains this value anymore.

Fixes: 567e4b7973 ("net: rfs: add hash collision detection")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <tom@herbertland.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-26 16:07:57 -04:00
Johannes Berg
8b86a61da3 net: remove unused 'dev' argument from netif_needs_gso()
In commit 04ffcb255f ("net: Add ndo_gso_check") Tom originally
added the 'dev' argument to be able to call ndo_gso_check().

Then later, when generalizing this in commit 5f35227ea3
("net: Generalize ndo_gso_check to ndo_features_check")
Jesse removed the call to ndo_gso_check() in netif_needs_gso()
by calling the new ndo_features_check() in a different place.
This made the 'dev' argument unused.

Remove the unused argument and go back to the code as before.

Cc: Tom Herbert <therbert@google.com>
Cc: Jesse Gross <jesse@nicira.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-17 13:29:41 -04:00
Daniel Borkmann
4577139b2d net: use jump label patching for ingress qdisc in __netif_receive_skb_core
Even if we make use of classifier and actions from the egress
path, we're going into handle_ing() executing additional code
on a per-packet cost for ingress qdisc, just to realize that
nothing is attached on ingress.

Instead, this can just be blinded out as a no-op entirely with
the use of a static key. On input fast-path, we already make
use of static keys in various places, e.g. skb time stamping,
in RPS, etc. It makes sense to not waste time when we're assured
that no ingress qdisc is attached anywhere.

Enabling/disabling of that code path is being done via two
helpers, namely net_{inc,dec}_ingress_queue(), that are being
invoked under RTNL mutex when a ingress qdisc is being either
initialized or destructed.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-13 13:34:40 -04:00
David Miller
7026b1ddb6 netfilter: Pass socket pointer down through okfn().
On the output paths in particular, we have to sometimes deal with two
socket contexts.  First, and usually skb->sk, is the local socket that
generated the frame.

And second, is potentially the socket used to control a tunneling
socket, such as one the encapsulates using UDP.

We do not want to disassociate skb->sk when encapsulating in order
to fix this, because that would break socket memory accounting.

The most extreme case where this can cause huge problems is an
AF_PACKET socket transmitting over a vxlan device.  We hit code
paths doing checks that assume they are dealing with an ipv4
socket, but are actually operating upon the AF_PACKET one.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-07 15:25:55 -04:00
David S. Miller
c85d6975ef Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/mellanox/mlx4/cmd.c
	net/core/fib_rules.c
	net/ipv4/fib_frontend.c

The fib_rules.c and fib_frontend.c conflicts were locking adjustments
in 'net' overlapping addition and removal of code in 'net-next'.

The mlx4 conflict was a bug fix in 'net' happening in the same
place a constant was being replaced with a more suitable macro.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-06 22:34:15 -04:00
hannes@stressinduktion.org
f60e5990d9 ipv6: protect skb->sk accesses from recursive dereference inside the stack
We should not consult skb->sk for output decisions in xmit recursion
levels > 0 in the stack. Otherwise local socket settings could influence
the result of e.g. tunnel encapsulation process.

ipv6 does not conform with this in three places:

1) ip6_fragment: we do consult ipv6_npinfo for frag_size

2) sk_mc_loop in ipv6 uses skb->sk and checks if we should
   loop the packet back to the local socket

3) ip6_skb_dst_mtu could query the settings from the user socket and
   force a wrong MTU

Furthermore:
In sk_mc_loop we could potentially land in WARN_ON(1) if we use a
PF_PACKET socket ontop of an IPv6-backed vxlan device.

Reuse xmit_recursion as we are currently only interested in protecting
tunnel devices.

Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-06 16:12:49 -04:00
Nicolas Dichtel
e1622baf54 dev: set iflink to 0 for virtual interfaces
Virtual interfaces are supposed to set an iflink value != of their ifindex.
It was not the case for some of them, like vxlan, bond or bridge.
Let's set iflink to 0 when dev->rtnl_link_ops is set.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-02 14:05:01 -04:00
Nicolas Dichtel
7a66bbc96c net: remove iflink field from struct net_device
Now that all users of iflink have the ndo_get_iflink handler available, it's
possible to remove this field.

By default, dev_get_iflink() returns the ifindex of the interface.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-02 14:05:01 -04:00
Nicolas Dichtel
a54acb3a6f dev: introduce dev_get_iflink()
The goal of this patch is to prepare the removal of the iflink field. It
introduces a new ndo function, which will be implemented by virtual interfaces.

There is no functional change into this patch. All readers of iflink field
now call dev_get_iflink().

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-02 14:04:59 -04:00
Jiri Pirko
fbcb217059 net: rename dev to orig_dev in deliver_ptype_list_skb
Unlike other places, this function uses name "dev" for what should be
"orig_dev", which might be a bit confusing. So fix this.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-31 16:37:43 -04:00
Toshiaki Makita
e38f30256b net: Introduce passthru_features_check
As there are a number of (especially virtual) devices that don't
need the multiple vlan check, introduce passthru_features_check() for
convenience.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-29 13:33:22 -07:00
Toshiaki Makita
8cb65d0008 net: Move check for multiple vlans to drivers
To allow drivers to handle the features check for multiple tags,
move the check to ndo_features_check().
As no drivers currently handle multiple tagged TSO, introduce
dflt_features_check() and call it if the driver does not have
ndo_features_check().

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-29 13:33:22 -07:00
Toshiaki Makita
f5a7fb88e1 vlan: Introduce helper functions to check if skb is tagged
Separate the two checks for single vlan and multiple vlans in
netif_skb_features().  This allows us to move the check for multiple
vlans to another function later.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-29 13:33:22 -07:00
WANG Cong
08b4b8ea79 net: clear skb->priority when forwarding to another netns
skb->priority can be set for two purposes:

1) With respect to IP TOS field, which is computed by a mask.
Ususally used for priority qdisc's (pfifo, prio etc.), on TX
side (we only have ingress qdisc on RX side).

2) Used as a classid or flowid, works in the same way with tc
classid. What's more, this can even override the classid
of tc filters.

For case 1), it has been respected within its netns, I don't
see any point of keeping it for another netns, especially
when packets will be forwarded to Rx path (no matter from TX
path or RX path).

For case 2) we care, our applications run inside a netns,
and we classify the packets by our own filters outside,
If some application sets this priority, it could bypass
our filters, therefore clear it when moving out of a netns,
it makes no sense to bypass tc filters out of its netns.

Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23 16:43:08 -04:00
David S. Miller
99c4a26a15 net: Fix high overhead of vlan sub-device teardown.
When a networking device is taken down that has a non-trivial number
of VLAN devices configured under it, we eat a full synchronize_net()
for every such VLAN device.

This is because of the call chain:

	NETDEV_DOWN notifier
	--> vlan_device_event()
		--> dev_change_flags()
		--> __dev_change_flags()
		--> __dev_close()
		--> __dev_close_many()
		--> dev_deactivate_many()
			--> synchronize_net()

This is kind of rediculous because we already have infrastructure for
batching doing operation X to a list of net devices so that we only
incur one sync.

So make use of that by exporting dev_close_many() and adjusting it's
interfaace so that the caller can fully manage the batch list.  Use
this in vlan_device_event() and all the overhead goes away.

Reported-by: Salam Noureddine <noureddine@arista.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-18 22:52:56 -04:00
David Ahern
db24a9044e net: add support for phys_port_name
Similar to port id allow netdevices to specify port names and export
the name via sysfs. Drivers can implement the netdevice operation to
assist udev in having sane default names for the devices using the
rule:

$ cat /etc/udev/rules.d/80-net-setup-link.rules
SUBSYSTEM=="net", ACTION=="add", ATTR{phys_port_name}!="",
NAME="$attr{phys_port_name}"

Use of phys_name versus phys_id was suggested-by Jiri Pirko.

Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-18 22:30:35 -04:00
Eric W. Biederman
efd7ef1c19 net: Kill hold_net release_net
hold_net and release_net were an idea that turned out to be useless.
The code has been disabled since 2008.  Kill the code it is long past due.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-12 14:39:40 -04:00
Matthew Thode
a4176a9391 net: reject creation of netdev names with colons
colons are used as a separator in netdev device lookup in dev_ioctl.c

Specific functions are SIOCGIFTXQLEN SIOCETHTOOL SIOCSIFNAME

Signed-off-by: Matthew Thode <mthode@mthode.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-21 21:45:25 -05:00
Masanari Iida
4a26e453d9 net/core: Fix warning while make xmldocs caused by dev.c
This patch fix following warning wile make xmldocs.

  Warning(.//net/core/dev.c:5345): No description found
  for parameter 'bonding_info'
  Warning(.//net/core/dev.c:5345): Excess function parameter
  'netdev_bonding_info' description in 'netdev_bonding_info_change'

This warning starts to appear after following patch was added
into Linus's tree during merger period.

commit 61bd3857ff
net/core: Add event for a change in slave state

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-14 20:34:49 -08:00
Tom Herbert
15e2396d4e net: Infrastructure for CHECKSUM_PARTIAL with remote checsum offload
This patch adds infrastructure so that remote checksum offload can
set CHECKSUM_PARTIAL instead of calling csum_partial and writing
the modfied checksum field.

Add skb_remcsum_adjust_partial function to set an skb for using
CHECKSUM_PARTIAL with remote checksum offload.  Changed
skb_remcsum_process and skb_gro_remcsum_process to take a boolean
argument to indicate if checksum partial can be set or the
checksum needs to be modified using the normal algorithm.

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-11 15:12:12 -08:00
David S. Miller
2573beec56 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-09 14:35:57 -08:00
Eric Dumazet
567e4b7973 net: rfs: add hash collision detection
Receive Flow Steering is a nice solution but suffers from
hash collisions when a mix of connected and unconnected traffic
is received on the host, when flow hash table is populated.

Also, clearing flow in inet_release() makes RFS not very good
for short lived flows, as many packets can follow close().
(FIN , ACK packets, ...)

This patch extends the information stored into global hash table
to not only include cpu number, but upper part of the hash value.

I use a 32bit value, and dynamically split it in two parts.

For host with less than 64 possible cpus, this gives 6 bits for the
cpu number, and 26 (32-6) bits for the upper part of the hash.

Since hash bucket selection use low order bits of the hash, we have
a full hash match, if /proc/sys/net/core/rps_sock_flow_entries is big
enough.

If the hash found in flow table does not match, we fallback to RPS (if
it is enabled for the rxqueue).

This means that a packet for an non connected flow can avoid the
IPI through a unrelated/victim CPU.

This also means we no longer have to clear the table at socket
close time, and this helps short lived flows performance.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-08 16:53:57 -08:00
Eric Dumazet
91e83133e7 net: use netif_rx_ni() from process context
Hotpluging a cpu might be rare, yet we have to use proper
handlers when taking over packets found in backlog queues.

dev_cpu_callback() runs from process context, thus we should
call netif_rx_ni() to properly invoke softirq handler.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-07 22:43:52 -08:00
David S. Miller
6e03f896b5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/vxlan.c
	drivers/vhost/net.c
	include/linux/if_vlan.h
	net/core/dev.c

The net/core/dev.c conflict was the overlap of one commit marking an
existing function static whilst another was adding a new function.

In the include/linux/if_vlan.h case, the type used for a local
variable was changed in 'net', whereas the function got rewritten
to fix a stacked vlan bug in 'net-next'.

In drivers/vhost/net.c, Al Viro's iov_iter conversions in 'net-next'
overlapped with an endainness fix for VHOST 1.0 in 'net'.

In drivers/net/vxlan.c, vxlan_find_vni() added a 'flags' parameter
in 'net-next' whereas in 'net' there was a bug fix to pass in the
correct network namespace pointer in calls to this function.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05 14:33:28 -08:00
Eric Dumazet
2ce1ee1780 net: remove some sparse warnings
netdev_adjacent_add_links() and netdev_adjacent_del_links()
are static.

queue->qdisc has __rcu annotation, need to use RCU_INIT_POINTER()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05 00:41:17 -08:00
Moni Shoua
61bd3857ff net/core: Add event for a change in slave state
Add event which provides an indication on a change in the state
of a bonding slave. The event handler should cast the pointer to the
appropriate type (struct netdev_bonding_info) in order to get the
full info about the slave.

Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-04 16:14:24 -08:00
Toshiaki Makita
d4bcef3fbe net: Fix vlan_get_protocol for stacked vlan
vlan_get_protocol() could not get network protocol if a skb has a 802.1ad
vlan tag or multiple vlans, which caused incorrect checksum calculation
in several drivers.

Fix vlan_get_protocol() to retrieve network protocol instead of incorrect
vlan protocol.

As the logic is the same as skb_network_protocol(), create a common helper
function __vlan_get_protocol() and call it from existing functions.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-30 18:03:47 -08:00
Salam Noureddine
7866a62104 dev: add per net_device packet type chains
When many pf_packet listeners are created on a lot of interfaces the
current implementation using global packet type lists scales poorly.
This patch adds per net_device packet type lists to fix this problem.

The patch was originally written by Eric Biederman for linux-2.6.29.
Tested on linux-3.16.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Salam Noureddine <noureddine@arista.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-29 14:41:39 -08:00
David S. Miller
95f873f2ff Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	arch/arm/boot/dts/imx6sx-sdb.dts
	net/sched/cls_bpf.c

Two simple sets of overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-27 16:59:56 -08:00
Eric Dumazet
ac64da0b83 net: rps: fix cpu unplug
softnet_data.input_pkt_queue is protected by a spinlock that
we must hold when transferring packets from victim queue to an active
one. This is because other cpus could still be trying to enqueue packets
into victim queue.

A second problem is that when we transfert the NAPI poll_list from
victim to current cpu, we absolutely need to special case the percpu
backlog, because we do not want to add complex locking to protect
process_queue : Only owner cpu is allowed to manipulate it, unless cpu
is offline.

Based on initial patch from Prasad Sodagudi & Subash Abhinov
Kasiviswanathan.

This version is better because we do not slow down packet processing,
only make migration safer.

Reported-by: Prasad Sodagudi <psodagud@codeaurora.org>
Reported-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-16 01:02:42 -05:00
Jiri Pirko
df8a39defa net: rename vlan_tx_* helpers since "tx" is misleading there
The same macros are used for rx as well. So rename it.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-13 17:51:08 -05:00
Pankaj Gupta
1059590254 net: allow large number of rx queues
netif_alloc_rx_queues() uses kcalloc() to allocate memory
for "struct netdev_queue *_rx" array.
If we are doing large rx queue allocation kcalloc() might
fail, so this patch does a fallback to vzalloc().
Similar implementation is done for tx queue allocation in
netif_alloc_netdev_queues().

We avoid failure of high order memory allocation
with the help of vzalloc(), this allows us to do large
rx and tx queue allocation which in turn helps us to
increase the number of queues in tun.

As vmalloc() adds overhead on a critical network path,
__GFP_REPEAT flag is used with kzalloc() to do this fallback
only when really needed.

Signed-off-by: Pankaj Gupta <pagupta@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: David Gibson <dgibson@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-12 17:05:05 -05:00
Jesse Gross
5f35227ea3 net: Generalize ndo_gso_check to ndo_features_check
GSO isn't the only offload feature with restrictions that
potentially can't be expressed with the current features mechanism.
Checksum is another although it's a general issue that could in
theory apply to anything. Even if it may be possible to
implement these restrictions in other ways, it can result in
duplicate code or inefficient per-packet behavior.

This generalizes ndo_gso_check so that drivers can remove any
features that don't make sense for a given packet, similar to
netif_skb_features(). It also converts existing driver
restrictions to the new format, completing the work that was
done to support tunnel protocols since the issues apply to
checksums as well.

By actually removing features from the set that are used to do
offloading, it solves another problem with the existing
interface. In these cases, GSO would run with the original set
of features and not do anything because it appears that
segmentation is not required.

CC: Tom Herbert <therbert@google.com>
CC: Joe Stringer <joestringer@nicira.com>
CC: Eric Dumazet <edumazet@google.com>
CC: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by:  Tom Herbert <therbert@google.com>
Fixes: 04ffcb255f ("net: Add ndo_gso_check")
Tested-by: Hayes Wang <hayeswang@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-26 17:20:56 -05:00
Jay Vosburgh
2c26d34bbc net/core: Handle csum for CHECKSUM_COMPLETE VXLAN forwarding
When using VXLAN tunnels and a sky2 device, I have experienced
checksum failures of the following type:

[ 4297.761899] eth0: hw csum failure
[...]
[ 4297.765223] Call Trace:
[ 4297.765224]  <IRQ>  [<ffffffff8172f026>] dump_stack+0x46/0x58
[ 4297.765235]  [<ffffffff8162ba52>] netdev_rx_csum_fault+0x42/0x50
[ 4297.765238]  [<ffffffff8161c1a0>] ? skb_push+0x40/0x40
[ 4297.765240]  [<ffffffff8162325c>] __skb_checksum_complete+0xbc/0xd0
[ 4297.765243]  [<ffffffff8168c602>] tcp_v4_rcv+0x2e2/0x950
[ 4297.765246]  [<ffffffff81666ca0>] ? ip_rcv_finish+0x360/0x360

	These are reliably reproduced in a network topology of:

container:eth0 == host(OVS VXLAN on VLAN) == bond0 == eth0 (sky2) -> switch

	When VXLAN encapsulated traffic is received from a similarly
configured peer, the above warning is generated in the receive
processing of the encapsulated packet.  Note that the warning is
associated with the container eth0.

        The skbs from sky2 have ip_summed set to CHECKSUM_COMPLETE, and
because the packet is an encapsulated Ethernet frame, the checksum
generated by the hardware includes the inner protocol and Ethernet
headers.

	The receive code is careful to update the skb->csum, except in
__dev_forward_skb, as called by dev_forward_skb.  __dev_forward_skb
calls eth_type_trans, which in turn calls skb_pull_inline(skb, ETH_HLEN)
to skip over the Ethernet header, but does not update skb->csum when
doing so.

	This patch resolves the problem by adding a call to
skb_postpull_rcsum to update the skb->csum after the call to
eth_type_trans.

Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-26 16:16:51 -05:00
Toshiaki Makita
796f2da81b net: Fix stacked vlan offload features computation
When vlan tags are stacked, it is very likely that the outer tag is stored
in skb->vlan_tci and skb->protocol shows the inner tag's vlan_proto.
Currently netif_skb_features() first looks at skb->protocol even if there
is the outer tag in vlan_tci, thus it incorrectly retrieves the protocol
encapsulated by the inner vlan instead of the inner vlan protocol.
This allows GSO packets to be passed to HW and they end up being
corrupted.

Fixes: 58e998c6d2 ("offloading: Force software GSO for multiple vlan tags.")
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-24 00:08:33 -05:00
Pravin B Shelar
d0edc7bf39 mpls: Fix config check for mpls.
Fixes MPLS GSO for case when mpls is compiled as kernel module.

Fixes: 0d89d2035f ("MPLS: Add limited GSO support").
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-23 23:57:30 -05:00
Herbert Xu
ceb8d5bf17 net: Rearrange loop in net_rx_action
This patch rearranges the loop in net_rx_action to reduce the
amount of jumping back and forth when reading the code.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-23 23:20:21 -05:00
Herbert Xu
6bd373ebba net: Always poll at least one device in net_rx_action
We should only perform the softnet_break check after we have polled
at least one device in net_rx_action.  Otherwise a zero or negative
setting of netdev_budget can lock up the whole system.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-23 23:20:21 -05:00