The driver is written as if it can adapt to a low memory situation allocating
less RX skbs and TX aligned buffers than the respective RX/TX ring sizes. In
reality though the driver would malfunction in this case. Stop being overly
smart and just fail in such situation -- this is achieved by moving the memory
allocation from ravb_ring_format() to ravb_ring_init().
We leave dma_map_single() calls in place but make their failure non-fatal
by marking the corresponding RX descriptors with zero data size which should
prevent DMA to an invalid addresses.
Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add debugfs entry 'use_backdoor' to enable backdoor access to read sge
context. By default, we read sge context's via firmware. In case of FW
issues, one can enable backdoor access via debugfs to dump sge context
for debugging purpose.
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix warning: logical ‘or’ of collectively exhaustive tests is always true
Change the internal delay check from an 'or' condition to an 'and'
condition.
Reported-by: David Binderman <dcb314@hotmail.com>
Signed-off-by: Dan Murphy <dmurphy@ti.com>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If user did not specify an oif, try and get it from the via address.
If failed to get device, return with -ENODEV.
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some architectures like POWER can have a NUMA node_possible_map that
contains sparse entries. This causes memory corruption with openvswitch
since it allocates flow_cache with a multiple of num_possible_nodes() and
assumes the node variable returned by for_each_node will index into
flow->stats[node].
Use nr_node_ids to allocate a maximal sparse array instead of
num_possible_nodes().
The crash was noticed after 3af229f2 was applied as it changed the
node_possible_map to match node_online_map on boot.
Fixes: 3af229f207
Signed-off-by: Chris J Arges <chris.j.arges@canonical.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Edward Cree says:
====================
sfc: support for cascaded multicast filtering
Recent versions of firmware for SFC9100 adapters add support for filter
chaining, in which packets matching multiple filters are delivered to all
filters' recipients, rather than only the highest match-priority filter as was
previously the case.
This patch series enables this feature and redesigns the filter handling code
to make use of it; in particular, subscribing to a multicast address on one
function no longer prevents traffic to that address reaching another function
which is in promiscuous or allmulti mode.
If the firmware does not support filter chaining, the driver will fall back to
the old behaviour.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Separate functions for inserting individual and promisc filters; explicit
fallback logic in efx_ef10_filter_sync_rx_mode(), in order not to overload
the 'promisc' flag as also meaning "fall back to promisc".
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If the workaround to support cascaded multicast filters ("workaround_26807") is
enabled, the broadcast filter and individual multicast filters are not inserted
when in promiscuous or allmulti mode.
There is a race while inserting and removing filters when entering and leaving
promiscuous mode. When changing promiscuous state with cascaded multicast
filters, the old multicast filters are removed before inserting the new filters
to avoid duplicating packets; this can lead to dropped packets until all
filters have been inserted.
The efx_nic:mc_promisc flag is added to record the presence of a multicast
promiscuous filter; this gives a simple way to tell if the promiscuous state is
changing.
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change is only re-factoring; there are no changes to functionality
except for a slight elaboration of an error message (on mismatch filter
insertion failure).
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If a function is in promiscuous mode and another function has a broadcast or
multicast filter inserted, the function in promiscuous mode won't see that
broadcast or multicast traffic.
Most notably this breaks broadcast, which means ARP doesn't work. Less
show-stoppingly, a function listening on a multicast address that's also in
promiscuous mode will not see that multicast traffic if another function is
also listening on that multicast address.
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When enabling the workaround for cascaded multicast filters, the MC
can reset other functions if they have already inserted filters.
In that case, the workaround has been enabled, but print an info
message in the log recording that other functions had to be reset.
As other functions were reset, the MC will have incremented its boot
count, so also increment the warm_boot_count on the function which
enabled the workaround, as that function won't have received an MC
reboot event and does not need to reset.
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The initial use of this will be to check a flag reporting if an FLR was
performed on other functions when enabling cascaded multicast filters.
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
GET_WORKAROUNDS was only introduced in May 2014, not all firmware
will have it. So call sites need to handle ENOSYS.
In this case we're probing the bug26807 workaround, which is not
implemented in any firmware that doesn't have GET_WORKAROUNDS.
So interpret ENOSYS as 'false'.
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After creating event queue 0, check to see if the workaround is enabled,
and enable it if necessary. This will be called during PCI probe and
also when coming back up after a reset. The nic_data->workaround_26807
will be used in the future to control the filter insertion behaviour
based on this workaround.
Only the primary PF can enable this workaround, so tolerate an EPERM
error and continue. Otherwise, if any step in the checking and enabling
of the workaround fails, the event queue must be removed.
We check that workaround is implemented before trying to enable it,
and store the current workaround setting before trying to change it.
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Nicolas Schichan says:
====================
BPF JIT fixes for ARM
These patches are fixing bugs in the ARM JIT and should probably find
their way to a stable kernel. All 60 test_bpf tests in Linux 4.1 release
are now passing OK (was 54 out of 60 before).
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This makes BPF_ANC | SKF_AD_VLAN_TAG and BPF_ANC | SKF_AD_VLAN_TAG_PRESENT
have the same behaviour as the in kernel VM and makes the test_bpf LD_VLAN_TAG
and LD_VLAN_TAG_PRESENT tests pass.
Signed-off-by: Nicolas Schichan <nschichan@freebox.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Previously, the JIT would reject negative offsets known during code
generation and mishandle negative offsets provided at runtime.
Fix that by calling bpf_internal_load_pointer_neg_helper()
appropriately in the jit_get_skb_{b,h,w} slow path helpers and by forcing
the execution flow to the slow path helpers when the offset is
negative.
Signed-off-by: Nicolas Schichan <nschichan@freebox.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
To check whether the load should take the fast path or not, the code
would check that (r_skb_hlen - load_order) is greater than the offset
of the access using an "Unsigned higher or same" condition. For
halfword accesses and an skb length of 1 at offset 0, that test is
valid, as we end up comparing 0xffffffff(-1) and 0, so the fast path
is taken and the filter allows the load to wrongly succeed. A similar
issue exists for word loads at offset 0 and an skb length of less than
4.
Fix that by using the condition "Signed greater than or equal"
condition for the fast path code for load orders greater than 0.
Signed-off-by: Nicolas Schichan <nschichan@freebox.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Morton reported following warning on one ARM build
with gcc-4.4 :
net/ipv4/inet_hashtables.c: In function 'inet_ehash_locks_alloc':
net/ipv4/inet_hashtables.c:617: warning: division by zero
Even guarded with a test on sizeof(spinlock_t), compiler does not
like current construct on a !CONFIG_SMP build.
Remove the warning by using a temporary variable.
Fixes: 095dc8e0c3 ("tcp: fix/cleanup inet_ehash_locks_alloc()")
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit d999297c3d
("tipc: reduce locking scope during packet reception") we introduced
a new function tipc_link_proto_rcv(). This function contains a bug,
so that it sometimes by error sends out a non-zero link priority value
in created protocol messages.
The bug may lead to an extra link reset at initial link establising
with older nodes. This will never happen more than once, whereafter
the link will work as intended.
We fix this bug in this commit.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli says:
====================
net: enable inband link state negotiation only when explicitly requested
Changes in v5:
- removed an invalid use of the link_update callback in the SF2 driver
was appeared after merging "net: phy: fixed_phy: handle link-down case"
- reworded the commit message for patch 2 to make it clear what it fixes and
why this is required
Initial cover letter from Stas:
Hello.
Currently the link status auto-negotiation is enabled
for any SGMII link with fixed-link DT binding.
The regression was reported:
https://lkml.org/lkml/2015/7/8/865
Apparently not all HW that implements SGMII protocol, generates the
inband status for the auto-negotiation to work.
More details here:
https://lkml.org/lkml/2015/7/10/206
The following patches reverts to the old behavior by default,
which is to not enable the auto-negotiation for fixed-link.
The new DT property is added that allows to explicitly request
the auto-negotiation.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The commit 898b2970e2 ("mvneta: implement SGMII-based in-band link state
signaling") implemented the link parameters auto-negotiation unconditionally.
Unfortunately it appears that some HW that implements SGMII protocol,
doesn't generate the inband status, so it is not possible to auto-negotiate
anything with such HW.
This patch enables the auto-negotiation only if explicitly requested with
the 'managed' DT property.
This patch fixes the following regression:
https://lkml.org/lkml/2015/7/8/865
Signed-off-by: Stas Sergeev <stsp@users.sourceforge.net>
CC: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
CC: netdev@vger.kernel.org
CC: linux-kernel@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently the PHY management type is selected by the MAC driver arbitrary.
The decision is based on the presence of the "fixed-link" node and on a
will of the driver's authors.
This caused a regression recently, when mvneta driver suddenly started
to use the in-band status for auto-negotiation on fixed links.
It appears the auto-negotiation may not work when expected by the MAC driver.
Sebastien Rannou explains:
<< Yes, I confirm that my HW does not generate an in-band status. AFAIK, it's
a PHY that aggregates 4xSGMIIs to 1xQSGMII ; the MAC side of the PHY (with
inband status) is connected to the switch through QSGMII, and in this context
we are on the media side of the PHY. >>
https://lkml.org/lkml/2015/7/10/206
This patch introduces the new string property 'managed' that allows
the user to set the management type explicitly.
The supported values are:
"auto" - default. Uses either MDIO or nothing, depending on the presence
of the fixed-link node
"in-band-status" - use in-band status
Signed-off-by: Stas Sergeev <stsp@users.sourceforge.net>
CC: Rob Herring <robh+dt@kernel.org>
CC: Pawel Moll <pawel.moll@arm.com>
CC: Mark Rutland <mark.rutland@arm.com>
CC: Ian Campbell <ijc+devicetree@hellion.org.uk>
CC: Kumar Gala <galak@codeaurora.org>
CC: Florian Fainelli <f.fainelli@gmail.com>
CC: Grant Likely <grant.likely@linaro.org>
CC: devicetree@vger.kernel.org
CC: linux-kernel@vger.kernel.org
CC: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
fixed_phy_register() currently hardcodes the fixed PHY link to 1, and
expects to find a "speed" parameter to provide correct information
towards the fixed PHY consumer.
In a subsequent change, where we allow "managed" (e.g: (RS)GMII in-band
status auto-negotiation) fixed PHYs, none of these parameters can be
provided since they will be auto-negotiated, hence, we just provide a
zero-initialized fixed_phy_status to fixed_phy_register() which makes it
fail when we call fixed_phy_update_regs() since status.speed = 0 which
makes us hit the "default" label and error out.
Without this change, we would also see potentially inconsistent
speed/duplex parameters for fixed PHYs when the link is DOWN.
CC: netdev@vger.kernel.org
CC: linux-kernel@vger.kernel.org
Signed-off-by: Stas Sergeev <stsp@users.sourceforge.net>
[florian: add more background to why this is correct and desirable]
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The SF2 driver currently overrides speed settings for its port
configured using a fixed PHY, this is both unnecessary and incorrect,
because we keep feedback to the hardware parameters that we read from
the PHY device, which in the case of a fixed PHY cannot possibly change
speed.
This is a required change to allow the fixed PHY code to allow
registering a PHY with a link configured as DOWN by default and avoid
some sort of circular dependency where we require the link_update
callback to run to program the hardware, and we then utilize the fixed
PHY parameters to program the hardware with the same settings.
Fixes: 246d7f773c ("net: dsa: add Broadcom SF2 switch driver")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit a2673b6e04.
Kinglong Mee reports a memory leak with that patch, and Jan Kara confirms:
"Thanks for report! You are right that my patch introduces a race
between fsnotify kthread and fsnotify_destroy_group() which can result
in leaking inotify event on group destruction.
I haven't yet decided whether the right fix is not to queue events for
dying notification group (as that is pointless anyway) or whether we
should just fix the original problem differently... Whenever I look
at fsnotify code mark handling I get lost in the maze of locks, lists,
and subtle differences between how different notification systems
handle notification marks :( I'll think about it over night"
and after thinking about it, Jan says:
"OK, I have looked into the code some more and I found another
relatively simple way of fixing the original oops. It will be IMHO
better than trying to fixup this issue which has more potential for
breakage. I'll ask Linus to revert the fsnotify fix he already merged
and send a new fix"
Reported-by: Kinglong Mee <kinglongmee@gmail.com>
Requested-by: Jan Kara <jack@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* fix device ID check for AR956x
iwlwifi:
* bug fixes specific for 8000 series
* fix a crash in time events
* fix a crash in PCIe transport
* fix BT Coex code that prevented association on certain
devices (3160).
* revert the new RBD allocation model because it introduced
a bug when running on weak VM setups.
* new device IDs
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQEcBAABAgAGBQJVrQ65AAoJEG4XJFUm622bArgH/jlGm44aPLVTtTfc3Qi/yH1m
pVZ+F6Z4FhFM8Ln/skL/PIWPbxmcwMQ9IYiDI+1y0obr5RaNGZbh5EBwLcNQzAII
L9aO7vGGQRHewJj3LAY4ovkT7xYT6Kra4iZuXrozeq8CJN2/0l4Yv2uPkwPtszIf
Gp1QGgCbvUzaPdIIevx4bMyLcC5h58y7Thg2+kxSSo/VFJxGh2DFFnLuJx5RVS1D
r7fBWH5BzUPP1sh84Gt+0IpjyoxqpWiI//Wqg2Hkt6zdis3fixDvK8Wm08EewZdj
Wf63HgOzQeL9vE6IHg3WuUiR8QOn51+oqDWCtbrRemBsywZ9rc4vesMAREjcw2Q=
=ygb2
-----END PGP SIGNATURE-----
Merge tag 'wireless-drivers-for-davem-2015-07-20' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers
Kalle Valo says:
====================
ath9k:
* fix device ID check for AR956x
iwlwifi:
* bug fixes specific for 8000 series
* fix a crash in time events
* fix a crash in PCIe transport
* fix BT Coex code that prevented association on certain
devices (3160).
* revert the new RBD allocation model because it introduced
a bug when running on weak VM setups.
* new device IDs
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The sk_classid member is only required when CONFIG_CGROUP_NET_CLASSID is
enabled. #ifdefify it to reduce the size of struct sock on 32 bit
systems, at least.
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- Some dead defines dropped from the Samsung driver, was
targeted for -rc2 but got delayed
- Drop the strict mode from abx500, this was too strict
- Fix the R-Car sparse IRQs code to work as intended
- Fix the IRQ code for the pinctrl-single GPIO backend to not
enforce threaded IRQs
- Clear the latched events/IRQs for the Broadcom BCM2835
driver
- Fix up debugfs for the Freescale imx1 driver
- Fix a typo bug in the Schmitt Trigger setup in the LPC18xx
driver
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJVrpAVAAoJEEEQszewGV1zeq8P/1hIdJfHYpVb5whr4Cxq2JFh
RHKFCBGI75JDj+K7dBjJkflBxnb158rFA7QxEumEFp2VnWFUzlFJeirGDM9KArXO
Wxsp+Lm9oO8U7T1dUXhsEZJTmVNXSiNcYbuaYOkxtuVn4YlVSS/XB3T8dcXPzKRG
3BHuKnOA5qpcvM9FaA1O1UiPwR/wc/SrtX38+c1Wt0dXJO+Tgj9PtiiK6iUQHskZ
rbsxXZEBTP2mcmBBXNtMXbAh9qnL88uG44zSEv1nTDr/jHVYftIVnTdQ07ICT3S9
mCKEloeZuvHPIkttZ9Ddlj5Jf5PbaqvJllSHhE9FPGEjkOgAtfNdf0zN+Zbqhj0F
aZAHtknYRsOXFDKAHJckUvXlumFrOSd/8vDIeaVwC807Lz190syBdgUbKVBtzZYf
r7+HC1y3XIyLk2M2ZiQLwaYJPr5DJqxNgxMm7Wg/E0mmwScPhvMhrYKNJQvSu2f2
hE/l0XigFxaY7JYAj49ltjaCOKXy02IMGTcT7MAYS9mSWeI8XFI+xPN2ZjiUkQLS
4nLG4oC9FfCndcAEYf4f/86L9F1k+5ysH+DsEbkB6aCjz1D3Lijb+IzoRJTH9CE9
jRyQbhtaC3kPJb7Ucsr4RBVCLOevu8E6xiBp0mdeeSc9a2mHZcrE1IVTU033oNOp
GDhPSA4vZApj0YJqZdTw
=pDEn
-----END PGP SIGNATURE-----
Merge tag 'pinctrl-v4.2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
Pull pin control fixes from Linus Walleij:
"Here are some overly ripe pin control fixes for the v4.2 series.
They got delayed because of various crap commits and having to clean
and rinse the patch stack a few times. Now they are however looking
good.
- some dead defines dropped from the Samsung driver, was targeted for
-rc2 but got delayed
- drop the strict mode from abx500, this was too strict
- fix the R-Car sparse IRQs code to work as intended
- fix the IRQ code for the pinctrl-single GPIO backend to not enforce
threaded IRQs
- clear the latched events/IRQs for the Broadcom BCM2835 driver
- fix up debugfs for the Freescale imx1 driver
- fix a typo bug in the Schmitt Trigger setup in the LPC18xx driver"
* tag 'pinctrl-v4.2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
pinctrl: lpc18xx: fix schmitt trigger setup
Subject: pinctrl: imx1-core: Fix debug output in .pin_config_set callback
pinctrl: bcm2835: Clear the event latch register when disabling interrupts
pinctrl: single: ensure pcs irq will not be forced threaded
sh-pfc: fix sparse GPIOs for R-Car SoCs
pinctrl: abx500: remove strict mode
pinctrl: samsung: Remove old unused defines
Pull UDF fix from Jan Kara:
"A fix for UDF corruption when certain disk-format feature is enabled"
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
udf: Don't corrupt unalloc spacetable when writing it
function __get_dynamic_array_len() is broken. This only changes the
sample code, and I'm pushing this now instead of later because I don't
want others using the broken code as an example when using it for real.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJVrpUnAAoJEEjnJuOKh9ld9TgIAM9lgZ9KsAcfWaYHotr6hd7r
cbpAN5L30H5iuwC6rN+gWe4roiIn9csIgR+LHcloBrjmRujQXY4FegeuMCwNZiQO
+J6SXuGxweZ+9kloMSw3RvQw8rp1hIIUwbvkybHNbFJq/w4m4nOraEhrVxxELCGd
iNoOFULVjEUyHoHtttsHzaSnxl3p2TXjHxk4RlZUL3kcxlbmeG2zsQhZF8F0Gw9/
/Q/fMzMGctl8Xj9SWwy6FIyF9DkqXDhSR0adzw/Hd03n1pMj9+YtBKNtnRAbt42E
V+y9hjaiulUuWD7U7q1FVQp1w3ksCX8E3fNooVKeUjA2Zkrbu1EYqpskwXpAtiw=
=Z2MY
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.2-rc2-fix2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing sample code fix from Steven Rostedt:
"He Kuang noticed that the sample code using the trace_event helper
function __get_dynamic_array_len() is broken.
This only changes the sample code, and I'm pushing this now instead of
later because I don't want others using the broken code as an example
when using it for real"
* tag 'trace-v4.2-rc2-fix2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Fix sample output of dynamic arrays
With commit c03abd8463 ("net: ethernet: cpsw: don't requests IRQs
we don't use") common isr and napi are separated into separate tx isr
and rx isr/napi, but still in rx napi tx events are handled. So removing
the tx event handling in rx napi.
Signed-off-by: Mugunthan V N <mugunthanvnm@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Thomas Graf says:
====================
Lightweight & flow based encapsulation
This series combines the work previously posted by Roopa, Robert and
myself. It's according to what we discussed at NFWS. The motivation
of this series is to:
* Consolidate code between OVS and the rest of the kernel and get
rid of OVS vports and instead represent them as pure net_devices.
* Introduce a lightweight tunneling mechanism which enables flow
based encapsulation to improve scalability on both RX and TX.
* Do the above in an encapsulation unspecific way so that the
encapsulation type is eventually abstracted away from the user.
* Use the same forwarding decision for both native forwarding and
encapsulation thus allowing to switch between native IPv6 and
UDP encapsulation based on endpoint without requiring additional
logic
The fundamental changes introduces in this series are:
* A new RTA_ENCAP Netlink attribute for routes carrying encapsulation
instructions. Depending on the specified type, the instructions
apply to UDP encapsulations, MPLS and possible other in the future.
* Depending on the encapsulation type, the output function of the
dst is directly overwritten or the dst merely attaches metadata and
relies on a subsequent net_device to apply it to the packet. The
latter is typically used if an inner and outer IP header exist which
require two subsequent routing lookups to be performed.
* A new metadata_dst structure which can be attached to skbs to
carry metadata in between subsystems. This new metadata transport
is used to provide a single interface for VXLAN, routing and OVS
to communicate through metadata.
The OVS interfaces remain as-is but will transparently create a real
VXLAN net_device in the background. iproute2 is extended with a new
use cases:
VXLAN:
ip route add 40.1.1.1/32 encap vxlan id 10 dst 50.1.1.2 dev vxlan0
MPLS:
ip route add 10.1.1.0/30 encap mpls 200 via inet 10.1.1.1 dev swp1
Performance implications:
The additional memory allocation in the receive path should have
performance implications although it is not observable in standard
throughput tests if GRO is properly done. The correct net_device
model outweights the additional cost of the allocation. Furthermore,
this implication can be relaxed by reintroducing a direct unqueued
path from a software device to a consumer like bridge or OVS if
needed.
$ netperf -t TCP_STREAM -H 15.1.1.201
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
15.1.1.201 (15.1.1.201) port 0 AF_INET : demo
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec
87380 16384 16384 10.00 9118.17
Changes since v1:
* Properly initialize tun_id as reported by Julian
* Drop dupliate netif_keep_dst() as reported by Alexei
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This gets rid of all OVS specific VXLAN code in the receive and
transmit path by using a VXLAN net_device to represent the vport.
Only a small shim layer remains which takes care of handling the
VXLAN specific OVS Netlink configuration.
Unexports vxlan_sock_add(), vxlan_sock_release(), vxlan_xmit_skb()
since they are no longer needed.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This allows to get rid of the get_name() vport ops later on.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is the first step in representing all OVS vports as regular
struct net_devices. Move the net_device pointer into the vport
structure itself to get rid of struct vport_netdev.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Utilize the new metadata dst to attach encapsulation instructions to
the skb. The existing egress_tun_info via the OVS_CB() is left in
place until all tunnel vports have been converted to the new method.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This factors out the device configuration out of the RTNL newlink
API which allows for in-kernel creation of VXLAN net_devices.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
This add the ability to select a routing table based on the tunnel
id which allows to maintain separate routing tables for each virtual
tunnel network.
ip rule add from all tunnel-id 100 lookup 100
ip rule add from all tunnel-id 200 lookup 200
A new static key controls the collection of metadata at tunnel level
upon demand.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
This introduces a new IP tunnel lightweight tunnel type which allows
to specify IP tunnel instructions per route. Only IPv4 is supported
at this point.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a new flowi_tunnel structure which is a subset of ip_tunnel_key to
allow routes to match on tunnel metadata. For now, the tunnel id is
added to flowi_tunnel which allows for routes to be bound to specific
virtual tunnels.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allows putting a VXLAN device into a new flow-based mode in which
skbs with a ip_tunnel_info dst metadata attached will be encapsulated
according to the instructions stored in there with the VXLAN device
defaults taken into consideration.
Similar on the receive side, if the VXLAN_F_COLLECT_METADATA flag is
set, the packet processing will populate a ip_tunnel_info struct for
each packet received and attach it to the skb using the new metadata
dst. The metadata structure will contain the outer header and tunnel
header fields which have been stripped off. Layers further up in the
stack such as routing, tc or netfitler can later match on these fields
and perform forwarding. It is the responsibility of upper layers to
ensure that the flag is set if the metadata is needed. The flag limits
the additional cost of metadata collecting based on demand.
This prepares the VXLAN device to be steered by the routing and other
subsystems which allows to support encapsulation for a large number
of tunnel endpoints and tunnel ids through a single net_device which
improves the scalability.
It also allows for OVS to leverage this mode which in turn allows for
the removal of the OVS specific VXLAN code.
Because the skb is currently scrubed in vxlan_rcv(), the attachment of
the new dst metadata is postponed until after scrubing which requires
the temporary addition of a new member to vxlan_metadata. This member
is removed again in a later commit after the indirect VXLAN receive API
has been removed.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If output device wants to see the dst, inherit the dst of the
original skb and pass it on to generate the ARP request.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduces a new dst_metadata which enables to carry per packet metadata
between forwarding and processing elements via the skb->dst pointer.
The structure is set up to be a union. Thus, each separate type of
metadata requires its own dst instance. If demand arises to carry
multiple types of metadata concurrently, metadata dst entries can be
made stackable.
The metadata dst entry is refcnt'ed as expected for now but a non
reference counted use is possible if the reference is forced before
queueing the skb.
In order to allow allocating dsts with variable length, the existing
dst_alloc() is split into a dst_alloc() and dst_init() function. The
existing dst_init() function to initialize the subsystem is being
renamed to dst_subsys_init() to make it clear what is what.
The check before ip_route_input() is changed to ignore metadata dsts
and drop the dst inside the routing function thus allowing to interpret
metadata in a later commit.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip_route_input() unconditionally overwrites the dst. Hide the original
dst attached to the skb by calling skb_dst_set(skb, NULL) prior to
ip_route_input().
Reported-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename the tunnel metadata data structures currently internal to
OVS and make them generic for use by all IP tunnels.
Both structures are kernel internal and will stay that way. Their
members are exposed to user space through individual Netlink
attributes by OVS. It will therefore be possible to extend/modify
these structures without affecting user ABI.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
This implementation uses lwtunnel infrastructure to register
hooks for mpls tunnel encaps.
It picks cues from iptunnel_encaps infrastructure and previous
mpls iptunnel RFC patches from Eric W. Biederman and Robert Shearman
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>