Florian Westphal says:
====================
sched, cbq: remove OVL_STRATEGY/POLICE support
iproute2 does not implement any options that result in the
TCA_CBQ_OVL_STRATEGY/TCA_CBQ_POLICE attributes being set/used.
This series removes these two attributes from cbq and makes kernel reject
them via EOPNOTSUPP in case they are present.
The two followup changes then remove several features from qdisc
infrastructure that are then no longer used/needed. These are:
- The 'drop' method provided by most qdiscs
- the 'reshape_fail' function used by some qdiscs
- the __parent member in struct Qdisc
I tested this with allmod and allyesconfig builds and also with
a brief cbq script:
tc qdisc add dev eth0 root handle 1:0 cbq bandwidth 10Mbit avpkt 1000 cell 8
tc class add dev eth0 parent 1:0 classid 1:1 est 1sec 8sec cbq bandwidth 10Mbit rate 5Mbit prio 1 allot 1514 maxburst 20 cell 8 avpkt 1000 bounded split 1:0 defmap 3f
tc class add dev eth0 parent 1:0 classid 1:2 est 1sec 8sec cbq bandwidth 10Mbit rate 5Mbit prio 1 allot 1514 maxburst 20 cell 8 avpkt 1000 bounded split 1:0 defmap 3f
tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 match ip tos 0x10 0xff classid 1:1 police rate 2Mbit burst 10K reclassify
tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 match ip tos 0x0c 0xff classid 1:2
tc filter add dev eth0 parent 1:0 protocol ip prio 2 u32 match ip tos 0x10 0xff classid 1:2
tc filter add dev eth0 parent 1:0 protocol ip prio 3 u32 match ip tos 0x0 0x0 classid 1:2
No changes since v1 except patch #5 to fix up struct Qdisc layout.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Earlier commits removed two members from struct Qdisc which places
next_sched/gso_skb into a different cacheline than ->state.
This restores the struct layout to what it was before the removal.
Move the two members, then add an annotation so they all reside in the
same cacheline.
This adds a 16 byte hole after cpu_qstats.
The hole could be closed but as it doesn't decrease total struct size just
do it this way.
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
after removal of TCA_CBQ_OVL_STRATEGY from cbq scheduler, there are no
more callers of ->drop() outside of other ->drop functions, i.e.
nothing calls them.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
After the removal of TCA_CBQ_POLICE in cbq scheduler qdisc->reshape_fail
is always NULL, i.e. qdisc_rehape_fail is now the same as qdisc_drop.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
iproute2 doesn't implement any cbq option that results in this attribute
being sent to kernel.
To make use of it, user would have to
- patch iproute2
- add a class
- attach a qdisc to the class (default pfifo doesn't work as
q->handle is 0 and cbq_set_police() is a no-op in this case)
- re-'add' the same class (tc class change ...) again
- user must also specifiy a defmap (e.g. 'split 1:0 defmap 3f'), since
this 'police' feature relies on its presence
- the added qdisc must be one of bfifo, pfifo or netem
If all of these conditions are met and _some_ leaf qdiscs, namely
p/bfifo, netem, plug or tbf would drop a packet, kernel calls back into
cbq, which will attempt to re-queue the skb into a different class
as indicated by the parents' defmap entry for TC_PRIO_BESTEFFORT.
[ i.e. we behave as if tc_classify returned TC_ACT_RECLASSIFY ].
This feature, which isn't documented or implemented in iproute2,
and isn't implemented consistently (most qdiscs like sfq, codel, etc
drop right away instead of attempting this reclassification) is the
sole reason for the reshape_fail and __parent member in Qdisc struct.
So remove TCA_CBQ_POLICE support from the kernel, reject it via EOPNOTSUPP
so userspace knows we don't support it, and then remove no-longer needed
infrastructure in followup commit.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
since initial revision of cbq in 2004 iproute 2 has never implemented
support for TCA_CBQ_OVL_STRATEGY, which is what needs to be set to
activate the class->drop() call (TC_CBQ_OVL_DROP strategy must be
set by userspace value must be set by userspace).
David Miller says:
It seems really safe to kill this thing off, flag an error if someone
tries to set the attribute, and therefore kill off all of the
non-default cbq_ovl_*() functions.
A followup commit can then remove all .drop qdisc methods since this
removed the only caller.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Kabylake shows up as PCI ID 0xa171. And Kabylake-LP as 0x9d71.
Since these are similar to Skylake add these to SKL_PLUS macro
Signed-off-by: Vinod Koul <vinod.koul@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
When we need to create a new aggregate to enqueue the skb we call kzalloc.
If that fails we returned ENOBUFS without freeing the skb.
Spotted during code review.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ip6 GRE tap device should not be forced to down state to change
the mac address and should allow live address change for tap device
similar to ipv4 gre.
Signed-off-by: Shweta Choudaha <schoudah@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ip6 GRE tap device should not be forced to down state to change
the mac address and should allow live address change for tap device
similar to ipv4 gre.
Signed-off-by: Shweta Choudaha <schoudah@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski says:
====================
incremental cls_u32 hardware offload fixes
These are incremental changes from v1 of cls_u32 fixes.
First patch is reposted in its entirety, patch 2 is an
incremental change from patch 2 of the original series.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Return an error if user requested skip-sw and the underlaying
hardware cannot handle tc offloads (or offloads are disabled).
This patch fixes the knode handling.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Errors reported by u32_replace_hw_hnode() were not propagated.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
merged before 4.7rc1, plus two new fixes that have come in since then.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABCgAGBQJXVd6tAAoJELXWKTbR/J7oLGEP/0Y9PXDu3UzEXyyFhNC80L2D
S+UiW0SnZvcc0uWGts75timIq1CfodZbtN2ePymTLgyDWCOyUxdE0YhTh2NjwPjU
THmDiXia2RfkKYn/wU2ahHjCPIbyGt1ryjEOc/XflvfWGbNwgeLYY3PlzfxCej3F
rJKefcNarS5RJO90/HLJJwH2ZiDlLomMIqjBLco0al7kv5jYdf1mxJ0pzISWTDk2
10G7QM9s496t0weJ2RJxhTuylelomzZZ6+RUBAoUKNaqFrEunV6f1sjWX9vQZD0E
9zMQ+bj02jKa6yyVRyjS8t0SvdbUxXMWVrd9eU0hGa4TRANaZtTRsm4/1DKvD6+5
lKlw6fDzCoWkjkJSvDEu01GvWktFszO4exLU7MDzXXMmG2CU3Mo+0lA0KynAPjaV
CmiseVgGKB1VJZXVYfrXGdYYrqpCPZD04ZARvSEL8FeEGXCp2ggoLYOfIauSys0P
AVzQymAWSrR31uO7QI7hgos0k4lxSdNrGUjD5HivlJMBH4SeEvhQ5pSTBMamnGTV
qsezZeKg68kqF/JsSUmru9rQTrULFVpyHl/6SMmBj5KKwz5oHpCEMCmoSLxfI7lf
XkC2T8JrH5AVDvrGGKZxKxhxcw2wzbt8zGmkT9mDjnUVZdPdXDWOLS4AkJ5HZ1hf
N03d++EsGS/1cTwT70kE
=52r8
-----END PGP SIGNATURE-----
Merge tag 'drm-vc4-fixes-2016-06-06' of github.com:anholt/linux into drm-fixes
This pull request brings in vblank/pageflip fixes I had hoped to see
merged before 4.7rc1, plus two new fixes that have come in since then.
* tag 'drm-vc4-fixes-2016-06-06' of github.com:anholt/linux:
drm/vc4: Make pageflip completion handling more robust.
drm/vc4: Fix ioctl permissions for render nodes.
drm/vc4: Return -EBUSY if there's already a pending flip event.
drm/vc4: Fix drm_vblank_put/get imbalance in page flip path.
drm/vc4: Fix get_vblank_counter with proper no-op for Linux 4.4+
Fixes for two issues reported by KASAN, a display engine hang due to
incorrect BIOS table parsing, and incorrect LTC interrupt handling on
Maxwell which could lead to a never-ending interrupt storm.
* 'linux-4.7' of git://github.com/skeggsb/linux:
drm/nouveau/disp/sor/gm107: training pattern registers are like gm200
drm/nouveau/disp/sor/gf119: both links use the same training register
drm/nouveau/core: swap the order of imem/fb
drm/nouveau/fbcon: fix out-of-bounds memory accesses
drm/nouveau/gr/gf100-: update sm error decoding from gk20a nvgpu headers
drm/nouveau/ltc/gm107-: fix typo in the address of NV_PLTCG_LTC0_LTS0_INTR
drm/nouveau/bios/disp: fix handling of "match any protocol" entries
Using flat regmap cache instead of RB-tree to avoid the following
lockdep warning on driver load:
WARNING: CPU: 0 PID: 1 at kernel/locking/lockdep.c:2755 lockdep_trace_alloc+0x15c/0x160()
DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags))
The RB-tree regmap cache needs to allocate new space on first
writes. However, allocations in an atomic context (e.g. when a
spinlock is held) are not allowed. The function regmap_write
calls map->lock, which acquires a spinlock in the fast_io case.
Since the FSL DCU driver uses MMIO, the regmap bus of type
regmap_mmio is being used which has fast_io set to true.
Use flat regmap cache and specify max register to be large
enouth to cover all registers available in LS1021a and Vybrids
register space.
Signed-off-by: Stefan Agner <stefan@agner.ch>
Cc: Mark Brown <broonie@kernel.org>
Cc: stable@vger.kernel.org
David Ahern says:
====================
net: vrf: Improve use of FIB rules
Currently, VRFs require 1 oif and 1 iif rule per address family per
VRF. As the number of VRF devices increases it brings scalability
issues with the increasing rule list. All of the VRF rules have the
same format with the exception of the specific table id to direct the
lookup. Since the table id is available from the oif or iif in the
loopup, the VRF rules can be consolidated to a single rule that pulls
the table from the VRF device.
This solution still allows a user to insert their own rules for VRFs,
including rules with additional attributes. Accordingly, it is backwards
compatible with existing setups and allows other policy routing as
desired.
Hopefully v5 is the charm; my e-waste can is getting full.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Add l3mdev rule per address family when the first VRF device is
created. The rules are installed with a default preference of 1000.
Users can replace the default rule as desired.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, VRFs require 1 oif and 1 iif rule per address family per
VRF. As the number of VRF devices increases it brings scalability
issues with the increasing rule list. All of the VRF rules have the
same format with the exception of the specific table id to direct the
lookup. Since the table id is available from the oif or iif in the
loopup, the VRF rules can be consolidated to a single rule that pulls
the table from the VRF device.
This patch introduces a new rule attribute l3mdev. The l3mdev rule
means the table id used for the lookup is pulled from the L3 master
device (e.g., VRF) rather than being statically defined. With the
l3mdev rule all of the basic VRF FIB rules are reduced to 1 l3mdev
rule per address family (IPv4 and IPv6).
If an admin wishes to insert higher priority rules for specific VRFs
those rules will co-exist with the l3mdev rule. This capability means
current VRF scripts will co-exist with this new simpler implementation.
Currently, the rules list for both ipv4 and ipv6 look like this:
$ ip ru ls
1000: from all oif vrf1 lookup 1001
1000: from all iif vrf1 lookup 1001
1000: from all oif vrf2 lookup 1002
1000: from all iif vrf2 lookup 1002
1000: from all oif vrf3 lookup 1003
1000: from all iif vrf3 lookup 1003
1000: from all oif vrf4 lookup 1004
1000: from all iif vrf4 lookup 1004
1000: from all oif vrf5 lookup 1005
1000: from all iif vrf5 lookup 1005
1000: from all oif vrf6 lookup 1006
1000: from all iif vrf6 lookup 1006
1000: from all oif vrf7 lookup 1007
1000: from all iif vrf7 lookup 1007
1000: from all oif vrf8 lookup 1008
1000: from all iif vrf8 lookup 1008
...
32765: from all lookup local
32766: from all lookup main
32767: from all lookup default
With the l3mdev rule the list is just the following regardless of the
number of VRFs:
$ ip ru ls
1000: from all lookup [l3mdev table]
32765: from all lookup local
32766: from all lookup main
32767: from all lookup default
(Note: the above pretty print of the rule is based on an iproute2
prototype. Actual verbage may change)
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Maloy says:
====================
tipc: two small fixes
We fix a couple of rarely seen anomalies discovered during testing.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The node keepalive interval is recalculated at each timer expiration
to catch any changes in the link tolerance, and stored in a field in
struct tipc_node. We use jiffies as unit for the stored value.
This is suboptimal, because it makes the calculation unnecessary
complex, including two unit conversions. The conversions also lead to
a rounding error that causes the link "abort limit" to be 3 in the
normal case, instead of 4, as intended. This again leads to unnecessary
link resets when the network is pushed close to its limit, e.g., in an
environment with hundreds of nodes or namesapces.
In this commit, we do instead let the keepalive value be calculated and
stored in milliseconds, so that there is only one conversion and the
rounding error is eliminated.
We also remove a redundant "keepalive" field in struct tipc_link. This
is remnant from the previous implementation.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit 88e8ac7000 ("tipc: reduce transmission rate of reset messages
when link is down") revealed a flaw in the node FSM, as defined in
the log of commit 66996b6c47 ("tipc: extend node FSM").
We see the following scenario:
1: Node B receives a RESET message from node A before its link endpoint
is fully up, i.e., the node FSM is in state SELF_UP_PEER_COMING. This
event will not change the node FSM state, but the (distinct) link FSM
will move to state RESETTING.
2: As an effect of the previous event, the local endpoint on B will
declare node A lost, and post the event SELF_DOWN to the its node
FSM. This moves the FSM state to SELF_DOWN_PEER_LEAVING, meaning
that no messages will be accepted from A until it receives another
RESET message that confirms that A's endpoint has been reset. This
is wasteful, since we know this as a fact already from the first
received RESET, but worse is that the link instance's FSM has not
wasted this information, but instead moved on to state ESTABLISHING,
meaning that it repeatedly sends out ACTIVATE messages to the reset
peer A.
3: Node A will receive one of the ACTIVATE messages, move its link FSM
to state ESTABLISHED, and start repeatedly sending out STATE messages
to node B.
4: Node B will consistently drop these messages, since it can only accept
accept a RESET according to its node FSM.
5: After four lost STATE messages node A will reset its link and start
repeatedly sending out RESET messages to B.
6: Because of the reduced send rate for RESET messages, it is very
likely that A will receive an ACTIVATE (which is sent out at a much
higher frequency) before it gets the chance to send a RESET, and A
may hence quickly move back to state ESTABLISHED and continue sending
out STATE messages, which will again be dropped by B.
7: GOTO 5.
8: After having repeated the cycle 5-7 a number of times, node A will
by chance get in between with sending a RESET, and the situation is
resolved.
Unfortunately, we have seen that it may take a substantial amount of
time before this vicious loop is broken, sometimes in the order of
minutes.
We correct this by making a small correction to the node FSM: When a
node in state SELF_UP_PEER_COMING receives a SELF_DOWN event, it now
moves directly back to state SELF_DOWN_PEER_DOWN, instead of as now
SELF_DOWN_PEER_LEAVING. This is logically consistent, since we don't
need to wait for RESET confirmation from of an endpoint that we alread
know has been reset. It also means that node B in the scenario above
will not be dropping incoming STATE messages, and the link can come up
immediately.
Finally, a symmetry comparison reveals that the FSM has a similar
error when receiving the event PEER_DOWN in state PEER_UP_SELF_COMING.
Instead of moving to PERR_DOWN_SELF_LEAVING, it should move directly
to SELF_DOWN_PEER_DOWN. Although we have never seen any negative effect
of this logical error, we choose fix this one, too.
The node FSM looks as follows after those changes:
+----------------------------------------+
| PEER_DOWN_EVT|
| |
+------------------------+----------------+ |
|SELF_DOWN_EVT | | |
| | | |
| +-----------+ +-----------+ |
| |NODE_ | |NODE_ | |
| +----------|FAILINGOVER|<---------|SYNCHING |-----------+ |
| |SELF_ +-----------+ FAILOVER_+-----------+ PEER_ | |
| |DOWN_EVT | A BEGIN_EVT A | DOWN_EVT| |
| | | | | | | |
| | | | | | | |
| | |FAILOVER_ |FAILOVER_ |SYNCH_ |SYNCH_ | |
| | |END_EVT |BEGIN_EVT |BEGIN_EVT|END_EVT | |
| | | | | | | |
| | | | | | | |
| | | +--------------+ | | |
| | +-------->| SELF_UP_ |<-------+ | |
| | +-----------------| PEER_UP |----------------+ | |
| | |SELF_DOWN_EVT +--------------+ PEER_DOWN_EVT| | |
| | | A A | | |
| | | | | | | |
| | | PEER_UP_EVT| |SELF_UP_EVT | | |
| | | | | | | |
V V V | | V V V
+------------+ +-----------+ +-----------+ +------------+
|SELF_DOWN_ | |SELF_UP_ | |PEER_UP_ | |PEER_DOWN |
|PEER_LEAVING| |PEER_COMING| |SELF_COMING| |SELF_LEAVING|
+------------+ +-----------+ +-----------+ +------------+
| | A A | |
| | | | | |
| SELF_ | |SELF_ |PEER_ |PEER_ |
| DOWN_EVT| |UP_EVT |UP_EVT |DOWN_EVT |
| | | | | |
| | | | | |
| | +--------------+ | |
|PEER_DOWN_EVT +--->| SELF_DOWN_ |<---+ SELF_DOWN_EVT|
+------------------->| PEER_DOWN |<--------------------+
+--------------+
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli says:
====================
net: dsa: misc improvements
This patch series builds on top of Andrew's "New DSA bind, switches as devices"
patch set and does the following:
- add a few helper functions/goodies for net/dsa/dsa2.c to be as close as possible
from net/dsa/dsa.c in terms of what drivers can expect, in particular the slave
MDIO bus and the enabled_port_mask and phy_mii_mask
- fix the CPU port ethtools ops to work in a multiple tree setup since we can
no longer assume a single tree is supported
- make the bcm_sf2 driver register its own MDIO bus, yet assign it to
ds->slave_mii_bus for everything to work in net/dsa/slave.c wrt. PHY probing,
this is a tad cleaner than what we have now
Changes in v2:
Most of the previous patches have been dropped to just keep the relevant ones
now.
Changes in v3:
- split the addition of the slave MII bus as a separate patch
- properly unwind all operations at the right place and right time (ethtool ops,
slave MDIO bus
- fixed a few typos here and there
Changes in v4:
- removed superfluous dst agrument to dsa_cpu_port_ethtool_{setup,restore}
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Register a slave MDIO bus which allows us to divert problematic
read/writes towards conflicting pseudo-PHY address (30). Do no longer
rely on DSA's slave_mii_bus, but instead provide our own implementation
which offers more flexibility as to what to do, and when to register it.
We need to register it by the time we are able to get access to our
memory mapped registers, which is not until drv->setup() time. In order
to avoid forward declarations, we need to re-order the function bodies a
bit.
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that we can properly support multiple distinct trees in the system,
using a global variable: dsa_cpu_port_ethtool_ops is getting clobbered
as soon as the second switch tree gets probed, and we don't want that.
We need to move this to be dynamically allocated, and since we can't
really be comparing addresses anymore to determine first time
initialization versus any other times, just move this to dsa.c and
dsa2.c where the remainder of the dst/ds initialization happens.
The operations teardown restores the master netdev's ethtool_ops to its
original ethtool_ops pointer (typically within the Ethernet driver)
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a helper function: dsa_cpu_port_ethtool_init() which initializes a
custom ethtool_ops structure with custom DSA ethtool operations for CPU
ports. This is a preliminary change to move the initialization outside
of net/dsa/slave.c.
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Mimic what net/dsa/dsa.c does and provide a slave MII bus by default
which will be created if the driver implements a phy_read method.
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some drivers rely on these two bitmasks to contain the correct values
for them to successfully probe and initialize at drv->setup() time,
calculate correct values to put in both masks as early as possible in
dsa_get_ports_dn().
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In case we have multiples trees and switches with the same index, we
need to add another discriminating id: the switch tree.
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
"make htmldocs" complains otherwise:
.//net/core/gen_stats.c:168: warning: No description found for parameter 'running'
.//include/linux/netdevice.h:1867: warning: No description found for parameter 'qdisc_running_key'
Fixes: f9eb8aea2a ("net_sched: transform qdisc running bit into a seqcount")
Fixes: edb09eb17e ("net: sched: do not acquire qdisc spinlock in qdisc/class stats dump")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7000-series SFC NICs connected with an SFP+ module currently fail to
report any supported link speeds.
Reported-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: Bert Kenward <bkenward@solarflare.com>
Reviewed-by: Jarod Wilson <jarod@redhat.com>
Tested-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
"make htmldocs" complains otherwise:
.//net/core/gen_stats.c:65: warning: No description found for parameter 'padattr'
.//net/core/gen_stats.c:101: warning: No description found for parameter 'padattr'
Fixes: 9854518ea0 ("sched: align nlattr properly when needed")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
At present we perform an xfrm_lookup() for each UDPv6 message we
send. The lookup involves querying the flow cache (flow_cache_lookup)
and, in case of a cache miss, creating an XFRM bundle.
If we miss the flow cache, we can end up creating a new bundle and
deriving the path MTU (xfrm_init_pmtu) from on an already transformed
dst_entry, which we pass from the socket cache (sk->sk_dst_cache) down
to xfrm_lookup(). This can happen only if we're caching the dst_entry
in the socket, that is when we're using a connected UDP socket.
To put it another way, the path MTU shrinks each time we miss the flow
cache, which later on leads to incorrectly fragmented payload. It can
be observed with ESPv6 in transport mode:
1) Set up a transformation and lower the MTU to trigger fragmentation
# ip xfrm policy add dir out src ::1 dst ::1 \
tmpl src ::1 dst ::1 proto esp spi 1
# ip xfrm state add src ::1 dst ::1 \
proto esp spi 1 enc 'aes' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b
# ip link set dev lo mtu 1500
2) Monitor the packet flow and set up an UDP sink
# tcpdump -ni lo -ttt &
# socat udp6-listen:12345,fork /dev/null &
3) Send a datagram that needs fragmentation with a connected socket
# perl -e 'print "@" x 1470 | socat - udp6:[::1]:12345
2016/06/07 18:52:52 socat[724] E read(3, 0x555bb3d5ba00, 8192): Protocol error
00:00:00.000000 IP6 ::1 > ::1: frag (0|1448) ESP(spi=0x00000001,seq=0x2), length 1448
00:00:00.000014 IP6 ::1 > ::1: frag (1448|32)
00:00:00.000050 IP6 ::1 > ::1: ESP(spi=0x00000001,seq=0x3), length 1272
(^ ICMPv6 Parameter Problem)
00:00:00.000022 IP6 ::1 > ::1: ESP(spi=0x00000001,seq=0x5), length 136
4) Compare it to a non-connected socket
# perl -e 'print "@" x 1500' | socat - udp6-sendto:[::1]:12345
00:00:40.535488 IP6 ::1 > ::1: frag (0|1448) ESP(spi=0x00000001,seq=0x6), length 1448
00:00:00.000010 IP6 ::1 > ::1: frag (1448|64)
What happens in step (3) is:
1) when connecting the socket in __ip6_datagram_connect(), we
perform an XFRM lookup, miss the flow cache, create an XFRM
bundle, and cache the destination,
2) afterwards, when sending the datagram, we perform an XFRM lookup,
again, miss the flow cache (due to mismatch of flowi6_iif and
flowi6_oif, which is an issue of its own), and recreate an XFRM
bundle based on the cached (and already transformed) destination.
To prevent the recreation of an XFRM bundle, avoid an XFRM lookup
altogether whenever we already have a destination entry cached in the
socket. This prevents the path MTU shrinkage and brings us on par with
UDPv4.
The fix also benefits connected PINGv6 sockets, another user of
ip6_sk_dst_lookup_flow(), who also suffer messages being transformed
twice.
Joint work with Hannes Frederic Sowa.
Reported-by: Jan Tluka <jtluka@redhat.com>
Signed-off-by: Jakub Sitnicki <jkbs@redhat.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When in kdump kernel, reduce memory usage by only using a single Queue
Set for multiqueue devices. So make netif_get_num_default_rss_queues()
return one, when in kdump kernel.
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sudarsana Reddy Kalluru says:
====================
qed/qede support for dcbnl.
This series adds the dcbnl functionality to the driver. Patch (1) adds
the qed infrastucture for querying/configuring the dcbx parameters.
Patch (2) adds the qed infrastructure for dcbnl APIs. And patch (3)
adds the qede support for dcbnl.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the interfaces for ieee/cee dcbnl callbacks and registers
them with the kernel.
Signed-off-by: Sudarsana Reddy Kalluru <sudarsana.kalluru@qlogic.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the implementation for both cee/ieee dcbnl callbacks by
using the qed query/config APIs.
Signed-off-by: Sudarsana Reddy Kalluru <sudarsana.kalluru@qlogic.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Query API reads the dcbx data from the device shared memory and return it
to the caller. The config API configures the user provided dcbx values on
the device, and initiates the dcbx negotiation with the peer.
Signed-off-by: Sudarsana Reddy Kalluru <sudarsana.kalluru@qlogic.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The CONFIG_ prefix should only be used for options which
can be configured through Kconfig and not for guarding headers.
Signed-off-by: Andreas Ziegler <andreas.ziegler@fau.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
The CONFIG_ prefix should only be used for options which
can be configured through Kconfig and not for guarding headers.
Signed-off-by: Andreas Ziegler <andreas.ziegler@fau.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
When setting up ILA in a router we noticed that the the encapsulation
is invoked twice: once in the route input path and again upon route
output. To resolve this we add a flag set_csum_neutral for the
ila_update_ipv6_locator. If this flag is set and the checksum
neutral bit is also set we assume that checksum-neutral translation
has already been performed and take no further action. The
flag is set only in ila_output path. The flag is not set for ila_input and
ila_xlat.
Tested:
Used 3 netns to set to emulate a router and two hosts. The router
translates SIR addresses between the two destinations in other two netns.
Verified ping and netperf are functional.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The of_find_net_device_by_node() function is defined in
<linux/of_net.h> but not included in the .c file that
implements it. Fix the following warning by including the
header:
net/core/net-sysfs.c:1494:19: warning: symbol 'of_find_net_device_by_node' was not declared. Should it be static?
Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
RFC 5961 advises to only accept RST packets containing a seq number
matching the next expected seq number instead of the whole receive
window in order to avoid spoofing attacks.
However, this situation is not optimal in the case SACK is in use at the
time the RST is sent. I recently run into a scenario in which packet
losses were high while uploading data to a server, and userspace was
willing to frequently terminate connections by sending a RST. In
this case, the ACK sent on the receiver side (rcv_nxt) is frozen waiting
for a lost packet retransmission and SACK blocks are used to let the
client continue uploading data. At some point later on, the client sends
the RST (snd_nxt), which matches the next expected seq number of the
right-most SACK block on the receiver side which is going forward
receiving data.
In this scenario, as RFC 5961 defines, the RST SEQ doesn't match the
frozen main ACK at receiver side and thus gets dropped and a challenge
ACK is sent, which gets usually lost due to network conditions. The main
consequence is that the connection stays alive for a while even if it
made sense to accept the RST. This can get really bad if lots of
connections like this one are created in few seconds, allocating all the
resources of the server easily.
For security reasons, not all SACK blocks are checked (there could be a
big amount of SACK blocks => acceptable SEQ numbers). Furthermore, it
wouldn't make sense to check for RST in blocks other than the right-most
received one because the sender is not expected to be sending new data
after the RST. For simplicity, only up to the 4 most recently updated
SACK blocks (selective_acks[4] field) are compared to find the
right-most block, as usually those are the ones with bigger probability
to contain it.
This patch was tested in a 3.18 kernel and probed to improve the
situation in the scenario described above.
Signed-off-by: Pau Espin Pedrol <pau.espin@tessares.net>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Tested-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the current code "ent_per_page" could be more than "conn_num" making
"conn_num" negative after the subtraction. In the next iteration
through the loop then the negative is treated as a very high positive
meaning we don't put a limit on "ent_num". It could lead to memory
corruption.
Fixes: dbb799c397 ('qed: Initialize hardware for new protocols')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The missing br_vlan_should_use() test caused creation of an unneeded
local fdb entry on changing mac address of a bridge device when there is
a vlan which is configured on a bridge port but not on the bridge
device.
Fixes: 2594e9064a ("bridge: vlan: add per-vlan struct and move to rhashtables")
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern says:
====================
net: vrf: Add support for local traffic to local addresses
Add support for locally originated traffic to VRF-local addresses,
be it addresses on enslaved devices or addresses on the VRF device:
$ ip addr show dev red
33: red: <NOARP,MASTER,UP,LOWER_UP> mtu 65536 qdisc pfifo_fast state UP group default qlen 1000
link/ether be:00:53:b5:e4:25 brd ff:ff:ff:ff:ff:ff
inet 1.1.1.1/32 scope global red
valid_lft forever preferred_lft forever
inet6 1111:1::1/128 scope global
valid_lft forever preferred_lft forever
$ ip addr show dev eth1
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master red state UP group default qlen 1000
link/ether 02:e0:f9:79:34:bd brd ff:ff:ff:ff:ff:ff
inet 10.100.1.1/24 brd 10.100.1.255 scope global eth1
valid_lft forever preferred_lft forever
inet6 2100:1::1/120 scope global
valid_lft forever preferred_lft forever
inet6 fe80::e0:f9ff:fe79:34bd/64 scope link
valid_lft forever preferred_lft forever
$ ping -c1 -I red 10.100.1.1
ping: Warning: source address might be selected on device other than red.
PING 10.100.1.1 (10.100.1.1) from 10.100.1.1 red: 56(84) bytes of data.
64 bytes from 10.100.1.1: icmp_seq=1 ttl=64 time=0.057 ms
$ ping -c1 -I red 1.1.1.1
PING 1.1.1.1 (1.1.1.1) from 1.1.1.1 red: 56(84) bytes of data.
64 bytes from 1.1.1.1: icmp_seq=1 ttl=64 time=0.136 ms
--- 1.1.1.1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.136/0.136/0.136/0.000 ms
$ ping6 -c1 -I red 2100:1::1
ping6: Warning: source address might be selected on device other than red.
PING 2100:1::1(2100:1::1) from 2100:1::1 red: 56 data bytes
64 bytes from 2100:1::1: icmp_seq=1 ttl=64 time=0.167 ms
--- 2100:1::1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.167/0.167/0.167/0.000 ms
$ ping6 -c1 -I red 1111::1
PING 1111::1(1111::1) from 1111:1::1 red: 56 data bytes
64 bytes from 1111::1: icmp_seq=1 ttl=64 time=0.187 ms
--- 1111::1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.187/0.187/0.187/0.000 ms
This change also enables use of loopback address on the VRF device:
$ ip addr add dev red 127.0.0.1/8
$ ping -c1 -I red 127.0.0.1
PING 127.0.0.1 (127.0.0.1) from 127.0.0.1 red: 56(84) bytes of data.
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.058 ms
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for locally originated traffic to VRF-local IPv6 addresses.
Similar to IPv4 a local dst is set on the skb and the packet is
reinserted with a call to netif_rx. With this patch, ping, tcp and udp
packets to a local IPv6 address are successfully routed:
$ ip addr show dev eth1
4: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master red state UP group default qlen 1000
link/ether 02:e0:f9:1c:b9:74 brd ff:ff:ff:ff:ff:ff
inet 10.100.1.1/24 brd 10.100.1.255 scope global eth1
valid_lft forever preferred_lft forever
inet6 2100:1::1/120 scope global
valid_lft forever preferred_lft forever
inet6 fe80::e0:f9ff:fe1c:b974/64 scope link
valid_lft forever preferred_lft forever
$ ping6 -c1 -I red 2100:1::1
ping6: Warning: source address might be selected on device other than red.
PING 2100:1::1(2100:1::1) from 2100:1::1 red: 56 data bytes
64 bytes from 2100:1::1: icmp_seq=1 ttl=64 time=0.098 ms
ip6_input is exported so the VRF driver can use it for the dst input
function. The dst_alloc function for IPv4 defaults to setting the input and
output functions; IPv6's does not. VRF does not need to duplicate the Rx path
so just export the ipv6 input function.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>