Steffen Klassert says:
====================
pull request (net-next): ipsec-next 2016-10-25
Just a leftover from the last development cycle.
1) Remove some unused code, from Florian Westphal.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
No one can see these events, because a network namespace can not be
destroyed, if it has sockets.
Unlike other devices, uevent-s for network devices are generated
only inside their network namespaces. They are filtered in
kobj_bcast_filter()
My experiments shows that net namespaces are destroyed more 30% faster
with this optimization.
Here is a perf output for destroying network namespaces without this
patch.
- 94.76% 0.02% kworker/u48:1 [kernel.kallsyms] [k] cleanup_net
- 94.74% cleanup_net
- 94.64% ops_exit_list.isra.4
- 41.61% default_device_exit_batch
- 41.47% unregister_netdevice_many
- rollback_registered_many
- 40.36% netdev_unregister_kobject
- 14.55% device_del
+ 13.71% kobject_uevent
- 13.04% netdev_queue_update_kobjects
+ 12.96% kobject_put
- 12.72% net_rx_queue_update_kobjects
kobject_put
- kobject_release
+ 12.69% kobject_uevent
+ 0.80% call_netdevice_notifiers_info
+ 19.57% nfsd_exit_net
+ 11.15% tcp_net_metrics_exit
+ 8.25% rpcsec_gss_exit_net
It's very critical to optimize the exit path for network namespaces,
because they are destroyed under net_mutex and many namespaces can be
destroyed for one iteration.
v2: use dev_set_uevent_suppress()
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes: d894be57ca92('ethernet: use net core MTU range checking in more drivers')
CC: Jarod Wilson <jarod@redhat.com>
CC: Thomas Falcon <tlfalcon@linux.vnet.ibm.com>
Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
Acked-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Johannes Berg says:
====================
genetlink improvements
This series contains some generic netlink improvements, making
the API safer to use, and making the function pointers in the
family struct safer by allowing it to be __ro_after_init.
The first patch, introducing genl_family_attrbuf(), just ensures
that the users of family->attrbuf aren't actually racy, but making
them use the indirection function for obtaining a reference and
checking that the context can actually do so.
The second patch removes the more or less broken ability to have
a static family ID, the three IDs that need to be static because
it's simply needed (genl controller), or due to old API misused.
Everything else couldn't be static anyway, or could fail when the
family is registered, if somebody else already got a static ID.
The third patch statically initializes the families, mostly to save
some code. I wrote this initially because I thought I could make
them all const, but that ends up being very inefficient (it would
require always doing some kind of family -> id lookup), so now it's
just here because I had it already and it reduces the code size.
The fourth patch then, finally, lays the groundwork for what I had
really wanted - now with __ro_after_init instead of const; I remove
code there to do the ID->family hash table mapping in genetlink and
use IDR instead to both allocate and map the IDs, which again ends
up saving some code size.
Finally, the fifth patch updates all families, as it turns out, no
families exist that really dynamically register/unregister. This
last patch should perhaps be split up, I could submit it for each
subsystem separately, but it'd depend on the second and third to
go in first, so would take a while. I can do that though, if that
seems better to you.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Now genl_register_family() is the only thing (other than the
users themselves, perhaps, but I didn't find any doing that)
writing to the family struct.
In all families that I found, genl_register_family() is only
called from __init functions (some indirectly, in which case
I've add __init annotations to clarifly things), so all can
actually be marked __ro_after_init.
This protects the data structure from accidental corruption.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since generic netlink family IDs are small integers, allocated
densely, IDR is an ideal match for lookups. Replace the existing
hand-written hash-table with IDR for allocation and lookup.
This lets the families only be written to once, during register,
since the list_head can be removed and removal of a family won't
cause any writes.
It also slightly reduces the code size (by about 1.3k on x86-64).
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of providing macros/inline functions to initialize
the families, make all users initialize them statically and
get rid of the macros.
This reduces the kernel code size by about 1.6k on x86-64
(with allyesconfig).
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Static family IDs have never really been used, the only
use case was the workaround I introduced for those users
that assumed their family ID was also their multicast
group ID.
Additionally, because static family IDs would never be
reserved by the generic netlink code, using a relatively
low ID would only work for built-in families that can be
registered immediately after generic netlink is started,
which is basically only the control family (apart from
the workaround code, which I also had to add code for so
it would reserve those IDs)
Thus, anything other than GENL_ID_GENERATE is flawed and
luckily not used except in the cases I mentioned. Move
those workarounds into a few lines of code, and then get
rid of GENL_ID_GENERATE entirely, making it more robust.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This helper function allows family implementations to access
their family's attrbuf. This gets rid of the attrbuf usage
in families, and also adds locking validation, since it's not
valid to use the attrbuf with parallel_ops or outside of the
dumpit callback.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The user may want to use only some bits of the skb mark in
his skbedit rules because the remaining part might be used by
something else.
Introduce the "mask" parameter to the skbedit actor in order
to implement such functionality.
When the mask is specified, only those bits selected by the
latter are altered really changed by the actor, while the
rest is left untouched.
Signed-off-by: Antonio Quartulli <antonio@open-mesh.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a port_type_set() is been called and the new port type set is the same
as the old one, just return success.
Signed-off-by: Elad Raz <eladr@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
firewire-net, like the older eth1394 driver, reduced the initial MTU to
less than 1500 octets if the local link layer controller's asynchronous
packet reception limit was lower.
This is bogus, since this reception limit does not have anything to do
with the transmission limit. Neither did this reduction affect the TX
path positively, nor could it prevent link fragmentation at the RX path.
Many FireWire CardBus cards have a max_rec of 9, causing an initial MTU
of 1024 - 16 = 1008. RFC 2734 and RFC 3146 allow a minimum max_rec = 8,
which would result in an initial MTU of 512 - 16 = 496. On such cards,
IPv6 could only be employed if the MTU was manually increased to 1280 or
more, i.e. IPv6 would not work without intervention from userland.
We now always initialize the MTU to 1500, which is the default according
to RFC 2734 and RFC 3146.
On a VIA VT6316 based CardBus card which was affected by this, changing
the MTU from 1008 to 1500 also increases TX bandwidth by 6 %.
RX remains unaffected.
CC: netdev@vger.kernel.org
CC: linux1394-devel@lists.sourceforge.net
CC: Jarod Wilson <jarod@redhat.com>
Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit b3e3893e12 ("net: use core MTU range checking in misc drivers")
mistakenly introduced an upper limit for firewire-net's MTU based on the
local link layer controller's reception capability. Revert this. Neither
RFC 2734 nor our implementation impose any particular upper limit.
Actually, to be on the safe side and to make the code explicit, set
ETH_MAX_MTU = 65535 as upper limit now.
(I replaced sizeof(struct rfc2734_header) by the equivalent
RFC2374_FRAG_HDR_SIZE in order to avoid distracting long/int conversions.)
Fixes: b3e3893e1253('net: use core MTU range checking in misc drivers')
CC: netdev@vger.kernel.org
CC: linux1394-devel@lists.sourceforge.net
CC: Jarod Wilson <jarod@redhat.com>
Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
Acked-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This node pointer is returned by of_get_child_by_name() with refcount
incremented in this function. of_node_put() on it before exitting this
function.
This is detected by Coccinelle semantic patch.
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use setup_timer() instead of init_timer(), being the preferred/standard
way to set a timer up.
Also, quoting the mod_timer() function comment:
-> mod_timer() is a more efficient way to update the expire field of an
active timer (if the timer is inactive it will be activated).
Use setup_timer and mod_timer to setup and arm a timer, to make the code
cleaner and easier to read.
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix to return error code -ENODEV from the DMA is not supported error
handling case instead of 0, as done elsewhere in this function.
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Acked-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is not allowed to call kfree_skb() from hardware interrupt
context or with interrupts being disabled, spin_lock_irqsave()
make sure always in irq disable context. So the kfree_skb()
should be replaced with dev_kfree_skb_irq().
This is detected by Coccinelle semantic patch.
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix to return error code -EINVAL from the error handling
case instead of 0, as done elsewhere in this function.
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use setup_timer function instead of initializing timer with the function
and data fields.
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
It's not necessary to free memory allocated with devm_kzalloc in the
remove path and using kfree leads to a double free.
Fixes: 84640e27f2 ("net: netcp: Add Keystone NetCP core ethernet
driver")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The maximum MTU is defined via the slave devices of an batman-adv
interface. Thus it is not possible to calculate the max_mtu during the
creation of the batman-adv device when no slave devices are attached. Doing
so would for example break non-fragmentation setups which then
(incorrectly) allow an MTU of 1500 even when underlying device cannot
transport 1500 bytes + batman-adv headers.
Checking the dynamically calculated max_mtu via the minimum of the slave
devices MTU during .ndo_change_mtu is also used by the bridge interface.
Cc: Jarod Wilson <jarod@redhat.com>
Fixes: b3e3893e12 ("net: use core MTU range checking in misc drivers")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Xo Wang says:
====================
Broadcom BCM54612E support
This series is based on tip of torvalds/master.
The first patch adds register definitions from Broadcom docs.
The second patch adds the BCM54612E PHY ID, flags, and device-specific
RGMII internal delay initialization.
I tested on a custom board with an Aspeed AST2500 SOC with its second
MAC connected to this PHY.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This PHY has internal delays enabled after reset. This clears the
internal delay enables unless the interface specifically requests them.
Signed-off-by: Xo Wang <xow@google.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the RXD-to-RXC skew (delay) time bit in the Miscellaneous Control
shadow register and a mask for the shadow selector field.
Remove a re-definition of MII_BCM54XX_AUXCTL_SHDWSEL_AUXCTL.
Signed-off-by: Xo Wang <xow@google.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
When I prepared commit d250a5f90e ("pkt_sched: gen_estimator: Dont
report fake rate estimators"), htb still had an implicit rate estimator
for all its classes.
Then later, I made this rate estimator optional in commit 64153ce0a7
("net_sched: htb: do not setup default rate estimators"), but I forgot
to update htb use of gnet_stats_copy_rate_est()
After this patch, "tc -s qdisc ..." no longer report fake rate
estimators for HTB classes.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In criu we are actively using diag interface to collect sockets
present in the system when dumping applications. And while for
unix, tcp, udp[lite], packet, netlink it works as expected,
the raw sockets do not have. Thus add it.
v2:
- add missing sock_put calls in raw_diag_dump_one (by eric.dumazet@)
- implement @destroy for diag requests (by dsa@)
v3:
- add export of raw_abort for IPv6 (by dsa@)
- pass net-admin flag into inet_sk_diag_fill due to
changes in net-next branch (by dsa@)
v4:
- use @pad in struct inet_diag_req_v2 for raw socket
protocol specification: raw module carries sockets
which may have custom protocol passed from socket()
syscall and sole @sdiag_protocol is not enough to
match underlied ones
- start reporting protocol specifed in socket() call
when sockets are raw ones for the same reason: user
space tools like ss may parse this attribute and use
it for socket matching
v5 (by eric.dumazet@):
- use sock_hold in raw_sock_get instead of atomic_inc,
we're holding (raw_v4_hashinfo|raw_v6_hashinfo)->lock
when looking up so counter won't be zero here.
v6:
- use sdiag_raw_protocol() helper which will access @pad
structure used for raw sockets protocol specification:
we can't simply rename this member without breaking uapi
v7:
- sine sdiag_raw_protocol() helper is not suitable for
uapi lets rather make an alias structure with proper
names. __check_inet_diag_req_raw helper will catch
if any of structure unintentionally changed.
CC: David S. Miller <davem@davemloft.net>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: David Ahern <dsa@cumulusnetworks.com>
CC: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
CC: James Morris <jmorris@namei.org>
CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
CC: Patrick McHardy <kaber@trash.net>
CC: Andrey Vagin <avagin@openvz.org>
CC: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The field is initialized by ILA and MPLS but never used. Remove it.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
net_mutex can be locked for a long time. It may be because many
namespaces are being destroyed or many processes decide to create
a network namespace.
Both these operations are heavy, so it is better to have an ability to
kill a process which is waiting net_mutex.
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
META_COLLECTOR int_vlan_tag() assumes that if the accel tag (vlan_tci)
is zero, then no vlan accel tag is present.
This is incorrect for zero VID vlan accel packets, making the following
match fail:
tc filter add ... basic match 'meta(vlan mask 0xfff eq 0)' ...
Apparently 'int_vlan_tag' was implemented prior VLAN_TAG_PRESENT was
introduced in 05423b2 "vlan: allow null VLAN ID to be used"
(and at time introduced, the 'vlan_tx_tag_get' call in em_meta was not
adapted).
Fix, testing skb_vlan_tag_present instead of testing skb_vlan_tag_get's
value.
Fixes: 05423b2413 ("vlan: allow null VLAN ID to be used")
Fixes: 1a31f2042e ("netsched: Allow meta match on vlan tag on receive")
Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko says:
====================
mlxsw: Driver update
Mostly cosmetics and small resource values management rewrite.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Since the number of resources is going to get much bigger, ease up the
addition by simly defining IDs. Convert the existing structure members
to a set array, one for validity, one for values. Introduce a set of
getters and setters for easy access.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Push cmd resource query related defines to cmd.h where they belong.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Extend the MLXSW_REG_DEFINE macro to store register name in string form.
Use this string later on instead of hard coded string values.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Save some code and also prepare to easily carry name in string form.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Enforce const for getter buf args.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These should be const, so enforce it.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
Add BPF numa id helper
This patch set adds a helper for retrieving current numa node
id and a test case for SO_REUSEPORT.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Use case is mainly for soreuseport to select sockets for the local
numa node, but since generic, lets also add this for other networking
and tracing program types.
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni says:
====================
udp: refactor memory accounting
This patch series refactor the udp memory accounting, replacing the
generic implementation with a custom one, in order to remove the needs for
locking the socket on the enqueue and dequeue operations. The socket backlog
usage is dropped, as well.
The first patch factor out pieces of some queue and memory management
socket helpers, so that they can later be used by the udp memory accounting
functions.
The second patch adds the memory account helpers, without using them.
The third patch replacse the old rx memory accounting path for udp over ipv4 and
udp over ipv6. In kernel UDP users are updated, as well.
The memory accounting schema is described in detail in the individual patch
commit message.
The performance gain depends on the specific scenario; with few flows (and
little contention in the original code) the differences are in the noise range,
while with several flows contending the same socket, the measured speed-up
is relevant (e.g. even over 100% in case of extreme contention)
Many thanks to Eric Dumazet for the reiterated reviews and suggestions.
v5 -> v6:
- do not orphan the skb on enqueue, skb_steal_sock() already did
the work for us
v4 -> v5:
- use the receive queue spin lock to protect the memory accounting
- several minor clean-up
v3 -> v4:
- simplified the locking schema, always use a plain spinlock
v2 -> v3:
- do not set the now unsed backlog_rcv callback
v1 -> v2:
- changed slighly the memory accounting schema, we now perform lazy reclaim
- fixed forward_alloc updating issue
- fixed memory counter integer overflows
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Completely avoid default sock memory accounting and replace it
with udp-specific accounting.
Since the new memory accounting model encapsulates completely
the required locking, remove the socket lock on both enqueue and
dequeue, and avoid using the backlog on enqueue.
Be sure to clean-up rx queue memory on socket destruction, using
udp its own sk_destruct.
Tested using pktgen with random src port, 64 bytes packet,
wire-speed on a 10G link as sender and udp_sink as the receiver,
using an l4 tuple rxhash to stress the contention, and one or more
udp_sink instances with reuseport.
nr readers Kpps (vanilla) Kpps (patched)
1 170 440
3 1250 2150
6 3000 3650
9 4200 4450
12 5700 6250
v4 -> v5:
- avoid unneeded test in first_packet_length
v3 -> v4:
- remove useless sk_rcvqueues_full() call
v2 -> v3:
- do not set the now unsed backlog_rcv callback
v1 -> v2:
- add memory pressure support
- fixed dropwatch accounting for ipv6
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Avoid using the generic helpers.
Use the receive queue spin lock to protect the memory
accounting operation, both on enqueue and on dequeue.
On dequeue perform partial memory reclaiming, trying to
leave a quantum of forward allocated memory.
On enqueue use a custom helper, to allow some optimizations:
- use a plain spin_lock() variant instead of the slightly
costly spin_lock_irqsave(),
- avoid dst_force check, since the calling code has already
dropped the skb dst
- avoid orphaning the skb, since skb_steal_sock() already did
the work for us
The above needs custom memory reclaiming on shutdown, provided
by the udp_destruct_sock().
v5 -> v6:
- don't orphan the skb on enqueue
v4 -> v5:
- replace the mem_lock with the receive queue spin lock
- ensure that the bh is always allowed to enqueue at least
a skb, even if sk_rcvbuf is exceeded
v3 -> v4:
- reworked memory accunting, simplifying the schema
- provide an helper for both memory scheduling and enqueuing
v1 -> v2:
- use a udp specific destrctor to perform memory reclaiming
- remove a couple of helpers, unneeded after the above cleanup
- do not reclaim memory on dequeue if not under memory
pressure
- reworked the fwd accounting schema to avoid potential
integer overflow
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Basic sock operations that udp code can use with its own
memory accounting schema. No functional change is introduced
in the existing APIs.
v4 -> v5:
- avoid whitespace changes
v2 -> v4:
- avoid exporting __sock_enqueue_skb
v1 -> v2:
- avoid export sock_rmem_free
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These few drivers call ether_setup(), but have no ndo_change_mtu, and thus
were overlooked for changes to MTU range checking behavior. They
previously had no range checks, so for feature-parity, set their min_mtu
to 0 and max_mtu to ETH_MAX_MTU (65535), instead of the 68 and 1500
inherited from the ether_setup() changes. Fine-tuning can come after we get
back to full feature-parity here.
CC: netdev@vger.kernel.org
Reported-by: Asbjoern Sloth Toennesen <asbjorn@asbjorn.st>
CC: Asbjoern Sloth Toennesen <asbjorn@asbjorn.st>
CC: R Parameswaran <parameswaran.r7@gmail.com>
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix in commit 8809883482 ("hv_netvsc: set nvdev link after populating
chn_table") turns out to be incomplete. A crash in
netvsc_get_next_send_section() is observed on mtu change when the device
is under load. The race I identified is: if we get to netvsc_send() after
we set net_device_ctx->nvdev link in netvsc_device_add() but before we
finish netvsc_connect_vsp()->netvsc_init_buf() send_section_map is not
allocated and we crash. Unfortunately we can't set net_device_ctx->nvdev
link after the netvsc_init_buf() call as during the negotiation we need
to receive packets and on the receive path we check for it. It would
probably be possible to split nvdev into a pair of nvdev_in and nvdev_out
links and check them accordingly in get_outbound_net_device()/
get_inbound_net_device() but this looks like an overkill.
Check that send_section_map is allocated in netvsc_send().
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jarod Wilson says:
====================
net: use core MTU range checking everywhere
This stack of patches should get absolutely everything in the kernel
converted from doing their own MTU range checking to the core MTU range
checking. This second spin includes alterations to hopefully fix all
concerns raised with the first, as well as including some additional
changes to drivers and infrastructure where I completely missed necessary
updates.
These have all been built through the 0-day build infrastructure via the
(rebasing) master branch at https://github.com/jarodwilson/linux-muck, which
at the time of the most recent compile across 147 configs, was based on
net-next at commit 7b1536ef0a.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
firewire-net:
- set min/max_mtu
- remove fwnet_change_mtu
nes:
- set max_mtu
- clean up nes_netdev_change_mtu
xpnet:
- set min/max_mtu
- remove xpnet_dev_change_mtu
hippi:
- set min/max_mtu
- remove hippi_change_mtu
batman-adv:
- set max_mtu
- remove batadv_interface_change_mtu
- initialization is a little async, not 100% certain that max_mtu is set
in the optimal place, don't have hardware to test with
rionet:
- set min/max_mtu
- remove rionet_change_mtu
slip:
- set min/max_mtu
- streamline sl_change_mtu
um/net_kern:
- remove pointless ndo_change_mtu
hsi/clients/ssi_protocol:
- use core MTU range checking
- remove now redundant ssip_pn_set_mtu
ipoib:
- set a default max MTU value
- Note: ipoib's actual max MTU can vary, depending on if the device is in
connected mode or not, so we'll just set the max_mtu value to the max
possible, and let the ndo_change_mtu function continue to validate any new
MTU change requests with checks for CM or not. Note that ipoib has no
min_mtu set, and thus, the network core's mtu > 0 check is the only lower
bounds here.
mptlan:
- use net core MTU range checking
- remove now redundant mpt_lan_change_mtu
fddi:
- min_mtu = 21, max_mtu = 4470
- remove now redundant fddi_change_mtu (including export)
fjes:
- min_mtu = 8192, max_mtu = 65536
- The max_mtu value is actually one over IP_MAX_MTU here, but the idea is to
get past the core net MTU range checks so fjes_change_mtu can validate a
new MTU against what it supports (see fjes_support_mtu in fjes_hw.c)
hsr:
- min_mtu = 0 (calls ether_setup, max_mtu is 1500)
f_phonet:
- min_mtu = 6, max_mtu = 65541
u_ether:
- min_mtu = 14, max_mtu = 15412
phonet/pep-gprs:
- min_mtu = 576, max_mtu = 65530
- remove redundant gprs_set_mtu
CC: netdev@vger.kernel.org
CC: linux-rdma@vger.kernel.org
CC: Stefan Richter <stefanr@s5r6.in-berlin.de>
CC: Faisal Latif <faisal.latif@intel.com>
CC: linux-rdma@vger.kernel.org
CC: Cliff Whickman <cpw@sgi.com>
CC: Robin Holt <robinmholt@gmail.com>
CC: Jes Sorensen <jes@trained-monkey.org>
CC: Marek Lindner <mareklindner@neomailbox.ch>
CC: Simon Wunderlich <sw@simonwunderlich.de>
CC: Antonio Quartulli <a@unstable.cc>
CC: Sathya Prakash <sathya.prakash@broadcom.com>
CC: Chaitra P B <chaitra.basappa@broadcom.com>
CC: Suganath Prabu Subramani <suganath-prabu.subramani@broadcom.com>
CC: MPT-FusionLinux.pdl@broadcom.com
CC: Sebastian Reichel <sre@kernel.org>
CC: Felipe Balbi <balbi@kernel.org>
CC: Arvid Brodin <arvid.brodin@alten.se>
CC: Remi Denis-Courmont <courmisch@gmail.com>
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
hyperv_net:
- set min/max_mtu, per Haiyang, after rndis_filter_device_add
virtio_net:
- set min/max_mtu
- remove virtnet_change_mtu
vmxnet3:
- set min/max_mtu
xen-netback:
- min_mtu = 0, max_mtu = 65517
xen-netfront:
- min_mtu = 0, max_mtu = 65535
unisys/visor:
- clean up defines a little to not clash with network core or add
redundat definitions
CC: netdev@vger.kernel.org
CC: virtualization@lists.linux-foundation.org
CC: "K. Y. Srinivasan" <kys@microsoft.com>
CC: Haiyang Zhang <haiyangz@microsoft.com>
CC: "Michael S. Tsirkin" <mst@redhat.com>
CC: Shrikrishna Khare <skhare@vmware.com>
CC: "VMware, Inc." <pv-drivers@vmware.com>
CC: Wei Liu <wei.liu2@citrix.com>
CC: Paul Durrant <paul.durrant@citrix.com>
CC: David Kershner <david.kershner@unisys.com>
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>