This patch added a new helper named lookup_subflow_by_daddr to find
whether the destination address is in the msk's conn_list.
In mptcp_pm_nl_add_addr_received, use lookup_subflow_by_daddr to check
whether the announced address is already connected. If it is, skip
connecting this address and send out the echo.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Drop the redundant argument 'port' from mptcp_pm_announce_addr, use the
port field of another argument 'addr' instead.
Fixes: 0f5c9e3f07 ("mptcp: add port parameter for mptcp_pm_announce_addr")
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After the previous patch we can easily avoid invoking
the workqueue to perform the retransmission, if the
msk socket lock is held at rtx timer expiration.
This also simplifies the relevant code.
Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Many tcp sysctls are either bools or small ints that can fit into u8.
Reducing space taken by sysctls can save few cache line misses
when sending/receiving data while cpu caches are empty,
for example after cpu idle period.
This is hard to measure with typical network performance tests,
but after this patch, struct netns_ipv4 has shrunk
by three cache lines.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For these sysctls, their dedicated helpers have
to use proc_dou8vec_minmax().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This sysctl uses ip_fwd_update_priority() helper,
so the conversion needs to change it.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These sysctls that can fit in one byte instead of one int
are converted to save space and thus reduce cache line misses.
- icmp_echo_ignore_all, icmp_echo_ignore_broadcasts,
- icmp_ignore_bogus_error_responses, icmp_errors_use_inbound_ifaddr
- tcp_ecn, tcp_ecn_fallback
- ip_default_ttl, ip_no_pmtu_disc, ip_fwd_use_pmtu
- ip_nonlocal_bind, ip_autobind_reuse
- ip_dynaddr, ip_early_demux, raw_l3mdev_accept
- nexthop_compat_mode, fwmark_reflect
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
netdev_unregister_timeout_secs=0 can lead to printing the
"waiting for dev to become free" message every jiffy.
This is too frequent and unnecessary.
Set the min value to 1 second.
Also fix the merge issue introduced by
"net: make unregister netdev warning timeout configurable":
it changed "refcnt != 1" to "refcnt".
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Fixes: 5aa3afe107 ("net: make unregister netdev warning timeout configurable")
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Modify "accomodate" to "accommodate" in net/ipv4/esp4.c.
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Lu Wei <luwei32@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Modify "Apparantly" to "Apparently" in net/dsa/tag_rtl4_a.c..
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Lu Wei <luwei32@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Modify "erronous" to "erroneous" in net/decnet/dn_nsp_in.c.
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Lu Wei <luwei32@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Modify "funciton" to "function" in net/core/dev_addr_lists.c.
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Lu Wei <luwei32@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Modify "inital" to "initial" in net/ceph/osdmap.c.
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Lu Wei <luwei32@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
s/addres/address
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Lu Wei <luwei32@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add extack error messages for -EINVAL errors when enabling bearer,
getting/setting properties for a media/bearer
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reject NONE on set, this mode means device does not support
FEC so it's a little out of place in the set interface.
This should be safe to do - user space ethtool does not allow
the use of NONE on set. A few drivers treat it the same as OFF,
but none use it instead of OFF.
Similarly reject an empty FEC mask. The common user space tool
will not send such requests and most drivers correctly reject
it already.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
struct ethtool_fecparam::active_fec is a GET-only field,
all in-tree drivers correctly ignore it on SET. Clear
the field on SET to avoid any confusion. Again, we can't
reject non-zero now since ethtool user space does not
zero-init the param correctly.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
struct ethtool_fecparam::reserved is never looked at by the core.
Make sure it's actually 0. Unfortunately we can't return an error
because old ethtool doesn't zero-initialize the structure for SET.
On GET we can be more verbose, there are no in tree (ab)users.
Fix up the kdoc on the structure. Remove the mention of FEC
bypass. Seems like a niche thing to configure in the first
place.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexei Starovoitov says:
====================
pull-request: bpf-next 2021-03-24
The following pull-request contains BPF updates for your *net-next* tree.
We've added 37 non-merge commits during the last 15 day(s) which contain
a total of 65 files changed, 3200 insertions(+), 738 deletions(-).
The main changes are:
1) Static linking of multiple BPF ELF files, from Andrii.
2) Move drop error path to devmap for XDP_REDIRECT, from Lorenzo.
3) Spelling fixes from various folks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
"Various fixes, all over:
1) Fix overflow in ptp_qoriq_adjfine(), from Yangbo Lu.
2) Always store the rx queue mapping in veth, from Maciej
Fijalkowski.
3) Don't allow vmlinux btf in map_create, from Alexei Starovoitov.
4) Fix memory leak in octeontx2-af from Colin Ian King.
5) Use kvalloc in bpf x86 JIT for storing jit'd addresses, from
Yonghong Song.
6) Fix tx ptp stats in mlx5, from Aya Levin.
7) Check correct ip version in tun decap, fropm Roi Dayan.
8) Fix rate calculation in mlx5 E-Switch code, from arav Pandit.
9) Work item memork leak in mlx5, from Shay Drory.
10) Fix ip6ip6 tunnel crash with bpf, from Daniel Borkmann.
11) Lack of preemptrion awareness in macvlan, from Eric Dumazet.
12) Fix data race in pxa168_eth, from Pavel Andrianov.
13) Range validate stab in red_check_params(), from Eric Dumazet.
14) Inherit vlan filtering setting properly in b53 driver, from
Florian Fainelli.
15) Fix rtnl locking in igc driver, from Sasha Neftin.
16) Pause handling fixes in igc driver, from Muhammad Husaini
Zulkifli.
17) Missing rtnl locking in e1000_reset_task, from Vitaly Lifshits.
18) Use after free in qlcnic, from Lv Yunlong.
19) fix crash in fritzpci mISDN, from Tong Zhang.
20) Premature rx buffer reuse in igb, from Li RongQing.
21) Missing termination of ip[a driver message handler arrays, from
Alex Elder.
22) Fix race between "x25_close" and "x25_xmit"/"x25_rx" in hdlc_x25
driver, from Xie He.
23) Use after free in c_can_pci_remove(), from Tong Zhang.
24) Uninitialized variable use in nl80211, from Jarod Wilson.
25) Off by one size calc in bpf verifier, from Piotr Krysiuk.
26) Use delayed work instead of deferrable for flowtable GC, from
Yinjun Zhang.
27) Fix infinite loop in NPC unmap of octeontx2 driver, from
Hariprasad Kelam.
28) Fix being unable to change MTU of dwmac-sun8i devices due to lack
of fifo sizes, from Corentin Labbe.
29) DMA use after free in r8169 with WoL, fom Heiner Kallweit.
30) Mismatched prototypes in isdn-capi, from Arnd Bergmann.
31) Fix psample UAPI breakage, from Ido Schimmel"
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (171 commits)
psample: Fix user API breakage
math: Export mul_u64_u64_div_u64
ch_ktls: fix enum-conversion warning
octeontx2-af: Fix memory leak of object buf
ptp_qoriq: fix overflow in ptp_qoriq_adjfine() u64 calcalation
net: bridge: don't notify switchdev for local FDB addresses
net/sched: act_ct: clear post_ct if doing ct_clear
net: dsa: don't assign an error value to tag_ops
isdn: capi: fix mismatched prototypes
net/mlx5: SF, do not use ecpu bit for vhca state processing
net/mlx5e: Fix division by 0 in mlx5e_select_queue
net/mlx5e: Fix error path for ethtool set-priv-flag
net/mlx5e: Offload tuple rewrite for non-CT flows
net/mlx5e: Allow to match on MPLS parameters only for MPLS over UDP
net/mlx5: Add back multicast stats for uplink representor
net: ipconfig: ic_dev can be NULL in ic_close_devs
MAINTAINERS: Combine "QLOGIC QLGE 10Gb ETHERNET DRIVER" sections into one
docs: networking: Fix a typo
r8169: fix DMA being used after buffer free if WoL is enabled
net: ipa: fix init header command validation
...
s/Orignal/Original/
s/infered/inferred/
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wang Hai <wanghai38@huawei.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
s/sequencially/sequentially/
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wang Hai <wanghai38@huawei.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
s/beggining/beginning/
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wang Hai <wanghai38@huawei.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit 73f156a6e8 ("inetpeer: get rid of ip_id_count")
I used a very small hash table that could be abused
by patient attackers to reveal sensitive information.
Switch to a dynamic sizing, depending on RAM size.
Typical big hosts will now use 128x more storage (2 MB)
to get a similar increase in security and reduction
of hash collisions.
As a bonus, use of alloc_large_system_hash() spreads
allocated memory among all NUMA nodes.
Fixes: 73f156a6e8 ("inetpeer: get rid of ip_id_count")
Reported-by: Amit Klein <aksecurity@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willy Tarreau <w@1wt.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>
Made changes to coding style as suggested by checkpatch.pl
changes are of the type:
space required before the open parenthesis '('
space required after that ','
Signed-off-by: Sai Kalyaan Palla <saikalyaan63@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
s/procdure/procedure/
s/maintanance/maintenance/
Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The dsa infrastructure provides a well-defined hierarchy of devices,
pass up the call to set up the flow block to the master device. From the
software dataplane, the netfilter infrastructure uses the dsa slave
devices to refer to the input and output device for the given skbuff.
Similarly, the flowtable definition in the ruleset refers to the dsa
slave port devices.
This patch adds the glue code to call ndo_setup_tc with TC_SETUP_FT
with the master device via the dsa slave devices.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a PPPoE push action if layer 2 protocol is ETH_P_PPP_SES to add
PPPoE flowtable hardware offload support.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The switch might have already added the VLAN tag through PVID hardware
offload. Keep this extra VLAN in the flowtable but skip it on egress.
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
If there is a forward path to reach an ethernet device and hardware
offload is enabled, then use the direct xmit path.
Moreover, store the real device in the direct xmit path info since
software datapath uses dev_hard_header() to push the layer encapsulation
headers while hardware offload refers to the real device.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the flow tuple xmit_type is set to FLOW_OFFLOAD_XMIT_DIRECT, the
dst_cache pointer is not valid, and the h_source/h_dest/ifidx out fields
need to be used.
This patch also adds the FLOW_ACTION_VLAN_PUSH action to pass the VLAN
tag to the driver.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace the master ethernet device by the dsa slave port. Packets coming
in from the software ingress path use the dsa slave port as input
device.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the PPPoE protocol and session id to the flow tuple using the encap
fields to uniquely identify flows from the receive path. For the
transmit path, dev_hard_header() on the vlan device push the headers.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the vlan tag based when PVID is set on.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the vlan id and protocol to the flow tuple to uniquely identify
flows from the receive path. For the transmit path, dev_hard_header() on
the vlan device push the headers. This patch includes support for two
vlan headers (QinQ) from the ingress path.
Add a generic encap field to the flowtable entry which stores the
protocol and the tag id. This allows to reuse these fields in the PPPoE
support coming in a later patch.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The egress device in the tuple is obtained from route. Use
dev_fill_forward_path() instead to provide the real egress device for
this flow whenever this is available.
The new FLOW_OFFLOAD_XMIT_DIRECT type uses dev_queue_xmit() to transmit
ethernet frames. Cache the source and destination hardware address to
use dev_queue_xmit() to transfer packets.
The FLOW_OFFLOAD_XMIT_DIRECT replaces FLOW_OFFLOAD_XMIT_NEIGH if
dev_fill_forward_path() finds a direct transmit path.
In case of topology updates, if peer is moved to different bridge port,
the connection will time out, reconnect will result in a new entry with
the correct path. Snooping fdb updates would allow for cleaning up stale
flowtable entries.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Obtain the ingress device in the tuple from the route in the reply
direction. Use dev_fill_forward_path() instead to get the real ingress
device for this flow.
Fall back to use the ingress device that the IP forwarding route
provides if:
- dev_fill_forward_path() finds no real ingress device.
- the ingress device that is obtained is not part of the flowtable
devices.
- this route has a xfrm policy.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the xmit_type field that defines the two supported xmit paths in the
flowtable data plane, which are the neighbour and the xfrm xmit paths.
This patch prepares for new flowtable xmit path types to come.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add .ndo_fill_forward_path for dsa slave port devices
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Depending on the VLAN settings of the bridge and the port, the bridge can
either add or remove a tag. When vlan filtering is enabled, the fdb lookup
also needs to know the VLAN tag/proto for the destination address
To provide this, keep track of the stack of VLAN tags for the path in the
lookup context
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add .ndo_fill_forward_path for bridge devices.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add .ndo_fill_forward_path for vlan devices.
For instance, assuming the following topology:
IP forwarding
/ \
eth0.100 eth0
|
eth0
.
.
.
ethX
ab💿ef🆎cd:ef
For packets going through IP forwarding to eth0.100 whose destination
MAC address is ab💿ef🆎cd:ef, dev_fill_forward_path() provides the
following path:
eth0.100 -> eth0
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds dev_fill_forward_path() which resolves the path to reach
the real netdevice from the IP forwarding side. This function takes as
input the netdevice and the destination hardware address and it walks
down the devices calling .ndo_fill_forward_path() for each device until
the real device is found.
For instance, assuming the following topology:
IP forwarding
/ \
br0 eth0
/ \
eth1 eth2
.
.
.
ethX
ab💿ef🆎cd:ef
where eth1 and eth2 are bridge ports and eth0 provides WAN connectivity.
ethX is the interface in another box which is connected to the eth1
bridge port.
For packets going through IP forwarding to br0 whose destination MAC
address is ab💿ef🆎cd:ef, dev_fill_forward_path() provides the
following path:
br0 -> eth1
.ndo_fill_forward_path for br0 looks up at the FDB for the bridge port
from the destination MAC address to get the bridge port eth1.
This information allows to create a fast path that bypasses the classic
bridge and IP forwarding paths, so packets go directly from the bridge
port eth1 to eth0 (wan interface) and vice versa.
fast path
.------------------------.
/ \
| IP forwarding |
| / \ \/
| br0 eth0
. / \
-> eth1 eth2
.
.
.
ethX
ab💿ef🆎cd:ef
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The call to br_vlan_replay_one is returning an error return value but
this is not being assigned to err and the following check on err is
currently always false because err was initialized to zero. Fix this
by assigning err.
Addresses-Coverity: ("'Constant' variable guards dead code")
Fixes: 22f67cdfae ("net: bridge: add helper to replay VLANs installed on port")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When an MRP instance was created, the driver was notified that the
instance is created and then in a different callback about role of the
instance. But when the instance was deleted the driver was notified only
that the MRP instance is deleted and not also that the role is disabled.
This patch make sure that the driver is notified that the role is
changed to disabled before the MRP instance is deleted to have similar
callbacks with the creating of the instance. In this way it would
simplify the logic in the drivers.
Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
netdev_wait_allrefs() issues a warning if refcount does not drop to 0
after 10 seconds. While 10 second wait generally should not happen
under normal workload in normal environment, it seems to fire falsely
very often during fuzzing and/or in qemu emulation (~10x slower).
At least it's not possible to understand if it's really a false
positive or not. Automated testing generally bumps all timeouts
to very high values to avoid flake failures.
Add net.core.netdev_unregister_timeout_secs sysctl to make
the timeout configurable for automated testing systems.
Lowering the timeout may also be useful for e.g. manual bisection.
The default value matches the current behavior.
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=211877
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
If we join an already-created bridge port, such as a bond master
interface, then we can miss the initial switchdev notifications emitted
by the bridge for this port, while it wasn't offloaded by anybody.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
DSA currently assumes that the bridge port starts off with this
constellation of bridge port flags:
- learning on
- unicast flooding on
- multicast flooding on
- broadcast flooding on
just by virtue of code copy-pasta from the bridge layer (new_nbp).
This was a simple enough strategy thus far, because the 'bridge join'
moment always coincided with the 'bridge port creation' moment.
But with sandwiched interfaces, such as:
br0
|
bond0
|
swp0
it may happen that the user has had time to change the bridge port flags
of bond0 before enslaving swp0 to it. In that case, swp0 will falsely
assume that the bridge port flags are those determined by new_nbp, when
in fact this can happen:
ip link add br0 type bridge
ip link add bond0 type bond
ip link set bond0 master br0
ip link set bond0 type bridge_slave learning off
ip link set swp0 master br0
Now swp0 has learning enabled, bond0 has learning disabled. Not nice.
Fix this by "dumpster diving" through the actual bridge port flags with
br_port_flag_is_set, at bridge join time.
We use this opportunity to split dsa_port_change_brport_flags into two
distinct functions called dsa_port_inherit_brport_flags and
dsa_port_clear_brport_flags, now that the implementation for the two
cases is no longer similar. This patch also creates two functions called
dsa_port_switchdev_sync and dsa_port_switchdev_unsync which collect what
we have so far, even if that's asymmetrical. More is going to be added
in the next patch.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is a pretty noisy change that was broken out of the larger change
for replaying switchdev attributes and objects at bridge join time,
which is when these extack objects are actually used.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Tobias Waldekranz <tobias@waldekranz.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
DSA can properly detect and offload this sequence of operations:
ip link add br0 type bridge
ip link add bond0 type bond
ip link set swp0 master bond0
ip link set bond0 master br0
But not this one:
ip link add br0 type bridge
ip link add bond0 type bond
ip link set bond0 master br0
ip link set swp0 master bond0
Actually the second one is more complicated, due to the elapsed time
between the enslavement of bond0 and the offloading of it via swp0, a
lot of things could have happened to the bond0 bridge port in terms of
switchdev objects (host MDBs, VLANs, altered STP state etc). So this is
a bit of a can of worms, and making sure that the DSA port's state is in
sync with this already existing bridge port is handled in the next
patches.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Tobias Waldekranz <tobias@waldekranz.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently this simple setup with DSA:
ip link add br0 type bridge vlan_filtering 1
ip link add bond0 type bond
ip link set bond0 master br0
ip link set swp0 master bond0
will not work because the bridge has created the PVID in br_add_if ->
nbp_vlan_init, and it has notified switchdev of the existence of VLAN 1,
but that was too early, since swp0 was not yet a lower of bond0, so it
had no reason to act upon that notification.
We need a helper in the bridge to replay the switchdev VLAN objects that
were notified since the bridge port creation, because some of them may
have been missed.
As opposed to the br_mdb_replay function, the vg->vlan_list write side
protection is offered by the rtnl_mutex which is sleepable, so we don't
need to queue up the objects in atomic context, we can replay them right
away.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a switchdev port starts offloading a LAG that is already in a
bridge and has an FDB entry pointing to it:
ip link set bond0 master br0
bridge fdb add dev bond0 00:01:02:03:04:05 master static
ip link set swp0 master bond0
the switchdev driver will have no idea that this FDB entry is there,
because it missed the switchdev event emitted at its creation.
Ido Schimmel pointed this out during a discussion about challenges with
switchdev offloading of stacked interfaces between the physical port and
the bridge, and recommended to just catch that condition and deny the
CHANGEUPPER event:
https://lore.kernel.org/netdev/20210210105949.GB287766@shredder.lan/
But in fact, we might need to deal with the hard thing anyway, which is
to replay all FDB addresses relevant to this port, because it isn't just
static FDB entries, but also local addresses (ones that are not
forwarded but terminated by the bridge). There, we can't just say 'oh
yeah, there was an upper already so I'm not joining that'.
So, similar to the logic for replaying MDB entries, add a function that
must be called by individual switchdev drivers and replays local FDB
entries as well as ones pointing towards a bridge port. This time, we
use the atomic switchdev notifier block, since that's what FDB entries
expect for some reason.
Reported-by: Ido Schimmel <idosch@idosch.org>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I have a system with DSA ports, and udhcpcd is configured to bring
interfaces up as soon as they are created.
I create a bridge as follows:
ip link add br0 type bridge
As soon as I create the bridge and udhcpcd brings it up, I also have
avahi which automatically starts sending IPv6 packets to advertise some
local services, and because of that, the br0 bridge joins the following
IPv6 groups due to the code path detailed below:
33:33:ff:6d:c1:9c vid 0
33:33:00:00:00:6a vid 0
33:33:00:00:00:fb vid 0
br_dev_xmit
-> br_multicast_rcv
-> br_ip6_multicast_add_group
-> __br_multicast_add_group
-> br_multicast_host_join
-> br_mdb_notify
This is all fine, but inside br_mdb_notify we have br_mdb_switchdev_host
hooked up, and switchdev will attempt to offload the host joined groups
to an empty list of ports. Of course nobody offloads them.
Then when we add a port to br0:
ip link set swp0 master br0
the bridge doesn't replay the host-joined MDB entries from br_add_if,
and eventually the host joined addresses expire, and a switchdev
notification for deleting it is emitted, but surprise, the original
addition was already completely missed.
The strategy to address this problem is to replay the MDB entries (both
the port ones and the host joined ones) when the new port joins the
bridge, similar to what vxlan_fdb_replay does (in that case, its FDB can
be populated and only then attached to a bridge that you offload).
However there are 2 possibilities: the addresses can be 'pushed' by the
bridge into the port, or the port can 'pull' them from the bridge.
Considering that in the general case, the new port can be really late to
the party, and there may have been many other switchdev ports that
already received the initial notification, we would like to avoid
delivering duplicate events to them, since they might misbehave. And
currently, the bridge calls the entire switchdev notifier chain, whereas
for replaying it should just call the notifier block of the new guy.
But the bridge doesn't know what is the new guy's notifier block, it
just knows where the switchdev notifier chain is. So for simplification,
we make this a driver-initiated pull for now, and the notifier block is
passed as an argument.
To emulate the calling context for mdb objects (deferred and put on the
blocking notifier chain), we must iterate under RCU protection through
the bridge's mdb entries, queue them, and only call them once we're out
of the RCU read-side critical section.
There was some opportunity for reuse between br_mdb_switchdev_host_port,
br_mdb_notify and the newly added br_mdb_queue_one in how the switchdev
mdb object is created, so a helper was created.
Suggested-by: Ido Schimmel <idosch@idosch.org>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The SWITCHDEV_ATTR_ID_BRIDGE_AGEING_TIME attribute is only emitted from:
sysfs/ioctl/netlink
-> br_set_ageing_time
-> __set_ageing_time
therefore not at bridge port creation time, so:
(a) switchdev drivers have to hardcode the initial value for the address
ageing time, because they didn't get any notification
(b) that hardcoded value can be out of sync, if the user changes the
ageing time before enslaving the port to the bridge
We need a helper in the bridge, such that switchdev drivers can query
the current value of the bridge ageing time when they start offloading
it.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Tobias Waldekranz <tobias@waldekranz.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It may happen that we have the following topology with DSA or any other
switchdev driver with LAG offload:
ip link add br0 type bridge stp_state 1
ip link add bond0 type bond
ip link set bond0 master br0
ip link set swp0 master bond0
ip link set swp1 master bond0
STP decides that it should put bond0 into the BLOCKING state, and
that's that. The ports that are actively listening for the switchdev
port attributes emitted for the bond0 bridge port (because they are
offloading it) and have the honor of seeing that switchdev port
attribute can react to it, so we can program swp0 and swp1 into the
BLOCKING state.
But if then we do:
ip link set swp2 master bond0
then as far as the bridge is concerned, nothing has changed: it still
has one bridge port. But this new bridge port will not see any STP state
change notification and will remain FORWARDING, which is how the
standalone code leaves it in.
We need a function in the bridge driver which retrieves the current STP
state, such that drivers can synchronize to it when they may have missed
switchdev events.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Tobias Waldekranz <tobias@waldekranz.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As explained in this discussion:
https://lore.kernel.org/netdev/20210117193009.io3nungdwuzmo5f7@skbuf/
the switchdev notifiers for FDB entries managed to have a zero-day bug.
The bridge would not say that this entry is local:
ip link add br0 type bridge
ip link set swp0 master br0
bridge fdb add dev swp0 00:01:02:03:04:05 master local
and the switchdev driver would be more than happy to offload it as a
normal static FDB entry. This is despite the fact that 'local' and
non-'local' entries have completely opposite directions: a local entry
is locally terminated and not forwarded, whereas a static entry is
forwarded and not locally terminated. So, for example, DSA would install
this entry on swp0 instead of installing it on the CPU port as it should.
There is an even sadder part, which is that the 'local' flag is implicit
if 'static' is not specified, meaning that this command produces the
same result of adding a 'local' entry:
bridge fdb add dev swp0 00:01:02:03:04:05 master
I've updated the man pages for 'bridge', and after reading it now, it
should be pretty clear to any user that the commands above were broken
and should have never resulted in the 00:01:02:03:04:05 address being
forwarded (this behavior is coherent with non-switchdev interfaces):
https://patchwork.kernel.org/project/netdevbpf/cover/20210211104502.2081443-1-olteanv@gmail.com/
If you're a user reading this and this is what you want, just use:
bridge fdb add dev swp0 00:01:02:03:04:05 master static
Because switchdev should have given drivers the means from day one to
classify FDB entries as local/non-local, but didn't, it means that all
drivers are currently broken. So we can just as well omit the switchdev
notifications for local FDB entries, which is exactly what this patch
does to close the bug in stable trees. For further development work
where drivers might want to trap the local FDB entries to the host, we
can add a 'bool is_local' to br_switchdev_fdb_call_notifiers(), and
selectively make drivers act upon that bit, while all the others ignore
those entries if the 'is_local' bit is set.
Fixes: 6b26b51b1d ("net: bridge: Add support for notifying devices about FDB add/del")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Invalid detection works with two distinct moments: act_ct tries to find
a conntrack entry and set post_ct true, indicating that that was
attempted. Then, when flow dissector tries to dissect CT info and no
entry is there, it knows that it was tried and no entry was found, and
synthesizes/sets
key->ct_state = TCA_FLOWER_KEY_CT_FLAGS_TRACKED |
TCA_FLOWER_KEY_CT_FLAGS_INVALID;
mimicing what OVS does.
OVS has this a bit more streamlined, as it recomputes the key after
trying to find a conntrack entry for it.
Issue here is, when we have 'tc action ct clear', it didn't clear
post_ct, causing a subsequent match on 'ct_state -trk' to fail, due to
the above. The fix, thus, is to clear it.
Reproducer rules:
tc filter add dev enp130s0f0np0_0 ingress prio 1 chain 0 \
protocol ip flower ip_proto tcp ct_state -trk \
action ct zone 1 pipe \
action goto chain 2
tc filter add dev enp130s0f0np0_0 ingress prio 1 chain 2 \
protocol ip flower \
action ct clear pipe \
action goto chain 4
tc filter add dev enp130s0f0np0_0 ingress prio 1 chain 4 \
protocol ip flower ct_state -trk \
action mirred egress redirect dev enp130s0f1np1_0
With the fix, the 3rd rule matches, like it does with OVS kernel
datapath.
Fixes: 7baf2429a1 ("net/sched: cls_flower add CT_FLAGS_INVALID flag support")
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Reviewed-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
Problem:
The "lapb_t1timer_running" function in "lapb_timer.c" is used in only
one place: in the "lapb_kick" function in "lapb_out.c". "lapb_kick" calls
"lapb_t1timer_running" to check if the timer is already pending, and if
it is not, schedule it to run.
However, if the timer has already fired and is running, and is waiting to
get the "lapb->lock" lock, "lapb_t1timer_running" will not detect this,
and "lapb_kick" will then schedule a new timer. The old timer will then
abort when it sees a new timer pending.
I think this is not right. The purpose of "lapb_kick" should be ensuring
that the actual work of the timer function is scheduled to be done.
If the timer function is already running but waiting for the lock,
"lapb_kick" should not abort and reschedule it.
Changes made:
I added a new field "t1timer_running" in "struct lapb_cb" for
"lapb_t1timer_running" to use. "t1timer_running" will accurately reflect
whether the actual work of the timer is pending. If the timer has fired
but is still waiting for the lock, "t1timer_running" will still correctly
reflect whether the actual work is waiting to be done.
The old "t1timer_stop" field, whose only responsibility is to ask a timer
(that is already running but waiting for the lock) to abort, is no longer
needed, because the new "t1timer_running" field can fully take over its
responsibility. Therefore "t1timer_stop" is deleted.
"t1timer_running" is not simply a negation of the old "t1timer_stop".
At the end of the timer function, if it does not reschedule itself,
"t1timer_running" is set to false to indicate that the timer is stopped.
For consistency of the code, I also added "t2timer_running" and deleted
"t2timer_stop".
Signed-off-by: Xie He <xie.he.0141@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use a temporary variable to hold the return value from
dsa_tag_driver_get() instead of assigning it to dst->tag_ops. Leaving
an error value in dst->tag_ops can result in deferencing an invalid
pointer when a deferred switch configuration happens later.
Fixes: 357f203bb3 ("net: dsa: keep a copy of the tagging protocol in the DSA switch tree")
Signed-off-by: George McCollister <george.mccollister@gmail.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following batch contains Netfilter updates for net-next:
1) Split flowtable workqueues per events, from Oz Shlomo.
2) fall-through warnings for clang, from Gustavo A. R. Silva
3) Remove unused declaration in conntrack, from YueHaibing.
4) Consolidate skb_try_make_writable() in flowtable datapath,
simplify some of the existing codebase.
5) Call dst_check() to fall back to static classic forwarding path.
6) Update table flags from commit phase.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
When adding CONFIG_PCPU_DEV_REFCNT, I forgot that the
initial net device refcount was 0.
When CONFIG_PCPU_DEV_REFCNT is not set, this means
the first dev_hold() triggers an illegal refcount
operation (addition on 0)
refcount_t: addition on 0; use-after-free.
WARNING: CPU: 0 PID: 1 at lib/refcount.c:25 refcount_warn_saturate+0x128/0x1a4
Fix is to change initial (and final) refcount to be 1.
Also add a missing kerneldoc piece, as reported by
Stephen Rothwell.
Fixes: 919067cc84 ("net: add CONFIG_PCPU_DEV_REFCNT")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Guenter Roeck <groeck@google.com>
Tested-by: Guenter Roeck <groeck@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Recently we had an interop issue where RARP packets got suppressed with
bridge neigh suppression enabled, but the check in the code was meant to
suppress GARP. Exclude RARP packets from it which would allow some VMWare
setups to work, to quote the report:
"Those RARP packets usually get generated by vMware to notify physical
switches when vMotion occurs. vMware may use random sip/tip or just use
sip=tip=0. So the RARP packet sometimes get properly flooded by the vtep
and other times get dropped by the logic"
Reported-by: Amer Abdalamer <amer@nvidia.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
xps_queue_show is mostly made of an RCU read-side critical section and
calls bitmap_zalloc with GFP_KERNEL in the middle of it. That is not
allowed as this call may sleep and such behaviours aren't allowed in RCU
read-side critical sections. Fix this by using GFP_NOWAIT instead.
Fixes: 5478fcd0f4 ("net: embed nr_ids in the xps maps")
Reported-by: kernel test robot <oliver.sang@intel.com>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
s/verifed/verified/
Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
ptype_all and ptype_base are declared in net/core/dev.c as non-static,
because they are used by net-procfs.c too. However, a "make W=1" build
complains that there was no previous declaration of ptype_all and
ptype_base in a header file, so this way of declaring things constitutes
a violation of coding style.
Let's move the extern declarations of ptype_all and ptype_base to the
linux/netdevice.h file, which is included by net-procfs.c too.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since their introduction in commit 04157469b7 ("net: Use static_key
for XPS maps"), xps_needed and xps_rxqs_needed were never used outside
net/core/dev.c, so I don't really understand why they were exported as
symbols in the first place.
This is needed in order to silence a "make W=1" warning about these
static keys not being declared as static variables, but not having a
previous declaration in a header file nonetheless.
Cc: Amritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The only caller of br_vlan_tunnel_lookup, br_handle_ingress_vlan_tunnel,
extracts the tunnel_id from struct ip_tunnel_info::struct ip_tunnel_key::
tun_id which is a __be64 value.
The exact endianness does not seem to matter, because the tunnel id is
just used as a lookup key for the VLAN group's tunnel hash table, and
the value is not interpreted directly per se. Moreover,
rhashtable_lookup_fast treats the key argument as a const void *.
Therefore, there is no functional change associated with this patch,
just one to silence "make W=1" builds.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
s/subsytem/subsystem/
Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
ic_close_dev contains a generalization of the logic to not close a
network interface if it's the host port for a DSA switch. This logic is
disguised behind an iteration through the lowers of ic_dev in
ic_close_dev.
When no interface for ipconfig can be found, ic_dev is NULL, and
ic_close_dev:
- dereferences a NULL pointer when assigning selected_dev
- would attempt to search through the lower interfaces of a NULL
net_device pointer
So we should protect against that case.
The "lower_dev" iterator variable was shortened to "lower" in order to
keep the 80 character limit.
Fixes: f68cbaed67 ("net: ipconfig: avoid use-after-free in ic_close_devs")
Fixes: 46acf7bdbc ("Revert "net: ipv4: handle DSA enabled master network devices"")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Tested-by: Heiko Thiery <heiko.thiery@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The existing code is functionally correct: iproute2 parses the ip_flags
argument for tc-flower and really packs it as big endian into the
TCA_FLOWER_KEY_FLAGS netlink attribute. But there is a problem in the
fact that W=1 builds complain:
net/sched/cls_flower.c:1047:15: warning: cast to restricted __be32
This is because we should use the dedicated helper for obtaining a
__be32 pointer to the netlink attribute, not a u32 one. This ensures
type correctness for be32_to_cpu.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A make W=1 build complains that:
net/sched/cls_flower.c:214:20: warning: cast from restricted __be16
net/sched/cls_flower.c:214:20: warning: incorrect type in argument 1 (different base types)
net/sched/cls_flower.c:214:20: expected unsigned short [usertype] val
net/sched/cls_flower.c:214:20: got restricted __be16 [usertype] dst
This is because we use htons on struct flow_dissector_key_ports members
src and dst, which are defined as __be16, so they are already in network
byte order, not host. The byte swap function for the other direction
should have been used.
Because htons and ntohs do the same thing (either both swap, or none
does), this change has no functional effect except to silence the
warnings.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
1. Remove CONFIG_HAVE_NET_DSA.
CONFIG_HAVE_NET_DSA is a legacy leftover from the times when drivers
should have selected CONFIG_NET_DSA manually.
Currently, all drivers has explicit 'depends on NET_DSA', so this is
no more needed.
2. CONFIG_HAVE_NET_DSA dependencies became CONFIG_NET_DSA's ones.
- dropped !S390 dependency which was introduced to be sure NET_DSA
can select CONFIG_PHYLIB. DSA migrated to Phylink almost 3 years
ago and the PHY library itself doesn't depend on !S390 since
commit 870a2b5e4f ("phylib: remove !S390 dependeny from Kconfig");
- INET dependency is kept to be sure we can select NET_SWITCHDEV;
- NETDEVICES dependency is kept to be sure we can select PHYLINK.
3. DSA drivers menu now depends on NET_DSA.
Instead on 'depends on NET_DSA' on every single driver, the entire
menu now depends on it. This eliminates a lot of duplicated lines
from Kconfig with no loss (when CONFIG_NET_DSA=m, drivers also can
be only m or n).
This also has a nice side effect that there's no more empty menu on
configurations without DSA.
4. Kbuild will now descend into 'drivers/net/dsa' only when
CONFIG_NET_DSA is y or m.
This is safe since no objects inside this folder can be built without
DSA core, as well as when CONFIG_NET_DSA=m, no objects can be
built-in.
Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Made changes to coding style as suggested by checkpatch.pl
changes are of the type:
open brace '{' following struct go on the same line
do not use assignment in if condition
Signed-off-by: Sai Kalyaan Palla <saikalyaan63@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit d4eb538e1f ("can: isotp: TX-path: ensure that CAN frame flags are
initialized") ensured the TX flags to be properly set for outgoing CAN
frames.
In fact the root cause of the issue results from a missing initialization
of outgoing CAN frames created by isotp. This is no problem on the CAN bus
as the CAN driver only picks the correctly defined content from the struct
can(fd)_frame. But when the outgoing frames are monitored (e.g. with
candump) we potentially leak some bytes in the unused content of
struct can(fd)_frame.
Fixes: e057dd3fc2 ("can: add ISO 15765-2:2016 transport protocol")
Cc: Marc Kleine-Budde <mkl@pengutronix.de>
Link: https://lore.kernel.org/r/20210319100619.10858-1-socketcan@hartkopp.net
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
For AF_VSOCK, accept() currently returns sockets that are unlabelled.
Other socket families derive the child's SID from the SID of the parent
and the SID of the incoming packet. This is typically done as the
connected socket is placed in the queue that accept() removes from.
Reuse the existing 'security_sk_clone' hook to copy the SID from the
parent (server) socket to the child. There is no packet SID in this
case.
Fixes: d021c34405 ("VSOCK: Introduce VM Sockets")
Signed-off-by: David Brazdil <dbrazdil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I was working on a syzbot issue, claiming one device could not be
dismantled because its refcount was -1
unregister_netdevice: waiting for sit0 to become free. Usage count = -1
It would be nice if syzbot could trigger a warning at the time
this reference count became negative.
This patch adds CONFIG_PCPU_DEV_REFCNT options which defaults
to per cpu variables (as before this patch) on SMP builds.
v2: free_dev label in alloc_netdev_mqs() is moved to avoid
a compiler warning (-Wunused-label), as reported
by kernel test robot <lkp@intel.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The sk's sk_route_caps is set in sctp_packet_config, and later it
only needs to change when traversing the transport_list in a loop,
as the dst might be changed in the tx path.
So move sk_route_caps check and set into sctp_outq_flush_transports
from sctp_packet_transmit. This also fixes a dst leak reported by
Chen Yi:
https://bugzilla.kernel.org/show_bug.cgi?id=212227
As calling sk_setup_caps() in sctp_packet_transmit may also set the
sk_route_caps for the ctrl sock in a netns. When the netns is being
deleted, the ctrl sock's releasing is later than dst dev's deleting,
which will cause this dev's deleting to hang and dmesg error occurs:
unregister_netdevice: waiting for xxx to become free. Usage count = 1
Reported-by: Chen Yi <yiche@redhat.com>
Fixes: bcd623d8e9 ("sctp: call sk_setup_caps in sctp_packet_transmit instead")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When using short intervals e.g. below one millisecond, large packets won't be
transmitted at all. The software implementations checks whether the packet can
be fit into the remaining interval. Therefore, it takes the packet length and
the transmission speed into account. That is correct.
However, for large packets it may be that the transmission time exceeds the
interval resulting in no packet transmission. The same situation works fine with
hardware offloading applied.
The problem has been observed with the following schedule and iperf3:
|tc qdisc replace dev lan1 parent root handle 100 taprio \
| num_tc 8 \
| map 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 \
| queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \
| base-time $base \
| sched-entry S 0x40 500000 \
| sched-entry S 0xbf 500000 \
| clockid CLOCK_TAI \
| flags 0x00
[...]
|root@tsn:~# iperf3 -c 192.168.2.105
|Connecting to host 192.168.2.105, port 5201
|[ 5] local 192.168.2.121 port 52610 connected to 192.168.2.105 port 5201
|[ ID] Interval Transfer Bitrate Retr Cwnd
|[ 5] 0.00-1.00 sec 45.2 KBytes 370 Kbits/sec 0 1.41 KBytes
|[ 5] 1.00-2.00 sec 0.00 Bytes 0.00 bits/sec 0 1.41 KBytes
After debugging, it seems that the packet length stored in the SKB is about
7000-8000 bytes. Using a 100 Mbit/s link the transmission time is about 600us
which larger than the interval of 500us.
Therefore, segment the SKB into smaller chunks if the packet is too big. This
yields similar results than the hardware offload:
|root@tsn:~# iperf3 -c 192.168.2.105
|Connecting to host 192.168.2.105, port 5201
|- - - - - - - - - - - - - - - - - - - - - - - - -
|[ ID] Interval Transfer Bitrate Retr
|[ 5] 0.00-10.00 sec 48.9 MBytes 41.0 Mbits/sec 0 sender
|[ 5] 0.00-10.02 sec 48.7 MBytes 40.7 Mbits/sec receiver
Furthermore, the segmentation can be skipped for the full offload case, as the
driver or the hardware is expected to handle this.
Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
The two most popular headers going after Ethernet are IPv4 and IPv6.
Retpoline overhead for them is addressed only in dev_gro_receive(),
when they lie right after the outermost Ethernet header.
Use the indirect call wrappers in TEB (Transparent Ethernet Bridging,
such as GENEVE, NvGRE, VxLAN etc.) GRO receive code to reduce the
penalty when processing the inner headers.
Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
The two most popular headers going after VLAN are IPv4 and IPv6.
Retpoline overhead for them is addressed only in dev_gro_receive(),
when they lie right after the outermost Ethernet header.
Use the indirect call wrappers in VLAN GRO receive code to reduce
the penalty on receiving tagged frames (when hardware stripping is
off or not available).
Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
1) Several patches to testore use of memory barriers instead of RCU to
ensure consistent access to ruleset, from Mark Tomlinson.
2) Fix dump of expectation via ctnetlink, from Florian Westphal.
3) GRE helper works for IPv6, from Ludovic Senecaux.
4) Set error on unsupported flowtable flags.
5) Use delayed instead of deferrable workqueue in the flowtable,
from Yinjun Zhang.
6) Fix spurious EEXIST in case of add-after-delete flowtable in
the same batch.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
A typo is found out by codespell tool in 1734th line of drop_monitor.c:
$ codespell ./net/core/
./net/core/drop_monitor.c:1734: guarnateed ==> guaranteed
Fix a typo found by codespell.
Signed-off-by: Xiong Zhenwu <xiong.zhenwu@zte.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
A typo is found out by codespell tool in 111th line of hsr_debugfs.c:
$ codespell ./net/hsr/
net/hsr/hsr_debugfs.c:111: Debufs ==> Debugfs
Fix typos found by codespell.
Signed-off-by: Xiong Zhenwu <xiong.zhenwu@zte.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order for a driver to be able to query a bridge for information
about itself, e.g. reading out port flags, it has to use a netdev that
is known to the bridge. In the simple case, that is just the netdev
representing the port, e.g. swp0 or swp1 in this example:
br0
/ \
swp0 swp1
But in the case of an offloaded lag, this will be the bond or team
interface, e.g. bond0 in this example:
br0
/
bond0
/ \
swp0 swp1
Add a helper that hides some of this complexity from the
drivers. Then, redefine dsa_port_offloads_bridge_port using the helper
to avoid double accounting of the set of possible offloaded uppers.
Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In __netif_set_xps_queue, old map entries from the old dev_maps are
freed but their corresponding entry in the old dev_maps aren't NULLed.
Fix this.
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When setting up an new dev_maps in __netif_set_xps_queue, we remove and
free maps from unused CPUs/rx-queues near the end of the function; by
calling remove_xps_queue. However it's possible those maps are also part
of the old not-freed-yet dev_maps, which might be used concurrently.
When that happens, a map can be freed while its corresponding entry in
the old dev_maps table isn't NULLed, leading to: "BUG: KASAN:
use-after-free" in different places.
This fixes the map freeing logic for unused CPUs/rx-queues, to also NULL
the map entries from the old dev_maps table.
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Most of the xps_cpus_show and xps_rxqs_show functions share the same
logic. Having it in two different functions does not help maintenance.
This patch moves their common logic into a new function, xps_queue_show,
to improve this.
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that nr_ids and num_tc are stored in the xps dev_maps, which are RCU
protected, we do not have the need to protect the maps in the rtnl lock.
Move the rtnl unlock up so we reduce the rtnl locking section.
We also increase the reference count on the subordinate device if any,
as we don't want this device to be freed while we use it (now that the
rtnl lock isn't protecting it in the whole function).
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Improve the readability of the loop removing tx-queue from unused
CPUs/rx-queues in __netif_set_xps_queue. The change should only be
cosmetic.
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds an helper, xps_copy_dev_maps, to copy maps from dev_maps
to new_dev_maps at a given index. The logic should be the same, with an
improved code readability and maintenance.
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move the xps maps (xps_cpus_map and xps_rxqs_map) to an array in
net_device. That will simplify a lot the code removing the need for lots
of if/else conditionals as the correct map will be available using its
offset in the array.
This should not modify the xps maps behaviour in any way.
Suggested-by: Alexander Duyck <alexander.duyck@gmail.com>
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove the xps possible_mask. It was an optimization but we can just
loop from 0 to nr_ids now that it is embedded in the xps dev_maps. That
simplifies the code a bit.
Suggested-by: Alexander Duyck <alexander.duyck@gmail.com>
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Embed nr_ids (the number of cpu for the xps cpus map, and the number of
rxqs for the xps cpus map) in dev_maps. That will help not accessing out
of bound memory if those values change after dev_maps was allocated.
Suggested-by: Alexander Duyck <alexander.duyck@gmail.com>
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The xps cpus/rxqs map is accessed using dev->num_tc, which is used when
allocating the map. But later updates of dev->num_tc can lead to having
a mismatch between the maps and how they're accessed. In such cases the
map values do not make any sense and out of bound accesses can occur
(that can be easily seen using KASAN).
This patch aims at fixing this by embedding num_tc into the maps, using
the value at the time the map is created. This brings two improvements:
- The maps can be accessed using the embedded num_tc, so we know for
sure we won't have out of bound accesses.
- Checks can be made before accessing the maps so we know the values
retrieved will make sense.
We also update __netif_set_xps_queue to conditionally copy old maps from
dev_maps in the new one only if the number of traffic classes from both
maps match.
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make the implementations of xps_cpus_show and xps_rxqs_show to converge,
as the two share the same logic but diverted over time. This should not
modify their behaviour but will help future changes and improve
maintenance.
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In net-sysfs, get_netdev_queue_index returns an unsigned int. Some of
its callers use an unsigned long to store the returned value. Update the
code to be consistent, this should only be cosmetic.
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use bitmap_zalloc instead of zalloc_cpumask_var in xps_cpus_show to
align with xps_rxqs_show. This will improve maintenance and allow us to
factorize the two functions. The function should behave the same.
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
__dev_alloc_name(), when supplied with a name containing '%d',
will search for the first available device number to generate a
unique device name.
Since commit ff92741270 ("net:
introduce name_node struct to be used in hashlist") network
devices may have alternate names. __dev_alloc_name() does take
these alternate names into account, possibly generating a name
that is already taken and failing with -ENFILE as a result.
This demonstrates the bug:
# rmmod dummy 2>/dev/null
# ip link property add dev lo altname dummy0
# modprobe dummy numdummies=1
modprobe: ERROR: could not insert 'dummy': Too many open files in system
Instead of creating a device named dummy1, modprobe fails.
Fix this by checking all the names in the d->name_node list, not just d->name.
Signed-off-by: Jiri Bohac <jbohac@suse.cz>
Fixes: ff92741270 ("net: introduce name_node struct to be used in hashlist")
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit 6af1799aaf.
Commit 6af1799aaf ("ipv6: drop incoming packets having a v4mapped
source address") introduced an input check against v4mapped addresses.
Use of such addresses on the wire is indeed questionable and not
allowed on public Internet. As the commit pointed out
https://tools.ietf.org/html/draft-itojun-v6ops-v4mapped-harmful-02
lists potential issues.
Unfortunately there are applications which use v4mapped addresses,
and breaking them is a clear regression. For example v4mapped
addresses (or any semi-valid addresses, really) may be used
for uni-direction event streams or packet export.
Since the issue which sparked the addition of the check was with
TCP and request_socks in particular push the check down to TCPv6
and DCCP. This restores the ability to receive UDPv6 packets with
v4mapped address as the source.
Keep using the IPSTATS_MIB_INHDRERRORS statistic to minimize the
user-visible changes.
Fixes: 6af1799aaf ("ipv6: drop incoming packets having a v4mapped source address")
Reported-by: Sunyi Shao <sunyishao@fb.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>