Another thing making netns dismantles potentially very slow is located
in gro_cells_destroy(),
whenever cleanup_net() has to remove a device using gro_cells framework.
RTNL is not held at this stage, so synchronize_net()
is calling synchronize_rcu():
netdev_run_todo()
ip_tunnel_dev_free()
gro_cells_destroy()
synchronize_net()
synchronize_rcu() // Ouch.
This patch uses call_rcu(), and gave me a 25x performance improvement
in my tests.
cleanup_net() is no longer blocked ~10 ms per synchronize_rcu()
call.
In the case we could not allocate the memory needed to queue the
deferred free, use synchronize_rcu_expedited()
v2: made percpu_free_defer_callback() static
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20220220041155.607637-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Russell King says:
====================
net: dsa: b53: convert to phylink_generic_validate() and mark as non-legacy
This series converts b53 to use phylink_generic_validate() and also
marks this driver as non-legacy.
Patch 1 cleans up an if() condition to be more readable before we
proceed with the conversion.
Patch 2 populates the supported_interfaces and mac_capabilities members
of phylink_config.
Patch 3 drops the use of phylink_helper_basex_speed() which is now not
necessary.
Patch 4 switches the driver to use phylink_generic_validate()
Patch 5 marks the driver as non-legacy.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The B53 driver does not make use of the speed, duplex, pause or
advertisement in its phylink_mac_config() implementation, so it can be
marked as a non-legacy driver.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Switch the Broadcom b53 driver to using the phylink_generic_validate()
implementation by removing its own .phylink_validate method and
associated code.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that we have a better method to select SFP interface modes, we
no longer need to use phylink_helper_basex_speed() in a driver's
validation function.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
Populate the supported interfaces and MAC capabilities for the Broadcom
B53 DSA switches in preparation to using these for the generic
validation functionality.
The interface modes are derived from:
- b53_serdes_phylink_validate()
- SRAB mux configuration
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
I've stared at this if() statement for a while trying to work out if
it really does correspond with the comment above, and it does seem to.
However, let's make it more readable and phrase it in the same way as
the comment.
Also add a FIXME into the comment - we appear to deny Gigabit modes for
802.3z interface modes, but 802.3z interface modes only operate at
gigabit and above.
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
In hsr, lockdep_is_held() is needed for rcu_dereference_bh_check().
But if lockdep is not enabled, lockdep_is_held() causes a build error:
ERROR: modpost: "lockdep_is_held" [net/hsr/hsr.ko] undefined!
Thus, this patch solved by adding lockdep_hsr_is_held(). This helper
function calls the lockdep_is_held() when lockdep is enabled, and returns 1
if not defined.
Fixes: e7f2742068 ("net: hsr: fix suspicious RCU usage warning in hsr_node_get_first()")
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Juhee Kang <claudiajkang@gmail.com>
Link: https://lore.kernel.org/r/20220220153250.5285-1-claudiajkang@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Rakesh Babu Saladi says:
====================
RVU AF and NETDEV drivers' PTP updates.
Patch 1: Add suppot such that RVU drivers support new timestamp format.
Patch 2: This patch adds workaround for PTP errata.
Changes made from v1 to v2
1. CC'd Richard Cochran to review PTP related patches.
2. Removed a patch from the old patch series. Will submit the removed patch
separately.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds workaround for PTP errata given below.
1. At the time of 1 sec rollover of nano-second counter,
the nano-second counter is set to 0. However, it should
be set to (existing counter_value - 10^9). This leads to
an accumulating error in the timestamp value with each sec
rollover.
2. Additionally, the nano-second counter currently is rolling
over at 'h3B9A_C9FF. It should roll over at 'h3B9A_CA00.
The workaround for issue #1 is to speed up the ptp clock by
adjusting PTP_CLOCK_COMP register to the desired value to
compensate for the nanoseconds lost per each second.
The workaround for issue #2 is to slow down the ptp clock
such that the rollover occurs at ~1sec.
Signed-off-by: Naveen Mamindlapalli <naveenm@marvell.com>
Signed-off-by: Sunil Kovvuri Goutham <sgoutham@marvell.com>
Signed-off-by: Rakesh Babu Saladi <rsaladi2@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The cn10k hardware ptp timestamp format has been modified primarily
to support 1-step ptp clock. The 64-bit timestamp used by hardware is
split into two 32-bit fields, the upper one holds seconds, the lower
one nanoseconds. A new register (PTP_CLOCK_SEC) has been added that
returns the current seconds value. The nanoseconds register PTP_CLOCK_HI
resets after every second. The cn10k RPM block provides Rx/Tx timestamps
to the NIX block using the new timestamp format. The software can read
the current timestamp in nanoseconds by reading both PTP_CLOCK_SEC &
PTP_CLOCK_HI registers.
This patch provides support for new timestamp format.
Signed-off-by: Naveen Mamindlapalli <naveenm@marvell.com>
Signed-off-by: Sunil Kovvuri Goutham <sgoutham@marvell.com>
Signed-off-by: Rakesh Babu Saladi <rsaladi2@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hangbin Liu says:
====================
bonding: add IPv6 NS/NA monitor support
This patch add bond IPv6 NS/NA monitor support. A new option
ns_ip6_target is added, which is similar with arp_ip_target.
The IPv6 NS/NA monitor will take effect when there is a valid IPv6
address. Both ARP monitor and NS monitor will working at the same time.
A new extra storage field is added to struct bond_opt_value for IPv6 support.
Function bond_handle_vlan() is split from bond_arp_send() for both
IPv4/IPv6 usage.
To alloc NS message and send out. ndisc_ns_create() and ndisc_send_skb()
are exported.
v1 -> v2:
1. remove sysfs entry[1] and only keep netlink support.
RFC -> v1:
1. define BOND_MAX_ND_TARGETS as BOND_MAX_ARP_TARGETS
2. adjust for reverse xmas tree ordering of local variables
3. remove bond_do_ns_validate()
4. add extra field for bond_opt_value
5. set IS_ENABLED(CONFIG_IPV6) for IPv6 codes
[1] https://lore.kernel.org/netdev/8863.1645071997@famine
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch add a new bonding option ns_ip6_target, which correspond
to the arp_ip_target. With this we set IPv6 targets and send IPv6 NS
request to determine the health of the link.
For other related options like the validation, we still use
arp_validate, and will change to ns_validate later.
Note: the sysfs configuration support was removed based on
https://lore.kernel.org/netdev/8863.1645071997@famine
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a new bonding parameter ns_targets to store IPv6 address.
Add required bond_ns_send/rcv functions first before adding
IPv6 address option setting.
Add two functions bond_send/rcv_validate so we can send/recv
ARP and NS at the same time.
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adding an extra storage field for bond_opt_value so we can set large
bytes of data for bonding options in future, e.g. IPv6 address.
Define a new call bond_opt_initextra(). Also change the checking order of
__bond_opt_init() and check values first.
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Function bond_handle_vlan() is split from bond_arp_send() for later
IPv6 usage.
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch separate NS message allocation steps from ndisc_send_ns(),
so it could be used in other places, like bonding, to allocate and
send IPv6 NS message.
Also export ndisc_send_skb() and ndisc_ns_create() for later bonding usage.
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
'max_rx_len' can be up to GBETH_RX_BUFF_MAX (i.e. 8192) (see
'gbeth_hw_info').
The default value of 'num_rx_ring' can be BE_RX_RING_SIZE (i.e. 1024).
So this loop can allocate 8 Mo of memory.
Previous memory allocations in this function already use GFP_KERNEL, so
use __netdev_alloc_skb() and an explicit GFP_KERNEL instead of a
implicit GFP_ATOMIC.
This gives more opportunities of successful allocation.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Reviewed-by: Biju Das <biju.das.jz@bp.renesas.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use skb_put_zero() instead of hand-writing it. This saves a few lines of
code and is more readable.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel says:
====================
ipv4: Invalidate neighbour for broadcast address upon address addition
Patch #1 solves a recently reported issue [1]. See detailed description
in the changelog.
Patch #2 adds a matching test case.
Targeting at net-next since as far as I can tell this use case never
worked.
There are no regressions in fib_tests.sh with this change:
# ./fib_tests.sh
...
Tests passed: 186
Tests failed: 0
[1] https://lore.kernel.org/netdev/55a04a8f-56f3-f73c-2aea-2195923f09d1@huawei.com/
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
In case user space sends a packet destined to a broadcast address when a
matching broadcast route is not configured, the kernel will create a
unicast neighbour entry that will never be resolved [1].
When the broadcast route is configured, the unicast neighbour entry will
not be invalidated and continue to linger, resulting in packets being
dropped.
Solve this by invalidating unresolved neighbour entries for broadcast
addresses after routes for these addresses are internally configured by
the kernel. This allows the kernel to create a broadcast neighbour entry
following the next route lookup.
Another possible solution that is more generic but also more complex is
to have the ARP code register a listener to the FIB notification chain
and invalidate matching neighbour entries upon the addition of broadcast
routes.
It is also possible to wave off the issue as a user space problem, but
it seems a bit excessive to expect user space to be that intimately
familiar with the inner workings of the FIB/neighbour kernel code.
[1] https://lore.kernel.org/netdev/55a04a8f-56f3-f73c-2aea-2195923f09d1@huawei.com/
Reported-by: Wang Hai <wanghai38@huawei.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Tested-by: Wang Hai <wanghai38@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Open coded calculation can be avoided and replaced by the
equivalent csum_replace_by_diff() and csum_sub().
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>
Menglong Dong says:
====================
net: add skb drop reasons to TCP packet receive
In the commit c504e5c2f9 ("net: skb: introduce kfree_skb_reason()"),
we added the support of reporting the reasons of skb drops to kfree_skb
tracepoint. And in this series patches, reasons for skb drops are added
to TCP layer (both TCPv4 and TCPv6 are considered).
Following functions are processed:
tcp_v4_rcv()
tcp_v6_rcv()
tcp_v4_inbound_md5_hash()
tcp_v6_inbound_md5_hash()
tcp_add_backlog()
tcp_v4_do_rcv()
tcp_v6_do_rcv()
tcp_rcv_established()
tcp_data_queue()
tcp_data_queue_ofo()
The functions we handled are mostly for packet ingress, as skb drops
hardly happens in the egress path of TCP layer. However, it's a little
complex for TCP state processing, as I find that it's hard to report skb
drop reasons to where it is freed. For example, when skb is dropped in
tcp_rcv_state_process(), the reason can be caused by the call of
tcp_v4_conn_request(), and it's hard to return a drop reason from
tcp_v4_conn_request(). So such cases are skipped for this moment.
Following new drop reasons are introduced (what they mean can be see
in the document for them):
/* SKB_DROP_REASON_TCP_MD5* corresponding to LINUX_MIB_TCPMD5* */
SKB_DROP_REASON_TCP_MD5NOTFOUND
SKB_DROP_REASON_TCP_MD5UNEXPECTED
SKB_DROP_REASON_TCP_MD5FAILURE
SKB_DROP_REASON_SOCKET_BACKLOG
SKB_DROP_REASON_TCP_FLAGS
SKB_DROP_REASON_TCP_ZEROWINDOW
SKB_DROP_REASON_TCP_OLD_DATA
SKB_DROP_REASON_TCP_OVERWINDOW
/* corresponding to LINUX_MIB_TCPOFOMERGE */
SKB_DROP_REASON_TCP_OFOMERGE
Here is a example to get TCP packet drop reasons from ftrace:
$ echo 1 > /sys/kernel/debug/tracing/events/skb/kfree_skb/enable
$ cat /sys/kernel/debug/tracing/trace
$ <idle>-0 [036] ..s1. 647.428165: kfree_skb: skbaddr=000000004d037db6 protocol=2048 location=0000000074cd1243 reason: NO_SOCKET
$ <idle>-0 [020] ..s2. 639.676674: kfree_skb: skbaddr=00000000bcbfa42d protocol=2048 location=00000000bfe89d35 reason: PROTO_MEM
From the reason 'PROTO_MEM' we can know that the skb is dropped because
the memory configured in net.ipv4.tcp_mem is up to the limition.
Changes since v2:
- remove the 'inline' of tcp_drop() in the 1th patch, as Jakub
suggested
Changes since v1:
- enrich the document for this series patches in the cover letter,
as Eric suggested
- fix compile warning report by Jakub in the 6th patch
- let NO_SOCKET trump the XFRM failure in the 2th and 3th patches
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace tcp_drop() used in tcp_data_queue() with tcp_drop_reason().
Following drop reasons are introduced:
SKB_DROP_REASON_TCP_ZEROWINDOW
SKB_DROP_REASON_TCP_OLD_DATA
SKB_DROP_REASON_TCP_OVERWINDOW
SKB_DROP_REASON_TCP_OLD_DATA is used for the case that end_seq of skb
less than the left edges of receive window. (Maybe there is a better
name?)
Reviewed-by: Mengen Sun <mengensun@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pass the address of drop reason to tcp_v4_inbound_md5_hash() and
tcp_v6_inbound_md5_hash() to store the reasons for skb drops when this
function fails. Therefore, the drop reason can be passed to
kfree_skb_reason() when the skb needs to be freed.
Following drop reasons are added:
SKB_DROP_REASON_TCP_MD5NOTFOUND
SKB_DROP_REASON_TCP_MD5UNEXPECTED
SKB_DROP_REASON_TCP_MD5FAILURE
SKB_DROP_REASON_TCP_MD5* above correspond to LINUX_MIB_TCPMD5*
Reviewed-by: Mengen Sun <mengensun@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
For TCP protocol, tcp_drop() is used to free the skb when it needs
to be dropped. To make use of kfree_skb_reason() and pass the drop
reason to it, introduce the function tcp_drop_reason(). Meanwhile,
make tcp_drop() an inline call to tcp_drop_reason().
Reviewed-by: Mengen Sun <mengensun@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The KSZ9477 SPI driver already has support for the KSZ8563. The same switch
chip can also be managed via i2c and we have an KSZ9477 I2C driver, but
that one lacks the relevant compatible entry. Add it.
DT bindings already describe this compatible.
Signed-off-by: Ahmad Fatoum <a.fatoum@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Provide access to HW offloaded packets over stats64 interface.
The rx/tx_bytes values needed some fixing since HW is accounting size of
the Ethernet frame together with FCS.
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Russell King says:
====================
net: phylink: remove pcs_poll
This small series removes the now unused pcs_poll members from DSA and
phylink. "git grep pcs_poll drivers/net/ net/" on net-next confirms that
the only places that reference this are in DSA core code and phylink
code:
drivers/net/phy/phylink.c: if (pl->config->pcs_poll || pcs->poll)
drivers/net/phy/phylink.c: poll |= pl->config->pcs_poll;
net/dsa/port.c: dp->pl_config.pcs_poll = ds->pcs_poll;
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
phylink_config's pcs_poll is no longer used, let's get rid of it.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
With drivers converted over to using phylink PCS, there is no need for
the struct dsa_switch member "pcs_poll" to exist anymore - there is a
flag in the struct phylink_pcs which indicates whether this PCS needs
to be polled which supersedes this.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>