Instead of (struct rt6_info *)dst casts, we can use :
#define dst_rt6_info(_ptr) \
container_of_const(_ptr, struct rt6_info, dst)
Some places needed missing const qualifiers :
ip6_confirm_neigh(), ipv6_anycast_destination(),
ipv6_unicast_destination(), has_gateway()
v2: added missing parts (David Ahern)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata says:
====================
mlxsw: Improve events processing performance
Amit Cohen writes:
Spectrum ASICs only support a single interrupt, it means that all the
events are handled by one IRQ (interrupt request) handler.
Currently, we schedule a tasklet to handle events in EQ, then we also use
tasklet for CQ, SDQ and RDQ. Tasklet runs in softIRQ (software IRQ)
context, and will be run on the same CPU which scheduled it. It means that
today we have one CPU which handles all the packets (both network packets
and EMADs) from hardware.
The existing implementation is not efficient and can be improved.
Measuring latency of EMADs in the driver (without the time in FW) shows
that latency is increased by factor of 28 (x28) when network traffic is
handled by the driver.
Measuring throughput in CPU shows that CPU can handle ~35% less packets
of specific flow when corrupted packets are also handled by the driver.
There are cases that these values even worse, we measure decrease of ~44%
packet rate.
This can be improved if network packet and EMADs will be handled in
parallel by several CPUs, and more than that, if different types of traffic
will be handled in parallel. We can achieve this using NAPI.
This set converts the driver to process completions from hardware via NAPI.
The idea is to add NAPI instance per CQ (which is mapped 1:1 to SDQ/RDQ),
which means that each DQ can be handled separately. we have DQ for EMADs
and DQs for each trap group (like LLDP, BGP, L3 drops, etc..). See more
details in commit messages.
An additional improvement which is done as part of this set is related to
doorbells' ring. The idea is to handle small chunks of Rx packets (which
is also recommended using NAPI) and ring doorbells once per chunk. This
reduces the access to hardware which is expensive (time wise) and might
take time because of memory barriers.
With this set we can see better performance.
To summerize:
EMADs latency:
+------------------------------------------------------------------------+
| | Before this set | Now |
|------------------|---------------------------|-------------------------|
| Increased factor | x28 | x1.5 |
+------------------------------------------------------------------------+
Note that we can see even measurements that show better latency when
traffic is handled by the driver.
Throughput:
+------------------------------------------------------------------------+
| | Before this set | Now |
|-------------|----------------------------|-----------------------------|
| Reduced | 35% | 6% |
| packet rate | | |
+------------------------------------------------------------------------+
Additional improvements are planned - use page pool for buffer allocations
and avoid cache miss of each SKB using napi_build_skb().
Patch set overview:
Patches #1-#2 improve access to hardware by reducing dorbells' rings
Patch #3-#4 are preaparations for NAPI usage
Patch #5 converts the driver to use NAPI
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Spectrum ASICs only support a single interrupt, that means that all the
events are handled by one IRQ (interrupt request) handler. Once an
interrupt is received, we schedule tasklet to handle events from EQ and
then schedule tasklets to handle completions from CQs. Tasklet runs in
softIRQ (software IRQ) context, and will be run on the same CPU which
scheduled it. That means that today we use only one CPU to handle all the
packets (both network packets and EMADs) from hardware.
This can be improved using NAPI. The idea is to use NAPI instance per
CQ, which is mapped 1:1 to DQ (RDQ or SDQ). NAPI poll method can be run
in kernel thread, so then the driver will be able to handle WQEs in several
CPUs. Convert the existing code to use NAPI APIs.
Add NAPI instance as part of 'struct mlxsw_pci_queue' and initialize it
as part of CQs initialization. Set the appropriate poll method and dummy
net device, according to queue number, similar to tasklet setup. For CQs
which are used for completions of RDQ, use Rx poll method and
'napi_dev_rx', which is set as 'threaded'. It means that Rx poll method
will run in kernel context, so several RDQs will be handled in parallel.
For CQs which are used for completions of SDQ, use Tx poll method and
'napi_dev_tx', this method will run in softIRQ context, as it is
recommended in NAPI documentation, as Tx packets' processing is short task.
Convert mlxsw_pci_cq_{rx,tx}_tasklet() to poll methods. Handle 'budget'
argument - ignore it in Tx poll method, as it is recommended to not limit
Tx processing. For Rx processing, handle up to 'budget' completions.
Return 'work_done' which is the amount of completions that were handled.
Handle the following cases:
1. After processing 'budget' completions, the driver still has work to do:
Return work-done = budget. In that case, the NAPI instance will be
polled again (without the need to be rescheduled). Do not re-arm the
queue, as NAPI will handle the reschedule, so we do not have to involve
hardware to send an additional interrupt for the completions that should
be processed.
2. Event processing has been completed:
Call napi_complete_done() to mark NAPI processing as completed, which
means that the poll method will not be rescheduled. Re-arm the queue,
as all completions were handled.
In case that poll method handled exactly 'budget' completions, return
work-done = budget -1, to distinguish from the case that driver still
has completions to handle. Otherwise, return the amount of completions
that were handled.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The next patch will set the driver to use NAPI for event processing. Then
tasklet mechanism will be used only for EQ. Reorganize 'mlxsw_pci_queue'
to hold EQ and CQ attributes in a union. For now, add tasklet for both EQ
and CQ. This will be changed in the next patch, as 'tasklet_struct' will be
replaced with NAPI instance.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
mlxsw will use NAPI for event processing in a next patch. As preparation,
add two dummy net devices and initialize them.
NAPI instance should be attached to net device. Usually each queue is used
by a single net device in network drivers, so the mapping between net
device to NAPI instance is intuitive. In our case, Rx queues are not per
port, they are per trap-group. Tx queues are mapped to net devices, but we
do not have a separate queue for each local port, several ports share the
same queue.
Use init_dummy_netdev() to initialize dummy net devices for NAPI.
To run NAPI poll method in a kernel thread, the net device which NAPI
instance is attached to should be marked as 'threaded'. It is
recommended to handle Tx packets in softIRQ context, as usually this is
a short task - just free the Tx packet which has been transmitted.
Rx packets handling is more complicated task, so drivers can use a
dedicated kernel thread to process them. It allows processing packets from
different Rx queues in parallel. We would like to handle only Rx packets in
kernel threads, which means that we will use two dummy net devices
(one for Rx and one for Tx). Set only one of them with 'threaded' as it
will be used for Rx processing. Do not fail in case that setting 'threaded'
fails, as it is better to use regular softIRQ NAPI rather than preventing
the driver from loading.
Note that the net devices are initialized with init_dummy_netdev(), so
they are not registered, which means that they will not be visible to user.
It will not be possible to change 'threaded' configuration from user
space, but it is reasonable in our case, as there is no another
configuration which makes sense, considering that user has no influence
on the usage of each queue.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, for each CQE in CQ, we ring CQ doorbell, then handle RDQ and
ring RDQ doorbell. Finally we ring CQ arm doorbell - once per CQ tasklet.
The idea of ringing CQ doorbell before RDQ doorbell, is to be sure that
when we post new WQE (after RDQ is handled), there is an available CQE.
This was done because of a hardware bug as part of
commit c9ebea04cb ("mlxsw: pci: Ring CQ's doorbell before RDQ's").
There is no real reason to ring RDQ and CQ doorbells for each completion,
it is better to handle several completions and reduce number of ringings,
as access to hardware is expensive (time wise) and might take time because
of memory barriers.
A previous patch changed CQ tasklet to handle up to 64 Rx packets. With
this limitation, we can ring CQ and RDQ doorbells once per CQ tasklet.
The counters of the doorbells are increased by the amount of packets
that we handled, then the device will know for which completion to send
an additional event.
To avoid reordering CQ and RDQ doorbells' ring, let the tasklet to ring
also RDQ doorbell, mlxsw_pci_cqe_rdq_handle() handles the counter but
does not ring the doorbell.
Note that with this change there is no need to copy the CQE, as we ring CQ
doorbell only after Rx packet processing (which uses the CQE) is done.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We can get many completions in one interrupt. Currently, the CQ tasklet
handles up to half queue size completions, and then arms the hardware to
generate additional events, which means that in case that there were
additional completions that we did not handle, we will get immediately an
additional interrupt to handle the rest.
The decision to handle up to half of the queue size is arbitrary and was
determined in 2015, when mlxsw driver was added to the kernel. One
additional fact that should be taken into account is that while WQEs
from RDQ are handled, the CPU that handles the tasklet is dedicated for
this task, which means that we might hold the CPU for a long time.
Handle WQEs in smaller chucks, then arm CQ doorbell to notify the hardware
to send additional notifications. Set the chunk size to 64 as this number
is recommended using NAPI and the driver will use NAPI in a next patch.
Note that for now we use ARM doorbell to retrigger CQ tasklet, but with
NAPI it will be more efficient as software will reschedule the poll
method and we will not involve hardware for that.
Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is a followup of commit c4e86b4363 ("net: add two more
call_rcu_hurry()")
fib6_info_destroy_rcu() is calling nexthop_put() or fib6_nh_release()
We must not delay it too much or risk unregister_netdevice/ref_tracker
traces because references to netdev are not released in time.
This should speedup device/netns dismantles when CONFIG_RCU_LAZY=y
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is a followup of commit c4e86b4363 ("net: add two more
call_rcu_hurry()")
Our reference to ifa->ifa_dev must be freed ASAP
to release the reference to the netdev the same way.
inet_rcu_free_ifa()
in_dev_put()
-> in_dev_finish_destroy()
-> netdev_put()
This should speedup device/netns dismantles when CONFIG_RCU_LAZY=y
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This came while reviewing commit c4e86b4363 ("net: add two more
call_rcu_hurry()").
Paolo asked if adding one synchronize_rcu() would help.
While synchronize_rcu() does not help, making sure to call
rcu_barrier() before msleep(wait) is definitely helping
to make sure lazy call_rcu() are completed.
Instead of waiting ~100 seconds in my tests, the ref_tracker
splats occurs one time only, and netdev_wait_allrefs_any()
latency is reduced to the strict minimum.
Ideally we should audit our call_rcu() users to make sure
no refcount (or cascading call_rcu()) is held too long,
because rcu_barrier() is quite expensive.
Fixes: 0e4be9e57e ("net: use exponential backoff in netdev_wait_allrefs")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/all/28bbf698-befb-42f6-b561-851c67f464aa@kernel.org/T/#m76d73ed6b03cd930778ac4d20a777f22a08d6824
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If the base-time for taprio is in the past, start the schedule at the time
of the form "base_time + N*cycle_time" where N is the smallest possible
integer such that the above time is in the future.
Signed-off-by: Tanmay Patil <t-patil@ti.com>
Signed-off-by: Chintan Vankar <c-vankar@ti.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
When using YNL in tests appending the doc string to the type
name makes it harder to check that we got the correct error.
Put the doc under a separate key.
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Link: https://lore.kernel.org/r/20240426003111.359285-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Throw a slightly more helpful exception when env variables
are partially populated. Prior to this change we'd get
a dictionary key exception somewhere later on.
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20240425222341.309778-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
I forgot to call tcp_skb_collapse_tstamp() in the
case we consume the second skb in write queue.
Neal suggested to create a common helper used by tcp_mtu_probe()
and tcp_grow_skb().
Fixes: 8ee602c635 ("tcp: try to send bigger TSO packets")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Link: https://lore.kernel.org/r/20240425193450.411640-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jason Xing says:
====================
Implement reset reason mechanism to detect
From: Jason Xing <kernelxing@tencent.com>
In production, there are so many cases about why the RST skb is sent but
we don't have a very convenient/fast method to detect the exact underlying
reasons.
RST is implemented in two kinds: passive kind (like tcp_v4_send_reset())
and active kind (like tcp_send_active_reset()). The former can be traced
carefully 1) in TCP, with the help of drop reasons, which is based on
Eric's idea[1], 2) in MPTCP, with the help of reset options defined in
RFC 8684. The latter is relatively independent, which should be
implemented on our own, such as active reset reasons which can not be
replace by skb drop reason or something like this.
In this series, I focus on the fundamental implement mostly about how
the rstreason mechanism works and give the detailed passive part as an
example, not including the active reset part. In future, we can go
further and refine those NOT_SPECIFIED reasons.
Here are some examples when tracing:
<idle>-0 [002] ..s1. 1830.262425: tcp_send_reset: skbaddr=x
skaddr=x src=x dest=x state=x reason=NOT_SPECIFIED
<idle>-0 [002] ..s1. 1830.262425: tcp_send_reset: skbaddr=x
skaddr=x src=x dest=x state=x reason=NO_SOCKET
[1]
Link: https://lore.kernel.org/all/CANn89iJw8x-LqgsWOeJQQvgVg6DnL5aBRLi10QN2WBdr+X4k=w@mail.gmail.com/
====================
Link: https://lore.kernel.org/r/20240425031340.46946-1-kerneljasonxing@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
At last, we should let it work by introducing this reset reason in
trace world.
One of the possible expected outputs is:
... tcp_send_reset: skbaddr=xxx skaddr=xxx src=xxx dest=xxx
state=TCP_ESTABLISHED reason=NOT_SPECIFIED
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Since we have mapped every mptcp reset reason definition in enum
sk_rst_reason, introducing a new helper can cover some missing places
where we have already set the subflow->reset_reason.
Note: using SK_RST_REASON_NOT_SPECIFIED is the same as
SK_RST_REASON_MPTCP_RST_EUNSPEC. They are both unknown. So we can convert
it directly.
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
It relies on what reset options in the skb are as rfc8684 says. Reusing
this logic can save us much energy. This patch replaces most of the prior
NOT_SPECIFIED reasons.
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reuse the dropreason logic to show the exact reason of tcp reset,
so we can finally display the corresponding item in enum sk_reset_reason
instead of reinventing new reset reasons. This patch replaces all
the prior NOT_SPECIFIED reasons.
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Like what we did to passive reset:
only passing possible reset reason in each active reset path.
No functional changes.
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Adjust the parameter and support passing reason of reset which
is for now NOT_SPECIFIED. No functional changes.
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Add a new standalone file for the easy future extension to support
both active reset and passive reset in the TCP/DCCP/MPTCP protocols.
This patch only does the preparations for reset reason mechanism,
nothing else changes.
The reset reasons are divided into three parts:
1) reuse drop reasons for passive reset in TCP
2) our own independent reasons which aren't relying on other reasons at all
3) reuse MP_TCPRST option for MPTCP
The benefits of a standalone reset reason are listed here:
1) it can cover more than one case, such as reset reasons in MPTCP,
active reset reasons.
2) people can easily/fastly understand and maintain this mechanism.
3) we get unified format of output with prefix stripped.
4) more new reset reasons are on the way
...
I will implement the basic codes of active/passive reset reason in
those three protocols, which are not complete for this moment. For
passive reset part in TCP, I only introduce the NO_SOCKET common case
which could be set as an example.
After this series applied, it will have the ability to open a new
gate to let other people contribute more reasons into it :)
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This patch adds support to per-packet Tx hardware timestamp request to
AF_XDP zero-copy packet via XDP Tx metadata framework. Please note that
user needs to enable Tx HW timestamp capability via igc_ioctl() with
SIOCSHWTSTAMP cmd before sending xsk Tx hardware timestamp request.
Same as implementation in RX timestamp XDP hints kfunc metadata, Timer 0
(adjustable clock) is used in xsk Tx hardware timestamp. i225/i226 have
four sets of timestamping registers. *skb and *xsk_tx_buffer pointers
are used to indicate whether the timestamping register is already occupied.
Furthermore, a boolean variable named xsk_pending_ts is used to hold the
transmit completion until the tx hardware timestamp is ready. This is
because, for i225/i226, the timestamp notification event comes some time
after the transmit completion event. The driver will retrigger hardware irq
to clean the packet after retrieve the tx hardware timestamp.
Besides, xsk_meta is added into struct igc_tx_timestamp_request as a hook
to the metadata location of the transmit packet. When the Tx timestamp
interrupt is fired, the interrupt handler will copy the value of Tx hwts
into metadata location via xsk_tx_metadata_complete().
This patch is tested with tools/testing/selftests/bpf/xdp_hw_metadata
on Intel ADL-S platform. Below are the test steps and results.
Test Step 1: Run xdp_hw_metadata app
./xdp_hw_metadata <iface> > /dev/shm/result.log
Test Step 2: Enable Tx hardware timestamp
hwstamp_ctl -i <iface> -t 1 -r 1
Test Step 3: Run ptp4l and phc2sys for time synchronization
Test Step 4: Generate UDP packets with 1ms interval for 10s
trafgen --dev <iface> '{eth(da=<addr>), udp(dp=9091)}' -t 1ms -n 10000
Test Step 5: Rerun Step 1-3 with 10s iperf3 as background traffic
Test Step 6: Rerun Step 1-4 with 10s iperf3 as background traffic
Based on iperf3 results below, the impact of holding tx completion to
throughput is not observable.
Result of last UDP packet (no. 10000) in Step 4:
poll: 1 (0) skip=99 fail=0 redir=10000
xsk_ring_cons__peek: 1
0x5640a37972d0: rx_desc[9999]->addr=f2110 addr=f2110 comp_addr=f2110 EoP
rx_hash: 0x2049BE1D with RSS type:0x1
HW RX-time: 1679819246792971268 (sec:1679819246.7930) delta to User RX-time sec:0.0000 (14.990 usec)
XDP RX-time: 1679819246792981987 (sec:1679819246.7930) delta to User RX-time sec:0.0000 (4.271 usec)
No rx_vlan_tci or rx_vlan_proto, err=-95
0x5640a37972d0: ping-pong with csum=ab19 (want 315b) csum_start=34 csum_offset=6
0x5640a37972d0: complete tx idx=9999 addr=f010
HW TX-complete-time: 1679819246793036971 (sec:1679819246.7930) delta to User TX-complete-time sec:0.0001 (77.656 usec)
XDP RX-time: 1679819246792981987 (sec:1679819246.7930) delta to User TX-complete-time sec:0.0001 (132.640 usec)
HW RX-time: 1679819246792971268 (sec:1679819246.7930) delta to HW TX-complete-time sec:0.0001 (65.703 usec)
0x5640a37972d0: complete rx idx=10127 addr=f2110
Result of iperf3 without tx hwts request in step 5:
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 2.74 GBytes 2.36 Gbits/sec 0 sender
[ 5] 0.00-10.05 sec 2.74 GBytes 2.34 Gbits/sec receiver
Result of iperf3 running parallel with trafgen command in step 6:
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 2.74 GBytes 2.36 Gbits/sec 0 sender
[ 5] 0.00-10.04 sec 2.74 GBytes 2.34 Gbits/sec receiver
Co-developed-by: Lai Peter Jun Ann <jun.ann.lai@intel.com>
Signed-off-by: Lai Peter Jun Ann <jun.ann.lai@intel.com>
Signed-off-by: Song Yoong Siang <yoong.siang.song@intel.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Tested-by: Naama Meir <naamax.meir@linux.intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://lore.kernel.org/r/20240424210256.3440903-1-anthony.l.nguyen@intel.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jiri Pirko says:
====================
selftests: virtio_net: introduce initial testing infrastructure
This patchset aims at introducing very basic initial infrastructure
for virtio_net testing, namely it focuses on virtio feature testing.
The first patch adds support for debugfs for virtio devices, allowing
user to filter features to pretend to be driver that is not capable
of the filtered feature.
Example:
$ cat /sys/bus/virtio/devices/virtio0/features
1110010111111111111101010000110010000000100000000000000000000000
$ echo "5" >/sys/kernel/debug/virtio/virtio0/filter_feature_add
$ cat /sys/kernel/debug/virtio/virtio0/filter_features
5
$ echo "virtio0" > /sys/bus/virtio/drivers/virtio_net/unbind
$ echo "virtio0" > /sys/bus/virtio/drivers/virtio_net/bind
$ cat /sys/bus/virtio/devices/virtio0/features
1110000111111111111101010000110010000000100000000000000000000000
Leverage that in the last patch that lays ground for virtio_net
selftests testing, including very basic F_MAC feature test.
To run this, do:
$ make -C tools/testing/selftests/ TARGETS=drivers/net/virtio_net/ run_tests
It is assumed, as with lot of other selftests in the net group,
that there are netdevices connected back-to-back. In this case,
two virtio_net devices connected back to back. If you use "tap" qemu
netdevice type, to configure this loop on a hypervisor, one may use
this script:
DEV1="$1"
DEV2="$2"
sudo tc qdisc add dev $DEV1 clsact
sudo tc qdisc add dev $DEV2 clsact
sudo tc filter add dev $DEV1 ingress protocol all pref 1 matchall action mirred egress redirect dev $DEV2
sudo tc filter add dev $DEV2 ingress protocol all pref 1 matchall action mirred egress redirect dev $DEV1
sudo ip link set $DEV1 up
sudo ip link set $DEV2 up
Another possibility is to use virtme-ng like this:
$ vng --network=loop
or directly:
$ vng --network=loop -- make -C tools/testing/selftests/ TARGETS=drivers/net/virtio_net/ run_tests
"loop" network type will take care of creating two "hubport" qemu netdevs
putting them into a single hub.
To do it manually with qemu, pass following command line options:
-nic hubport,hubid=1,id=nd0,model=virtio-net-pci
-nic hubport,hubid=1,id=nd1,model=virtio-net-pci
====================
Link: https://lore.kernel.org/r/20240424104049.3935572-1-jiri@resnulli.us
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Introduce initial tests for virtio_net driver. Focus on feature testing
leveraging previously introduced debugfs feature filtering
infrastructure. Add very basic ping and F_MAC feature tests.
To run this, do:
$ make -C tools/testing/selftests/ TARGETS=drivers/net/virtio_net/ run_tests
Run it on a system with 2 virtio_net devices connected back-to-back
on the hypervisor.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Tested-by: Benjamin Poirier <bpoirier@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The existing setup_wait*() helper family check the status of the
interface to be up. Introduce wait_for_dev() to wait for the netdevice
to appear, for example after test script does manual device bind.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Benjamin Poirier <bpoirier@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Add a helper to be used to check if the netdevice is backed by specified
driver.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Benjamin Poirier <bpoirier@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Allow driver tests to work without specifying the netdevice names.
Introduce a possibility to search for available netdevices according to
set driver name. Allow test to specify the name by setting
NETIF_FIND_DRIVER variable.
Note that user overrides this either by passing netdevice names on the
command line or by declaring NETIFS array in custom forwarding.config
configuration file.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Benjamin Poirier <bpoirier@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Currently there is no way for user to set what features the driver
should obey or not, it is hard wired in the code.
In order to be able to debug the device behavior in case some feature is
disabled, introduce a debugfs infrastructure with couple of files
allowing user to see what features the device advertises and
to set filter for features used by driver.
Example:
$cat /sys/bus/virtio/devices/virtio0/features
1110010111111111111101010000110010000000100000000000000000000000
$ echo "5" >/sys/kernel/debug/virtio/virtio0/filter_feature_add
$ cat /sys/kernel/debug/virtio/virtio0/filter_features
5
$ echo "virtio0" > /sys/bus/virtio/drivers/virtio_net/unbind
$ echo "virtio0" > /sys/bus/virtio/drivers/virtio_net/bind
$ cat /sys/bus/virtio/devices/virtio0/features
1110000111111111111101010000110010000000100000000000000000000000
Note that sysfs "features" now already exists, this patch does not
touch it.
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Lukasz Majewski says:
====================
net: hsr: Add support for HSR-SAN (RedBOX)
This patch set provides v6 of HSR-SAN (RedBOX) as well as hsr_redbox.sh
test script.
The most straightforward way to test those patches is to use buildroot
(2024.02.01) to create rootfs and QEMU based environment to run x86_64
Linux.
Then one shall run hsr_redbox.sh and hsr_ping.sh from
tools/testing/selftests/net/hsr.
====================
Link: https://lore.kernel.org/r/20240423124908.2073400-1-lukma@denx.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This patch adds hsr_redbox.sh script to test if HSR-SAN mode of operation
works correctly.
Signed-off-by: Lukasz Majewski <lukma@denx.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Current code checks if ping command output match hardcoded pattern:
"10 packets transmitted, 10 received, 0% packet loss,".
Such approach will work only from one ping program version (for which
this test has been originally written).
This patch address problem when ping with different summary output
like "10 packets transmitted, 10 packets received, 0% packet" is
used to run this test - for example one from busybox (as the test
system runs in QEMU with rootfs created with buildroot).
The fix is to modify output of ping command to be agnostic to ping
version used on the platform.
Signed-off-by: Lukasz Majewski <lukma@denx.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Some of the code already present in the hsr_ping.sh test program can be
moved to a separate script file, so it can be reused by other HSR
functionality (like HSR-SAN) tests.
Signed-off-by: Lukasz Majewski <lukma@denx.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Some parts (like netns creation and cleanup) of hsr_ping.sh script are
already implemented in ../lib.sh common script, so can be replaced by it.
Signed-off-by: Lukasz Majewski <lukma@denx.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Introduce RedBox support (HSR-SAN to be more precise) for HSR networks.
Following traffic reduction optimizations have been implemented:
- Do not send HSR supervisory frames to Port C (interlink)
- Do not forward to HSR ring frames addressed to Port C
- Do not forward to Port C frames from HSR ring
- Do not send duplicate HSR frame to HSR ring when destination is Port C
The corresponding patch to modify iptable2 sources has already been sent:
https://lore.kernel.org/netdev/20240308145729.490863-1-lukma@denx.de/T/
Testing procedure (veth and netns):
-----------------------------------
One shall run:
linux-vanila/tools/testing/selftests/net/hsr/hsr_redbox.sh
(Detailed description of the setup one can find in the test
script file).
Testing procedure (real hardware):
----------------------------------
The EVB-KSZ9477 has been used for testing on net-next branch
(SHA1: 5fc68320c1).
Ports 4/5 were used for SW managed HSR (hsr1) as first hsr0 for ports 1/2
(with HW offloading for ksz9477) was created. Port 3 has been used as
interlink port (single USB-ETH dongle).
Configuration - RedBox (EVB-KSZ9477):
if link set lan1 down;ip link set lan2 down
ip link add name hsr0 type hsr slave1 lan1 slave2 lan2 supervision 45 version 1
ip link add name hsr1 type hsr slave1 lan4 slave2 lan5 interlink lan3 supervision 45 version 1
ip link set lan4 up;ip link set lan5 up
ip link set lan3 up
ip addr add 192.168.0.11/24 dev hsr1
ip link set hsr1 up
Configuration - DAN-H (EVB-KSZ9477):
ip link set lan1 down;ip link set lan2 down
ip link add name hsr0 type hsr slave1 lan1 slave2 lan2 supervision 45 version 1
ip link add name hsr1 type hsr slave1 lan4 slave2 lan5 supervision 45 version 1
ip link set lan4 up;ip link set lan5 up
ip addr add 192.168.0.12/24 dev hsr1
ip link set hsr1 up
This approach uses only SW based HSR devices (hsr1).
-------------- ----------------- ------------
DAN-H Port5 | <------> | Port5 | |
Port4 | <------> | Port4 Port3 | <---> | PC
| | (RedBox) | | (USB-ETH)
EVB-KSZ9477 | | EVB-KSZ9477 | |
-------------- ----------------- ------------
Signed-off-by: Lukasz Majewski <lukma@denx.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Xiumei and Christoph reported the following lockdep splat, complaining of
the qdisc root lock being taken twice:
============================================
WARNING: possible recursive locking detected
6.7.0-rc3+ #598 Not tainted
--------------------------------------------
swapper/2/0 is trying to acquire lock:
ffff888177190110 (&sch->q.lock){+.-.}-{2:2}, at: __dev_queue_xmit+0x1560/0x2e70
but task is already holding lock:
ffff88811995a110 (&sch->q.lock){+.-.}-{2:2}, at: __dev_queue_xmit+0x1560/0x2e70
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&sch->q.lock);
lock(&sch->q.lock);
*** DEADLOCK ***
May be due to missing lock nesting notation
5 locks held by swapper/2/0:
#0: ffff888135a09d98 ((&in_dev->mr_ifc_timer)){+.-.}-{0:0}, at: call_timer_fn+0x11a/0x510
#1: ffffffffaaee5260 (rcu_read_lock){....}-{1:2}, at: ip_finish_output2+0x2c0/0x1ed0
#2: ffffffffaaee5200 (rcu_read_lock_bh){....}-{1:2}, at: __dev_queue_xmit+0x209/0x2e70
#3: ffff88811995a110 (&sch->q.lock){+.-.}-{2:2}, at: __dev_queue_xmit+0x1560/0x2e70
#4: ffffffffaaee5200 (rcu_read_lock_bh){....}-{1:2}, at: __dev_queue_xmit+0x209/0x2e70
stack backtrace:
CPU: 2 PID: 0 Comm: swapper/2 Not tainted 6.7.0-rc3+ #598
Hardware name: Red Hat KVM, BIOS 1.13.0-2.module+el8.3.0+7353+9de0a3cc 04/01/2014
Call Trace:
<IRQ>
dump_stack_lvl+0x4a/0x80
__lock_acquire+0xfdd/0x3150
lock_acquire+0x1ca/0x540
_raw_spin_lock+0x34/0x80
__dev_queue_xmit+0x1560/0x2e70
tcf_mirred_act+0x82e/0x1260 [act_mirred]
tcf_action_exec+0x161/0x480
tcf_classify+0x689/0x1170
prio_enqueue+0x316/0x660 [sch_prio]
dev_qdisc_enqueue+0x46/0x220
__dev_queue_xmit+0x1615/0x2e70
ip_finish_output2+0x1218/0x1ed0
__ip_finish_output+0x8b3/0x1350
ip_output+0x163/0x4e0
igmp_ifc_timer_expire+0x44b/0x930
call_timer_fn+0x1a2/0x510
run_timer_softirq+0x54d/0x11a0
__do_softirq+0x1b3/0x88f
irq_exit_rcu+0x18f/0x1e0
sysvec_apic_timer_interrupt+0x6f/0x90
</IRQ>
This happens when TC does a mirred egress redirect from the root qdisc of
device A to the root qdisc of device B. As long as these two locks aren't
protecting the same qdisc, they can be acquired in chain: add a per-qdisc
lockdep key to silence false warnings.
This dynamic key should safely replace the static key we have in sch_htb:
it was added to allow enqueueing to the device "direct qdisc" while still
holding the qdisc root lock.
v2: don't use static keys anymore in HTB direct qdiscs (thanks Eric Dumazet)
CC: Maxim Mikityanskiy <maxim@isovalent.com>
CC: Xiumei Mu <xmu@redhat.com>
Reported-by: Christoph Paasch <cpaasch@apple.com>
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/451
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Link: https://lore.kernel.org/r/7dc06d6158f72053cf877a82e2a7a5bd23692faa.1713448007.git.dcaratti@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Tony Nguyen says:
====================
net: intel: start The Great Code Dedup + Page Pool for iavf
Alexander Lobakin says:
Here's a two-shot: introduce {,Intel} Ethernet common library (libeth and
libie) and switch iavf to Page Pool. Details are in the commit messages;
here's a summary:
Not a secret there's a ton of code duplication between two and more Intel
ethernet modules. Before introducing new changes, which would need to be
copied over again, start decoupling the already existing duplicate
functionality into a new module, which will be shared between several
Intel Ethernet drivers. The first name that came to my mind was
"libie" -- "Intel Ethernet common library". Also this sounds like
"lovelie" (-> one word, no "lib I E" pls) and can be expanded as
"lib Internet Explorer" :P
The "generic", pure-software part is placed separately, so that it can be
easily reused in any driver by any vendor without linking to the Intel
pre-200G guts. In a few words, it's something any modern driver does the
same way, but nobody moved it level up (yet).
The series is only the beginning. From now on, adding every new feature
or doing any good driver refactoring will remove much more lines than add
for quite some time. There's a basic roadmap with some deduplications
planned already, not speaking of that touching every line now asks:
"can I share this?". The final destination is very ambitious: have only
one unified driver for at least i40e, ice, iavf, and idpf with a struct
ops for each generation. That's never gonna happen, right? But you still
can at least try.
PP conversion for iavf lands within the same series as these two are tied
closely. libie will support Page Pool model only, so that a driver can't
use much of the lib until it's converted. iavf is only the example, the
rest will eventually be converted soon on a per-driver basis. That is
when it gets really interesting. Stay tech.
* '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
MAINTAINERS: add entry for libeth and libie
iavf: switch to Page Pool
iavf: pack iavf_ring more efficiently
libeth: add Rx buffer management
page_pool: add DMA-sync-for-CPU inline helper
page_pool: constify some read-only function arguments
slab: introduce kvmalloc_array_node() and kvcalloc_node()
iavf: drop page splitting and recycling
iavf: kill "legacy-rx" for good
net: intel: introduce {, Intel} Ethernet common library
====================
Link: https://lore.kernel.org/r/20240424203559.3420468-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Use flow_rule_is_supp_control_flags() to reject filters with
unsupported control flags.
In case any unsupported control flags are masked,
flow_rule_is_supp_control_flags() sets a NL extended
error message, and we return -EOPNOTSUPP.
Only compile-tested.
Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Link: https://lore.kernel.org/r/20240424125347.461995-4-ast@fiberby.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Rename goto label, as the error message is specific to the fragment flags.
Only compile-tested.
Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Link: https://lore.kernel.org/r/20240424125347.461995-3-ast@fiberby.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Asbjørn Sloth Tønnesen says:
====================
net: sparx5: flower: validate control flags
This series adds flower control flags validation to the
sparx5 driver, and changes it from assuming that it handles
all control flags, to instead reject rules if they have
masked any unknown/unsupported control flags.
Reviewed-by: Daniel Machon <daniel.machon@microchip.com>
Tested-by: Daniel Machon <daniel.machon@microchip.com>
v1: https://lore.kernel.org/netdev/20240423102728.228765-1-ast@fiberby.net/
====================
Link: https://lore.kernel.org/r/20240424121632.459022-1-ast@fiberby.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Use flow_rule_is_supp_control_flags() to reject filters with
unsupported control flags.
In case any unsupported control flags are masked,
flow_rule_is_supp_control_flags() sets a NL extended
error message, and we return -EOPNOTSUPP.
Only compile-tested.
Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Daniel Machon <daniel.machon@microchip.com>
Tested-by: Daniel Machon <daniel.machon@microchip.com>
Link: https://lore.kernel.org/r/20240424121632.459022-5-ast@fiberby.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Remove goto, as it's only used once, and the error message is
specific to that context.
Only compile tested.
Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Daniel Machon <daniel.machon@microchip.com>
Tested-by: Daniel Machon <daniel.machon@microchip.com>
Link: https://lore.kernel.org/r/20240424121632.459022-4-ast@fiberby.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Define extack locally, to reduce line lengths and aid future users.
Only compile tested.
Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Daniel Machon <daniel.machon@microchip.com>
Tested-by: Daniel Machon <daniel.machon@microchip.com>
Link: https://lore.kernel.org/r/20240424121632.459022-3-ast@fiberby.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The fragment lookup should only be performed, when
at least one of the fragment flags are set.
This change was deliberately not included in commit
68aba00483 ("net: sparx5: flower: fix fragment flags handling")
as it's only needed for future proffing the code, since
"mask" is currently only set in conjunction with the
fragment flags.
(The 3rd flag FLOW_DIS_ENCAPSULATION is only used with "key")
Only compile tested.
Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Reviewed-by: Daniel Machon <daniel.machon@microchip.com>
Tested-by: Daniel Machon <daniel.machon@microchip.com>
Link: https://lore.kernel.org/r/20240424121632.459022-2-ast@fiberby.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>