Introduce a new netdev feature, NETIF_F_GRO_UDP_FWD, to allow user
to turn UDP GRO on and off for forwarding.
Defaults to off to not change current datapath.
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Richard Cochran says:
====================
Remove unneeded PHY time stamping option.
The NETWORK_PHY_TIMESTAMPING configuration option adds additional
checks into the networking hot path, and it is only needed by two
rather esoteric devices, namely the TI DP83640 PHYTER and the ZHAW
InES 1588 IP core. Very few end users have these devices, and those
that do have them are building specialized embedded systems.
Unfortunately two unrelated drivers depend on this option, and two
defconfigs enable it. It is probably my fault for not paying enough
attention in reviews.
This series corrects the gratuitous use of NETWORK_PHY_TIMESTAMPING.
====================
Link: https://lore.kernel.org/r/cover.1611198584.git.richardcochran@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The mvpp2 is an Ethernet driver, and it implements MAC style time
stamping of PTP frames. It has no need of the expensive option to
enable PHY time stamping. Remove the incorrect dependency.
Signed-off-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The mv88e6xxx is a DSA driver, and it implements DSA style time
stamping of PTP frames. It has no need of the expensive option to
enable PHY time stamping. Remove the bogus dependency.
Signed-off-by: Richard Cochran <richardcochran@gmail.com>
Acked-by: Brandon Streiff <brandon.streiff@ni.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Alex Elder says:
====================
net: ipa: NAPI poll updates
While reviewing the IPA NAPI polling code in detail I found two
problems. This series fixes those, and implements a few other
improvements to this part of the code.
The first two patches are minor bug fixes that avoid extra passes
through the poll function. The third simplifies code inside the
polling loop a bit.
The last two update how interrupts are disabled; previously it was
possible for another I/O completion condition to be recorded before
NAPI got scheduled.
====================
Link: https://lore.kernel.org/r/20210121114821.26495-1-elder@linaro.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently in gsi_isr_ieob(), event ring IEOB interrupts are disabled
one at a time. The loop disables the IEOB interrupt for all event
rings represented in the event mask. Instead, just disable them all
at once.
Disable them all *before* clearing the interrupt condition. This
guarantees we'll schedule NAPI for each event once, before another
IEOB interrupt could be signaled.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Rename gsi_irq_ieob_disable() to be gsi_irq_ieob_disable_one().
Introduce a new function gsi_irq_ieob_disable() that takes a mask of
events to disable rather than a single event id. This will be used
in the next patch.
Rename gsi_irq_ieob_enable() to be gsi_irq_ieob_enable_one() to be
consistent.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Have gsi_channel_update() return the first transaction in the
updated completed transaction list, or NULL if no new transactions
have been added.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pay attention to the return value of napi_complete(), completing
polling only if it returns true.
Just use napi rather than &channel->napi as the argument passed to
napi_complete().
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There is an off-by-one problem in gsi_channel_poll(). The count of
transactions completed is incremented each time through the loop
*before* determining whether there is any more work to do. As a
result, if we exit the loop early the counter its value is one more
than the number of transactions actually processed.
Instead, increment the count after processing, to ensure it reflects
the number of processed transactions. The result is more naturally
described as a for loop rather than a while loop, so change that.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ido Schimmel says:
====================
mlxsw: Expose number of physical ports
The switch ASIC has a limited capacity of physical ports that it can
support. While each system is brought up with a different number of
ports, this number can be increased via splitting up to the ASIC's
limit.
Expose physical ports as a devlink resource so that user space will have
visibility into the maximum number of ports that can be supported and
the current occupancy. With this resource it is possible, for example,
to write generic (i.e., not platform dependent) tests for port
splitting.
Patch #1 adds the new resource and patch #2 adds a selftest.
v2:
* Add the physical ports resource as a generic devlink resource so that
it could be re-used by other device drivers
====================
Link: https://lore.kernel.org/r/20210121131024.2656154-1-idosch@idosch.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Query the maximum number of supported physical ports using devlink-resource
and test that this number can be reached by splitting each of the
splittable ports to its width. Test that an error is returned in case
the maximum number is exceeded.
Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The switch ASIC has a limited capacity of physical ('flavour physical'
in devlink terminology) ports that it can support. While each system is
brought up with a different number of ports, this number can be
increased via splitting up to the ASIC's limit.
Expose physical ports as a devlink resource so that user space will have
visibility to the maximum number of ports that can be supported and the
current occupancy.
In addition, add a "Generic Resources" section in devlink-resource
documentation so the different drivers will be aligned by the same resource
name when exposing to user space.
Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Maxim Mikityanskiy says:
====================
HTB offload
This series adds support for HTB offload to the HTB qdisc, and adds
usage to mlx5 driver.
The previous RFCs are available at [1], [2].
The feature is intended to solve the performance bottleneck caused by
the single lock of the HTB qdisc, which prevents it from scaling well.
The HTB algorithm itself is offloaded to the device, eliminating the
need to take the root lock of HTB on every packet. Classification part
is done in clsact (still in software) to avoid acquiring the lock, which
imposes a limitation that filters can target only leaf classes.
The speedup on Mellanox ConnectX-6 Dx was 14.2 times in the UDP
multi-stream test, compared to software HTB implementation (more details
in the mlx5 patch).
[1]: https://www.spinics.net/lists/netdev/msg628422.html
[2]: https://www.spinics.net/lists/netdev/msg663548.html
v2 changes:
Fixed sparse and smatch warnings. Formatted HTB patches to 80 chars per
line.
v3 changes:
Fixed the CI failure on parisc with 16-bit xchg by replacing it with
WRITE_ONCE. Fixed the capability bits in mlx5_ifc.h and the value of
MLX5E_QOS_MAX_LEAF_NODES.
v4 changes:
Check if HTB is root when offloading. Add extack for hardware errors.
Rephrase explanations of how it works in the commit message. Remove %hu
from format strings. Add resiliency when leaf_del_last fails to create a
new leaf node.
====================
Link: https://lore.kernel.org/r/20210119120815.463334-1-maximmi@mellanox.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This commit adds support for HTB offload in the mlx5e driver.
Performance:
NIC: Mellanox ConnectX-6 Dx
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz (24 cores with HT)
100 Gbit/s line rate, 500 UDP streams @ ~200 Mbit/s each
48 traffic classes, flower used for steering
No shaping (rate limits set to 4 Gbit/s per TC) - checking for max
throughput.
Baseline: 98.7 Gbps, 8.25 Mpps
HTB: 6.7 Gbps, 0.56 Mpps
HTB offload: 95.6 Gbps, 8.00 Mpps
Limitations:
1. 256 leaf nodes, 3 levels of depth.
2. Granularity for ceil is 1 Mbit/s. Rates are converted to weights, and
the bandwidth is split among the siblings according to these weights.
Other parameters for classes are not supported.
Ethtool statistics support for QoS SQs are also added. The counters are
called qos_txN_*, where N is the QoS queue number (starting from 0, the
numeration is separate from the normal SQs), and * is the counter name
(the counters are the same as for the normal SQs).
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This commit adds support for statistics of offloaded HTB. Bytes and
packets counters for leaf and inner nodes are supported, the values are
taken from per-queue qdiscs, and the numbers that the user sees should
have the same behavior as the software (non-offloaded) HTB.
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
HTB doesn't scale well because of contention on a single lock, and it
also consumes CPU. This patch adds support for offloading HTB to
hardware that supports hierarchical rate limiting.
In the offload mode, HTB passes control commands to the driver using
ndo_setup_tc. The driver has to replicate the whole hierarchy of classes
and their settings (rate, ceil) in the NIC. Every modification of the
HTB tree caused by the admin results in ndo_setup_tc being called.
After this setup, the HTB algorithm is done completely in the NIC. An SQ
(send queue) is created for every leaf class and attached to the
hierarchy, so that the NIC can calculate and obey aggregated rate
limits, too. In the future, it can be changed, so that multiple SQs will
back a single leaf class.
ndo_select_queue is responsible for selecting the right queue that
serves the traffic class of each packet.
The data path works as follows: a packet is classified by clsact, the
driver selects a hardware queue according to its class, and the packet
is enqueued into this queue's qdisc.
This solution addresses two main problems of scaling HTB:
1. Contention by flow classification. Currently the filters are attached
to the HTB instance as follows:
# tc filter add dev eth0 parent 1:0 protocol ip flower dst_port 80
classid 1:10
It's possible to move classification to clsact egress hook, which is
thread-safe and lock-free:
# tc filter add dev eth0 egress protocol ip flower dst_port 80
action skbedit priority 1:10
This way classification still happens in software, but the lock
contention is eliminated, and it happens before selecting the TX queue,
allowing the driver to translate the class to the corresponding hardware
queue in ndo_select_queue.
Note that this is already compatible with non-offloaded HTB and doesn't
require changes to the kernel nor iproute2.
2. Contention by handling packets. HTB is not multi-queue, it attaches
to a whole net device, and handling of all packets takes the same lock.
When HTB is offloaded, it registers itself as a multi-queue qdisc,
similarly to mq: HTB is attached to the netdev, and each queue has its
own qdisc.
Some features of HTB may be not supported by some particular hardware,
for example, the maximum number of classes may be limited, the
granularity of rate and ceil parameters may be different, etc. - so, the
offload is not enabled by default, a new parameter is used to enable it:
# tc qdisc replace dev eth0 root handle 1: htb offload
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In a following commit, sch_htb will start using extack in the delete
class operation to pass hardware errors in offload mode. This commit
prepares for that by adding the extack parameter to this callback and
converting usage of the existing qdiscs.
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The existing qdiscs that set TCQ_F_MQROOT don't use sch_tree_lock.
However, hardware-offloaded HTB will start setting this flag while also
using sch_tree_lock.
The current implementation of sch_tree_lock basically locks on
qdisc->dev_queue->qdisc, and it works fine when the tree is attached to
some queue. However, it's not the case for MQROOT qdiscs: such a qdisc
is the root itself, and its dev_queue just points to queue 0, while not
actually being used, because there are real per-queue qdiscs.
This patch changes the logic of sch_tree_lock and sch_tree_unlock to
lock the qdisc itself if it's the MQROOT.
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Arjun Roy says:
====================
tcp: add CMSG+rx timestamps to rx. zerocopy
Provide CMSG and receive timestamp support to TCP
receive zerocopy. Patch 1 refactors CMSG pending state for
tcp_recvmsg() to avoid the use of magic numbers; patch 2 implements
receive timestamp via CMSG support for receive zerocopy, and uses the
constants added in patch 1.
====================
Link: https://lore.kernel.org/r/20210121004148.2340206-1-arjunroy.kdev@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
tcp_recvmsg() uses the CMSG mechanism to receive control information
like packet receive timestamps. This patch adds CMSG fields to
struct tcp_zerocopy_receive, and provides receive timestamps
if available to the user.
Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
At present, tcp_recvmsg() uses flags to track if any CMSGs are pending
and what those CMSGs are. These flags are currently magic numbers,
used only within tcp_recvmsg().
To prepare for receive timestamp support in tcp receive zerocopy,
gently refactor these magic numbers into enums.
Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Nikolay Aleksandrov says:
====================
net: bridge: multicast: add initial EHT support
This set adds explicit host tracking support for IGMPv3/MLDv2. The
already present per-port fast leave flag is used to enable it since that
is the primary goal of EHT, to track a group and its S,Gs usage per-host
and when left without any interested hosts delete them before the standard
timers. The EHT code is pretty self-contained and not enabled by default.
There is no new uAPI added, all of the functionality is currently hidden
behind the fast leave flag. In the future that will change (more below).
The host tracking uses two new sets per port group: one having an entry for
each host which contains that host's view of the group (source list and
filter mode), and one set which contains an entry for each source having
an internal set which contains an entry for each host that has reported
an interest for that source. RB trees are used for all sets so they're
compact when not used and fast when we need to do lookups.
To illustrate it:
[ bridge port group ]
` [ host set (rb) ]
` [ host entry with a list of sources and filter mode ]
` [ source set (rb) ]
` [ source entry ]
` [ source host set (rb) ]
` [ source host entry with a timer ]
The number of tracked sources per host is limited to the maximum total
number of S,G entries per port group - PG_SRC_ENT_LIMIT (currently 32).
The number of hosts is unlimited, I think the argument that a local
attacker can exhaust the memory/cause high CPU usage can be applied to
fdb entries as well which are unlimited. In the future if needed we can
add an option to limit these, but I don't think it's necessary for a
start. All of the new sets are protected by the bridge's multicast lock.
I'm pretty sure we'll be changing the cases and improving the
convergence time in the future, but this seems like a good start.
Patch breakdown:
patch 1 - 4: minor cleanups and preparations for EHT
patch 5: adds the new structures which will be used in the
following patches
patch 6: adds support to create, destroy and lookup host entries
patch 7: adds support to create, delete and lokup source set entries
patch 8: adds a host "delete" function which is just a host's
source list flush since that would automatically delete
the host
patch 9 - 10: add support for handling all IGMPv3/MLDv2 report types
more information can be found in the individual patches
patch 11: optmizes a specific TO_INCLUDE use-case with host timeouts
patch 12: handles per-host filter mode changing (include <-> exclude)
patch 13: pulls out block group deletion since now it can be
deleted in both filter modes
patch 14: marks deletions done due to fast leave
Future plans:
- export host information
- add an option to reduce queries
- add an option to limit the number of host entries
- tune more fast leave cases for quicker convergence
By the way I think this is the first open-source EHT implementation, I
couldn't find any while researching it. :)
====================
Link: https://lore.kernel.org/r/20210120145203.1109140-1-razor@blackwall.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Mark groups which were deleted due to fast leave/EHT.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
A block report can result in empty source and host sets for both include
and exclude groups so if there are no hosts left we can safely remove
the group. Pull the block group handling so it can cover both cases and
add a check if EHT requires the delete.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We should be able to handle host filter mode changing. For exclude mode
we must create a zero-src entry so the group will be kept even without
any S,G entries (non-zero source sets). That entry doesn't count to the
entry limit and can always be created, its timer is refreshed on new
exclude reports and if we change the host filter mode to include then it
gets removed and we rely only on the non-zero source sets.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This is an optimization specifically for TO_INCLUDE which sends queries
for the older entries and thus lowers the S,G timers to LMQT. If we have
the following situation for a group in either include or exclude mode:
- host A was interested in srcs X and Y, but is timing out
- host B sends TO_INCLUDE src Z, the bridge lowers X and Y's timeouts
to LMQT
- host B sends BLOCK src Z after LMQT time has passed
=> since host B is the last host we can delete the group, but if we
still have host A's EHT entries for X and Y (i.e. if they weren't
lowered to LMQT previously) then we'll have to wait another LMQT
time before deleting the group, with this optimization we can
directly remove it regardless of the group mode as there are no more
interested hosts
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add support for IGMPv3/MLDv2 include and exclude EHT handling. Similar to
how the reports are processed we have 2 cases when the group is in include
or exclude mode, these are processed as follows:
- group include
- is_include: create missing entries
- to_include: flush existing entries and create a new set from the
report, obviously if the src set is empty then we delete the group
- group exclude
- is_exclude: create missing entries
- to_exclude: flush existing entries and create a new set from the
report, any empty source set entries are removed
If the group is in a different mode then we just flush all entries reported
by the host and we create a new set with the new mode entries created from
the report. If the report is include type, the source list is empty and
the group has empty sources' set then we remove it. Any source set entries
which are empty are removed as well. If the group is in exclude mode it
can exist without any S,G entries (allowing for all traffic to pass).
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add support for IGMPv3/MLDv2 allow/block EHT handling. Similar to how
the reports are processed we have 2 cases when the group is in include
or exclude mode, these are processed as follows:
- group include
- allow: create missing entries
- block: remove existing matching entries and remove the corresponding
S,G entries if there are no more set host entries, then possibly
delete the whole group if there are no more S,G entries
- group exclude
- allow
- host include: create missing entries
- host exclude: remove existing matching entries and remove the
corresponding S,G entries if there are no more set host entries
- block
- host include: remove existing matching entries and remove the
corresponding S,G entries if there are no more set host entries,
then possibly delete the whole group if there are no more S,G entries
- host exclude: create missing entries
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Now that we can delete set entries, we can use that to remove EHT hosts.
Since the group's host set entries exist only when there are related
source set entries we just have to flush all source set entries
joined by the host set entry and it will be automatically removed.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add EHT source set and set-entry create, delete and lookup functions.
These allow to manipulate source sets which contain their own host sets
with entries which joined that S,G. We're limiting the maximum number of
tracked S,G entries per host to PG_SRC_ENT_LIMIT (currently 32) which is
the current maximum of S,G entries for a group. There's a per-set timer
which will be used to destroy the whole set later.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add functions to create, destroy and lookup an EHT host. These are
per-host entries contained in the eht_host_tree in net_bridge_port_group
which are used to store a list of all sources (S,G) entries joined for that
group by each host, the host's current filter mode and total number of
joined entries.
No functional changes yet, these would be used in later patches.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add EHT structures for tracking hosts and sources per group. We keep one
set for each host which has all of the host's S,G entries, and one set for
each multicast source which has all hosts that have joined that S,G. For
each host, source entry we record the filter_mode and we keep an expiry
timer. There is also one global expiry timer per source set, it is
updated with each set entry update, it will be later used to lower the
set's timer instead of lowering each entry's timer separately.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We need to preserve the srcs pointer since we'll be passing it for EHT
handling later.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Prepare __grp_src_block_incl() for being able to cause a notification
due to changes. Currently it cannot happen, but EHT would change that
since we'll be deleting sources immediately. Make sure that if the pg is
deleted we don't return true as that would cause the caller to access
freed pg. This patch shouldn't cause any functional change.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We need to pass the host address so later it can be used for explicit
host tracking. No functional change.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Rename src_size argument to addr_size in preparation for passing host
address as an argument to IGMPv3/MLDv2 functions.
No functional change.
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Paolo Abeni says:
====================
mptcp: re-enable sndbuf autotune
The sendbuffer autotuning was unintentionally disabled as a
side effect of the recent workqueue removal refactor. These
patches re-enable id, with some extra care: with autotuning
enable/large send buffer we need a more accurate packet
scheduler to be able to use efficiently the available
subflow bandwidth, especially when the subflows have
different capacities.
The first patch cleans-up subflow socket handling, making
the actual re-enable (patch 2) simpler.
Patches 3 and 4 improve the packet scheduler, to better cope
with non trivial scenarios and large send buffer.
Finally patch 5 adds and uses some infrastructure to avoid
the workqueue usage for the packet scheduler operations introduced
by the previous patches.
====================
Link: https://lore.kernel.org/r/cover.1611153172.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
On MPTCP-level ack reception, the packet scheduler
may select a subflow other then the current one.
Prior to this commit we rely on the workqueue to trigger
action on such subflow.
This changeset introduces an infrastructure that allows
any MPTCP subflow to schedule actions (MPTCP xmit) on
others subflows without resorting to (multiple) process
reschedule.
A dummy NAPI instance is used instead. When MPTCP needs to
trigger action an a different subflow, it enqueues the target
subflow on the NAPI backlog and schedule such instance as needed.
The dummy NAPI poll method walks the sockets backlog and tries
to acquire the (BH) socket lock on each of them. If the socket
is owned by the user space, the action will be completed by
the sock release cb, otherwise push is started.
This change leverages the delegated action infrastructure
to avoid invoking the MPTCP worker to spool the pending data,
when the packet scheduler picks a subflow other then the one
currently processing the incoming MPTCP-level ack.
Additionally we further refine the subflow selection
invoking the packet scheduler for each chunk of data
even inside __mptcp_subflow_push_pending().
v1 -> v2:
- fix possible UaF at shutdown time, resetting sock ops
after removing the ulp context
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Otherwise the packet scheduler policy will not be
enforced when pushing pending data at MPTCP-level
ack reception time.
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The current packet scheduler can enqueue up to sndbuf
data on each subflow. If the send buffer is large and
the subflows are not symmetric, this could lead to
suboptimal aggregate bandwidth utilization.
Limit the amount of queued data to the maximum send
window.
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
After commit 6e628cd3a8 ("mptcp: use mptcp release_cb for
delayed tasks"), MPTCP never sets the flag bit SOCK_NOSPACE
on its subflow. As a side effect, autotune never takes place,
as it happens inside tcp_new_space(), which in turn is called
only when the mentioned bit is set.
Let's sendmsg() set the subflows NOSPACE bit when looking for
more memory and use the subflow write_space callback to propagate
the snd buf update and wake-up the user-space.
Additionally, this allows dropping a bunch of duplicate code and
makes the SNDBUF_LIMITED chrono relevant again for MPTCP subflows.
Fixes: 6e628cd3a8 ("mptcp: use mptcp release_cb for delayed tasks")
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently, incoming subflows link to the parent socket,
while outgoing ones link to a per subflow socket. The latter
is not really needed, except at the initial connect() time and
for the first subflow.
Always graft the outgoing subflow to the parent socket and
free the unneeded ones early.
This allows some code cleanup, reduces the amount of memory
used and will simplify the next patch
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Without this change the driver tries to allocate too many queues,
breaching the number of available msi-x interrupts on machines
with many logical cpus and default adapter settings:
Insufficient resources for 12 XDP event queues (24 other channels, max 32)
Which in turn triggers EINVAL on XDP processing:
sfc 0000:86:00.0 ext0: XDP TX failed (-22)
Signed-off-by: Ivan Babrou <ivan@cloudflare.com>
Acked-by: Edward Cree <ecree.xilinx@gmail.com>
Link: https://lore.kernel.org/r/20210120212759.81548-1-ivan@cloudflare.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch adds TCP_NLA_TTL to SCM_TIMESTAMPING_OPT_STATS that exports
the time-to-live or hop limit of the latest incoming packet with
SCM_TSTAMP_ACK. The value exported may not be from the packet that acks
the sequence when incoming packets are aggregated. Exporting the
time-to-live or hop limit value of incoming packets helps to estimate
the hop count of the path of the flow that may change over time.
Signed-off-by: Yousuk Seung <ysseung@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Link: https://lore.kernel.org/r/20210120204155.552275-1-ysseung@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Since the early retransmit has been removed by
commit bec41a11dd ("tcp: remove early retransmit"),
we also remove the unused ICSK_TIME_EARLY_RETRANS macro.
Signed-off-by: Pengcheng Yang <yangpc@wangsu.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/1611239473-27304-1-git-send-email-yangpc@wangsu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Document the compatible value for the RAVB block in the Renesas R-Car
V3U (R8A779A0) SoC. This variant has no stream buffer, so we only need
to add the new compatible and add it to the TX delay block.
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Acked-by: Rob Herring <robh@kernel.org>
Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Link: https://lore.kernel.org/r/20210121100619.5653-2-wsa+renesas@sang-engineering.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Alex Elder says:
====================
net: ipa: remove a build dependency
Unlike the original (temporary) IPA notification mechanism, the
generic remoteproc SSR notification code does not require the IPA
driver to maintain a pointer to the modem subsystem remoteproc
structure.
The IPA driver was converted to use the newer SSR notifiers, but the
specification and use of a phandle for the modem subsystem was never
removed.
This series removes the lookup of the remoteproc pointer, and that
removes the need for the modem DT property. It also removes the
reference to the "modem-remoteproc" property from the DT binding,
and from the DT files that specified them.
====================
Link: https://lore.kernel.org/r/20210120212606.12556-1-elder@linaro.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>