To avoid confusions like when working on the previous patch, better to
declare and assign this variable only where it is needed.
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
On one side, we can allow the creation of subflows between v4 mapped in
v6 and v4 addresses. For that we look for v4mapped addresses between the
local address we want to select and the remote one.
On the other side, we also properly deal with received v4mapped
addresses, either announced ones or set via Netlink.
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/122
Suggested-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Co-developed-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
With an IPv4 mapped in v6 socket, we were trying to call inet6_bind()
with an IPv4 address resulting in a -EINVAL error because the given
addr_len -- size of the address structure -- was too short.
We now make sure to use address structures for the same family as the
MPTCP socket for both the bind() and the connect(). It means we convert
v4 addresses to v4 mapped in v6 or the opposite if needed.
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/122
Co-developed-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
On MPTCP-level ack reception, the packet scheduler
may select a subflow other then the current one.
Prior to this commit we rely on the workqueue to trigger
action on such subflow.
This changeset introduces an infrastructure that allows
any MPTCP subflow to schedule actions (MPTCP xmit) on
others subflows without resorting to (multiple) process
reschedule.
A dummy NAPI instance is used instead. When MPTCP needs to
trigger action an a different subflow, it enqueues the target
subflow on the NAPI backlog and schedule such instance as needed.
The dummy NAPI poll method walks the sockets backlog and tries
to acquire the (BH) socket lock on each of them. If the socket
is owned by the user space, the action will be completed by
the sock release cb, otherwise push is started.
This change leverages the delegated action infrastructure
to avoid invoking the MPTCP worker to spool the pending data,
when the packet scheduler picks a subflow other then the one
currently processing the incoming MPTCP-level ack.
Additionally we further refine the subflow selection
invoking the packet scheduler for each chunk of data
even inside __mptcp_subflow_push_pending().
v1 -> v2:
- fix possible UaF at shutdown time, resetting sock ops
after removing the ulp context
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Otherwise the packet scheduler policy will not be
enforced when pushing pending data at MPTCP-level
ack reception time.
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The current packet scheduler can enqueue up to sndbuf
data on each subflow. If the send buffer is large and
the subflows are not symmetric, this could lead to
suboptimal aggregate bandwidth utilization.
Limit the amount of queued data to the maximum send
window.
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
After commit 6e628cd3a8 ("mptcp: use mptcp release_cb for
delayed tasks"), MPTCP never sets the flag bit SOCK_NOSPACE
on its subflow. As a side effect, autotune never takes place,
as it happens inside tcp_new_space(), which in turn is called
only when the mentioned bit is set.
Let's sendmsg() set the subflows NOSPACE bit when looking for
more memory and use the subflow write_space callback to propagate
the snd buf update and wake-up the user-space.
Additionally, this allows dropping a bunch of duplicate code and
makes the SNDBUF_LIMITED chrono relevant again for MPTCP subflows.
Fixes: 6e628cd3a8 ("mptcp: use mptcp release_cb for delayed tasks")
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently, incoming subflows link to the parent socket,
while outgoing ones link to a per subflow socket. The latter
is not really needed, except at the initial connect() time and
for the first subflow.
Always graft the outgoing subflow to the parent socket and
free the unneeded ones early.
This allows some code cleanup, reduces the amount of memory
used and will simplify the next patch
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
tcp_disconnect() expects the caller acquires the sock lock,
but mptcp_disconnect() is not doing that. Add the missing
required lock.
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Fixes: 76e2a55d16 ("mptcp: better msk-level shutdown.")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/f818e82b58a556feeb71dcccc8bf1c87aafc6175.1610638176.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Instead of re-implementing most of inet_shutdown, re-use
such helper, and implement the MPTCP-specific bits at the
'proto' level.
The msk-level disconnect() can now be invoked, lets provide a
suitable implementation.
As a side effect, this fixes bad state management for listener
sockets. The latter could lead to division by 0 oops since
commit ea4ca586b1 ("mptcp: refine MPTCP-level ack scheduling").
Fixes: 43b54c6ee3 ("mptcp: Use full MPTCP-level disconnect state machine")
Fixes: ea4ca586b1 ("mptcp: refine MPTCP-level ack scheduling")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Syzkaller found a way to trigger division by zero
in mptcp_subflow_cleanup_rbuf().
The current checks implemented into tcp_can_send_ack()
are too week, let's be more accurate.
Reported-by: Christoph Paasch <cpaasch@apple.com>
Fixes: ea4ca586b1 ("mptcp: refine MPTCP-level ack scheduling")
Fixes: fd8976790a ("mptcp: be careful on MPTCP-level ack.")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch added the mibs for MP_PRIO, MPTCP_MIB_MPPRIOTX for transmitting
of the MP_PRIO suboption, and MPTCP_MIB_MPPRIORX for receiving of it.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch added a new command MPTCP_PM_CMD_SET_FLAGS in PM netlink:
In mptcp_nl_cmd_set_flags, parse the input address, get the backup value
according to whether the address's FLAG_BACKUP flag is set from the
user-space. Then check whether this address had been added in the local
address list. If it had been, then call mptcp_nl_addr_backup to deal with
this address.
In mptcp_nl_addr_backup, traverse all the existing msk sockets to find
the relevant sockets, and call mptcp_pm_nl_mp_prio_send_ack to send out
a MP_PRIO ACK packet.
Finally in mptcp_nl_cmd_set_flags, set or clear the address's FLAG_BACKUP
flag.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch added the incoming MP_PRIO logic:
Added a flag named mp_prio in struct mptcp_options_received, to mark the
MP_PRIO is received, and save the priority value to struct
mptcp_options_received's backup member. Then invoke
mptcp_pm_mp_prio_received with the receiving subsocket and the backup
value.
In mptcp_pm_mp_prio_received, get the subflow context according the input
subsocket, and change the subflow's backup as the incoming priority value.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch added the outgoing MP_PRIO logic:
In mptcp_pm_nl_mp_prio_send_ack, find the related subflow and subsocket
according to the input parameter addr. Save the input priority value to
suflow's backup, then set subflow's send_mp_prio flag to true, and save
the input priority value to suflow's request_bkup. Finally, send out a
pure ACK on the related subsocket.
In mptcp_established_options_mp_prio, check whether the subflow's
send_mp_prio is set. If it is, this is the packet for sending MP_PRIO.
So save subflow->request_bkup value to mptcp_out_options's backup, and
change the option type to OPTION_MPTCP_PRIO.
In mptcp_write_options, clear the send_mp_prio flag and send out the
MP_PRIO suboption with mptcp_out_options's backup value.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently the address ID set by the netlink PM from user-space is
overridden by the kernel. This patch added the address ID assignment
bitmap to allow user-space to set the address ID.
Use a per netns bitmask id_bitmap (256 bits) to keep track of in-use IDs.
And use next_id to keep track of the highest ID currently in use. If the
user-space provides an ID at endpoint creation time, try to use it. If
already in use, endpoint creation fails. Otherwise pick the first ID
available after the highest currently in use, with wrap-around.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When sendmsg() needs to wait for memory, the pending data
is not updated. That causes a drift in forward memory allocation,
leading to stall and/or warnings at socket close time.
This change addresses the above issue moving the pending data
counter update inside the sendmsg() main loop.
Fixes: 6e628cd3a8 ("mptcp: use mptcp release_cb for delayed tasks")
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When multiple subflows are active, we can receive a
window update on subflow with no write space available.
MPTCP will try to push frames on such subflow and will
fail. Pending frames will be pushed only after receiving
a window update on a subflow with some wspace available.
Overall the above could lead to suboptimal aggregate
bandwidth usage.
Instead, we should try to push pending frames as soon as
the subflow reaches both conditions mentioned above.
We can finally enable self-tests with asymmetric links,
as the above makes them finally pass.
Fixes: 6f8a612a33 ("mptcp: keep track of advertised windows right edge")
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
MPTCP closes the subflows while holding the msk-level lock.
While acquiring the subflow socket lock we need to use the
correct nested annotation, or we can hit a lockdep splat
at runtime.
Reported-and-tested-by: Geliang Tang <geliangtang@gmail.com>
Fixes: e16163b6e2 ("mptcp: refactor shutdown and close")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently MPTCP is not propagating the security context
from the ingress request socket to newly created msk
at clone time.
Address the issue invoking the missing security helper.
Fixes: cf7da0d66c ("mptcp: Create SUBFLOW socket for incoming connections")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Core:
- support "prefer busy polling" NAPI operation mode, where we defer softirq
for some time expecting applications to periodically busy poll
- AF_XDP: improve efficiency by more batching and hindering
the adjacency cache prefetcher
- af_packet: make packet_fanout.arr size configurable up to 64K
- tcp: optimize TCP zero copy receive in presence of partial or unaligned
reads making zero copy a performance win for much smaller messages
- XDP: add bulk APIs for returning / freeing frames
- sched: support fragmenting IP packets as they come out of conntrack
- net: allow virtual netdevs to forward UDP L4 and fraglist GSO skbs
BPF:
- BPF switch from crude rlimit-based to memcg-based memory accounting
- BPF type format information for kernel modules and related tracing
enhancements
- BPF implement task local storage for BPF LSM
- allow the FENTRY/FEXIT/RAW_TP tracing programs to use bpf_sk_storage
Protocols:
- mptcp: improve multiple xmit streams support, memory accounting and
many smaller improvements
- TLS: support CHACHA20-POLY1305 cipher
- seg6: add support for SRv6 End.DT4/DT6 behavior
- sctp: Implement RFC 6951: UDP Encapsulation of SCTP
- ppp_generic: add ability to bridge channels directly
- bridge: Connectivity Fault Management (CFM) support as is defined in
IEEE 802.1Q section 12.14.
Drivers:
- mlx5: make use of the new auxiliary bus to organize the driver internals
- mlx5: more accurate port TX timestamping support
- mlxsw:
- improve the efficiency of offloaded next hop updates by using
the new nexthop object API
- support blackhole nexthops
- support IEEE 802.1ad (Q-in-Q) bridging
- rtw88: major bluetooth co-existance improvements
- iwlwifi: support new 6 GHz frequency band
- ath11k: Fast Initial Link Setup (FILS)
- mt7915: dual band concurrent (DBDC) support
- net: ipa: add basic support for IPA v4.5
Refactor:
- a few pieces of in_interrupt() cleanup work from Sebastian Andrzej Siewior
- phy: add support for shared interrupts; get rid of multiple driver
APIs and have the drivers write a full IRQ handler, slight growth
of driver code should be compensated by the simpler API which
also allows shared IRQs
- add common code for handling netdev per-cpu counters
- move TX packet re-allocation from Ethernet switch tag drivers to
a central place
- improve efficiency and rename nla_strlcpy
- number of W=1 warning cleanups as we now catch those in a patchwork
build bot
Old code removal:
- wan: delete the DLCI / SDLA drivers
- wimax: move to staging
- wifi: remove old WDS wifi bridging support
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAl/YXmUACgkQMUZtbf5S
IrvSQBAAgOrt4EFopEvVqlTHZbqI45IEqgtXS+YWmlgnjZCgshyMj8q1yK1zzane
qYxr/NNJ9kV3FdtaynmmHPgEEEfR5kJ/D3B2BsxYDkaDDrD0vbNsBGw+L+/Gbhxl
N/5l/9FjLyLY1D+EErknuwR5XGuQ6BSDVaKQMhYOiK2hgdnAAI4hszo8Chf6wdD0
XDBslQ7vpD/05r+eMj0IkS5dSAoGOIFXUxhJ5dqrDbRHiKsIyWqA3PLbYemfAhxI
s2XckjfmSgGE3FKL8PSFu+EcfHbJQQjLcULJUnqgVcdwEEtRuE9ggEi52nZRXMWM
4e8sQJAR9Fx7pZy0G1xfS149j6iPU5LjRlU9TNSpVABz14Vvvo3gEL6gyIdsz+xh
hMN7UBdp0FEaP028CXoIYpaBesvQqj0BSndmee8qsYAtN6j+QKcM2AOSr7JN1uMH
C/86EDoGAATiEQIVWJvnX5MPmlAoblyLA+RuVhmxkIBx2InGXkFmWqRkXT5l4jtk
LVl8/TArR4alSQqLXictXCjYlCm9j5N4zFFtEVasSYi7/ZoPfgRNWT+lJ2R8Y+Zv
+htzGaFuyj6RJTVeFQMrkl3whAtBamo2a0kwg45NnxmmXcspN6kJX1WOIy82+MhD
Yht7uplSs7MGKA78q/CDU0XBeGjpABUvmplUQBIfrR/jKLW2730=
=GXs1
-----END PGP SIGNATURE-----
Merge tag 'net-next-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
"Core:
- support "prefer busy polling" NAPI operation mode, where we defer
softirq for some time expecting applications to periodically busy
poll
- AF_XDP: improve efficiency by more batching and hindering the
adjacency cache prefetcher
- af_packet: make packet_fanout.arr size configurable up to 64K
- tcp: optimize TCP zero copy receive in presence of partial or
unaligned reads making zero copy a performance win for much smaller
messages
- XDP: add bulk APIs for returning / freeing frames
- sched: support fragmenting IP packets as they come out of conntrack
- net: allow virtual netdevs to forward UDP L4 and fraglist GSO skbs
BPF:
- BPF switch from crude rlimit-based to memcg-based memory accounting
- BPF type format information for kernel modules and related tracing
enhancements
- BPF implement task local storage for BPF LSM
- allow the FENTRY/FEXIT/RAW_TP tracing programs to use
bpf_sk_storage
Protocols:
- mptcp: improve multiple xmit streams support, memory accounting and
many smaller improvements
- TLS: support CHACHA20-POLY1305 cipher
- seg6: add support for SRv6 End.DT4/DT6 behavior
- sctp: Implement RFC 6951: UDP Encapsulation of SCTP
- ppp_generic: add ability to bridge channels directly
- bridge: Connectivity Fault Management (CFM) support as is defined
in IEEE 802.1Q section 12.14.
Drivers:
- mlx5: make use of the new auxiliary bus to organize the driver
internals
- mlx5: more accurate port TX timestamping support
- mlxsw:
- improve the efficiency of offloaded next hop updates by using
the new nexthop object API
- support blackhole nexthops
- support IEEE 802.1ad (Q-in-Q) bridging
- rtw88: major bluetooth co-existance improvements
- iwlwifi: support new 6 GHz frequency band
- ath11k: Fast Initial Link Setup (FILS)
- mt7915: dual band concurrent (DBDC) support
- net: ipa: add basic support for IPA v4.5
Refactor:
- a few pieces of in_interrupt() cleanup work from Sebastian Andrzej
Siewior
- phy: add support for shared interrupts; get rid of multiple driver
APIs and have the drivers write a full IRQ handler, slight growth
of driver code should be compensated by the simpler API which also
allows shared IRQs
- add common code for handling netdev per-cpu counters
- move TX packet re-allocation from Ethernet switch tag drivers to a
central place
- improve efficiency and rename nla_strlcpy
- number of W=1 warning cleanups as we now catch those in a patchwork
build bot
Old code removal:
- wan: delete the DLCI / SDLA drivers
- wimax: move to staging
- wifi: remove old WDS wifi bridging support"
* tag 'net-next-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1922 commits)
net: hns3: fix expression that is currently always true
net: fix proc_fs init handling in af_packet and tls
nfc: pn533: convert comma to semicolon
af_vsock: Assign the vsock transport considering the vsock address flags
af_vsock: Set VMADDR_FLAG_TO_HOST flag on the receive path
vsock_addr: Check for supported flag values
vm_sockets: Add VMADDR_FLAG_TO_HOST vsock flag
vm_sockets: Add flags field in the vsock address data structure
net: Disable NETIF_F_HW_TLS_TX when HW_CSUM is disabled
tcp: Add logic to check for SYN w/ data in tcp_simple_retransmit
net: mscc: ocelot: install MAC addresses in .ndo_set_rx_mode from process context
nfc: s3fwrn5: Release the nfc firmware
net: vxget: clean up sparse warnings
mlxsw: spectrum_router: Use eXtended mezzanine to offload IPv4 router
mlxsw: spectrum: Set KVH XLT cache mode for Spectrum2/3
mlxsw: spectrum_router_xm: Introduce basic XM cache flushing
mlxsw: reg: Add Router LPM Cache Enable Register
mlxsw: reg: Add Router LPM Cache ML Delete Register
mlxsw: spectrum_router_xm: Implement L-value tracking for M-index
mlxsw: reg: Add XM Router M Table Register
...
Currently the xmit path of the MPTCP protocol creates smaller-
than-max-size skbs, which is suboptimal for the performances.
There are a few things to improve:
- when coalescing to an existing skb, must clear the PUSH flag
- tcp_build_frag() expect the available space as an argument.
When coalescing is enable MPTCP already subtracted the
to-be-coalesced skb len. We must increment said argument
accordingly.
Before:
./use_mptcp.sh netperf -H 127.0.0.1 -t TCP_STREAM
[...]
131072 16384 16384 30.00 24414.86
After:
./use_mptcp.sh netperf -H 127.0.0.1 -t TCP_STREAM
[...]
131072 16384 16384 30.05 28357.69
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There is no need to unconditionally acquire the join list
lock, we can simply splice the join list into the subflow
list and traverse only the latter.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
parse the MPTCP FASTCLOSE subtype.
If provided key matches the local one, schedule the work queue to close
(with tcp reset) all subflows.
The MPTCP socket moves to closed state immediately.
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When processing options from tcp reset path its possible that
tcp_done(ssk) drops the last reference on the mptcp socket which
results in use-after-free.
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Use the macro MPTCPOPT_HMAC_LEN instead of a constant in struct
mptcp_options_received.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When the PM netlink flushes the addresses, invoke the remove address
function mptcp_nl_remove_subflow_and_signal_addr to remove the addresses
and the subflows. Since this function should not be invoked under lock,
move __flush_addrs out of the pernet->lock.
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
It has been observed that the kernel sockets created for the subflows
(except the first one) are not in the same cgroup as their parents.
That's because the additional subflows are created by kernel workers.
This is a problem with eBPF programs attached to the parent's
cgroup won't be executed for the children. But also with any other features
of CGroup linked to a sk.
This patch fixes this behaviour.
As the subflow sockets are created by the kernel, we can't use
'mem_cgroup_sk_alloc' because of the current context being the one of the
kworker. This is why we have to do low level memcg manipulation, if
required.
Suggested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Nicolas Rybowski <nicolas.rybowski@tessares.net>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pull crypto updates from Herbert Xu:
"API:
- Add speed testing on 1420-byte blocks for networking
Algorithms:
- Improve performance of chacha on ARM for network packets
- Improve performance of aegis128 on ARM for network packets
Drivers:
- Add support for Keem Bay OCS AES/SM4
- Add support for QAT 4xxx devices
- Enable crypto-engine retry mechanism in caam
- Enable support for crypto engine on sdm845 in qce
- Add HiSilicon PRNG driver support"
* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (161 commits)
crypto: qat - add capability detection logic in qat_4xxx
crypto: qat - add AES-XTS support for QAT GEN4 devices
crypto: qat - add AES-CTR support for QAT GEN4 devices
crypto: atmel-i2c - select CONFIG_BITREVERSE
crypto: hisilicon/trng - replace atomic_add_return()
crypto: keembay - Add support for Keem Bay OCS AES/SM4
dt-bindings: Add Keem Bay OCS AES bindings
crypto: aegis128 - avoid spurious references crypto_aegis128_update_simd
crypto: seed - remove trailing semicolon in macro definition
crypto: x86/poly1305 - Use TEST %reg,%reg instead of CMP $0,%reg
crypto: x86/sha512 - Use TEST %reg,%reg instead of CMP $0,%reg
crypto: aesni - Use TEST %reg,%reg instead of CMP $0,%reg
crypto: cpt - Fix sparse warnings in cptpf
hwrng: ks-sa - Add dependency on IOMEM and OF
crypto: lib/blake2s - Move selftest prototype into header file
crypto: arm/aes-ce - work around Cortex-A57/A72 silion errata
crypto: ecdh - avoid unaligned accesses in ecdh_set_secret()
crypto: ccree - rework cache parameters handling
crypto: cavium - Use dma_set_mask_and_coherent to simplify code
crypto: marvell/octeontx - Use dma_set_mask_and_coherent to simplify code
...
xdp_return_frame_bulk() needs to pass a xdp_buff
to __xdp_return().
strlcpy got converted to strscpy but here it makes no
functional difference, so just keep the right code.
Conflicts:
net/netfilter/nf_tables_api.c
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When the workqueue disposes of the msk, the subflows can still
receive some data from the peer after __mptcp_close_ssk()
completes.
The above could trigger a race between the msk receive path and the
msk destruction. Acquiring the mptcp_data_lock() in __mptcp_destroy_sock()
will not save the day: the rx path could be reached even after msk
destruction completes.
Instead use the subflow 'disposable' flag to prevent entering
the msk receive path after __mptcp_close_ssk().
Fixes: e16163b6e2 ("mptcp: refactor shutdown and close")
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a MPTCP listener socket is closed with unaccepted
children pending, the ULP release callback will be invoked,
but nobody will call into __mptcp_close_ssk() on the
corresponding subflow.
As a consequence, at ULP release time, the 'disposable' flag
will be cleared and the subflow context memory will be leaked.
This change addresses the issue always freeing the context if
the subflow is still in the accept queue at ULP release time.
Additionally, this fixes an incorrect code reference in the
related comment.
Note: this fix leverages the changes introduced by the previous
commit.
Fixes: e16163b6e2 ("mptcp: refactor shutdown and close")
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Christoph reported the following splat:
WARNING: CPU: 0 PID: 4615 at net/ipv4/inet_connection_sock.c:1031 inet_csk_listen_stop+0x8e8/0xad0 net/ipv4/inet_connection_sock.c:1031
Modules linked in:
CPU: 0 PID: 4615 Comm: syz-executor.4 Not tainted 5.9.0 #37
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:inet_csk_listen_stop+0x8e8/0xad0 net/ipv4/inet_connection_sock.c:1031
Code: 03 00 00 00 e8 79 b2 3d ff e9 ad f9 ff ff e8 1f 76 ba fe be 02 00 00 00 4c 89 f7 e8 62 b2 3d ff e9 14 f9 ff ff e8 08 76 ba fe <0f> 0b e9 97 f8 ff ff e8 fc 75 ba fe be 03 00 00 00 4c 89 f7 e8 3f
RSP: 0018:ffffc900037f7948 EFLAGS: 00010293
RAX: ffff88810a349c80 RBX: ffff888114ee1b00 RCX: ffffffff827b14cd
RDX: 0000000000000000 RSI: ffffffff827b1c38 RDI: 0000000000000005
RBP: ffff88810a2a8000 R08: ffff88810a349c80 R09: fffff520006fef1f
R10: 0000000000000003 R11: fffff520006fef1e R12: ffff888114ee2d00
R13: dffffc0000000000 R14: 0000000000000001 R15: ffff888114ee1d68
FS: 00007f2ac1945700(0000) GS:ffff88811b400000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ffd44798bc0 CR3: 0000000109810002 CR4: 0000000000170ef0
Call Trace:
__tcp_close+0xd86/0x1110 net/ipv4/tcp.c:2433
__mptcp_close_ssk+0x256/0x430 net/mptcp/protocol.c:1761
__mptcp_destroy_sock+0x49b/0x770 net/mptcp/protocol.c:2127
mptcp_close+0x62d/0x910 net/mptcp/protocol.c:2184
inet_release+0xe9/0x1f0 net/ipv4/af_inet.c:434
__sock_release+0xd2/0x280 net/socket.c:596
sock_close+0x15/0x20 net/socket.c:1277
__fput+0x276/0x960 fs/file_table.c:281
task_work_run+0x109/0x1d0 kernel/task_work.c:151
get_signal+0xe8f/0x1d40 kernel/signal.c:2561
arch_do_signal+0x88/0x1b60 arch/x86/kernel/signal.c:811
exit_to_user_mode_loop kernel/entry/common.c:161 [inline]
exit_to_user_mode_prepare+0x9b/0xf0 kernel/entry/common.c:191
syscall_exit_to_user_mode+0x22/0x150 kernel/entry/common.c:266
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f2ac1254469
Code: 00 f3 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ff 49 2b 00 f7 d8 64 89 01 48
RSP: 002b:00007f2ac1944dc8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffbf RBX: 000000000069bf00 RCX: 00007f2ac1254469
RDX: 0000000000000000 RSI: 0000000000008982 RDI: 0000000000000003
RBP: 000000000069bf00 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000069bf0c
R13: 00007ffeb53f178f R14: 00000000004668b0 R15: 0000000000000003
After commit 0397c6d85f ("mptcp: keep unaccepted MPC subflow into
join list"), the msk's workqueue and/or PM can touch the MPC
subflow - and acquire its socket lock - even if it's still unaccepted.
If the above event races with the relevant listener socket close, we
can end-up with the above splat.
This change addresses the issue delaying the MPC socket insertion
in conn_list at accept time - that is, partially reverting the
blamed commit.
We must additionally ensure that mptcp_pm_fully_established()
happens after accept() time, or the PM will not be able to
handle properly such event - conn_list could be empty otherwise.
In the receive path, we check the subflow list node to ensure
it is out of the listener queue. Be sure client subflows do
not match transiently such condition moving them into the join
list earlier at creation time.
Since we now have multiple mptcp_pm_fully_established() call sites
from different code-paths, said helper can now race with itself.
Use an additional PM status bit to avoid multiple notifications.
Reported-by: Christoph Paasch <cpaasch@apple.com>
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/103
Fixes: 0397c6d85f ("mptcp: keep unaccepted MPC subflow into join list"),
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since the local variable sk has been defined, use it instead of
open-coding.
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since the RM_ADDR signal had been reused with add_addr_signal, it's not
suitable to call it add_addr_signal or mptcp_add_addr_status. So this
patch renamed add_addr_signal to addr_signal, and renamed
mptcp_add_addr_status to mptcp_addr_signal_status.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch reused add_addr_signal for the RM_ADDR announcing signal, by
defining a new ADD_ADDR status named MPTCP_RM_ADDR_SIGNAL. Then the flag
rm_addr_signal in PM could be dropped.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch printed out more debugging information for the ADD_ADDR
suboption parsing on the incoming path.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch added a new parameter 'port' for mptcp_pm_announce_addr. If
this parameter is true, we set the MPTCP_ADD_ADDR_PORT bit of the
add_addr_signal. That means the announced address is added with a port
number.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The process is similar to that of the ADD_ADDR IPv6, this patch also sent
out a pure ack for the ADD_ADDR using port.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch added a new add_addr_signal type named MPTCP_ADD_ADDR_PORT,
to identify it is an address with port to be added.
It also added a new parameter 'port' for both mptcp_add_addr_len and
mptcp_pm_add_addr_signal.
In mptcp_established_options_add_addr, we check whether the announced
address is added with port. If it is, we put this port number to
mptcp_out_options's port field.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch uses adding up size to get the ADD_ADDR suboption length rather
than returning the ADD_ADDR size constants.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In rfc8684, the length of ADD_ADDR suboption with IPv4 address and port
is 18 octets, but mptcp_write_options is 32-bit aligned, so we need to
pad it to 20 octets. All the other port related option lengths need to
be added up 2 octets similarly.
This patch added a new field 'port' in mptcp_out_options. When this
field is set with a port number, we need to add up 4 octets for the
ADD_ADDR suboption, and put the port number into the suboption.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The length of ADD_ADDR6 is 12 octets longer than ADD_ADDR. That's the
only difference between them.
This patch dropped the duplicate code between ADD_ADDR and ADD_ADDR6
suboptions writing, and unify them into one.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are two differences between ADD_ADDR suboption and ADD_ADDR echo
suboption: The length of the former is 8 octets longer than the length
of the latter. The former's echo-flag is 0, and latter's echo-flag is 1.
This patch added two local variables, len and echo, to unify ADD_ADDR
and ADD_ADDR echo suboptions writing.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When do cat /proc/net/netstat, the output isn't append with a new line, it looks like this:
[root@localhost ~]# cat /proc/net/netstat
...
MPTcpExt: 0 0 0 0 0 0 0 0 0 0 0 0 0[root@localhost ~]#
This is because in mptcp_seq_show(), if mptcp isn't in use, net->mib.mptcp_statistics is NULL,
so it just puts all 0 after "MPTcpExt:", and return, forgot the '\n'.
After this patch:
[root@localhost ~]# cat /proc/net/netstat
...
MPTcpExt: 0 0 0 0 0 0 0 0 0 0 0 0 0
[root@localhost ~]#
Fixes: fc518953bc ("mptcp: add and use MIB counter infrastructure")
Signed-off-by: Jianguo Wu <wujianguo@chinatelecom.cn>
Acked-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/142e2fd9-58d9-bb13-fb75-951cccc2331e@163.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
RFC 8684 says:
If the token is unknown or the host wants to refuse subflow establishment
(for example, due to a limit on the number of subflows it will permit),
the receiver will send back a reset (RST) signal, analogous to an unknown
port in TCP, containing an MP_TCPRST option (Section 3.6) with an
"MPTCP specific error" reason code.
mptcp-next doesn't support MP_TCPRST yet, this can be added in another
change.
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The Multipath-TCP standard (RFC 8684) says that an MPTCP host should send
a TCP reset if the token in a MP_JOIN request is unknown.
At this time we don't do this, the 3whs completes and the 'new subflow'
is reset afterwards. There are two ways to allow MPTCP to send the
reset.
1. override 'send_synack' callback and emit the rst from there.
The drawback is that the request socket gets inserted into the
listeners queue just to get removed again right away.
2. Send the reset from the 'route_req' function instead.
This avoids the 'add&remove request socket', but route_req lacks the
skb that is required to send the TCP reset.
Instead of just adding the skb to that function for MPTCP sake alone,
Paolo suggested to merge init_req and route_req functions.
This saves one indirection from syn processing path and provides the skb
to the merged function at the same time.
'send reset on unknown mptcp join token' is added in next patch.
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>