Currently, we delete the key expiry timer (in sk->close) before
unhashing the sk. This means that another thread may find the sk through
its presence on the key list, and re-queue the timer.
This change moves the timer deletion to the unhash, after we have made
the key no longer observable, so the timer cannot be re-queued.
Fixes: 7b14e15ae6 ("mctp: Implement a timeout for tags")
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, we correlate the mctp_sk_key lifetime to the sock lifetime
through the sock hash/unhash operations, but this is pretty tenuous, and
there are cases where we may have a temporary reference to an unhashed
sk.
This change makes the reference more explicit, by adding a hold on the
sock when it's associated with a mctp_sk_key, released on final key
unref.
Fixes: 73c618456d ("mctp: locking, lifetime and validity changes for sk_keys")
Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
When listen() and accept() are called on an x25 socket
that connect() succeeds, accept() succeeds immediately.
This is because x25_connect() queues the skb to
sk->sk_receive_queue, and x25_accept() dequeues it.
This creates a child socket with the sk of the parent
x25 socket, which can cause confusion.
Fix x25_listen() to return -EINVAL if the socket has
already been successfully connect()ed to avoid this issue.
Signed-off-by: Hyunwoo Kim <v4bel@theori.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
A listening socket linked to a sockmap has its sk_prot overridden. It
points to one of the struct proto variants in tcp_bpf_prots. The variant
depends on the socket's family and which sockmap programs are attached.
A child socket cloned from a TCP listener initially inherits their sk_prot.
But before cloning is finished, we restore the child's proto to the
listener's original non-tcp_bpf_prots one. This happens in
tcp_create_openreq_child -> tcp_bpf_clone.
Today, in tcp_bpf_clone we detect if the child's proto should be restored
by checking only for the TCP_BPF_BASE proto variant. This is not
correct. The sk_prot of listening socket linked to a sockmap can point to
to any variant in tcp_bpf_prots.
If the listeners sk_prot happens to be not the TCP_BPF_BASE variant, then
the child socket unintentionally is left if the inherited sk_prot by
tcp_bpf_clone.
This leads to issues like infinite recursion on close [1], because the
child state is otherwise not set up for use with tcp_bpf_prot operations.
Adjust the check in tcp_bpf_clone to detect all of tcp_bpf_prots variants.
Note that it wouldn't be sufficient to check the socket state when
overriding the sk_prot in tcp_bpf_update_proto in order to always use the
TCP_BPF_BASE variant for listening sockets. Since commit
b8b8315e39 ("bpf, sockmap: Remove unhash handler for BPF sockmap usage")
it is possible for a socket to transition to TCP_LISTEN state while already
linked to a sockmap, e.g. connect() -> insert into map ->
connect(AF_UNSPEC) -> listen().
[1]: https://lore.kernel.org/all/00000000000073b14905ef2e7401@google.com/
Fixes: e80251555f ("tcp_bpf: Don't let child socket inherit parent protocol ops on copy")
Reported-by: syzbot+04c21ed96d861dccc5cd@syzkaller.appspotmail.com
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20230113-sockmap-fix-v2-2-1e0ee7ac2f90@cloudflare.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
1) Perform SCTP vtag verification for ABORT/SHUTDOWN_COMPLETE according
to RFC 9260, Sect 8.5.1.
2) Fix infinite loop if SCTP chunk size is zero in for_each_sctp_chunk().
And remove useless check in this macro too.
3) Revert DATA_SENT state in the SCTP tracker, this was applied in the
previous merge window. Next patch in this series provides a more
simple approach to multihoming support.
4) Unify HEARTBEAT_ACKED and ESTABLISHED states for SCTP multihoming
support, use default ESTABLISHED of 210 seconds based on
heartbeat timeout * maximum number of retransmission + round-trip timeout.
Otherwise, SCTP conntrack entry that represents secondary paths
remain stale in the table for up to 5 days.
This is a slightly large batch with fixes for the SCTP connection
tracking helper, all patches from Sriram Yagnaraman.
* git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: conntrack: unify established states for SCTP paths
Revert "netfilter: conntrack: add sctp DATA_SENT state"
netfilter: conntrack: fix bug in for_each_sctp_chunk
netfilter: conntrack: fix vtag checks for ABORT/SHUTDOWN_COMPLETE
====================
Link: https://lore.kernel.org/r/20230124183933.4752-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently, if you bind the socket to something like:
servaddr.sin6_family = AF_INET6;
servaddr.sin6_port = htons(0);
servaddr.sin6_scope_id = 0;
inet_pton(AF_INET6, "::1", &servaddr.sin6_addr);
And then request a connect to:
connaddr.sin6_family = AF_INET6;
connaddr.sin6_port = htons(20000);
connaddr.sin6_scope_id = if_nametoindex("lo");
inet_pton(AF_INET6, "fe88::1", &connaddr.sin6_addr);
What the stack does is:
- bind the socket
- create a new asoc
- to handle the connect
- copy the addresses that can be used for the given scope
- try to connect
But the copy returns 0 addresses, and the effect is that it ends up
trying to connect as if the socket wasn't bound, which is not the
desired behavior. This unexpected behavior also allows KASLR leaks
through SCTP diag interface.
The fix here then is, if when trying to copy the addresses that can
be used for the scope used in connect() it returns 0 addresses, bail
out. This is what TCP does with a similar reproducer.
Reported-by: Pietro Borrello <borrello@diag.uniroma1.it>
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Link: https://lore.kernel.org/r/9fcd182f1099f86c6661f3717f63712ddd1c676c.1674496737.git.marcelo.leitner@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
An SCTP endpoint can start an association through a path and tear it
down over another one. That means the initial path will not see the
shutdown sequence, and the conntrack entry will remain in ESTABLISHED
state for 5 days.
By merging the HEARTBEAT_ACKED and ESTABLISHED states into one
ESTABLISHED state, there remains no difference between a primary or
secondary path. The timeout for the merged ESTABLISHED state is set to
210 seconds (hb_interval * max_path_retrans + rto_max). So, even if a
path doesn't see the shutdown sequence, it will expire in a reasonable
amount of time.
With this change in place, there is now more than one state from which
we can transition to ESTABLISHED, COOKIE_ECHOED and HEARTBEAT_SENT, so
handle the setting of ASSURED bit whenever a state change has happened
and the new state is ESTABLISHED. Removed the check for dir==REPLY since
the transition to ESTABLISHED can happen only in the reply direction.
Fixes: 9fb9cbb108 ("[NETFILTER]: Add nf_conntrack subsystem.")
Signed-off-by: Sriram Yagnaraman <sriram.yagnaraman@est.tech>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This reverts commit (bff3d05348: "netfilter: conntrack: add sctp
DATA_SENT state")
Using DATA/SACK to detect a new connection on secondary/alternate paths
works only on new connections, while a HEARTBEAT is required on
connection re-use. It is probably consistent to wait for HEARTBEAT to
create a secondary connection in conntrack.
Signed-off-by: Sriram Yagnaraman <sriram.yagnaraman@est.tech>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
skb_header_pointer() will return NULL if offset + sizeof(_sch) exceeds
skb->len, so this offset < skb->len test is redundant.
if sch->length == 0, this will end up in an infinite loop, add a check
for sch->length > 0
Fixes: 9fb9cbb108 ("[NETFILTER]: Add nf_conntrack subsystem.")
Suggested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Sriram Yagnaraman <sriram.yagnaraman@est.tech>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
RFC 9260, Sec 8.5.1 states that for ABORT/SHUTDOWN_COMPLETE, the chunk
MUST be accepted if the vtag of the packet matches its own tag and the
T bit is not set OR if it is set to its peer's vtag and the T bit is set
in chunk flags. Otherwise the packet MUST be silently dropped.
Update vtag verification for ABORT/SHUTDOWN_COMPLETE based on the above
description.
Fixes: 9fb9cbb108 ("[NETFILTER]: Add nf_conntrack subsystem.")
Signed-off-by: Sriram Yagnaraman <sriram.yagnaraman@est.tech>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
1) Fix overlap detection in rbtree set backend: Detect overlap by going
through the ordered list of valid tree nodes. To shorten the number of
visited nodes in the list, this algorithm descends the tree to search
for an existing element greater than the key value to insert that is
greater than the new element.
2) Fix for the rbtree set garbage collector: Skip inactive and busy
elements when checking for expired elements to avoid interference
with an ongoing transaction from control plane.
This is a rather large fix coming at this stage of the 6.2-rc. Since
33c7aba0b4 ("netfilter: nf_tables: do not set up extensions for end
interval"), bogus overlap errors in the rbtree set occur more frequently.
* git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nft_set_rbtree: skip elements in transaction from garbage collection
netfilter: nft_set_rbtree: Switch to node list walk for overlap detection
====================
Link: https://lore.kernel.org/r/20230123211601.292930-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
if (!type)
continue;
if (type > RTAX_MAX)
return false;
...
fi_val = fi->fib_metrics->metrics[type - 1];
@type being used as an array index, we need to prevent
cpu speculation or risk leaking kernel memory content.
Fixes: 5f9ae3d9e7 ("ipv4: do metrics match when looking up and deleting a route")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20230120133140.3624204-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
if (!type)
continue;
if (type > RTAX_MAX)
return -EINVAL;
...
metrics[type - 1] = val;
@type being used as an array index, we need to prevent
cpu speculation or risk leaking kernel memory content.
Fixes: 6cf9dfd3bd ("net: fib: move metrics parsing to a helper")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20230120133040.3623463-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
netlink_getsockbyportid() reads sk_state while a concurrent
netlink_connect() can change its value.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
netlink_getname(), netlink_sendmsg() and netlink_getsockbyportid()
can read nlk->dst_portid and nlk->dst_group while another
thread is changing them.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Skip interference with an ongoing transaction, do not perform garbage
collection on inactive elements. Reset annotated previous end interval
if the expired element is marked as busy (control plane removed the
element right before expiration).
Fixes: 8d8540c4f5 ("netfilter: nft_set_rbtree: add timeout support")
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
...instead of a tree descent, which became overly complicated in an
attempt to cover cases where expired or inactive elements would affect
comparisons with the new element being inserted.
Further, it turned out that it's probably impossible to cover all those
cases, as inactive nodes might entirely hide subtrees consisting of a
complete interval plus a node that makes the current insertion not
overlap.
To speed up the overlap check, descent the tree to find a greater
element that is closer to the key value to insert. Then walk down the
node list for overlap detection. Starting the overlap check from
rb_first() unconditionally is slow, it takes 10 times longer due to the
full linear traversal of the list.
Moreover, perform garbage collection of expired elements when walking
down the node list to avoid bogus overlap reports.
For the insertion operation itself, this essentially reverts back to the
implementation before commit 7c84d41416 ("netfilter: nft_set_rbtree:
Detect partial overlaps on insertion"), except that cases of complete
overlap are already handled in the overlap detection phase itself, which
slightly simplifies the loop to find the insertion point.
Based on initial patch from Stefano Brivio, including text from the
original patch description too.
Fixes: 7c84d41416 ("netfilter: nft_set_rbtree: Detect partial overlaps on insertion")
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When proxying IPv6 NDP requests, the adverts to the initial multicast
solicits are correct and working. On the other hand, when later a
reachability confirmation is requested (on unicast), no reply is sent.
This causes the neighbor entry expiring on the sending node, which is
mostly a non-issue, as a new multicast request is sent. There are
routers, where the multicast requests are intentionally delayed, and in
these environments the current implementation causes periodic packet
loss for the proxied endpoints.
The root cause is the erroneous decrease of the hop limit, as this
is checked in ndisc.c and no answer is generated when it's 254 instead
of the correct 255.
Cc: stable@vger.kernel.org
Fixes: 46c7655f0b ("ipv6: decrease hop limit counter in ip6_forward()")
Signed-off-by: Gergely Risko <gergely.risko@gmail.com>
Tested-by: Gergely Risko <gergely.risko@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Due to the fact that the kernel-side data structures have been carried
over from the ioctl-based ethtool, we are now in the situation where we
have an ethnl_update_bool32() function, but the plain function that
operates on a boolean value kept in an actual u8 netlink attribute
doesn't exist.
With new ethtool features that are exposed solely over netlink, the
kernel data structures will use the "bool" type, so we will need this
kind of helper. Introduce it now; it's needed for things like
verify-disabled for the MAC merge configuration.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
int type = nla_type(nla);
if (type > XFRMA_MAX) {
return -EOPNOTSUPP;
}
@type is then used as an array index and can be used
as a Spectre v1 gadget.
if (nla_len(nla) < compat_policy[type].len) {
array_index_nospec() can be used to prevent leaking
content of kernel memory to malicious users.
Fixes: 5106f4a8ac ("xfrm/compat: Add 32=>64-bit messages translator")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Dmitry Safonov <dima@arista.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Reviewed-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
If net_assign_generic() fails, the current error path in ops_init() tries
to clear the gen pointer slot. Anyway, in such error path, the gen pointer
itself has not been modified yet, and the existing and accessed one is
smaller than the accessed index, causing an out-of-bounds error:
BUG: KASAN: slab-out-of-bounds in ops_init+0x2de/0x320
Write of size 8 at addr ffff888109124978 by task modprobe/1018
CPU: 2 PID: 1018 Comm: modprobe Not tainted 6.2.0-rc2.mptcp_ae5ac65fbed5+ #1641
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.1-2.fc37 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x6a/0x9f
print_address_description.constprop.0+0x86/0x2b5
print_report+0x11b/0x1fb
kasan_report+0x87/0xc0
ops_init+0x2de/0x320
register_pernet_operations+0x2e4/0x750
register_pernet_subsys+0x24/0x40
tcf_register_action+0x9f/0x560
do_one_initcall+0xf9/0x570
do_init_module+0x190/0x650
load_module+0x1fa5/0x23c0
__do_sys_finit_module+0x10d/0x1b0
do_syscall_64+0x58/0x80
entry_SYSCALL_64_after_hwframe+0x72/0xdc
RIP: 0033:0x7f42518f778d
Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48
89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
ff 73 01 c3 48 8b 0d cb 56 2c 00 f7 d8 64 89 01 48
RSP: 002b:00007fff96869688 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
RAX: ffffffffffffffda RBX: 00005568ef7f7c90 RCX: 00007f42518f778d
RDX: 0000000000000000 RSI: 00005568ef41d796 RDI: 0000000000000003
RBP: 00005568ef41d796 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000003 R11: 0000000000000246 R12: 0000000000000000
R13: 00005568ef7f7d30 R14: 0000000000040000 R15: 0000000000000000
</TASK>
This change addresses the issue by skipping the gen pointer
de-reference in the mentioned error-path.
Found by code inspection and verified with explicit error injection
on a kasan-enabled kernel.
Fixes: d266935ac4 ("net: fix UAF issue in nfqnl_nf_hook_drop() when ops_init() failed")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Link: https://lore.kernel.org/r/cec4e0f3bb2c77ac03a6154a8508d3930beb5f0f.1674154348.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The initial default value of 0 for tp->rate_app_limited was incorrect,
since a flow is indeed application-limited until it first sends
data. Fixing the default to be 1 is generally correct but also
specifically will help user-space applications avoid using the initial
tcpi_delivery_rate value of 0 that persists until the connection has
some non-zero bandwidth sample.
Fixes: eb8329e0a0 ("tcp: export data delivery rate")
Suggested-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David Morley <morleyd@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Tested-by: David Morley <morleyd@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The referenced commit changed the error code returned by the kernel
when preventing a non-established socket from attaching the ktls
ULP. Before to such a commit, the user-space got ENOTCONN instead
of EINVAL.
The existing self-tests depend on such error code, and the change
caused a failure:
RUN global.non_established ...
tls.c:1673:non_established:Expected errno (22) == ENOTCONN (107)
non_established: Test failed at step #3
FAIL global.non_established
In the unlikely event existing applications do the same, address
the issue by restoring the prior error code in the above scenario.
Note that the only other ULP performing similar checks at init
time - smc_ulp_ops - also fails with ENOTCONN when trying to attach
the ULP to a non-established socket.
Reported-by: Sabrina Dubroca <sd@queasysnail.net>
Fixes: 2c02d41d71 ("net/ulp: prevent ULP without clone op from entering the LISTEN status")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Link: https://lore.kernel.org/r/7bb199e7a93317fb6f8bf8b9b2dc71c18f337cde.1674042685.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
While one cpu is working on looking up the right socket from ehash
table, another cpu is done deleting the request socket and is about
to add (or is adding) the big socket from the table. It means that
we could miss both of them, even though it has little chance.
Let me draw a call trace map of the server side.
CPU 0 CPU 1
----- -----
tcp_v4_rcv() syn_recv_sock()
inet_ehash_insert()
-> sk_nulls_del_node_init_rcu(osk)
__inet_lookup_established()
-> __sk_nulls_add_node_rcu(sk, list)
Notice that the CPU 0 is receiving the data after the final ack
during 3-way shakehands and CPU 1 is still handling the final ack.
Why could this be a real problem?
This case is happening only when the final ack and the first data
receiving by different CPUs. Then the server receiving data with
ACK flag tries to search one proper established socket from ehash
table, but apparently it fails as my map shows above. After that,
the server fetches a listener socket and then sends a RST because
it finds a ACK flag in the skb (data), which obeys RST definition
in RFC 793.
Besides, Eric pointed out there's one more race condition where it
handles tw socket hashdance. Only by adding to the tail of the list
before deleting the old one can we avoid the race if the reader has
already begun the bucket traversal and it would possibly miss the head.
Many thanks to Eric for great help from beginning to end.
Fixes: 5e0724d027 ("tcp/dccp: fix hashdance race for passive sessions")
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/lkml/20230112065336.41034-1-kerneljasonxing@gmail.com/
Link: https://lore.kernel.org/r/20230118015941.1313-1-kerneljasonxing@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Third set of fixes for v6.2. This time most of them are for drivers,
only one revert for mac80211. For an important mt76 fix we had to
cherry pick two commits from wireless-next.
-----BEGIN PGP SIGNATURE-----
iQFFBAABCgAvFiEEiBjanGPFTz4PRfLobhckVSbrbZsFAmPHoUcRHGt2YWxvQGtl
cm5lbC5vcmcACgkQbhckVSbrbZtaCgf9G16rPowxI6AD+EliFbArdiwrV+mJHDyN
6eVniHDKgiFnvbyqvh+sGpbuYMwqqfERdxU3qi4+YhVGWyNNQYdJlntggKsTVRKK
gtE6h4zAo2DC6F+/zYt/FkQ6mCK6UQsaHDktGEqRP0vxH8Kdk85+yXEuwklI+L1L
w5ZTZ3HRxdtMhF9AmjCVOUrEEFXosanYTwSZ+1nlMEZ8vc5Wg5TH9wgue3Eg+9vx
vUjfRrrjOAlvGCcb9lVvPseH7n0m/U2JbugQkebuEUvzo4Fxcl2mR9pFXLGGbtAM
gseuNUfJftKVyYlTLc8brI7XpSSx6pV75h1EmvrHPkjiw1oSGeK8ig==
=13z+
-----END PGP SIGNATURE-----
Merge tag 'wireless-2023-01-18' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless
Kalle Valo says:
====================
wireless fixes for v6.2
Third set of fixes for v6.2. This time most of them are for drivers,
only one revert for mac80211. For an important mt76 fix we had to
cherry pick two commits from wireless-next.
* tag 'wireless-2023-01-18' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
Revert "wifi: mac80211: fix memory leak in ieee80211_if_add()"
wifi: mt76: dma: fix a regression in adding rx buffers
wifi: mt76: handle possible mt76_rx_token_consume failures
wifi: mt76: dma: do not increment queue head if mt76_dma_add_buf fails
wifi: rndis_wlan: Prevent buffer overflow in rndis_query_oid
wifi: brcmfmac: fix regression for Broadcom PCIe wifi devices
wifi: brcmfmac: avoid NULL-deref in survey dump for 2G only device
wifi: brcmfmac: avoid handling disabled channels for survey dump
====================
Link: https://lore.kernel.org/r/20230118073749.AF061C433EF@smtp.kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pablo Niera Ayuso says:
====================
The following patchset contains Netfilter fixes for net:
1) Fix syn-retransmits until initiator gives up when connection is re-used
due to rst marked as invalid, from Florian Westphal.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
syzbot reports a possible deadlock in rfcomm_sk_state_change [1].
While rfcomm_sock_connect acquires the sk lock and waits for
the rfcomm lock, rfcomm_sock_release could have the rfcomm
lock and hit a deadlock for acquiring the sk lock.
Here's a simplified flow:
rfcomm_sock_connect:
lock_sock(sk)
rfcomm_dlc_open:
rfcomm_lock()
rfcomm_sock_release:
rfcomm_sock_shutdown:
rfcomm_lock()
__rfcomm_dlc_close:
rfcomm_k_state_change:
lock_sock(sk)
This patch drops the sk lock before calling rfcomm_dlc_open to
avoid the possible deadlock and holds sk's reference count to
prevent use-after-free after rfcomm_dlc_open completes.
Reported-by: syzbot+d7ce59...@syzkaller.appspotmail.com
Fixes: 1804fdf6e4 ("Bluetooth: btintel: Combine setting up MSFT extension")
Link: https://syzkaller.appspot.com/bug?extid=d7ce59b06b3eb14fd218 [1]
Signed-off-by: Ying Hsu <yinghsu@chromium.org>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
This fixes the following trace caused by attempting to lock
cmd_sync_work_lock while holding the rcu_read_lock:
kworker/u3:2/212 is trying to lock:
ffff888002600910 (&hdev->cmd_sync_work_lock){+.+.}-{3:3}, at:
hci_cmd_sync_queue+0xad/0x140
other info that might help us debug this:
context-{4:4}
4 locks held by kworker/u3:2/212:
#0: ffff8880028c6530 ((wq_completion)hci0#2){+.+.}-{0:0}, at:
process_one_work+0x4dc/0x9a0
#1: ffff888001aafde0 ((work_completion)(&hdev->rx_work)){+.+.}-{0:0},
at: process_one_work+0x4dc/0x9a0
#2: ffff888002600070 (&hdev->lock){+.+.}-{3:3}, at:
hci_cc_le_set_cig_params+0x64/0x4f0
#3: ffffffffa5994b00 (rcu_read_lock){....}-{1:2}, at:
hci_cc_le_set_cig_params+0x2f9/0x4f0
Fixes: 26afbd826e ("Bluetooth: Add initial implementation of CIS connections")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
When hci_cmd_sync_queue() failed in hci_update_adv_data(), inst_ptr is
not freed, which will cause memory leak, convert to use ERR_PTR/PTR_ERR
to pass the instance to callback so no memory needs to be allocated.
Fixes: 651cd3d65b ("Bluetooth: convert hci_update_adv_data to hci_sync")
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
When hci_cmd_sync_queue() failed in hci_le_terminate_big() or
hci_le_big_terminate(), the memory pointed by variable d is not freed,
which will cause memory leak. Add release process to error path.
Fixes: eca0ae4aea ("Bluetooth: Add initial implementation of BIS connections")
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Don't try to use HCI_OP_LE_READ_BUFFER_SIZE_V2 if controller don't
support ISO channels, but in order to check if ISO channels are
supported HCI_OP_LE_READ_LOCAL_FEATURES needs to be done earlier so the
features bits can be checked on hci_le_read_buffer_size_sync.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216817
Fixes: c1631dbc00 ("Bluetooth: hci_sync: Fix hci_read_buffer_size_sync")
Cc: stable@vger.kernel.org # 6.1
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Smatch Warning:
net/bluetooth/mgmt_util.c:375 mgmt_mesh_add() error: __memcpy()
'mesh_tx->param' too small (48 vs 50)
Analysis:
'mesh_tx->param' is array of size 48. This is the destination.
u8 param[sizeof(struct mgmt_cp_mesh_send) + 29]; // 19 + 29 = 48.
But in the caller 'mesh_send' we reject only when len > 50.
len > (MGMT_MESH_SEND_SIZE + 31) // 19 + 31 = 50.
Fixes: b338d91703 ("Bluetooth: Implement support for Mesh")
Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
Signed-off-by: Brian Gix <brian.gix@intel.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
When a connection is re-used, following can happen:
[ connection starts to close, fin sent in either direction ]
> syn # initator quickly reuses connection
< ack # peer sends a challenge ack
> rst # rst, sequence number == ack_seq of previous challenge ack
> syn # this syn is expected to pass
Problem is that the rst will fail window validation, so it gets
tagged as invalid.
If ruleset drops such packets, we get repeated syn-retransmits until
initator gives up or peer starts responding with syn/ack.
Before the commit indicated in the "Fixes" tag below this used to work:
The challenge-ack made conntrack re-init state based on the challenge
ack itself, so the following rst would pass window validation.
Add challenge-ack support: If we get ack for syn, record the ack_seq,
and then check if the rst sequence number matches the last ack number
seen in reverse direction.
Fixes: c7aab4f170 ("netfilter: nf_conntrack_tcp: re-init for syn packets only")
Reported-by: Michal Tesar <mtesar@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The code in l2tp_tunnel_register() is racy in several ways:
1. It modifies the tunnel socket _after_ publishing it.
2. It calls setup_udp_tunnel_sock() on an existing socket without
locking.
3. It changes sock lock class on fly, which triggers many syzbot
reports.
This patch amends all of them by moving socket initialization code
before publishing and under sock lock. As suggested by Jakub, the
l2tp lockdep class is not necessary as we can just switch to
bh_lock_sock_nested().
Fixes: 37159ef2c1 ("l2tp: fix a lockdep splat")
Fixes: 6b9f34239b ("l2tp: fix races in tunnel creation")
Reported-by: syzbot+52866e24647f9a23403f@syzkaller.appspotmail.com
Reported-by: syzbot+94cc2a66fc228b23f360@syzkaller.appspotmail.com
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Guillaume Nault <gnault@redhat.com>
Cc: Jakub Sitnicki <jakub@cloudflare.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Tom Parkin <tparkin@katalix.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
l2tp uses l2tp_tunnel_list to track all registered tunnels and
to allocate tunnel ID's. IDR can do the same job.
More importantly, with IDR we can hold the ID before a successful
registration so that we don't need to worry about late error
handling, it is not easy to rollback socket changes.
This is a preparation for the following fix.
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Guillaume Nault <gnault@redhat.com>
Cc: Jakub Sitnicki <jakub@cloudflare.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Tom Parkin <tparkin@katalix.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
syzbot reported a nasty crash [1] in net_tx_action() which
made little sense until we got a repro.
This repro installs a taprio qdisc, but providing an
invalid TCA_RATE attribute.
qdisc_create() has to destroy the just initialized
taprio qdisc, and taprio_destroy() is called.
However, the hrtimer used by taprio had already fired,
therefore advance_sched() called __netif_schedule().
Then net_tx_action was trying to use a destroyed qdisc.
We can not undo the __netif_schedule(), so we must wait
until one cpu serviced the qdisc before we can proceed.
Many thanks to Alexander Potapenko for his help.
[1]
BUG: KMSAN: uninit-value in queued_spin_trylock include/asm-generic/qspinlock.h:94 [inline]
BUG: KMSAN: uninit-value in do_raw_spin_trylock include/linux/spinlock.h:191 [inline]
BUG: KMSAN: uninit-value in __raw_spin_trylock include/linux/spinlock_api_smp.h:89 [inline]
BUG: KMSAN: uninit-value in _raw_spin_trylock+0x92/0xa0 kernel/locking/spinlock.c:138
queued_spin_trylock include/asm-generic/qspinlock.h:94 [inline]
do_raw_spin_trylock include/linux/spinlock.h:191 [inline]
__raw_spin_trylock include/linux/spinlock_api_smp.h:89 [inline]
_raw_spin_trylock+0x92/0xa0 kernel/locking/spinlock.c:138
spin_trylock include/linux/spinlock.h:359 [inline]
qdisc_run_begin include/net/sch_generic.h:187 [inline]
qdisc_run+0xee/0x540 include/net/pkt_sched.h:125
net_tx_action+0x77c/0x9a0 net/core/dev.c:5086
__do_softirq+0x1cc/0x7fb kernel/softirq.c:571
run_ksoftirqd+0x2c/0x50 kernel/softirq.c:934
smpboot_thread_fn+0x554/0x9f0 kernel/smpboot.c:164
kthread+0x31b/0x430 kernel/kthread.c:376
ret_from_fork+0x1f/0x30
Uninit was created at:
slab_post_alloc_hook mm/slab.h:732 [inline]
slab_alloc_node mm/slub.c:3258 [inline]
__kmalloc_node_track_caller+0x814/0x1250 mm/slub.c:4970
kmalloc_reserve net/core/skbuff.c:358 [inline]
__alloc_skb+0x346/0xcf0 net/core/skbuff.c:430
alloc_skb include/linux/skbuff.h:1257 [inline]
nlmsg_new include/net/netlink.h:953 [inline]
netlink_ack+0x5f3/0x12b0 net/netlink/af_netlink.c:2436
netlink_rcv_skb+0x55d/0x6c0 net/netlink/af_netlink.c:2507
rtnetlink_rcv+0x30/0x40 net/core/rtnetlink.c:6108
netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline]
netlink_unicast+0xf3b/0x1270 net/netlink/af_netlink.c:1345
netlink_sendmsg+0x1288/0x1440 net/netlink/af_netlink.c:1921
sock_sendmsg_nosec net/socket.c:714 [inline]
sock_sendmsg net/socket.c:734 [inline]
____sys_sendmsg+0xabc/0xe90 net/socket.c:2482
___sys_sendmsg+0x2a1/0x3f0 net/socket.c:2536
__sys_sendmsg net/socket.c:2565 [inline]
__do_sys_sendmsg net/socket.c:2574 [inline]
__se_sys_sendmsg net/socket.c:2572 [inline]
__x64_sys_sendmsg+0x367/0x540 net/socket.c:2572
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
CPU: 0 PID: 13 Comm: ksoftirqd/0 Not tainted 6.0.0-rc2-syzkaller-47461-gac3859c02d7f #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/22/2022
Fixes: 5a781ccbd1 ("tc: Add support for configuring the taprio scheduler")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pable Neira Ayuso says:
====================
The following patchset contains Netfilter fixes for net:
1) Increase timeout to 120 seconds for netfilter selftests to fix
nftables transaction tests, from Florian Westphal.
2) Fix overflow in bitmap_ip_create() due to integer arithmetics
in a 64-bit bitmask, from Gavrilov Ilia.
3) Fix incorrect arithmetics in nft_payload with double-tagged
vlan matching.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Peek at old qdisc and graft only when deleting a leaf class in the htb,
rather than when deleting the htb itself. Do not peek at the qdisc of the
netdev queue when destroying the htb. The caller may already have grafted a
new qdisc that is not part of the htb structure being destroyed.
This fix resolves two use cases.
1. Using tc to destroy the htb.
- Netdev was being prematurely activated before the htb was fully
destroyed.
2. Using tc to replace the htb with another qdisc (which also leads to
the htb being destroyed).
- Premature netdev activation like previous case. Newly grafted qdisc
was also getting accidentally overwritten when destroying the htb.
Fixes: d03b195b5a ("sch_htb: Hierarchical QoS hardware offload")
Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Maxim Mikityanskiy <maxtram95@gmail.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230113005528.302625-1-rrameshbabu@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
If an MPTCP socket has been created with AF_INET6 and the IPV6_V6ONLY
option has been set, the userspace PM would allow creating subflows
using IPv4 addresses, e.g. mapped in v6.
The kernel side of userspace PM will also accept creating subflows with
local and remote addresses having different families. Depending on the
subflow socket's family, different behaviours are expected:
- If AF_INET is forced with a v6 address, the kernel will take the last
byte of the IP and try to connect to that: a new subflow is created
but to a non expected address.
- If AF_INET6 is forced with a v4 address, the kernel will try to
connect to a v4 address (v4-mapped-v6). A -EBADF error from the
connect() part is then expected.
It is then required to check the given families can be accepted. This is
done by using a new helper for addresses family matching, taking care of
IPv4 vs IPv4-mapped-IPv6 addresses. This helper will be re-used later by
the in-kernel path-manager to use mixed IPv4 and IPv6 addresses.
While at it, a clear error message is now reported if there are some
conflicts with the families that have been passed by the userspace.
Fixes: 702c2f646d ("mptcp: netlink: allow userspace-driven subflow establishment")
Cc: stable@vger.kernel.org
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Let the caller specify the to-be-created subflow family.
For a given MPTCP socket created with the AF_INET6 family, the current
userspace PM can already ask the kernel to create subflows in v4 and v6.
If "plain" IPv4 addresses are passed to the kernel, they are
automatically mapped in v6 addresses "by accident". This can be
problematic because the userspace will need to pass different addresses,
now the v4-mapped-v6 addresses to destroy this new subflow.
On the other hand, if the MPTCP socket has been created with the AF_INET
family, the command to create a subflow in v6 will be accepted but the
result will not be the one as expected as new subflow will be created in
IPv4 using part of the v6 addresses passed to the kernel: not creating
the expected subflow then.
No functional change intended for the in-kernel PM where an explicit
enforcement is currently in place. This arbitrary enforcement will be
leveraged by other patches in a future version.
Fixes: 702c2f646d ("mptcp: netlink: allow userspace-driven subflow establishment")
Cc: stable@vger.kernel.org
Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Fix a use-after-free that occurs in kfree_skb() called from
local_cleanup(). This could happen when killing nfc daemon (e.g. neard)
after detaching an nfc device.
When detaching an nfc device, local_cleanup() called from
nfc_llcp_unregister_device() frees local->rx_pending and decreases
local->ref by kref_put() in nfc_llcp_local_put().
In the terminating process, nfc daemon releases all sockets and it leads
to decreasing local->ref. After the last release of local->ref,
local_cleanup() called from local_release() frees local->rx_pending
again, which leads to the bug.
Setting local->rx_pending to NULL in local_cleanup() could prevent
use-after-free when local_cleanup() is called twice.
Found by a modified version of syzkaller.
BUG: KASAN: use-after-free in kfree_skb()
Call Trace:
dump_stack_lvl (lib/dump_stack.c:106)
print_address_description.constprop.0.cold (mm/kasan/report.c:306)
kasan_check_range (mm/kasan/generic.c:189)
kfree_skb (net/core/skbuff.c:955)
local_cleanup (net/nfc/llcp_core.c:159)
nfc_llcp_local_put.part.0 (net/nfc/llcp_core.c:172)
nfc_llcp_local_put (net/nfc/llcp_core.c:181)
llcp_sock_destruct (net/nfc/llcp_sock.c:959)
__sk_destruct (net/core/sock.c:2133)
sk_destruct (net/core/sock.c:2181)
__sk_free (net/core/sock.c:2192)
sk_free (net/core/sock.c:2203)
llcp_sock_release (net/nfc/llcp_sock.c:646)
__sock_release (net/socket.c:650)
sock_close (net/socket.c:1365)
__fput (fs/file_table.c:306)
task_work_run (kernel/task_work.c:179)
ptrace_notify (kernel/signal.c:2354)
syscall_exit_to_user_mode_prepare (kernel/entry/common.c:278)
syscall_exit_to_user_mode (kernel/entry/common.c:296)
do_syscall_64 (arch/x86/entry/common.c:86)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:106)
Allocated by task 4719:
kasan_save_stack (mm/kasan/common.c:45)
__kasan_slab_alloc (mm/kasan/common.c:325)
slab_post_alloc_hook (mm/slab.h:766)
kmem_cache_alloc_node (mm/slub.c:3497)
__alloc_skb (net/core/skbuff.c:552)
pn533_recv_response (drivers/nfc/pn533/usb.c:65)
__usb_hcd_giveback_urb (drivers/usb/core/hcd.c:1671)
usb_giveback_urb_bh (drivers/usb/core/hcd.c:1704)
tasklet_action_common.isra.0 (kernel/softirq.c:797)
__do_softirq (kernel/softirq.c:571)
Freed by task 1901:
kasan_save_stack (mm/kasan/common.c:45)
kasan_set_track (mm/kasan/common.c:52)
kasan_save_free_info (mm/kasan/genericdd.c:518)
__kasan_slab_free (mm/kasan/common.c:236)
kmem_cache_free (mm/slub.c:3809)
kfree_skbmem (net/core/skbuff.c:874)
kfree_skb (net/core/skbuff.c:931)
local_cleanup (net/nfc/llcp_core.c:159)
nfc_llcp_unregister_device (net/nfc/llcp_core.c:1617)
nfc_unregister_device (net/nfc/core.c:1179)
pn53x_unregister_nfc (drivers/nfc/pn533/pn533.c:2846)
pn533_usb_disconnect (drivers/nfc/pn533/usb.c:579)
usb_unbind_interface (drivers/usb/core/driver.c:458)
device_release_driver_internal (drivers/base/dd.c:1279)
bus_remove_device (drivers/base/bus.c:529)
device_del (drivers/base/core.c:3665)
usb_disable_device (drivers/usb/core/message.c:1420)
usb_disconnect (drivers/usb/core.c:2261)
hub_event (drivers/usb/core/hub.c:5833)
process_one_work (arch/x86/include/asm/jump_label.h:27 include/linux/jump_label.h:212 include/trace/events/workqueue.h:108 kernel/workqueue.c:2281)
worker_thread (include/linux/list.h:282 kernel/workqueue.c:2423)
kthread (kernel/kthread.c:319)
ret_from_fork (arch/x86/entry/entry_64.S:301)
Fixes: 3536da06db ("NFC: llcp: Clean local timers and works when removing a device")
Signed-off-by: Jisoo Jang <jisoo.jang@yonsei.ac.kr>
Link: https://lore.kernel.org/r/20230111131914.3338838-1-jisoo.jang@yonsei.ac.kr
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Current code for RSS_GET ethtool command includes netlink attributes
in reply message to user space even if they are null. Added checks
to include netlink attribute in reply message only if a value is
received from driver. Drivers might return null for RSS indirection
table or hash key. Instead of including attributes with empty value
in the reply message, add netlink attribute only if there is content.
Fixes: 7112a04664 ("ethtool: add netlink based get rss support")
Signed-off-by: Sudheer Mogilappagari <sudheer.mogilappagari@intel.com>
Reviewed-by: Michal Kubecek <mkubecek@suse.cz>
Link: https://lore.kernel.org/r/20230111235607.85509-1-sudheer.mogilappagari@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Fix rxrpc_connect_call() to return -ENOMEM rather than 0 if it fails to
look up a peer.
This generated a smatch warning:
net/rxrpc/call_object.c:303 rxrpc_connect_call() warn: missing error code 'ret'
I think this also fixes a syzbot-found bug:
rxrpc: Assertion failed - 1(0x1) == 11(0xb) is false
------------[ cut here ]------------
kernel BUG at net/rxrpc/call_object.c:645!
where the call being put is in the wrong state - as would be the case if we
failed to clear up correctly after the error in rxrpc_connect_call().
Fixes: 9d35d880e0 ("rxrpc: Move client call connection to the I/O thread")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <error27@gmail.com>
Reported-and-tested-by: syzbot+4bb6356bb29d6299360e@syzkaller.appspotmail.com
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/202301111153.9eZRYLf1-lkp@intel.com/
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Link: https://lore.kernel.org/r/2438405.1673460435@warthog.procyon.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Current release - regressions:
- rxrpc:
- only disconnect calls in the I/O thread
- move client call connection to the I/O thread
- fix incoming call setup race
- eth: mlx5:
- restore pkt rate policing support
- fix memory leak on updating vport counters
Previous releases - regressions:
- gro: take care of DODGY packets
- ipv6: deduct extension header length in rawv6_push_pending_frames
- tipc: fix unexpected link reset due to discovery messages
Previous releases - always broken:
- sched: disallow noqueue for qdisc classes
- eth: ice: fix potential memory leak in ice_gnss_tty_write()
- eth: ixgbe: fix pci device refcount leak
- eth: mlx5:
- fix command stats access after free
- fix macsec possible null dereference when updating MAC security entity (SecY)
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmPAGskSHHBhYmVuaUBy
ZWRoYXQuY29tAAoJECkkeY3MjxOk9aQQAInUtOfTi0EwR5oveTMWOcDc8P1rGFru
Yfid6d4gVRKDm9tosW8HSlnMVCDIGrvhmwVfMevkLjgQtgRXYecXM7MYMVH+6f6e
yIF0azu5z2PEQvfTLTuTN++bQ3lgyfYXOB3mScCOtBE9BFXwjtL111Qby1QlvHTg
sPIH5kCxpDfg3i2rge1BiyoQ4BWc4c6Us86CriKDX1vl7lilJccpWYxKFY8hyRzl
PF3OVJlMph7jny4zKOa2chWUnDj5ycK289/x2rOla4EOX7R8IHDyL+sAAAvdm7/q
FDuuetC3M+eo8/NTLiZkjTipw1nO+G0c1VtzAZ/wX1QkomwmN0yyPx47EllVH+ez
YQ80UrXOF3f7xYXHZIhwCrIVSaHpLyZHSfDBW1r+vTokIRSJ+5TOIH/YAUUKSR0U
kE4r+eHU3AdcBsDV0pZXtE0mUxROwRatOt5u+XQ3WYdORDyKo0HYu8QskIurgqUv
Cnr554zogmC4Bt/uY7j5u9NvhUH/Xyp5RVXtaQnwz+hcncgVFASDDpblejHE2Lcu
8fb1NrwB7AsPnMDGUSjnG0BbQaTo6ccacBrIrhWxRbEBAQmZEbO507yoYz21EBM5
5XKWd1bTq1YG5oPYl9WR3FI9hSQN7vKsUoW4SXsuh5j65ENhCAwBK8i3liy1j+dS
sf5xUgCg6KyH
=4ANJ
-----END PGP SIGNATURE-----
Merge tag 'net-6.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from rxrpc.
The rxrpc changes are noticeable large: to address a recent regression
has been necessary completing the threaded refactor.
Current release - regressions:
- rxrpc:
- only disconnect calls in the I/O thread
- move client call connection to the I/O thread
- fix incoming call setup race
- eth: mlx5:
- restore pkt rate policing support
- fix memory leak on updating vport counters
Previous releases - regressions:
- gro: take care of DODGY packets
- ipv6: deduct extension header length in rawv6_push_pending_frames
- tipc: fix unexpected link reset due to discovery messages
Previous releases - always broken:
- sched: disallow noqueue for qdisc classes
- eth: ice: fix potential memory leak in ice_gnss_tty_write()
- eth: ixgbe: fix pci device refcount leak
- eth: mlx5:
- fix command stats access after free
- fix macsec possible null dereference when updating MAC security
entity (SecY)"
* tag 'net-6.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (64 commits)
r8152: add vendor/device ID pair for Microsoft Devkit
net: stmmac: add aux timestamps fifo clearance wait
bnxt: make sure we return pages to the pool
net: hns3: fix wrong use of rss size during VF rss config
ipv6: raw: Deduct extension header length in rawv6_push_pending_frames
net: lan966x: check for ptp to be enabled in lan966x_ptp_deinit()
net: sched: disallow noqueue for qdisc classes
iavf/iavf_main: actually log ->src mask when talking about it
igc: Fix PPS delta between two synchronized end-points
ixgbe: fix pci device refcount leak
octeontx2-pf: Fix resource leakage in VF driver unbind
selftests/net: l2_tos_ttl_inherit.sh: Ensure environment cleanup on failure.
selftests/net: l2_tos_ttl_inherit.sh: Run tests in their own netns.
selftests/net: l2_tos_ttl_inherit.sh: Set IPv6 addresses with "nodad".
net/mlx5e: Fix macsec possible null dereference when updating MAC security entity (SecY)
net/mlx5e: Fix macsec ssci attribute handling in offload path
net/mlx5: E-switch, Coverity: overlapping copy
net/mlx5e: Don't support encap rules with gbp option
net/mlx5: Fix ptp max frequency adjustment range
net/mlx5e: Fix memory leak on updating vport counters
...
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCY76ohgAKCRCAXGG7T9hj
vo8fAP0XJ94B7asqcN4W3EyeyfqxUf1eZvmWRhrbKqpLnmHLaQEA/uJBkXL49Zj7
TTcbxR1coJ/hPwhtmONU4TNtCZ+RXw0=
=2Ib5
-----END PGP SIGNATURE-----
Merge tag 'for-linus-6.2-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
- two cleanup patches
- a fix of a memory leak in the Xen pvfront driver
- a fix of a locking issue in the Xen hypervisor console driver
* tag 'for-linus-6.2-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen/pvcalls: free active map buffer on pvcalls_front_free_map
hvc/xen: lock console list traversal
x86/xen: Remove the unused function p2m_index()
xen: make remove callback of xen driver void returned
If the offset + length goes over the ethernet + vlan header, then the
length is adjusted to copy the bytes that are within the boundaries of
the vlan_ethhdr scratchpad area. The remaining bytes beyond ethernet +
vlan header are copied directly from the skbuff data area.
Fix incorrect arithmetic operator: subtract, not add, the size of the
vlan header in case of double-tagged packets to adjust the length
accordingly to address CVE-2023-0179.
Reported-by: Davide Ornaghi <d.ornaghi97@gmail.com>
Fixes: f6ae9f120d ("netfilter: nft_payload: add C-VLAN support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When first_ip is 0, last_ip is 0xFFFFFFFF, and netmask is 31, the value of
an arithmetic expression 2 << (netmask - mask_bits - 1) is subject
to overflow due to a failure casting operands to a larger data type
before performing the arithmetic.
Note that it's harmless since the value will be checked at the next step.
Found by InfoTeCS on behalf of Linux Verification Center
(linuxtesting.org) with SVACE.
Fixes: b9fed74818 ("netfilter: ipset: Check and reject crazy /0 input parameters")
Signed-off-by: Ilia.Gavrilov <Ilia.Gavrilov@infotecs.ru>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The total cork length created by ip6_append_data includes extension
headers, so we must exclude them when comparing them against the
IPV6_CHECKSUM offset which does not include extension headers.
Reported-by: Kyle Zeng <zengyhkyle@gmail.com>
Fixes: 357b40a18b ("[IPV6]: IPV6_CHECKSUM socket option can corrupt kernel memory")
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
While experimenting with applying noqueue to a classful queue discipline,
we discovered a NULL pointer dereference in the __dev_queue_xmit()
path that generates a kernel OOPS:
# dev=enp0s5
# tc qdisc replace dev $dev root handle 1: htb default 1
# tc class add dev $dev parent 1: classid 1:1 htb rate 10mbit
# tc qdisc add dev $dev parent 1:1 handle 10: noqueue
# ping -I $dev -w 1 -c 1 1.1.1.1
[ 2.172856] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 2.173217] #PF: supervisor instruction fetch in kernel mode
...
[ 2.178451] Call Trace:
[ 2.178577] <TASK>
[ 2.178686] htb_enqueue+0x1c8/0x370
[ 2.178880] dev_qdisc_enqueue+0x15/0x90
[ 2.179093] __dev_queue_xmit+0x798/0xd00
[ 2.179305] ? _raw_write_lock_bh+0xe/0x30
[ 2.179522] ? __local_bh_enable_ip+0x32/0x70
[ 2.179759] ? ___neigh_create+0x610/0x840
[ 2.179968] ? eth_header+0x21/0xc0
[ 2.180144] ip_finish_output2+0x15e/0x4f0
[ 2.180348] ? dst_output+0x30/0x30
[ 2.180525] ip_push_pending_frames+0x9d/0xb0
[ 2.180739] raw_sendmsg+0x601/0xcb0
[ 2.180916] ? _raw_spin_trylock+0xe/0x50
[ 2.181112] ? _raw_spin_unlock_irqrestore+0x16/0x30
[ 2.181354] ? get_page_from_freelist+0xcd6/0xdf0
[ 2.181594] ? sock_sendmsg+0x56/0x60
[ 2.181781] sock_sendmsg+0x56/0x60
[ 2.181958] __sys_sendto+0xf7/0x160
[ 2.182139] ? handle_mm_fault+0x6e/0x1d0
[ 2.182366] ? do_user_addr_fault+0x1e1/0x660
[ 2.182627] __x64_sys_sendto+0x1b/0x30
[ 2.182881] do_syscall_64+0x38/0x90
[ 2.183085] entry_SYSCALL_64_after_hwframe+0x63/0xcd
...
[ 2.187402] </TASK>
Previously in commit d66d6c3152 ("net: sched: register noqueue
qdisc"), NULL was set for the noqueue discipline on noqueue init
so that __dev_queue_xmit() falls through for the noqueue case. This
also sets a bypass of the enqueue NULL check in the
register_qdisc() function for the struct noqueue_disc_ops.
Classful queue disciplines make it past the NULL check in
__dev_queue_xmit() because the discipline is set to htb (in this case),
and then in the call to __dev_xmit_skb(), it calls into htb_enqueue()
which grabs a leaf node for a class and then calls qdisc_enqueue() by
passing in a queue discipline which assumes ->enqueue() is not set to NULL.
Fix this by not allowing classes to be assigned to the noqueue
discipline. Linux TC Notes states that classes cannot be set to
the noqueue discipline. [1] Let's enforce that here.
Links:
1. https://linux-tc-notes.sourceforge.net/tc/doc/sch_noqueue.txt
Fixes: d66d6c3152 ("net: sched: register noqueue qdisc")
Cc: stable@vger.kernel.org
Signed-off-by: Frederick Lawler <fred@cloudflare.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/r/20230109163906.706000-1-fred@cloudflare.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
- Fix a race when creating NFSv4 files
- Revert the use of relaxed bitops
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmO9twgACgkQM2qzM29m
f5cQ/A//SSv/eZl2cnAMZtN1zd7wIMfI6E9y8Ccv49aebUXGmGDKSwz/CUns2sgO
avWentInUYg2cexIjaQnQeGkiQt0Do+3u/cdT86h2e8q3UhvctYWO5uRCqbP+36H
JRLfNUUbic4P8Yp/LZ5DvwOWae4PLdZq71mxJkaTXGHt8zLn/yEntCY8jb6V7D2L
SxMXAoO05bdzfPc8lXKmaGi4JMsANEOMh5ZMRpKxKTEFQG352db17MqwOAW/Qe+t
mMXY2jRfeufFwimmwLK06EzItgcs6D9g7dM3oIwDUNiPL4l3lOYeynbYOref7fD3
4u11LwZdzZ5LYIZ0HoTpRu3ZxAbrTtmd1FiT7SwN9jjq1vu0Zx0sfqk0R9VixY3c
jP+wYKEDTQUkIVdbG6g/u6yQZvwM281+GiAXoD3FJWKJDwAaqwxd6cphCn314RKY
hlgG4DGhAi0BYbiLVu5ObQwRb1yPgCP2pXqguAdAKbTM2DVC2+hAW3NDUcIKrR1U
JoXmGBaWeuJU9/0JbfVzddXUCs227hnovj1nmGW7E8JUegW4m+3oscEP8tsC5H5S
J3Jr9ovxyYGQE1qxM5909hjPjrZxI3NszKIpgWoo9/jJLUWfGtnS2BclrXUxQrdl
rvbKHvmSLyOsFYnZ5Nt7uj1l7LtWMljrjOjPqe02iU6pRDNHa9Y=
=/7AX
-----END PGP SIGNATURE-----
Merge tag 'nfsd-6.2-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd fixes from Chuck Lever:
- Fix a race when creating NFSv4 files
- Revert the use of relaxed bitops
* tag 'nfsd-6.2-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
NFSD: Use set_bit(RQ_DROPME)
Revert "SUNRPC: Use RMW bitops in single-threaded hot paths"
nfsd: fix handling of cached open files in nfsd4_open codepath
Instead of preventing adding AP_VLAN to MLO enabled APs, this check was
preventing adding more than one 4-addr AP_VLAN regardless of the MLO status.
Fix this by adding missing extra checks.
Fixes: ae960ee90b ("wifi: mac80211: prevent VLANs on MLDs")
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20221214130326.37756-1-nbd@nbd.name
Cc: stable@vger.kernel.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When multiple interfaces are present in the local interface
list, new skb copy is taken before rx processing except for
the first interface. The address translation happens each
time only on the original skb since the hdr pointer is not
updated properly to the newly created skb.
As a result frames start to drop in userspace when address
based checks or search fails.
Signed-off-by: Sriram R <quic_srirrama@quicinc.com>
Link: https://lore.kernel.org/r/20221208040050.25922-1-quic_srirrama@quicinc.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Reset multiple BSSID options when all AP related configurations are
reset in ieee80211_stop_ap().
Stale values result in HWSIM test failures (e.g. p2p_group_cli_invalid),
if run after 'he_ap_ema'.
Reported-by: Jouni Malinen <j@w1.fi>
Signed-off-by: Aloka Dixit <quic_alokad@quicinc.com>
Link: https://lore.kernel.org/r/20221221185616.11514-1-quic_alokad@quicinc.com
Cc: stable@vger.kernel.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
When a running wake_tx_queue() call is aborted due to a hw queue stop
the corresponding iTXQ is not always correctly marked for resumption:
wake_tx_push_queue() can stops the queue run without setting
@IEEE80211_TXQ_STOP_NETIF_TX.
Without the @IEEE80211_TXQ_STOP_NETIF_TX flag __ieee80211_wake_txqs()
will not schedule a new queue run and remaining frames in the queue get
stuck till another frame is queued to it.
Fix the issue for all drivers - also the ones with custom wake_tx_queue
callbacks - by moving the logic into ieee80211_tx_dequeue() and drop the
redundant @txqs_stopped.
@IEEE80211_TXQ_STOP_NETIF_TX is also renamed to @IEEE80211_TXQ_DIRTY to
better describe the flag.
Fixes: c850e31f79 ("wifi: mac80211: add internal handler for wake_tx_queue")
Signed-off-by: Alexander Wetzel <alexander@wetzel-home.de>
Link: https://lore.kernel.org/r/20221230121850.218810-1-alexander@wetzel-home.de
Cc: stable@vger.kernel.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The 'TCA_MPLS_LABEL' attribute is of 'NLA_U32' type, but has a
validation type of 'NLA_VALIDATE_FUNCTION'. This is an invalid
combination according to the comment above 'struct nla_policy':
"
Meaning of `validate' field, use via NLA_POLICY_VALIDATE_FN:
NLA_BINARY Validation function called for the attribute.
All other Unused - but note that it's a union
"
This can trigger the warning [1] in nla_get_range_unsigned() when
validation of the attribute fails. Despite being of 'NLA_U32' type, the
associated 'min'/'max' fields in the policy are negative as they are
aliased by the 'validate' field.
Fix by changing the attribute type to 'NLA_BINARY' which is consistent
with the above comment and all other users of NLA_POLICY_VALIDATE_FN().
As a result, move the length validation to the validation function.
No regressions in MPLS tests:
# ./tdc.py -f tc-tests/actions/mpls.json
[...]
# echo $?
0
[1]
WARNING: CPU: 0 PID: 17743 at lib/nlattr.c:118
nla_get_range_unsigned+0x1d8/0x1e0 lib/nlattr.c:117
Modules linked in:
CPU: 0 PID: 17743 Comm: syz-executor.0 Not tainted 6.1.0-rc8 #3
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.13.0-48-gd9c812dda519-prebuilt.qemu.org 04/01/2014
RIP: 0010:nla_get_range_unsigned+0x1d8/0x1e0 lib/nlattr.c:117
[...]
Call Trace:
<TASK>
__netlink_policy_dump_write_attr+0x23d/0x990 net/netlink/policy.c:310
netlink_policy_dump_write_attr+0x22/0x30 net/netlink/policy.c:411
netlink_ack_tlv_fill net/netlink/af_netlink.c:2454 [inline]
netlink_ack+0x546/0x760 net/netlink/af_netlink.c:2506
netlink_rcv_skb+0x1b7/0x240 net/netlink/af_netlink.c:2546
rtnetlink_rcv+0x18/0x20 net/core/rtnetlink.c:6109
netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline]
netlink_unicast+0x5e9/0x6b0 net/netlink/af_netlink.c:1345
netlink_sendmsg+0x739/0x860 net/netlink/af_netlink.c:1921
sock_sendmsg_nosec net/socket.c:714 [inline]
sock_sendmsg net/socket.c:734 [inline]
____sys_sendmsg+0x38f/0x500 net/socket.c:2482
___sys_sendmsg net/socket.c:2536 [inline]
__sys_sendmsg+0x197/0x230 net/socket.c:2565
__do_sys_sendmsg net/socket.c:2574 [inline]
__se_sys_sendmsg net/socket.c:2572 [inline]
__x64_sys_sendmsg+0x42/0x50 net/socket.c:2572
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
Link: https://lore.kernel.org/netdev/CAO4mrfdmjvRUNbDyP0R03_DrD_eFCLCguz6OxZ2TYRSv0K9gxA@mail.gmail.com/
Fixes: 2a2ea50870 ("net: sched: add mpls manipulation actions to TC")
Reported-by: Wei Chen <harperchen1110@gmail.com>
Tested-by: Wei Chen <harperchen1110@gmail.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Link: https://lore.kernel.org/r/20230107171004.608436-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jaroslav reported a recent throughput regression with virtio_net
caused by blamed commit.
It is unclear if DODGY GSO packets coming from user space
can be accepted by GRO engine in the future with minimal
changes, and if there is any expected gain from it.
In the meantime, make sure to detect and flush DODGY packets.
Fixes: 5eddb24901 ("gro: add support of (hw)gro packets to gro stack")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-and-bisected-by: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
Cc: Coco Li <lixiaoyan@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This change adds support for nested IPsec tunnels by ensuring that
XFRM-I verifies existing policies before decapsulating a subsequent
policies. Addtionally, this clears the secpath entries after policies
are verified, ensuring that previous tunnels with no-longer-valid
do not pollute subsequent policy checks.
This is necessary especially for nested tunnels, as the IP addresses,
protocol and ports may all change, thus not matching the previous
policies. In order to ensure that packets match the relevant inbound
templates, the xfrm_policy_check should be done before handing off to
the inner XFRM protocol to decrypt and decapsulate.
Notably, raw ESP/AH packets did not perform policy checks inherently,
whereas all other encapsulated packets (UDP, TCP encapsulated) do policy
checks after calling xfrm_input handling in the respective encapsulation
layer.
Test: Verified with additional Android Kernel Unit tests
Signed-off-by: Benedict Wong <benedictwong@google.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAmO5PdsACgkQ+7dXa6fL
C2uy4Q/9HhTnnMngwgxx8c8P/fNFjxLKHUhPeILqhBgestVoWjQXtgM7DWam0V13
numjNevavZESSbVJFOcP13ul0PcHgkZyOXtjHkSH+P9vZeWX8rC92ANzKLpBOrqK
lGCBjin9H24dSCFMy5V7cCz23/31jpMAE3SWNd9WDTUe0CjRNWkmB3RNIZupPj2Q
xTtiy4OGC3KVe/PeduEfvwQ+UXQ0GCP+UaP5aLT06NfW+qPKGRN3JTYli3yq4cbP
MR2N5CxE6VraphjSWvob/JixOWBAl6VKXrCRXtyYL1iT0c75CDNeTHqSMg19rAKB
7uBZJD3QgrtmUo07yW3UPK7hZF9/ovlZFnvzbWBSSXCWia+2WW0Ysk9V3TN+TwE2
mquZRe2M7yxlfwNqGBzjotpHf4AMCMzV50QmW6V3tjhR4KBR2SvUyV9rdQdv0lZh
Ymq9L776kH9HRkV8WS/JZpHlLJ2Lvjyk3gGstCq9y/HMqB2n/bKFPXcnGdpxRaQC
ZE94VcQtloLeYCeEMv+FMyg2Pi3GrtRXkQVJqpDTWRfsNpJp5Wlbdm8WSF+QwhWX
59MLILsnPqjPdcruiL8kSVPI40vBvbnHHJbBaDip2EM+IS4dSCWqoaHyYSoM7/BO
c6SPcFOkv8syjroudH1XRlH9bnObvSQOVtZE8PDjjsfJrPEpI+Y=
=iwoA
-----END PGP SIGNATURE-----
Merge tag 'rxrpc-fixes-20230107' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
David Howells says:
====================
rxrpc: Fix race between call connection, data transmit and call disconnect
Here are patches to fix an oops[1] caused by a race between call
connection, initial packet transmission and call disconnection which
results in something like:
kernel BUG at net/rxrpc/peer_object.c:413!
when the syzbot test is run. The problem is that the connection procedure
is effectively split across two threads and can get expanded by taking an
interrupt, thereby adding the call to the peer error distribution list
*after* it has been disconnected (say by the rxrpc socket shutting down).
The easiest solution is to look at the fourth set of I/O thread
conversion/SACK table expansion patches that didn't get applied[2] and take
from it those patches that move call connection and disconnection into the
I/O thread. Moving these things into the I/O thread means that the
sequencing is managed by all being done in the same thread - and the race
can no longer happen.
This is preferable to introducing an extra lock as adding an extra lock
would make the I/O thread have to wait for the app thread in yet another
place.
The changes can be considered as a number of logical parts:
(1) Move all of the call state changes into the I/O thread.
(2) Make client connection ID space per-local endpoint so that the I/O
thread doesn't need locks to access it.
(3) Move actual abort generation into the I/O thread and clean it up. If
sendmsg or recvmsg want to cause an abort, they have to delegate it.
(4) Offload the setting up of the security context on a connection to the
thread of one of the apps that's starting a call. We don't want to be
doing any sort of crypto in the I/O thread.
(5) Connect calls (ie. assign them to channel slots on connections) in the
I/O thread. Calls are set up by sendmsg/kafs and passed to the I/O
thread to connect. Connections are allocated in the I/O thread after
this.
(6) Disconnect calls in the I/O thread.
I've also added a patch for an unrelated bug that cropped up during
testing, whereby a race can occur between an incoming call and socket
shutdown.
Note that whilst this fixes the original syzbot bug, another bug may get
triggered if this one is fixed:
INFO: rcu detected stall in corrupted
rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { P5792 } 2657 jiffies s: 2825 root: 0x0/T
rcu: blocking rcu_node structures (internal RCU debug):
It doesn't look this should be anything to do with rxrpc, though, as I've
tested an additional patch[3] that removes practically all the RCU usage
from rxrpc and it still occurs. It seems likely that it is being caused by
something in the tunnelling setup that the syzbot test does, but there's
not enough info to go on. It also seems unlikely to be anything to do with
the afs driver as the test doesn't use that.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Highlights include:
Bugfixes
- Fix a race in the RPCSEC_GSS upcall code that causes hung RPC calls
- Fix a broken coalescing test in the pNFS file layout driver
- Ensure that the access cache rcu path also applies the login test
- Fix up for a sparse warning
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAmO5s6MACgkQZwvnipYK
APJ8PA//aVnu+DqLTxeGnyr4HNIcm0su4TetghQabAf0S8nk69Ly8W+N+xCRwwCR
/PvSe8j51tXVInpYQKSOTjDl8/0mq8cybf6N9DzDS0A+hKQfdK8jvPLU0yj/aZ6u
D8x17ve7RjzLgw9minltq4eTvVIHpqSh/vVUyQhnj2A8qQlWrboXiQnLQAp5KBbq
7NexvcajgfFHXm7S+edZ8NH5QZ20FNVskvzyrClp46DQLJcdNLHGJLp8tdeD31kE
lsNBsLG0LxOFyBY28MbQptXalhBP8RRwYwxdSdIUnExzB7eKTLS6LgHuK9GV+EzL
wSfRrnyM023momKmdFn3Fy/cRyeF1RwfQON+KAHo1cQ5fFiQS16p7lTELGkApyLc
yOXGpAvaPJuuBJLM+Xi07ovhqpc7SjGmr7cjZslyasNtez60Le06oSihKnrgcKDx
ZVh1pTc89O25GqGm161HaQJY9YU4XCtjxPv7rvdrlhO5Dkgxai293R5aXUFBz1dD
tssv5a2DqIekrE29YaCi/RjIqkngCXHlwniFw/L3/aFRUyHqN9/MyxH5VeCRDI2T
OJQRR056V1FXSBLH6Wuzs2Ixz2HnJCLR7LE0zAoPLm/RwW8CuNmEXu28jfU0Kcmt
Hj5VyxYfnqEAGsUO0O2oNhOnxWY9OHvlvpb/9DWcke0ASpeScB4=
=N6cT
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-6.2-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client fixes from Trond Myklebust:
- Fix a race in the RPCSEC_GSS upcall code that causes hung RPC calls
- Fix a broken coalescing test in the pNFS file layout driver
- Ensure that the access cache rcu path also applies the login test
- Fix up for a sparse warning
* tag 'nfs-for-6.2-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFS: Fix up a sparse warning
NFS: Judge the file access cache's timestamp in rcu path
pNFS/filelayout: Fix coalescing test for single DS
SUNRPC: ensure the matching upcall is in-flight upon downcall
An incoming call can race with rxrpc socket destruction, leading to a
leaked call. This may result in an oops when the call timer eventually
expires:
BUG: kernel NULL pointer dereference, address: 0000000000000874
RIP: 0010:_raw_spin_lock_irqsave+0x2a/0x50
Call Trace:
<IRQ>
try_to_wake_up+0x59/0x550
? __local_bh_enable_ip+0x37/0x80
? rxrpc_poke_call+0x52/0x110 [rxrpc]
? rxrpc_poke_call+0x110/0x110 [rxrpc]
? rxrpc_poke_call+0x110/0x110 [rxrpc]
call_timer_fn+0x24/0x120
with a warning in the kernel log looking something like:
rxrpc: Call 00000000ba5e571a still in use (1,SvAwtACK,1061d,0)!
incurred during rmmod of rxrpc. The 1061d is the call flags:
RECVMSG_READ_ALL, RX_HEARD, BEGAN_RX_TIMER, RX_LAST, EXPOSED,
IS_SERVICE, RELEASED
but no DISCONNECTED flag (0x800), so it's an incoming (service) call and
it's still connected.
The race appears to be that:
(1) rxrpc_new_incoming_call() consults the service struct, checks sk_state
and allocates a call - then pauses, possibly for an interrupt.
(2) rxrpc_release_sock() sets RXRPC_CLOSE, nulls the service pointer,
discards the prealloc and releases all calls attached to the socket.
(3) rxrpc_new_incoming_call() resumes, launching the new call, including
its timer and attaching it to the socket.
Fix this by read-locking local->services_lock to access the AF_RXRPC socket
providing the service rather than RCU in rxrpc_new_incoming_call().
There's no real need to use RCU here as local->services_lock is only
write-locked by the socket side in two places: when binding and when
shutting down.
Fixes: 5e6ef4f101 ("rxrpc: Make the I/O thread take over the call and local processor work")
Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-afs@lists.infradead.org
The premise that "Once an svc thread is scheduled and executing an
RPC, no other processes will touch svc_rqst::rq_flags" is false.
svc_xprt_enqueue() examines the RQ_BUSY flag in scheduled nfsd
threads when determining which thread to wake up next.
Found via KCSAN.
Fixes: 28df098881 ("SUNRPC: Use RMW bitops in single-threaded hot paths")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
This unexpected behavior is observed:
node 1 | node 2
------ | ------
link is established | link is established
reboot | link is reset
up | send discovery message
receive discovery message |
link is established | link is established
send discovery message |
| receive discovery message
| link is reset (unexpected)
| send reset message
link is reset |
It is due to delayed re-discovery as described in function
tipc_node_check_dest(): "this link endpoint has already reset
and re-established contact with the peer, before receiving a
discovery message from that node."
However, commit 598411d70f has changed the condition for calling
tipc_node_link_down() which was the acceptance of new media address.
This commit fixes this by restoring the old and correct behavior.
Fixes: 598411d70f ("tipc: make resetting of links non-atomic")
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Tung Nguyen <tung.q.nguyen@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move the connection setup of client calls to the I/O thread so that a whole
load of locking and barrierage can be eliminated. This necessitates the
app thread waiting for connection to complete before it can begin
encrypting data.
This also completes the fix for a race that exists between call connection
and call disconnection whereby the data transmission code adds the call to
the peer error distribution list after the call has been disconnected (say
by the rxrpc socket getting closed).
The fix is to complete the process of moving call connection, data
transmission and call disconnection into the I/O thread and thus forcibly
serialising them.
Note that the issue may predate the overhaul to an I/O thread model that
were included in the merge window for v6.2, but the timing is very much
changed by the change given below.
Fixes: cf37b59875 ("rxrpc: Move DATA transmission into call processor work item")
Reported-by: syzbot+c22650d2844392afdcfd@syzkaller.appspotmail.com
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Move the management of the client connection cache to the I/O thread rather
than managing it from the namespace as an aggregate across all the local
endpoints within the namespace.
This will allow a load of locking to be got rid of in a future patch as
only the I/O thread will be looking at the this.
The downside is that the total number of cached connections on the system
can get higher because the limit is now per-local rather than per-netns.
We can, however, keep the number of client conns in use across the entire
netfs and use that to reduce the expiration time of idle connection.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
All the setters of call->state are now in the I/O thread and thus the state
lock is now unnecessary.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Move the call state changes that are made in rxrpc_recvmsg() to the I/O
thread. This means that, thenceforth, only the I/O thread does this and
the call state lock can be removed.
This requires the Rx phase to be ended when the last packet is received,
not when it is processed.
Since this now changes the rxrpc call state to SUCCEEDED before we've
consumed all the data from it, rxrpc_kernel_check_life() mustn't say the
call is dead until the recvmsg queue is empty (unless the call has failed).
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Move all the call state changes that are made in rxrpc_sendmsg() to the I/O
thread. This is a step towards removing the call state lock.
This requires the switch to the RXRPC_CALL_CLIENT_AWAIT_REPLY and
RXRPC_CALL_SERVER_SEND_REPLY states to be done when the last packet is
decanted from ->tx_sendmsg to ->tx_buffer in the I/O thread, not when it is
added to ->tx_sendmsg by sendmsg().
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Wrap accesses to get the state of a call from outside of the I/O thread in
a single place so that the barrier needed to order wrt the error code and
abort code is in just that place.
Also use a barrier when setting the call state and again when reading the
call state such that the auxiliary completion info (error code, abort code)
can be read without taking a read lock on the call state lock.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Split out the functions that change the state of an rxrpc call into their
own file. The idea being to remove anything to do with changing the state
of a call directly from the rxrpc sendmsg() and recvmsg() paths and have
all that done in the I/O thread only, with the ultimate aim of removing the
state lock entirely. Moving the code out of sendmsg.c and recvmsg.c makes
that easier to manage.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Use the information now stored in struct rxrpc_call to configure the
connection bundle and thence the connection, rather than using the
rxrpc_conn_parameters struct.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Offload the completion of the challenge/response cycle on a service
connection to the I/O thread. After the RESPONSE packet has been
successfully decrypted and verified by the work queue, offloading the
changing of the call states to the I/O thread makes iteration over the
conn's channel list simpler.
Do this by marking the RESPONSE skbuff and putting it onto the receive
queue for the I/O thread to collect. We put it on the front of the queue
as we've already received the packet for it.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Make the set of connection IDs per local endpoint so that endpoints don't
cause each other's connections to get dismissed.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Tidy up the abort generation infrastructure in the following ways:
(1) Create an enum and string mapping table to list the reasons an abort
might be generated in tracing.
(2) Replace the 3-char string with the values from (1) in the places that
use that to log the abort source. This gets rid of a memcpy() in the
tracepoint.
(3) Subsume the rxrpc_rx_eproto tracepoint with the rxrpc_abort tracepoint
and use values from (1) to indicate the trace reason.
(4) Always make a call to an abort function at the point of the abort
rather than stashing the values into variables and using goto to get
to a place where it reported. The C optimiser will collapse the calls
together as appropriate. The abort functions return a value that can
be returned directly if appropriate.
Note that this extends into afs also at the points where that generates an
abort. To aid with this, the afs sources need to #define
RXRPC_TRACE_ONLY_DEFINE_ENUMS before including the rxrpc tracing header
because they don't have access to the rxrpc internal structures that some
of the tracepoints make use of.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Clean up connection abort, using the connection state_lock to gate access
to change that state, and use an rxrpc_call_completion value to indicate
the difference between local and remote aborts as these can be pasted
directly into the call state.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Provide a means by which an event notification can be sent to a connection
through such that the I/O thread can pick it up and handle it rather than
doing it in a separate workqueue.
This is then used to move the deferred final ACK of a call into the I/O
thread rather than a separate work queue as part of the drive to do all
transmission from the I/O thread.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Only perform call disconnection in the I/O thread to reduce the locking
requirement.
This is the first part of a fix for a race that exists between call
connection and call disconnection whereby the data transmission code adds
the call to the peer error distribution list after the call has been
disconnected (say by the rxrpc socket getting closed).
The fix is to complete the process of moving call connection, data
transmission and call disconnection into the I/O thread and thus forcibly
serialising them.
Note that the issue may predate the overhaul to an I/O thread model that
were included in the merge window for v6.2, but the timing is very much
changed by the change given below.
Fixes: cf37b59875 ("rxrpc: Move DATA transmission into call processor work item")
Reported-by: syzbot+c22650d2844392afdcfd@syzkaller.appspotmail.com
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Only set the abort call completion state in the I/O thread and only
transmit ABORT packets from there. rxrpc_abort_call() can then be made to
actually send the packet.
Further, ABORT packets should only be sent if the call has been exposed to
the network (ie. at least one attempted DATA transmission has occurred for
it).
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Call the rxrpc_conn_retransmit_call() directly from rxrpc_input_packet()
rather than calling it via connection event handling.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Make the local endpoint and it's I/O thread hold a reference on a connected
call until that call is disconnected. Without this, we're reliant on
either the AF_RXRPC socket to hold a ref (which is dropped when the call is
released) or a queued work item to hold a ref (the work item is being
replaced with the I/O thread).
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Stash the network namespace pointer in the rxrpc_local struct in addition
to a pointer to the rxrpc-specific net namespace info. Use this to remove
some places where the socket is passed as a parameter.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Current release - regressions:
- bpf: fix nullness propagation for reg to reg comparisons,
avoid null-deref
- inet: control sockets should not use current thread task_frag
- bpf: always use maximal size for copy_array()
- eth: bnxt_en: don't link netdev to a devlink port for VFs
Current release - new code bugs:
- rxrpc: fix a couple of potential use-after-frees
- netfilter: conntrack: fix IPv6 exthdr error check
- wifi: iwlwifi: fw: skip PPAG for JF, avoid FW crashes
- eth: dsa: qca8k: various fixes for the in-band register access
- eth: nfp: fix schedule in atomic context when sync mc address
- eth: renesas: rswitch: fix getting mac address from device tree
- mobile: ipa: use proper endpoint mask for suspend
Previous releases - regressions:
- tcp: add TIME_WAIT sockets in bhash2, fix regression caught
by Jiri / python tests
- net: tc: don't intepret cls results when asked to drop, fix
oob-access
- vrf: determine the dst using the original ifindex for multicast
- eth: bnxt_en:
- fix XDP RX path if BPF adjusted packet length
- fix HDS (header placement) and jumbo thresholds for RX packets
- eth: ice: xsk: do not use xdp_return_frame() on tx_buf->raw_buf,
avoid memory corruptions
Previous releases - always broken:
- ulp: prevent ULP without clone op from entering the LISTEN status
- veth: fix race with AF_XDP exposing old or uninitialized descriptors
- bpf:
- pull before calling skb_postpull_rcsum() (fix checksum support
and avoid a WARN())
- fix panic due to wrong pageattr of im->image (when livepatch
and kretfunc coexist)
- keep a reference to the mm, in case the task is dead
- mptcp: fix deadlock in fastopen error path
- netfilter:
- nf_tables: perform type checking for existing sets
- nf_tables: honor set timeout and garbage collection updates
- ipset: fix hash:net,port,net hang with /0 subnet
- ipset: avoid hung task warning when adding/deleting entries
- selftests: net:
- fix cmsg_so_mark.sh test hang on non-x86 systems
- fix the arp_ndisc_evict_nocarrier test for IPv6
- usb: rndis_host: secure rndis_query check against int overflow
- eth: r8169: fix dmar pte write access during suspend/resume with WOL
- eth: lan966x: fix configuration of the PCS
- eth: sparx5: fix reading of the MAC address
- eth: qed: allow sleep in qed_mcp_trace_dump()
- eth: hns3:
- fix interrupts re-initialization after VF FLR
- fix handling of promisc when MAC addr table gets full
- refine the handling for VF heartbeat
- eth: mlx5:
- properly handle ingress QinQ-tagged packets on VST
- fix io_eq_size and event_eq_size params validation on big endian
- fix RoCE setting at HCA level if not supported at all
- don't turn CQE compression on by default for IPoIB
- eth: ena:
- fix toeplitz initial hash key value
- account for the number of XDP-processed bytes in interface stats
- fix rx_copybreak value update
Misc:
- ethtool: harden phy stat handling against buggy drivers
- docs: netdev: convert maintainer's doc from FAQ to a normal document
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmO3MLcACgkQMUZtbf5S
IrsEQBAAijPrpxsGMfX+VMqZ8RPKA3Qg8XF3ji2fSp4c0kiKv6lYI7PzPTR3u/fj
CAlhQMHv7z53uM6Zd7FdUVl23paaEycu8YnlwSubg9z+wSeh/RQ6iq94mSk1PV+K
LLVR/yop2N35Yp/oc5KZMb9fMLkxRG9Ci73QUVVYgvIrSd4Zdm13FjfVjL2C1MZH
Yp003wigMs9IkIHOpHjNqwn/5s//0yXsb1PgKxCsaMdMQsG0yC+7eyDmxshCqsji
xQm15mkGMjvWEYJaa4Tj4L3JW6lWbQzCu9nqPUX16KpmrnScr8S8Is+aifFZIBeW
GZeDYgvjSxNWodeOrJnD3X+fnbrR9+qfx7T9y7XighfytAz5DNm1LwVOvZKDgPFA
s+LlxOhzkDNEqbIsusK/LW+04EFc5gJyTI2iR6s4SSqmH3c3coJZQJeyRFWDZy/x
1oqzcCcq8SwGUTJ9g6HAmDQoVkhDWDT/ZcRKhpWG0nJub972lB2iwM7LrAu+HoHI
r8hyCkHpOi5S3WZKI9gPiGD+yOlpVAuG2wHg2IpjhKQvtd9DFUChGDhFeoB2rqJf
9uI3RJBBYTDkeNu3kpfy5uMh2XhvbIZntK5kwpJ4VettZWFMaOAzn7KNqk8iT4gJ
ASMrUrX59X0TAN0MgpJJm7uGtKbKZOu4lHNm74TUxH7V7bYn7dk=
=TlcN
-----END PGP SIGNATURE-----
Merge tag 'net-6.2-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from bpf, wifi, and netfilter.
Current release - regressions:
- bpf: fix nullness propagation for reg to reg comparisons, avoid
null-deref
- inet: control sockets should not use current thread task_frag
- bpf: always use maximal size for copy_array()
- eth: bnxt_en: don't link netdev to a devlink port for VFs
Current release - new code bugs:
- rxrpc: fix a couple of potential use-after-frees
- netfilter: conntrack: fix IPv6 exthdr error check
- wifi: iwlwifi: fw: skip PPAG for JF, avoid FW crashes
- eth: dsa: qca8k: various fixes for the in-band register access
- eth: nfp: fix schedule in atomic context when sync mc address
- eth: renesas: rswitch: fix getting mac address from device tree
- mobile: ipa: use proper endpoint mask for suspend
Previous releases - regressions:
- tcp: add TIME_WAIT sockets in bhash2, fix regression caught by
Jiri / python tests
- net: tc: don't intepret cls results when asked to drop, fix
oob-access
- vrf: determine the dst using the original ifindex for multicast
- eth: bnxt_en:
- fix XDP RX path if BPF adjusted packet length
- fix HDS (header placement) and jumbo thresholds for RX packets
- eth: ice: xsk: do not use xdp_return_frame() on tx_buf->raw_buf,
avoid memory corruptions
Previous releases - always broken:
- ulp: prevent ULP without clone op from entering the LISTEN status
- veth: fix race with AF_XDP exposing old or uninitialized
descriptors
- bpf:
- pull before calling skb_postpull_rcsum() (fix checksum support
and avoid a WARN())
- fix panic due to wrong pageattr of im->image (when livepatch and
kretfunc coexist)
- keep a reference to the mm, in case the task is dead
- mptcp: fix deadlock in fastopen error path
- netfilter:
- nf_tables: perform type checking for existing sets
- nf_tables: honor set timeout and garbage collection updates
- ipset: fix hash:net,port,net hang with /0 subnet
- ipset: avoid hung task warning when adding/deleting entries
- selftests: net:
- fix cmsg_so_mark.sh test hang on non-x86 systems
- fix the arp_ndisc_evict_nocarrier test for IPv6
- usb: rndis_host: secure rndis_query check against int overflow
- eth: r8169: fix dmar pte write access during suspend/resume with
WOL
- eth: lan966x: fix configuration of the PCS
- eth: sparx5: fix reading of the MAC address
- eth: qed: allow sleep in qed_mcp_trace_dump()
- eth: hns3:
- fix interrupts re-initialization after VF FLR
- fix handling of promisc when MAC addr table gets full
- refine the handling for VF heartbeat
- eth: mlx5:
- properly handle ingress QinQ-tagged packets on VST
- fix io_eq_size and event_eq_size params validation on big endian
- fix RoCE setting at HCA level if not supported at all
- don't turn CQE compression on by default for IPoIB
- eth: ena:
- fix toeplitz initial hash key value
- account for the number of XDP-processed bytes in interface stats
- fix rx_copybreak value update
Misc:
- ethtool: harden phy stat handling against buggy drivers
- docs: netdev: convert maintainer's doc from FAQ to a normal
document"
* tag 'net-6.2-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (112 commits)
caif: fix memory leak in cfctrl_linkup_request()
inet: control sockets should not use current thread task_frag
net/ulp: prevent ULP without clone op from entering the LISTEN status
qed: allow sleep in qed_mcp_trace_dump()
MAINTAINERS: Update maintainers for ptp_vmw driver
usb: rndis_host: Secure rndis_query check against int overflow
net: dpaa: Fix dtsec check for PCS availability
octeontx2-pf: Fix lmtst ID used in aura free
drivers/net/bonding/bond_3ad: return when there's no aggregator
netfilter: ipset: Rework long task execution when adding/deleting entries
netfilter: ipset: fix hash:net,port,net hang with /0 subnet
net: sparx5: Fix reading of the MAC address
vxlan: Fix memory leaks in error path
net: sched: htb: fix htb_classify() kernel-doc
net: sched: cbq: dont intepret cls results when asked to drop
net: sched: atm: dont intepret cls results when asked to drop
dt-bindings: net: marvell,orion-mdio: Fix examples
dt-bindings: net: sun8i-emac: Add phy-supply property
net: ipa: use proper endpoint mask for suspend
selftests: net: return non-zero for failures reported in arp_ndisc_evict_nocarrier
...
When linktype is unknown or kzalloc failed in cfctrl_linkup_request(),
pkt is not released. Add release process to error path.
Fixes: b482cd2053 ("net-caif: add CAIF core protocol stack")
Fixes: 8d545c8f95 ("caif: Disconnect without waiting for response")
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230104065146.1153009-1-shaozhengchao@huawei.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Because ICMP handlers run from softirq contexts,
they must not use current thread task_frag.
Previously, all sockets allocated by inet_ctl_sock_create()
would use the per-socket page fragment, with no chance of
recursion.
Fixes: 98123866fc ("Treewide: Stop corrupting socket's task_frag")
Reported-by: syzbot+bebc6f1acdf4cbb79b03@syzkaller.appspotmail.com
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Benjamin Coddington <bcodding@redhat.com>
Acked-by: Guillaume Nault <gnault@redhat.com>
Link: https://lore.kernel.org/r/20230103192736.454149-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When an ULP-enabled socket enters the LISTEN status, the listener ULP data
pointer is copied inside the child/accepted sockets by sk_clone_lock().
The relevant ULP can take care of de-duplicating the context pointer via
the clone() operation, but only MPTCP and SMC implement such op.
Other ULPs may end-up with a double-free at socket disposal time.
We can't simply clear the ULP data at clone time, as TLS replaces the
socket ops with custom ones assuming a valid TLS ULP context is
available.
Instead completely prevent clone-less ULP sockets from entering the
LISTEN status.
Fixes: 734942cc4e ("tcp: ULP infrastructure")
Reported-by: slipper <slipper.alive@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/4b80c3d1dbe3d0ab072f80450c202d9bc88b4b03.1672740602.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net:
1) Use signed integer in ipv6_skip_exthdr() called from nf_confirm().
Reported by static analysis tooling, patch from Florian Westphal.
2) Missing set type checks in nf_tables: Validate that set declaration
matches the an existing set type, otherwise bail out with EEXIST.
Currently, nf_tables silently accepts the re-declaration with a
different type but it bails out later with EINVAL when the user adds
entries to the set. This fix is relatively large because it requires
two preparation patches that are included in this batch.
3) Do not ignore updates of timeout and gc_interval parameters in
existing sets.
4) Fix a hang when 0/0 subnets is added to a hash:net,port,net type of
ipset. Except hash:net,port,net and hash:net,iface, the set types don't
support 0/0 and the auxiliary functions rely on this fact. So 0/0 needs
a special handling in hash:net,port,net which was missing (hash:net,iface
was not affected by this bug), from Jozsef Kadlecsik.
5) When adding/deleting large number of elements in one step in ipset,
it can take a reasonable amount of time and can result in soft lockup
errors. This patch is a complete rework of the previous version in order
to use a smaller internal batch limit and at the same time removing
the external hard limit to add arbitrary number of elements in one step.
Also from Jozsef Kadlecsik.
Except for patch #1, which fixes a bug introduced in the previous net-next
development cycle, anything else has been broken for several releases.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
When adding/deleting large number of elements in one step in ipset, it can
take a reasonable amount of time and can result in soft lockup errors. The
patch 5f7b51bf09 ("netfilter: ipset: Limit the maximal range of
consecutive elements to add/delete") tried to fix it by limiting the max
elements to process at all. However it was not enough, it is still possible
that we get hung tasks. Lowering the limit is not reasonable, so the
approach in this patch is as follows: rely on the method used at resizing
sets and save the state when we reach a smaller internal batch limit,
unlock/lock and proceed from the saved state. Thus we can avoid long
continuous tasks and at the same time removed the limit to add/delete large
number of elements in one step.
The nfnl mutex is held during the whole operation which prevents one to
issue other ipset commands in parallel.
Fixes: 5f7b51bf09 ("netfilter: ipset: Limit the maximal range of consecutive elements to add/delete")
Reported-by: syzbot+9204e7399656300bf271@syzkaller.appspotmail.com
Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The hash:net,port,net set type supports /0 subnets. However, the patch
commit 5f7b51bf09 titled "netfilter: ipset: Limit the maximal range
of consecutive elements to add/delete" did not take into account it and
resulted in an endless loop. The bug is actually older but the patch
5f7b51bf09 brings it out earlier.
Handle /0 subnets properly in hash:net,port,net set types.
Fixes: 5f7b51bf09 ("netfilter: ipset: Limit the maximal range of consecutive elements to add/delete")
Reported-by: Марк Коренберг <socketpair@gmail.com>
Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Fix W=1 kernel-doc warning:
net/sched/sch_htb.c:214: warning: expecting prototype for htb_classify(). Prototype was for HTB_DIRECT() instead
by moving the HTB_DIRECT() macro above the function.
Add kernel-doc notation for function parameters as well.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If asked to drop a packet via TC_ACT_SHOT it is unsafe to assume
res.class contains a valid pointer
Fixes: b0188d4dbe ("[NET_SCHED]: sch_atm: Lindent")
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Slaby reported regression of bind() with a simple repro. [0]
The repro creates a TIME_WAIT socket and tries to bind() a new socket
with the same local address and port. Before commit 28044fc1d4 ("net:
Add a bhash2 table hashed by port and address"), the bind() failed with
-EADDRINUSE, but now it succeeds.
The cited commit should have put TIME_WAIT sockets into bhash2; otherwise,
inet_bhash2_conflict() misses TIME_WAIT sockets when validating bind()
requests if the address is not a wildcard one.
The straight option is to move sk_bind2_node from struct sock to struct
sock_common to add twsk to bhash2 as implemented as RFC. [1] However, the
binary layout change in the struct sock could affect performances moving
hot fields on different cachelines.
To avoid that, we add another TIME_WAIT list in inet_bind2_bucket and check
it while validating bind().
[0]: https://lore.kernel.org/netdev/6b971a4e-c7d8-411e-1f92-fda29b5b2fb9@kernel.org/
[1]: https://lore.kernel.org/netdev/20221221151258.25748-2-kuniyu@amazon.com/
Fixes: 28044fc1d4 ("net: Add a bhash2 table hashed by port and address")
Reported-by: Jiri Slaby <jirislaby@kernel.org>
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
So that it's easier to follow and make sense of the branching and
various conditions.
Stats retrieval has been split into two separate functions
ethtool_get_phy_stats_phydev & ethtool_get_phy_stats_ethtool.
The former attempts to retrieve the stats using phydev & phy_ops, while
the latter uses ethtool_ops.
Actual n_stats validation & array allocation has been moved into a new
ethtool_vzalloc_stats_array helper.
This also fixes a potential NULL dereference of
ops->get_ethtool_phy_stats where it was getting called in an else branch
unconditionally without making sure it was actually present.
Found by Linux Verification Center (linuxtesting.org) with the SVACE
static analysis tool.
Signed-off-by: Daniil Tatianin <d-tatianin@yandex-team.ru>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that we always early return if we don't have any stats we can remove
these checks as they're no longer necessary.
Signed-off-by: Daniil Tatianin <d-tatianin@yandex-team.ru>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
It's not very useful to copy back an empty ethtool_stats struct and
return 0 if we didn't actually have any stats. This also allows for
further simplification of this function in the future commits.
Signed-off-by: Daniil Tatianin <d-tatianin@yandex-team.ru>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
At the end of rxrpc_recvmsg(), if a call is found, the call is put and then
a trace line is emitted referencing that call in a couple of places - but
the call may have been deallocated by the time those traces happen.
Fix this by stashing the call debug_id in a variable and passing that to
the tracepoint rather than the call pointer.
Fixes: 849979051c ("rxrpc: Add a tracepoint to follow what recvmsg does")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: David S. Miller <davem@davemloft.net>
nfc_get_device() take reference for the device, add missing
nfc_put_device() to release it when not need anymore.
Also fix the style warnning by use error EOPNOTSUPP instead of
ENOTSUPP.
Fixes: 5ce3f32b52 ("NFC: netlink: SE API implementation")
Fixes: 29e76924cf ("nfc: netlink: Add capability to reply to vendor_cmd with data")
Signed-off-by: Miaoqian Lin <linmq006@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Syzkaller reports a memory leak as follows:
====================================
BUG: memory leak
unreferenced object 0xffff88810c287f00 (size 256):
comm "syz-executor105", pid 3600, jiffies 4294943292 (age 12.990s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<ffffffff814cf9f0>] kmalloc_trace+0x20/0x90 mm/slab_common.c:1046
[<ffffffff839c9e07>] kmalloc include/linux/slab.h:576 [inline]
[<ffffffff839c9e07>] kmalloc_array include/linux/slab.h:627 [inline]
[<ffffffff839c9e07>] kcalloc include/linux/slab.h:659 [inline]
[<ffffffff839c9e07>] tcf_exts_init include/net/pkt_cls.h:250 [inline]
[<ffffffff839c9e07>] tcindex_set_parms+0xa7/0xbe0 net/sched/cls_tcindex.c:342
[<ffffffff839caa1f>] tcindex_change+0xdf/0x120 net/sched/cls_tcindex.c:553
[<ffffffff8394db62>] tc_new_tfilter+0x4f2/0x1100 net/sched/cls_api.c:2147
[<ffffffff8389e91c>] rtnetlink_rcv_msg+0x4dc/0x5d0 net/core/rtnetlink.c:6082
[<ffffffff839eba67>] netlink_rcv_skb+0x87/0x1d0 net/netlink/af_netlink.c:2540
[<ffffffff839eab87>] netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline]
[<ffffffff839eab87>] netlink_unicast+0x397/0x4c0 net/netlink/af_netlink.c:1345
[<ffffffff839eb046>] netlink_sendmsg+0x396/0x710 net/netlink/af_netlink.c:1921
[<ffffffff8383e796>] sock_sendmsg_nosec net/socket.c:714 [inline]
[<ffffffff8383e796>] sock_sendmsg+0x56/0x80 net/socket.c:734
[<ffffffff8383eb08>] ____sys_sendmsg+0x178/0x410 net/socket.c:2482
[<ffffffff83843678>] ___sys_sendmsg+0xa8/0x110 net/socket.c:2536
[<ffffffff838439c5>] __sys_sendmmsg+0x105/0x330 net/socket.c:2622
[<ffffffff83843c14>] __do_sys_sendmmsg net/socket.c:2651 [inline]
[<ffffffff83843c14>] __se_sys_sendmmsg net/socket.c:2648 [inline]
[<ffffffff83843c14>] __x64_sys_sendmmsg+0x24/0x30 net/socket.c:2648
[<ffffffff84605fd5>] do_syscall_x64 arch/x86/entry/common.c:50 [inline]
[<ffffffff84605fd5>] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
[<ffffffff84800087>] entry_SYSCALL_64_after_hwframe+0x63/0xcd
====================================
Kernel uses tcindex_change() to change an existing
filter properties.
Yet the problem is that, during the process of changing,
if `old_r` is retrieved from `p->perfect`, then
kernel uses tcindex_alloc_perfect_hash() to newly
allocate filter results, uses tcindex_filter_result_init()
to clear the old filter result, without destroying
its tcf_exts structure, which triggers the above memory leak.
To be more specific, there are only two source for the `old_r`,
according to the tcindex_lookup(). `old_r` is retrieved from
`p->perfect`, or `old_r` is retrieved from `p->h`.
* If `old_r` is retrieved from `p->perfect`, kernel uses
tcindex_alloc_perfect_hash() to newly allocate the
filter results. Then `r` is assigned with `cp->perfect + handle`,
which is newly allocated. So condition `old_r && old_r != r` is
true in this situation, and kernel uses tcindex_filter_result_init()
to clear the old filter result, without destroying
its tcf_exts structure
* If `old_r` is retrieved from `p->h`, then `p->perfect` is NULL
according to the tcindex_lookup(). Considering that `cp->h`
is directly copied from `p->h` and `p->perfect` is NULL,
`r` is assigned with `tcindex_lookup(cp, handle)`, whose value
should be the same as `old_r`, so condition `old_r && old_r != r`
is false in this situation, kernel ignores using
tcindex_filter_result_init() to clear the old filter result.
So only when `old_r` is retrieved from `p->perfect` does kernel use
tcindex_filter_result_init() to clear the old filter result, which
triggers the above memory leak.
Considering that there already exists a tc_filter_wq workqueue
to destroy the old tcindex_data by tcindex_partial_destroy_work()
at the end of tcindex_set_parms(), this patch solves
this memory leak bug by removing this old filter result
clearing part and delegating it to the tc_filter_wq workqueue.
Note that this patch doesn't introduce any other issues. If
`old_r` is retrieved from `p->perfect`, this patch just
delegates old filter result clearing part to the
tc_filter_wq workqueue; If `old_r` is retrieved from `p->h`,
kernel doesn't reach the old filter result clearing part, so
removing this part has no effect.
[Thanks to the suggestion from Jakub Kicinski, Cong Wang, Paolo Abeni
and Dmitry Vyukov]
Fixes: b9a24bb76b ("net_sched: properly handle failure case of tcf_exts_init()")
Link: https://lore.kernel.org/all/0000000000001de5c505ebc9ec59@google.com/
Reported-by: syzbot+232ebdbd36706c965ebf@syzkaller.appspotmail.com
Tested-by: syzbot+232ebdbd36706c965ebf@syzkaller.appspotmail.com
Cc: Cong Wang <cong.wang@bytedance.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Hawkins Jiawei <yin31149@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Due to several bugs caused by timers being re-armed after they are
shutdown and just before they are freed, a new state of timers was added
called "shutdown". After a timer is set to this state, then it can no
longer be re-armed.
The following script was run to find all the trivial locations where
del_timer() or del_timer_sync() is called in the same function that the
object holding the timer is freed. It also ignores any locations where
the timer->function is modified between the del_timer*() and the free(),
as that is not considered a "trivial" case.
This was created by using a coccinelle script and the following
commands:
$ cat timer.cocci
@@
expression ptr, slab;
identifier timer, rfield;
@@
(
- del_timer(&ptr->timer);
+ timer_shutdown(&ptr->timer);
|
- del_timer_sync(&ptr->timer);
+ timer_shutdown_sync(&ptr->timer);
)
... when strict
when != ptr->timer
(
kfree_rcu(ptr, rfield);
|
kmem_cache_free(slab, ptr);
|
kfree(ptr);
)
$ spatch timer.cocci . > /tmp/t.patch
$ patch -p1 < /tmp/t.patch
Link: https://lore.kernel.org/lkml/20221123201306.823305113@linutronix.de/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Acked-by: Pavel Machek <pavel@ucw.cz> [ LED ]
Acked-by: Kalle Valo <kvalo@kernel.org> [ wireless ]
Acked-by: Paolo Abeni <pabeni@redhat.com> [ networking ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCY6YkXgAKCRDbK58LschI
g25kAP4jYi+YomSlmGUzN/fUbEIHkXXyh85Yh2/yHGYdVuIuvwEA0uXeC7JHQTca
dkcyYvgY6zJwFBV0lAVnhTRzFirFkQk=
=THs1
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:
====================
The following pull-request contains BPF updates for your *net* tree.
We've added 7 non-merge commits during the last 5 day(s) which contain
a total of 11 files changed, 231 insertions(+), 3 deletions(-).
The main changes are:
1) Fix a splat in bpf_skb_generic_pop() under CHECKSUM_PARTIAL due to
misuse of skb_postpull_rcsum(), from Jakub Kicinski with test case
from Martin Lau.
2) Fix BPF verifier's nullness propagation when registers are of
type PTR_TO_BTF_ID, from Hao Sun.
3) Fix bpftool build for JIT disassembler under statically built
libllvm, from Anton Protopopov.
4) Fix warnings reported by resolve_btfids when building vmlinux
with CONFIG_SECURITY_NETWORK disabled, from Hou Tao.
5) Minor fix up for BPF selftest gitignore, from Stanislav Fomichev.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
- improve p9_check_errors to check buffer size instead of msize when possible
(e.g. not zero-copy)
- some more syzbot and KCSAN fixes
- minor headers include cleanup
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE/IPbcYBuWt0zoYhOq06b7GqY5nAFAmOljPwACgkQq06b7GqY
5nDRjw//aJU+tdcKCMije/ul4hMWDlvMwxn7x6p0ELdomefs+ykS/knBxXSVIoEs
PrbVJVZVqOOOAn/IwWe8cMBD+hal0fLUErRbfrtzmOdkiF7z8PavJ209OeJLKBgD
ffL+bq6FhcVC6jVXcwVHoZkX9bb4pnM7/lsJrO0UjBw+fT3ceqtK0vsTa+R2xEOj
9lOS5124u69GVa9UvwQzqHko+UUx5T6XlULZYjNBEdtJqGULGi2oAABrae64R3N2
auaj5LRKzAFOx4zkJ+crCH1h08uZ4bfTyCHpfCeTHwWb1duKD3u4jMq9PhdetF4E
A6NYnOdeMxbV/sZfFOjjNWQrzP1TQJLmF6IVGSZkVQrlCjrZh7xQ5dr/AHrKr6be
U+NXb0UCmAS6/Gs7Sxq5jnihDHzJ4rYG+oFdYdNrwPrrpQXsYmmRh+bm61m/t40T
2JxBIiSt2KWL487AHsKisb6OsiH65N1ojntO5QJObZId4UdnhFJU6OaAzqv0Cojv
mqKlZ0UPyxICXNCL227w+SdDFgK25efdLF1Z1547hS5DO0+43oWAtnvd3KrRpjZ6
CmV9ARvdhHt49lNedbxmJAre5FusJQLeULuRzhMbd4mdcG7mKAmGTdM3u+AlFRIu
Te1ZotTJXxs16Yn/whWRShAooUnK9FbXzC3kViiibziYZlCfK+s=
=xLkl
-----END PGP SIGNATURE-----
Merge tag '9p-for-6.2-rc1' of https://github.com/martinetd/linux
Pull 9p updates from Dominique Martinet:
- improve p9_check_errors to check buffer size instead of msize when
possible (e.g. not zero-copy)
- some more syzbot and KCSAN fixes
- minor headers include cleanup
* tag '9p-for-6.2-rc1' of https://github.com/martinetd/linux:
9p/client: fix data race on req->status
net/9p: fix response size check in p9_check_errors()
net/9p: distinguish zero-copy requests
9p/xen: do not memcpy header into req->rc
9p: set req refcount to zero to avoid uninitialized usage
9p/net: Remove unneeded idr.h #include
9p/fs: Remove unneeded idr.h #include
Set timeout and garbage collection interval updates are ignored on
updates. Add transaction to update global set element timeout and
garbage collection interval.
Fixes: 96518518cc ("netfilter: add nftables")
Suggested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
MattB reported a lockdep splat in the mptcp listener code cleanup:
WARNING: possible circular locking dependency detected
packetdrill/14278 is trying to acquire lock:
ffff888017d868f0 ((work_completion)(&msk->work)){+.+.}-{0:0}, at: __flush_work (kernel/workqueue.c:3069)
but task is already holding lock:
ffff888017d84130 (sk_lock-AF_INET){+.+.}-{0:0}, at: mptcp_close (net/mptcp/protocol.c:2973)
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (sk_lock-AF_INET){+.+.}-{0:0}:
__lock_acquire (kernel/locking/lockdep.c:5055)
lock_acquire (kernel/locking/lockdep.c:466)
lock_sock_nested (net/core/sock.c:3463)
mptcp_worker (net/mptcp/protocol.c:2614)
process_one_work (kernel/workqueue.c:2294)
worker_thread (include/linux/list.h:292)
kthread (kernel/kthread.c:376)
ret_from_fork (arch/x86/entry/entry_64.S:312)
-> #0 ((work_completion)(&msk->work)){+.+.}-{0:0}:
check_prev_add (kernel/locking/lockdep.c:3098)
validate_chain (kernel/locking/lockdep.c:3217)
__lock_acquire (kernel/locking/lockdep.c:5055)
lock_acquire (kernel/locking/lockdep.c:466)
__flush_work (kernel/workqueue.c:3070)
__cancel_work_timer (kernel/workqueue.c:3160)
mptcp_cancel_work (net/mptcp/protocol.c:2758)
mptcp_subflow_queue_clean (net/mptcp/subflow.c:1817)
__mptcp_close_ssk (net/mptcp/protocol.c:2363)
mptcp_destroy_common (net/mptcp/protocol.c:3170)
mptcp_destroy (include/net/sock.h:1495)
__mptcp_destroy_sock (net/mptcp/protocol.c:2886)
__mptcp_close (net/mptcp/protocol.c:2959)
mptcp_close (net/mptcp/protocol.c:2974)
inet_release (net/ipv4/af_inet.c:432)
__sock_release (net/socket.c:651)
sock_close (net/socket.c:1367)
__fput (fs/file_table.c:320)
task_work_run (kernel/task_work.c:181 (discriminator 1))
exit_to_user_mode_prepare (include/linux/resume_user_mode.h:49)
syscall_exit_to_user_mode (kernel/entry/common.c:130)
do_syscall_64 (arch/x86/entry/common.c:87)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(sk_lock-AF_INET);
lock((work_completion)(&msk->work));
lock(sk_lock-AF_INET);
lock((work_completion)(&msk->work));
*** DEADLOCK ***
The report is actually a false positive, since the only existing lock
nesting is the msk socket lock acquired by the mptcp work.
cancel_work_sync() is invoked without the relevant socket lock being
held, but under a different (the msk listener) socket lock.
We could silence the splat adding a per workqueue dynamic lockdep key,
but that looks overkill. Instead just tell lockdep the msk socket lock
is not held around cancel_work_sync().
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/322
Fixes: 30e51b923e ("mptcp: fix unreleased socket in accept queue")
Reported-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
MatM reported a deadlock at fastopening time:
INFO: task syz-executor.0:11454 blocked for more than 143 seconds.
Tainted: G S 6.1.0-rc5-03226-gdb0157db5153 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:syz-executor.0 state:D stack:25104 pid:11454 ppid:424 flags:0x00004006
Call Trace:
<TASK>
context_switch kernel/sched/core.c:5191 [inline]
__schedule+0x5c2/0x1550 kernel/sched/core.c:6503
schedule+0xe8/0x1c0 kernel/sched/core.c:6579
__lock_sock+0x142/0x260 net/core/sock.c:2896
lock_sock_nested+0xdb/0x100 net/core/sock.c:3466
__mptcp_close_ssk+0x1a3/0x790 net/mptcp/protocol.c:2328
mptcp_destroy_common+0x16a/0x650 net/mptcp/protocol.c:3171
mptcp_disconnect+0xb8/0x450 net/mptcp/protocol.c:3019
__inet_stream_connect+0x897/0xa40 net/ipv4/af_inet.c:720
tcp_sendmsg_fastopen+0x3dd/0x740 net/ipv4/tcp.c:1200
mptcp_sendmsg_fastopen net/mptcp/protocol.c:1682 [inline]
mptcp_sendmsg+0x128a/0x1a50 net/mptcp/protocol.c:1721
inet6_sendmsg+0x11f/0x150 net/ipv6/af_inet6.c:663
sock_sendmsg_nosec net/socket.c:714 [inline]
sock_sendmsg+0xf7/0x190 net/socket.c:734
____sys_sendmsg+0x336/0x970 net/socket.c:2476
___sys_sendmsg+0x122/0x1c0 net/socket.c:2530
__sys_sendmmsg+0x18d/0x460 net/socket.c:2616
__do_sys_sendmmsg net/socket.c:2645 [inline]
__se_sys_sendmmsg net/socket.c:2642 [inline]
__x64_sys_sendmmsg+0x9d/0x110 net/socket.c:2642
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f5920a75e7d
RSP: 002b:00007f59201e8028 EFLAGS: 00000246 ORIG_RAX: 0000000000000133
RAX: ffffffffffffffda RBX: 00007f5920bb4f80 RCX: 00007f5920a75e7d
RDX: 0000000000000001 RSI: 0000000020002940 RDI: 0000000000000005
RBP: 00007f5920ae7593 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000020004050 R11: 0000000000000246 R12: 0000000000000000
R13: 000000000000000b R14: 00007f5920bb4f80 R15: 00007f59201c8000
</TASK>
In the error path, tcp_sendmsg_fastopen() ends-up calling
mptcp_disconnect(), and the latter tries to close each
subflow, acquiring the socket lock on each of them.
At fastopen time, we have a single subflow, and such subflow
socket lock is already held by the called, causing the deadlock.
We already track the 'fastopen in progress' status inside the msk
socket. Use it to address the issue, making mptcp_disconnect() a
no op when invoked from the fastopen (error) path and doing the
relevant cleanup after releasing the subflow socket lock.
While at the above, rename the fastopen status bit to something
more meaningful.
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/321
Fixes: fa9e57468a ("mptcp: fix abba deadlock on fastopen")
Reported-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
A recent commit introducing upcall packet accounting failed to properly
release the vport object when the per-cpu stats struct couldn't be
allocated. This can cause dangling pointers to dp objects long after
they've been released.
Cc: wangchuanlei <wangchuanlei@inspur.com>
Fixes: 1933ea365a ("net: openvswitch: Add support to count upcall packets")
Reported-by: syzbot+8f4e2dcfcb3209ac35f9@syzkaller.appspotmail.com
Signed-off-by: Aaron Conole <aconole@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Link: https://lore.kernel.org/r/20221220212717.526780-1-aconole@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
If a ruleset declares a set name that matches an existing set in the
kernel, then validate that this declaration really refers to the same
set, otherwise bail out with EEXIST.
Currently, the kernel reports success when adding a set that already
exists in the kernel. This usually results in EINVAL errors at a later
stage, when the user adds elements to the set, if the set declaration
mismatches the existing set representation in the kernel.
Add a new function to check that the set declaration really refers to
the same existing set in the kernel.
Fixes: 96518518cc ("netfilter: add nftables")
Reported-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add a helper function to allocate and initialize the stateful expressions
that are defined in a set.
This patch allows to reuse this code from the set update path, to check
that type of the update matches the existing set in the kernel.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add the following fields to the set description:
- key type
- data type
- object type
- policy
- gc_int: garbage collection interval)
- timeout: element timeout
This prepares for stricter set type checks on updates in a follow up
patch.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
smatch warnings:
net/netfilter/nf_conntrack_proto.c:167 nf_confirm() warn: unsigned 'protoff' is never less than zero.
We need to check if ipv6_skip_exthdr() returned an error, but protoff is
unsigned. Use a signed integer for this.
Fixes: a70e483460 ("netfilter: conntrack: merge ipv4+ipv6 confirm functions")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Anand hit a BUG() when pulling off headers on egress to a SW tunnel.
We get to skb_checksum_help() with an invalid checksum offset
(commit d7ea0d9df2 ("net: remove two BUG() from skb_checksum_help()")
converted those BUGs to WARN_ONs()).
He points out oddness in how skb_postpull_rcsum() gets used.
Indeed looks like we should pull before "postpull", otherwise
the CHECKSUM_PARTIAL fixup from skb_postpull_rcsum() will not
be able to do its job:
if (skb->ip_summed == CHECKSUM_PARTIAL &&
skb_checksum_start_offset(skb) < 0)
skb->ip_summed = CHECKSUM_NONE;
Reported-by: Anand Parthasarathy <anpartha@meta.com>
Fixes: 6578171a7f ("bpf: add bpf_skb_change_proto helper")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20221220004701.402165-1-kuba@kernel.org
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Convert the final two users of prandom_u32_max() that slipped in during
6.2-rc1 to use get_random_u32_below().
Then, with no more users left, we can finally remove the deprecated
function.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Since moving to memalloc_nofs_save/restore, SUNRPC has stopped setting the
GFP_NOIO flag on sk_allocation which the networking system uses to decide
when it is safe to use current->task_frag. The results of this are
unexpected corruption in task_frag when SUNRPC is involved in memory
reclaim.
The corruption can be seen in crashes, but the root cause is often
difficult to ascertain as a crashing machine's stack trace will have no
evidence of being near NFS or SUNRPC code. I believe this problem to
be much more pervasive than reports to the community may indicate.
Fix this by having kernel users of sockets that may corrupt task_frag due
to reclaim set sk_use_task_frag = false. Preemptively correcting this
situation for users that still set sk_allocation allows them to convert to
memalloc_nofs_save/restore without the same unexpected corruptions that are
sure to follow, unlikely to show up in testing, and difficult to bisect.
CC: Philipp Reisner <philipp.reisner@linbit.com>
CC: Lars Ellenberg <lars.ellenberg@linbit.com>
CC: "Christoph Böhmwalder" <christoph.boehmwalder@linbit.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: Josef Bacik <josef@toxicpanda.com>
CC: Keith Busch <kbusch@kernel.org>
CC: Christoph Hellwig <hch@lst.de>
CC: Sagi Grimberg <sagi@grimberg.me>
CC: Lee Duncan <lduncan@suse.com>
CC: Chris Leech <cleech@redhat.com>
CC: Mike Christie <michael.christie@oracle.com>
CC: "James E.J. Bottomley" <jejb@linux.ibm.com>
CC: "Martin K. Petersen" <martin.petersen@oracle.com>
CC: Valentina Manea <valentina.manea.m@gmail.com>
CC: Shuah Khan <shuah@kernel.org>
CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
CC: David Howells <dhowells@redhat.com>
CC: Marc Dionne <marc.dionne@auristor.com>
CC: Steve French <sfrench@samba.org>
CC: Christine Caulfield <ccaulfie@redhat.com>
CC: David Teigland <teigland@redhat.com>
CC: Mark Fasheh <mark@fasheh.com>
CC: Joel Becker <jlbec@evilplan.org>
CC: Joseph Qi <joseph.qi@linux.alibaba.com>
CC: Eric Van Hensbergen <ericvh@gmail.com>
CC: Latchesar Ionkov <lucho@ionkov.net>
CC: Dominique Martinet <asmadeus@codewreck.org>
CC: Ilya Dryomov <idryomov@gmail.com>
CC: Xiubo Li <xiubli@redhat.com>
CC: Chuck Lever <chuck.lever@oracle.com>
CC: Jeff Layton <jlayton@kernel.org>
CC: Trond Myklebust <trond.myklebust@hammerspace.com>
CC: Anna Schumaker <anna@kernel.org>
CC: Steffen Klassert <steffen.klassert@secunet.com>
CC: Herbert Xu <herbert@gondor.apana.org.au>
Suggested-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Sockets that can be used while recursing into memory reclaim, like
those used by network block devices and file systems, mustn't use
current->task_frag: if the current process is already using it, then
the inner memory reclaim call would corrupt the task_frag structure.
To avoid this, sk_page_frag() uses ->sk_allocation to detect sockets
that mustn't use current->task_frag, assuming that those used during
memory reclaim had their allocation constraints reflected in
->sk_allocation.
This unfortunately doesn't cover all cases: in an attempt to remove all
usage of GFP_NOFS and GFP_NOIO, sunrpc stopped setting these flags in
->sk_allocation, and used memalloc_nofs critical sections instead.
This breaks the sk_page_frag() heuristic since the allocation
constraints are now stored in current->flags, which sk_page_frag()
can't read without risking triggering a cache miss and slowing down
TCP's fast path.
This patch creates a new field in struct sock, named sk_use_task_frag,
which sockets with memory reclaim constraints can set to false if they
can't safely use current->task_frag. In such cases, sk_page_frag() now
always returns the socket's page_frag (->sk_frag). The first user is
sunrpc, which needs to avoid using current->task_frag but can keep
->sk_allocation set to GFP_KERNEL otherwise.
Eventually, it might be possible to simplify sk_page_frag() by only
testing ->sk_use_task_frag and avoid relying on the ->sk_allocation
heuristic entirely (assuming other sockets will set ->sk_use_task_frag
according to their constraints in the future).
The new ->sk_use_task_frag field is placed in a hole in struct sock and
belongs to a cache line shared with ->sk_shutdown. Therefore it should
be hot and shouldn't have negative performance impacts on TCP's fast
path (sk_shutdown is tested just before the while() loop in
tcp_sendmsg_locked()).
Link: https://lore.kernel.org/netdev/b4d8cb09c913d3e34f853736f3f5628abfd7f4b6.1656699567.git.gnault@redhat.com/
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The unregister check could be incorrectly triggered if a netdev
changes its type after register. That is possible for a tun device
using TUNSETLINK ioctl, resulting in mctp unregister failing
and the netdev unregister waiting forever.
This was encountered by https://github.com/openthread/openthread/issues/8523
Neither check at register or unregister is required. They were added in
an attempt to track down mctp_ptr being set unexpectedly, which should
not happen in normal operation.
Fixes: 7b1871af75 ("mctp: Warn if pointer is set for a wrong dev type")
Signed-off-by: Matt Johnston <matt@codeconstruct.com.au>
Link: https://lore.kernel.org/r/20221215054933.2403401-1-matt@codeconstruct.com.au
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Changheon Lee reported TCP socket leaks, with a nice repro.
It seems we leak TCP sockets with the following sequence:
1) SOF_TIMESTAMPING_TX_ACK is enabled on the socket.
Each ACK will cook an skb put in error queue, from __skb_tstamp_tx().
__skb_tstamp_tx() is using skb_clone(), unless
SOF_TIMESTAMPING_OPT_TSONLY was also requested.
2) If the application is also using MSG_ZEROCOPY, then we put in the
error queue cloned skbs that had a struct ubuf_info attached to them.
Whenever an struct ubuf_info is allocated, sock_zerocopy_alloc()
does a sock_hold().
As long as the cloned skbs are still in sk_error_queue,
socket refcount is kept elevated.
3) Application closes the socket, while error queue is not empty.
Since tcp_close() no longer purges the socket error queue,
we might end up with a TCP socket with at least one skb in
error queue keeping the socket alive forever.
This bug can be (ab)used to consume all kernel memory
and freeze the host.
We need to purge the error queue, with proper synchronization
against concurrent writers.
Fixes: 24bcbe1cc6 ("net: stream: don't purge sk_error_queue in sk_stream_kill_queues()")
Reported-by: Changheon Lee <darklight2357@icloud.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 4d1c7d8703 ("mac802154: Move an skb free within the rx path")
tried to simplify error handling within the receive path by moving the
kfree_skb() call at the very end of the top-level function but missed
one kfree_skb() called upon frame parsing error. Prevent this possible
double free from happening.
Fixes: 4d1c7d8703 ("mac802154: Move an skb free within the rx path")
Reported-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Link: https://lore.kernel.org/r/20221216235742.646134-1-miquel.raynal@bootlin.com
Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
Dan Carpenter sayeth[1]:
The patch 5e6ef4f101: "rxrpc: Make the I/O thread take over the
call and local processor work" from Jan 23, 2020, leads to the
following Smatch static checker warning:
net/rxrpc/io_thread.c:283 rxrpc_input_packet()
warn: bool is not less than zero.
Fix this (for now) by changing rxrpc_new_incoming_call() to return an int
with 0 or error code rather than bool. Note that the actual return value
of rxrpc_input_packet() is currently ignored. I have a separate patch to
clean that up.
Fixes: 5e6ef4f101 ("rxrpc: Make the I/O thread take over the call and local processor work")
Reported-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: http://lists.infradead.org/pipermail/linux-afs/2022-December/006123.html [1]
Signed-off-by: David S. Miller <davem@davemloft.net>
Dan Carpenter sayeth[1]:
The patch 75bfdbf2fc: "rxrpc: Implement an in-kernel rxperf server
for testing purposes" from Nov 3, 2022, leads to the following Smatch
static checker warning:
net/rxrpc/rxperf.c:337 rxperf_deliver_to_call()
error: uninitialized symbol 'ret'.
Fix this by initialising ret to 0. The value is only used for tracing
purposes in the rxperf server.
Fixes: 75bfdbf2fc ("rxrpc: Implement an in-kernel rxperf server for testing purposes")
Reported-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: http://lists.infradead.org/pipermail/linux-afs/2022-December/006124.html [1]
Signed-off-by: David S. Miller <davem@davemloft.net>
The rxrpc I/O thread checks to see if there's any work it needs to do, and
if not, checks kthread_should_stop() before scheduling, and if it should
stop, breaks out of the loop and tries to clean up and exit.
This can, however, race with socket destruction, wherein outstanding calls
are aborted and released from the socket and then the socket unuses the
local endpoint, causing kthread_stop() to be issued. The abort is deferred
to the I/O thread and the event can by issued between the I/O thread
checking if there's any work to be done (such as processing call aborts)
and the stop being seen.
This results in the I/O thread stopping processing of events whilst call
cleanup events are still outstanding, leading to connections or other
objects still being around and uncleaned up, which can result in assertions
being triggered, e.g.:
rxrpc: AF_RXRPC: Leaked client conn 00000000e8009865 {2}
------------[ cut here ]------------
kernel BUG at net/rxrpc/conn_client.c:64!
Fix this by retrieving the kthread_should_stop() indication, then checking
to see if there's more work to do, and going back round the loop if there
is, and breaking out of the loop only if there wasn't.
This was triggered by a syzbot test that produced some other symptom[1].
Fixes: a275da62e8 ("rxrpc: Create a per-local endpoint receive queue and I/O thread")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/0000000000002b4a9f05ef2b616f@google.com/ [1]
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix the switched parameters on rxrpc_alloc_peer() and rxrpc_get_peer().
The ref argument and the why argument got mixed.
Fixes: 47c810a798 ("rxrpc: trace: Don't use __builtin_return_address for rxrpc_peer tracing")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that rxrpc_put_local() may call kthread_stop(), it can't be called
under spinlock as it might sleep. This can cause a problem in the peer
keepalive code in rxrpc as it tries to avoid dropping the peer_hash_lock
from the point it needs to re-add peer->keepalive_link to going round the
loop again in rxrpc_peer_keepalive_dispatch().
Fix this by just dropping the lock when we don't need it and accepting that
we'll have to take it again. This code is only called about every 20s for
each peer, so not very often.
This allows rxrpc_put_peer_unlocked() to be removed also.
If triggered, this bug produces an oops like the following, as reproduced
by a syzbot reproducer for a different oops[1]:
BUG: sleeping function called from invalid context at kernel/sched/completion.c:101
...
RCU nest depth: 0, expected: 0
3 locks held by kworker/u9:0/50:
#0: ffff88810e74a138 ((wq_completion)krxrpcd){+.+.}-{0:0}, at: process_one_work+0x294/0x636
#1: ffff8881013a7e20 ((work_completion)(&rxnet->peer_keepalive_work)){+.+.}-{0:0}, at: process_one_work+0x294/0x636
#2: ffff88817d366390 (&rxnet->peer_hash_lock){+.+.}-{2:2}, at: rxrpc_peer_keepalive_dispatch+0x2bd/0x35f
...
Call Trace:
<TASK>
dump_stack_lvl+0x4c/0x5f
__might_resched+0x2cf/0x2f2
__wait_for_common+0x87/0x1e8
kthread_stop+0x14d/0x255
rxrpc_peer_keepalive_dispatch+0x333/0x35f
rxrpc_peer_keepalive_worker+0x2e9/0x449
process_one_work+0x3c1/0x636
worker_thread+0x25f/0x359
kthread+0x1a6/0x1b5
ret_from_fork+0x1f/0x30
Fixes: a275da62e8 ("rxrpc: Create a per-local endpoint receive queue and I/O thread")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Link: https://lore.kernel.org/r/0000000000002b4a9f05ef2b616f@google.com/ [1]
Signed-off-by: David S. Miller <davem@davemloft.net>
When starting a kthread, the __kthread_create_on_node() function, as called
from kthread_run(), waits for a completion to indicate that the task_struct
(or failure state) of the new kernel thread is available before continuing.
This does not wait, however, for the thread function to be invoked and,
indeed, will skip it if kthread_stop() gets called before it gets there.
If this happens, though, kthread_run() will have returned successfully,
indicating that the thread was started and returning the task_struct
pointer. The actual error indication is returned by kthread_stop().
Note that this is ambiguous, as the caller cannot tell whether the -EINTR
error code came from kthread() or from the thread function.
This was encountered in the new rxrpc I/O thread, where if the system is
being pounded hard by, say, syzbot, the check of KTHREAD_SHOULD_STOP can be
delayed long enough for kthread_stop() to get called when rxrpc releases a
socket - and this causes an oops because the I/O thread function doesn't
get started and thus doesn't remove the rxrpc_local struct from the
local_endpoints list.
Fix this by using a completion to wait for the thread to actually enter
rxrpc_io_thread(). This makes sure the thread can't be prematurely
stopped and makes sure the relied-upon cleanup is done.
Fixes: a275da62e8 ("rxrpc: Create a per-local endpoint receive queue and I/O thread")
Reported-by: syzbot+3538a6a72efa8b059c38@syzkaller.appspotmail.com
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Hillf Danton <hdanton@sina.com>
Link: https://lore.kernel.org/r/000000000000229f1505ef2b6159@google.com/
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix the propagation of the security settings from sendmsg to the rxrpc_call
struct.
Fixes: f3441d4125 ("rxrpc: Copy client call parameters into rxrpc_call earlier")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: David S. Miller <davem@davemloft.net>
One of the error paths in rxrpc_do_sendmsg() doesn't unlock the call mutex
before returning. Fix it to do this.
Note that this still doesn't get rid of the checker warning:
../net/rxrpc/sendmsg.c:617:5: warning: context imbalance in 'rxrpc_do_sendmsg' - wrong count at exit
I think the interplay between the socket lock and the call's user_mutex may
be too complicated for checker to analyse, especially as
rxrpc_new_client_call_for_sendmsg(), which it calls, returns with the
call's user_mutex if successful but unconditionally drops the socket lock.
Fixes: e754eba685 ("rxrpc: Provide a cmsg to specify the amount of Tx data for a call")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: David S. Miller <davem@davemloft.net>
When TCF_EM_SIMPLE was introduced, it is supposed to be convenient
for ematch implementation:
https://lore.kernel.org/all/20050105110048.GO26856@postel.suug.ch/
"You don't have to, providing a 32bit data chunk without TCF_EM_SIMPLE
set will simply result in allocating & copy. It's an optimization,
nothing more."
So if an ematch module provides ops->datalen that means it wants a
complex data structure (saved in its em->data) instead of a simple u32
value. We should simply reject such a combination, otherwise this u32
could be misinterpreted as a pointer.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-and-tested-by: syzbot+4caeae4c7103813598ae@syzkaller.appspotmail.com
Reported-by: Jun Nie <jun.nie@linaro.org>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* Randomize the per-cpu entry areas
Cleanups:
* Have CR3_ADDR_MASK use PHYSICAL_PAGE_MASK instead of open
coding it
* Move to "native" set_memory_rox() helper
* Clean up pmd_get_atomic() and i386-PAE
* Remove some unused page table size macros
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmOc53UACgkQaDWVMHDJ
krCUHw//SGZ+La0hLZLAiAiZTXLZZHpYkOmg1Oj1+11qSU11uZzTFqDpauhaKpRS
cJCSh+D+RXe5e2ipgt0+Zl0hESLt7pJf8258OE4ra0DL/IlyO9uqruAs9Kn3eRS/
Fk76nG8gdEU+JKJqpG02GqOLslYQuIy96n9hpuj1x25b614+uezPfC7S4XEat0NT
MbJQ+jnVDf16aJIJkzT+iSwhubDVeh+bSHeO0SSCzX23WLUqDeg5NvlyxoCHGbBh
UpUTWggV/0pYAkBKRHToeJs8qTWREwuuH/8JGewpe9A0tjdB5wyZfNL2PuracweN
9MauXC3T5f0+Ca4yIIaPq1fF7Ny/PR2dBFihk27rOD0N7tjaZxNwal2pB1sZcmvZ
+PAokjyTPVH5ZXjkMYGGAUe1jyjwr2+TgFSZxhTnDuGtyVQiY4pihGKOifLCX6tv
x6khvYeTBw7wfaDRtKEAf+2kLHYn+71HszHP/8bNKX9T03h+Zf0i1wdZu5xbM5Gc
VK2wR7bCC+UftJJYG0pldcHg2qaF19RBHK2tLwp7zngUv7lTbkKfkgKjre73KV2a
D4b76lrqdUMo6UYwYdw7WtDyarZS4OVLq2DcNhwwMddBCaX8kyN5a4AqwQlZYJ0u
dM+kuMofE8U3yMxmMhJimkZUsj09yLHIqfynY0jbAcU3nhKZZNY=
=wwVF
-----END PGP SIGNATURE-----
Merge tag 'x86_mm_for_6.2_v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 mm updates from Dave Hansen:
"New Feature:
- Randomize the per-cpu entry areas
Cleanups:
- Have CR3_ADDR_MASK use PHYSICAL_PAGE_MASK instead of open coding it
- Move to "native" set_memory_rox() helper
- Clean up pmd_get_atomic() and i386-PAE
- Remove some unused page table size macros"
* tag 'x86_mm_for_6.2_v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (35 commits)
x86/mm: Ensure forced page table splitting
x86/kasan: Populate shadow for shared chunk of the CPU entry area
x86/kasan: Add helpers to align shadow addresses up and down
x86/kasan: Rename local CPU_ENTRY_AREA variables to shorten names
x86/mm: Populate KASAN shadow for entire per-CPU range of CPU entry area
x86/mm: Recompute physical address for every page of per-CPU CEA mapping
x86/mm: Rename __change_page_attr_set_clr(.checkalias)
x86/mm: Inhibit _PAGE_NX changes from cpa_process_alias()
x86/mm: Untangle __change_page_attr_set_clr(.checkalias)
x86/mm: Add a few comments
x86/mm: Fix CR3_ADDR_MASK
x86/mm: Remove P*D_PAGE_MASK and P*D_PAGE_SIZE macros
mm: Convert __HAVE_ARCH_P..P_GET to the new style
mm: Remove pointless barrier() after pmdp_get_lockless()
x86/mm/pae: Get rid of set_64bit()
x86_64: Remove pointless set_64bit() usage
x86/mm/pae: Be consistent with pXXp_get_and_clear()
x86/mm/pae: Use WRITE_ONCE()
x86/mm/pae: Don't (ab)use atomic64
mm/gup: Fix the lockless PMD access
...
Extending the tail can have some unexpected side effects if a program uses
a helper like BPF_FUNC_skb_pull_data to read partial content beyond the
head skb headlen when all the skbs in the gso frag_list are linear with no
head_frag -
kernel BUG at net/core/skbuff.c:4219!
pc : skb_segment+0xcf4/0xd2c
lr : skb_segment+0x63c/0xd2c
Call trace:
skb_segment+0xcf4/0xd2c
__udp_gso_segment+0xa4/0x544
udp4_ufo_fragment+0x184/0x1c0
inet_gso_segment+0x16c/0x3a4
skb_mac_gso_segment+0xd4/0x1b0
__skb_gso_segment+0xcc/0x12c
udp_rcv_segment+0x54/0x16c
udp_queue_rcv_skb+0x78/0x144
udp_unicast_rcv_skb+0x8c/0xa4
__udp4_lib_rcv+0x490/0x68c
udp_rcv+0x20/0x30
ip_protocol_deliver_rcu+0x1b0/0x33c
ip_local_deliver+0xd8/0x1f0
ip_rcv+0x98/0x1a4
deliver_ptype_list_skb+0x98/0x1ec
__netif_receive_skb_core+0x978/0xc60
Fix this by marking these skbs as GSO_DODGY so segmentation can handle
the tail updates accordingly.
Fixes: 3dcbdb134f ("net: gso: Fix skb_segment splat when splitting gso_size mangled skb having linear-headed frag_list")
Signed-off-by: Sean Tranchetti <quic_stranche@quicinc.com>
Signed-off-by: Subash Abhinov Kasiviswanathan <quic_subashab@quicinc.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Link: https://lore.kernel.org/r/1671084718-24796-1-git-send-email-quic_subashab@quicinc.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Take the instance lock around devlink_nl_fill() when dumping,
doit takes it already.
We are only dumping basic info so in the worst case we were risking
data races around the reload statistics. Until the big devlink mutex
was removed all relevant code was protected by it, so the missing
instance lock was not exposed.
Fixes: d3efc2a6a6 ("net: devlink: remove devlink_mutex")
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://lore.kernel.org/r/20221216044122.1863550-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Here is the set of driver core and kernfs changes for 6.2-rc1.
The "big" change in here is the addition of a new macro,
container_of_const() that will preserve the "const-ness" of a pointer
passed into it.
The "problem" of the current container_of() macro is that if you pass in
a "const *", out of it can comes a non-const pointer unless you
specifically ask for it. For many usages, we want to preserve the
"const" attribute by using the same call. For a specific example, this
series changes the kobj_to_dev() macro to use it, allowing it to be used
no matter what the const value is. This prevents every subsystem from
having to declare 2 different individual macros (i.e.
kobj_const_to_dev() and kobj_to_dev()) and having the compiler enforce
the const value at build time, which having 2 macros would not do
either.
The driver for all of this have been discussions with the Rust kernel
developers as to how to properly mark driver core, and kobject, objects
as being "non-mutable". The changes to the kobject and driver core in
this pull request are the result of that, as there are lots of paths
where kobjects and device pointers are not modified at all, so marking
them as "const" allows the compiler to enforce this.
So, a nice side affect of the Rust development effort has been already
to clean up the driver core code to be more obvious about object rules.
All of this has been bike-shedded in quite a lot of detail on lkml with
different names and implementations resulting in the tiny version we
have in here, much better than my original proposal. Lots of subsystem
maintainers have acked the changes as well.
Other than this change, included in here are smaller stuff like:
- kernfs fixes and updates to handle lock contention better
- vmlinux.lds.h fixes and updates
- sysfs and debugfs documentation updates
- device property updates
All of these have been in the linux-next tree for quite a while with no
problems, OTHER than some merge issues with other trees that should be
obvious when you hit them (block tree deletes a driver that this tree
modifies, iommufd tree modifies code that this tree also touches). If
there are merge problems with these trees, please let me know.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCY5wz3A8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+yks0ACeKYUlVgCsER8eYW+x18szFa2QTXgAn2h/VhZe
1Fp53boFaQkGBjl8mGF8
=v+FB
-----END PGP SIGNATURE-----
Merge tag 'driver-core-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here is the set of driver core and kernfs changes for 6.2-rc1.
The "big" change in here is the addition of a new macro,
container_of_const() that will preserve the "const-ness" of a pointer
passed into it.
The "problem" of the current container_of() macro is that if you pass
in a "const *", out of it can comes a non-const pointer unless you
specifically ask for it. For many usages, we want to preserve the
"const" attribute by using the same call. For a specific example, this
series changes the kobj_to_dev() macro to use it, allowing it to be
used no matter what the const value is. This prevents every subsystem
from having to declare 2 different individual macros (i.e.
kobj_const_to_dev() and kobj_to_dev()) and having the compiler enforce
the const value at build time, which having 2 macros would not do
either.
The driver for all of this have been discussions with the Rust kernel
developers as to how to properly mark driver core, and kobject,
objects as being "non-mutable". The changes to the kobject and driver
core in this pull request are the result of that, as there are lots of
paths where kobjects and device pointers are not modified at all, so
marking them as "const" allows the compiler to enforce this.
So, a nice side affect of the Rust development effort has been already
to clean up the driver core code to be more obvious about object
rules.
All of this has been bike-shedded in quite a lot of detail on lkml
with different names and implementations resulting in the tiny version
we have in here, much better than my original proposal. Lots of
subsystem maintainers have acked the changes as well.
Other than this change, included in here are smaller stuff like:
- kernfs fixes and updates to handle lock contention better
- vmlinux.lds.h fixes and updates
- sysfs and debugfs documentation updates
- device property updates
All of these have been in the linux-next tree for quite a while with
no problems"
* tag 'driver-core-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (58 commits)
device property: Fix documentation for fwnode_get_next_parent()
firmware_loader: fix up to_fw_sysfs() to preserve const
usb.h: take advantage of container_of_const()
device.h: move kobj_to_dev() to use container_of_const()
container_of: add container_of_const() that preserves const-ness of the pointer
driver core: fix up missed drivers/s390/char/hmcdrv_dev.c class.devnode() conversion.
driver core: fix up missed scsi/cxlflash class.devnode() conversion.
driver core: fix up some missing class.devnode() conversions.
driver core: make struct class.devnode() take a const *
driver core: make struct class.dev_uevent() take a const *
cacheinfo: Remove of_node_put() for fw_token
device property: Add a blank line in Kconfig of tests
device property: Rename goto label to be more precise
device property: Move PROPERTY_ENTRY_BOOL() a bit down
device property: Get rid of __PROPERTY_ENTRY_ARRAY_EL*SIZE*()
kernfs: fix all kernel-doc warnings and multiple typos
driver core: pass a const * into of_device_uevent()
kobject: kset_uevent_ops: make name() callback take a const *
kobject: kset_uevent_ops: make filter() callback take a const *
kobject: make kobject_namespace take a const *
...
The commit mentioned below causes the ovs_flow_tbl_lookup() function
to be called with the masked key. However, it's supposed to be called
with the unmasked key. This due to the fact that the datapath supports
installing wider flows, and OVS relies on this behavior. For example
if ipv4(src=1.1.1.1/192.0.0.0, dst=1.1.1.2/192.0.0.0) exists, a wider
flow (smaller mask) of ipv4(src=192.1.1.1/128.0.0.0,dst=192.1.1.2/
128.0.0.0) is allowed to be added.
However, if we try to add a wildcard rule, the installation fails:
$ ovs-appctl dpctl/add-flow system@myDP "in_port(1),eth_type(0x0800), \
ipv4(src=1.1.1.1/192.0.0.0,dst=1.1.1.2/192.0.0.0,frag=no)" 2
$ ovs-appctl dpctl/add-flow system@myDP "in_port(1),eth_type(0x0800), \
ipv4(src=192.1.1.1/0.0.0.0,dst=49.1.1.2/0.0.0.0,frag=no)" 2
ovs-vswitchd: updating flow table (File exists)
The reason is that the key used to determine if the flow is already
present in the system uses the original key ANDed with the mask.
This results in the IP address not being part of the (miniflow) key,
i.e., being substituted with an all-zero value. When doing the actual
lookup, this results in the key wrongfully matching the first flow,
and therefore the flow does not get installed.
This change reverses the commit below, but rather than having the key
on the stack, it's allocated.
Fixes: 190aa3e778 ("openvswitch: Fix Frame-size larger than 1024 bytes warning.")
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Netdevsim triggers a splat on reload, when it destroys regions
with snapshots pending:
WARNING: CPU: 1 PID: 787 at net/core/devlink.c:6291 devlink_region_snapshot_del+0x12e/0x140
CPU: 1 PID: 787 Comm: devlink Not tainted 6.1.0-07460-g7ae9888d6e1c #580
RIP: 0010:devlink_region_snapshot_del+0x12e/0x140
Call Trace:
<TASK>
devl_region_destroy+0x70/0x140
nsim_dev_reload_down+0x2f/0x60 [netdevsim]
devlink_reload+0x1f7/0x360
devlink_nl_cmd_reload+0x6ce/0x860
genl_family_rcv_msg_doit.isra.0+0x145/0x1c0
This is the locking assert in devlink_region_snapshot_del(),
we're supposed to be holding the region->snapshot_lock here.
Fixes: 2dec18ad82 ("net: devlink: remove region snapshots list dependency on devlink->lock")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 9130b8dbc6 ("SUNRPC: allow for upcalls for the same uid
but different gss service") introduced `auth` argument to
__gss_find_upcall(), but in gss_pipe_downcall() it was left as NULL
since it (and auth->service) was not (yet) determined.
When multiple upcalls with the same uid and different service are
ongoing, it could happen that __gss_find_upcall(), which returns the
first match found in the pipe->in_downcall list, could not find the
correct gss_msg corresponding to the downcall we are looking for.
Moreover, it might return a msg which is not sent to rpc.gssd yet.
We could see mount.nfs process hung in D state with multiple mount.nfs
are executed in parallel. The call trace below is of CentOS 7.9
kernel-3.10.0-1160.24.1.el7.x86_64 but we observed the same hang w/
elrepo kernel-ml-6.0.7-1.el7.
PID: 71258 TASK: ffff91ebd4be0000 CPU: 36 COMMAND: "mount.nfs"
#0 [ffff9203ca3234f8] __schedule at ffffffffa3b8899f
#1 [ffff9203ca323580] schedule at ffffffffa3b88eb9
#2 [ffff9203ca323590] gss_cred_init at ffffffffc0355818 [auth_rpcgss]
#3 [ffff9203ca323658] rpcauth_lookup_credcache at ffffffffc0421ebc
[sunrpc]
#4 [ffff9203ca3236d8] gss_lookup_cred at ffffffffc0353633 [auth_rpcgss]
#5 [ffff9203ca3236e8] rpcauth_lookupcred at ffffffffc0421581 [sunrpc]
#6 [ffff9203ca323740] rpcauth_refreshcred at ffffffffc04223d3 [sunrpc]
#7 [ffff9203ca3237a0] call_refresh at ffffffffc04103dc [sunrpc]
#8 [ffff9203ca3237b8] __rpc_execute at ffffffffc041e1c9 [sunrpc]
#9 [ffff9203ca323820] rpc_execute at ffffffffc0420a48 [sunrpc]
The scenario is like this. Let's say there are two upcalls for
services A and B, A -> B in pipe->in_downcall, B -> A in pipe->pipe.
When rpc.gssd reads pipe to get the upcall msg corresponding to
service B from pipe->pipe and then writes the response, in
gss_pipe_downcall the msg corresponding to service A will be picked
because only uid is used to find the msg and it is before the one for
B in pipe->in_downcall. And the process waiting for the msg
corresponding to service A will be woken up.
Actual scheduing of that process might be after rpc.gssd processes the
next msg. In rpc_pipe_generic_upcall it clears msg->errno (for A).
The process is scheduled to see gss_msg->ctx == NULL and
gss_msg->msg.errno == 0, therefore it cannot break the loop in
gss_create_upcall and is never woken up after that.
This patch adds a simple check to ensure that a msg which is not
sent to rpc.gssd yet is not chosen as the matching upcall upon
receiving a downcall.
Signed-off-by: minoura makoto <minoura@valinux.co.jp>
Signed-off-by: Hiroshi Shimamoto <h-shimamoto@nec.com>
Tested-by: Hiroshi Shimamoto <h-shimamoto@nec.com>
Cc: Trond Myklebust <trondmy@hammerspace.com>
Fixes: 9130b8dbc6 ("SUNRPC: allow for upcalls for same uid but different gss service")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Since commit fc7a6209d5 ("bus: Make remove callback return void")
forces bus_type::remove be void-returned, it doesn't make much sense for
any bus based driver implementing remove callbalk to return non-void to
its caller.
This change is for xen bus based drivers.
Acked-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Dawei Li <set_pte_at@outlook.com>
Link: https://lore.kernel.org/r/TYCP286MB23238119AB4DF190997075C9CAE39@TYCP286MB2323.JPNP286.PROD.OUTLOOK.COM
Signed-off-by: Juergen Gross <jgross@suse.com>
There is a race resulting in alive SOCK_SEQPACKET socket
may change its state from TCP_ESTABLISHED to TCP_CLOSE:
unix_release_sock(peer) unix_dgram_sendmsg(sk)
sock_orphan(peer)
sock_set_flag(peer, SOCK_DEAD)
sock_alloc_send_pskb()
if !(sk->sk_shutdown & SEND_SHUTDOWN)
OK
if sock_flag(peer, SOCK_DEAD)
sk->sk_state = TCP_CLOSE
sk->sk_shutdown = SHUTDOWN_MASK
After that socket sk remains almost normal: it is able to connect, listen, accept
and recvmsg, while it can't sendmsg.
Since this is the only possibility for alive SOCK_SEQPACKET to change
the state in such way, we should better fix this strange and potentially
danger corner case.
Note, that we will return EPIPE here like this is normally done in sock_alloc_send_pskb().
Originally used ECONNREFUSED looks strange, since it's strange to return
a specific retval in dependence of race in kernel, when user can't affect on this.
Also, move TCP_CLOSE assignment for SOCK_DGRAM sockets under state lock
to fix race with unix_dgram_connect():
unix_dgram_connect(other) unix_dgram_sendmsg(sk)
unix_peer(sk) = NULL
unix_state_unlock(sk)
unix_state_double_lock(sk, other)
sk->sk_state = TCP_ESTABLISHED
unix_peer(sk) = other
unix_state_double_unlock(sk, other)
sk->sk_state = TCP_CLOSED
This patch fixes both of these races.
Fixes: 83301b5367 ("af_unix: Set TCP_ESTABLISHED for datagram sockets too")
Signed-off-by: Kirill Tkhai <tkhai@ya.ru>
Link: https://lore.kernel.org/r/135fda25-22d5-837a-782b-ceee50e19844@ya.ru
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
- Convert flexible array members, fix -Wstringop-overflow warnings,
and fix KCFI function type mismatches that went ignored by
maintainers (Gustavo A. R. Silva, Nathan Chancellor, Kees Cook).
- Remove the remaining side-effect users of ksize() by converting
dma-buf, btrfs, and coredump to using kmalloc_size_roundup(),
add more __alloc_size attributes, and introduce full testing
of all allocator functions. Finally remove the ksize() side-effect
so that each allocation-aware checker can finally behave without
exceptions.
- Introduce oops_limit (default 10,000) and warn_limit (default off)
to provide greater granularity of control for panic_on_oops and
panic_on_warn (Jann Horn, Kees Cook).
- Introduce overflows_type() and castable_to_type() helpers for
cleaner overflow checking.
- Improve code generation for strscpy() and update str*() kern-doc.
- Convert strscpy and sigphash tests to KUnit, and expand memcpy
tests.
- Always use a non-NULL argument for prepare_kernel_cred().
- Disable structleak plugin in FORTIFY KUnit test (Anders Roxell).
- Adjust orphan linker section checking to respect CONFIG_WERROR
(Xin Li).
- Make sure siginfo is cleared for forced SIGKILL (haifeng.xu).
- Fix um vs FORTIFY warnings for always-NULL arguments.
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmOZSOoWHGtlZXNjb29r
QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJjAAD/0YkvpU7f03f8hcQMJK6wv//24K
AW41hEaBikq9RcmkuvkLLrJRibGgZ5O2xUkUkxRs/HxhkhrZ0kEw8sbwZe8MoWls
F4Y9+TDjsrdHmjhfcBZdLnVxwcKK5wlaEcpjZXtbsfcdhx3TbgcDA23YELl5t0K+
I11j4kYmf9SLl4CwIrSP5iACml8CBHARDh8oIMF7FT/LrjNbM8XkvBcVVT6hTbOV
yjgA8WP2e9GXvj9GzKgqvd0uE/kwPkVAeXLNFWopPi4FQ8AWjlxbBZR0gamA6/EB
d7TIs0ifpVU2JGQaTav4xO6SsFMj3ntoUI0qIrFaTxZAvV4KYGrPT/Kwz1O4SFaG
rN5lcxseQbPQSBTFNG4zFjpywTkVCgD2tZqDwz5Rrmiraz0RyIokCN+i4CD9S0Ds
oEd8JSyLBk1sRALczkuEKo0an5AyC9YWRcBXuRdIHpLo08PsbeUUSe//4pe303cw
0ApQxYOXnrIk26MLElTzSMImlSvlzW6/5XXzL9ME16leSHOIfDeerPnc9FU9Eb3z
ODv22z6tJZ9H/apSUIHZbMciMbbVTZ8zgpkfydr08o87b342N/ncYHZ5cSvQ6DWb
jS5YOIuvl46/IhMPT16qWC8p0bP5YhxoPv5l6Xr0zq0ooEj0E7keiD/SzoLvW+Qs
AHXcibguPRQBPAdiPQ==
=yaaN
-----END PGP SIGNATURE-----
Merge tag 'hardening-v6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull kernel hardening updates from Kees Cook:
- Convert flexible array members, fix -Wstringop-overflow warnings, and
fix KCFI function type mismatches that went ignored by maintainers
(Gustavo A. R. Silva, Nathan Chancellor, Kees Cook)
- Remove the remaining side-effect users of ksize() by converting
dma-buf, btrfs, and coredump to using kmalloc_size_roundup(), add
more __alloc_size attributes, and introduce full testing of all
allocator functions. Finally remove the ksize() side-effect so that
each allocation-aware checker can finally behave without exceptions
- Introduce oops_limit (default 10,000) and warn_limit (default off) to
provide greater granularity of control for panic_on_oops and
panic_on_warn (Jann Horn, Kees Cook)
- Introduce overflows_type() and castable_to_type() helpers for cleaner
overflow checking
- Improve code generation for strscpy() and update str*() kern-doc
- Convert strscpy and sigphash tests to KUnit, and expand memcpy tests
- Always use a non-NULL argument for prepare_kernel_cred()
- Disable structleak plugin in FORTIFY KUnit test (Anders Roxell)
- Adjust orphan linker section checking to respect CONFIG_WERROR (Xin
Li)
- Make sure siginfo is cleared for forced SIGKILL (haifeng.xu)
- Fix um vs FORTIFY warnings for always-NULL arguments
* tag 'hardening-v6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (31 commits)
ksmbd: replace one-element arrays with flexible-array members
hpet: Replace one-element array with flexible-array member
um: virt-pci: Avoid GCC non-NULL warning
signal: Initialize the info in ksignal
lib: fortify_kunit: build without structleak plugin
panic: Expose "warn_count" to sysfs
panic: Introduce warn_limit
panic: Consolidate open-coded panic_on_warn checks
exit: Allow oops_limit to be disabled
exit: Expose "oops_count" to sysfs
exit: Put an upper limit on how often we can oops
panic: Separate sysctl logic from CONFIG_SMP
mm/pgtable: Fix multiple -Wstringop-overflow warnings
mm: Make ksize() a reporting-only function
kunit/fortify: Validate __alloc_size attribute results
drm/sti: Fix return type of sti_{dvo,hda,hdmi}_connector_mode_valid()
drm/fsl-dcu: Fix return type of fsl_dcu_drm_connector_mode_valid()
driver core: Add __alloc_size hint to devm allocators
overflow: Introduce overflows_type() and castable_to_type()
coredump: Proactively round up to kmalloc bucket size
...
Pablo Neira Ayuso says:
====================
Netfilter/IPVS fixes for net
1) Fix NAT IPv6 flowtable hardware offload, from Qingfang DENG.
2) Add a safety check to IPVS socket option interface report a
warning if unsupported command is seen, this. From Li Qiong.
3) Document SCTP conntrack timeouts, from Sriram Yagnaraman.
* git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: conntrack: document sctp timeouts
ipvs: add a 'default' case in do_ip_vs_set_ctl()
netfilter: flowtable: really fix NAT IPv6 offload
====================
Link: https://lore.kernel.org/r/20221213140923.154594-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Core
----
- Allow live renaming when an interface is up
- Add retpoline wrappers for tc, improving considerably the
performances of complex queue discipline configurations.
- Add inet drop monitor support.
- A few GRO performance improvements.
- Add infrastructure for atomic dev stats, addressing long standing
data races.
- De-duplicate common code between OVS and conntrack offloading
infrastructure.
- A bunch of UBSAN_BOUNDS/FORTIFY_SOURCE improvements.
- Netfilter: introduce packet parser for tunneled packets
- Replace IPVS timer-based estimators with kthreads to scale up
the workload with the number of available CPUs.
- Add the helper support for connection-tracking OVS offload.
BPF
---
- Support for user defined BPF objects: the use case is to allocate
own objects, build own object hierarchies and use the building
blocks to build own data structures flexibly, for example, linked
lists in BPF.
- Make cgroup local storage available to non-cgroup attached BPF
programs.
- Avoid unnecessary deadlock detection and failures wrt BPF task
storage helpers.
- A relevant bunch of BPF verifier fixes and improvements.
- Veristat tool improvements to support custom filtering, sorting,
and replay of results.
- Add LLVM disassembler as default library for dumping JITed code.
- Lots of new BPF documentation for various BPF maps.
- Add bpf_rcu_read_{,un}lock() support for sleepable programs.
- Add RCU grace period chaining to BPF to wait for the completion
of access from both sleepable and non-sleepable BPF programs.
- Add support storing struct task_struct objects as kptrs in maps.
- Improve helper UAPI by explicitly defining BPF_FUNC_xxx integer
values.
- Add libbpf *_opts API-variants for bpf_*_get_fd_by_id() functions.
Protocols
---------
- TCP: implement Protective Load Balancing across switch links.
- TCP: allow dynamically disabling TCP-MD5 static key, reverting
back to fast[er]-path.
- UDP: Introduce optional per-netns hash lookup table.
- IPv6: simplify and cleanup sockets disposal.
- Netlink: support different type policies for each generic
netlink operation.
- MPTCP: add MSG_FASTOPEN and FastOpen listener side support.
- MPTCP: add netlink notification support for listener sockets
events.
- SCTP: add VRF support, allowing sctp sockets binding to VRF
devices.
- Add bridging MAC Authentication Bypass (MAB) support.
- Extensions for Ethernet VPN bridging implementation to better
support multicast scenarios.
- More work for Wi-Fi 7 support, comprising conversion of all
the existing drivers to internal TX queue usage.
- IPSec: introduce a new offload type (packet offload) allowing
complete header processing and crypto offloading.
- IPSec: extended ack support for more descriptive XFRM error
reporting.
- RXRPC: increase SACK table size and move processing into a
per-local endpoint kernel thread, reducing considerably the
required locking.
- IEEE 802154: synchronous send frame and extended filtering
support, initial support for scanning available 15.4 networks.
- Tun: bump the link speed from 10Mbps to 10Gbps.
- Tun/VirtioNet: implement UDP segmentation offload support.
Driver API
----------
- PHY/SFP: improve power level switching between standard
level 1 and the higher power levels.
- New API for netdev <-> devlink_port linkage.
- PTP: convert existing drivers to new frequency adjustment
implementation.
- DSA: add support for rx offloading.
- Autoload DSA tagging driver when dynamically changing protocol.
- Add new PCP and APPTRUST attributes to Data Center Bridging.
- Add configuration support for 800Gbps link speed.
- Add devlink port function attribute to enable/disable RoCE and
migratable.
- Extend devlink-rate to support strict prioriry and weighted fair
queuing.
- Add devlink support to directly reading from region memory.
- New device tree helper to fetch MAC address from nvmem.
- New big TCP helper to simplify temporary header stripping.
New hardware / drivers
----------------------
- Ethernet:
- Marvel Octeon CNF95N and CN10KB Ethernet Switches.
- Marvel Prestera AC5X Ethernet Switch.
- WangXun 10 Gigabit NIC.
- Motorcomm yt8521 Gigabit Ethernet.
- Microchip ksz9563 Gigabit Ethernet Switch.
- Microsoft Azure Network Adapter.
- Linux Automation 10Base-T1L adapter.
- PHY:
- Aquantia AQR112 and AQR412.
- Motorcomm YT8531S.
- PTP:
- Orolia ART-CARD.
- WiFi:
- MediaTek Wi-Fi 7 (802.11be) devices.
- RealTek rtw8821cu, rtw8822bu, rtw8822cu and rtw8723du USB
devices.
- Bluetooth:
- Broadcom BCM4377/4378/4387 Bluetooth chipsets.
- Realtek RTL8852BE and RTL8723DS.
- Cypress.CYW4373A0 WiFi + Bluetooth combo device.
Drivers
-------
- CAN:
- gs_usb: bus error reporting support.
- kvaser_usb: listen only and bus error reporting support.
- Ethernet NICs:
- Intel (100G):
- extend action skbedit to RX queue mapping.
- implement devlink-rate support.
- support direct read from memory.
- nVidia/Mellanox (mlx5):
- SW steering improvements, increasing rules update rate.
- Support for enhanced events compression.
- extend H/W offload packet manipulation capabilities.
- implement IPSec packet offload mode.
- nVidia/Mellanox (mlx4):
- better big TCP support.
- Netronome Ethernet NICs (nfp):
- IPsec offload support.
- add support for multicast filter.
- Broadcom:
- RSS and PTP support improvements.
- AMD/SolarFlare:
- netlink extened ack improvements.
- add basic flower matches to offload, and related stats.
- Virtual NICs:
- ibmvnic: introduce affinity hint support.
- small / embedded:
- FreeScale fec: add initial XDP support.
- Marvel mv643xx_eth: support MII/GMII/RGMII modes for Kirkwood.
- TI am65-cpsw: add suspend/resume support.
- Mediatek MT7986: add RX wireless wthernet dispatch support.
- Realtek 8169: enable GRO software interrupt coalescing per
default.
- Ethernet high-speed switches:
- Microchip (sparx5):
- add support for Sparx5 TC/flower H/W offload via VCAP.
- Mellanox mlxsw:
- add 802.1X and MAC Authentication Bypass offload support.
- add ip6gre support.
- Embedded Ethernet switches:
- Mediatek (mtk_eth_soc):
- improve PCS implementation, add DSA untag support.
- enable flow offload support.
- Renesas:
- add rswitch R-Car Gen4 gPTP support.
- Microchip (lan966x):
- add full XDP support.
- add TC H/W offload via VCAP.
- enable PTP on bridge interfaces.
- Microchip (ksz8):
- add MTU support for KSZ8 series.
- Qualcomm 802.11ax WiFi (ath11k):
- support configuring channel dwell time during scan.
- MediaTek WiFi (mt76):
- enable Wireless Ethernet Dispatch (WED) offload support.
- add ack signal support.
- enable coredump support.
- remain_on_channel support.
- Intel WiFi (iwlwifi):
- enable Wi-Fi 7 Extremely High Throughput (EHT) PHY capabilities.
- 320 MHz channels support.
- RealTek WiFi (rtw89):
- new dynamic header firmware format support.
- wake-over-WLAN support.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmOYXUcSHHBhYmVuaUBy
ZWRoYXQuY29tAAoJECkkeY3MjxOk8zQP/R7BZtbJMTPiWkRnSoKHnAyupDVwrz5U
ktukLkwPsCyJuEbAjgxrxf4EEEQ9uq2FFlxNSYuKiiQMqIpFxV6KED7LCUygn4Tc
kxtkp0Q+5XiqisWlQmtfExf2OjuuPqcjV9tWCDBI6GebKUbfNwY/eI44RcMu4BSv
DzIlW5GkX/kZAPqnnuqaLsN3FudDTJHGEAD7NbA++7wJ076RWYSLXlFv0Z+SCSPS
H8/PEG0/ZK/65rIWMAFRClJ9BNIDwGVgp0GrsIvs1gqbRUOlA1hl1rDM21TqtNFf
5QPQT7sIfTcCE/nerxKJD5JE3JyP+XRlRn96PaRw3rt4MgI6I/EOj/HOKQ5tMCNc
oPiqb7N70+hkLZyr42qX+vN9eDPjp2koEQm7EO2Zs+/534/zWDs24Zfk/Aa1ps0I
Fa82oGjAgkBhGe/FZ6i5cYoLcyxqRqZV1Ws9XQMl72qRC7/BwvNbIW6beLpCRyeM
yYIU+0e9dEm+wHQEdh2niJuVtR63hy8tvmPx56lyh+6u0+pondkwbfSiC5aD3kAC
ikKsN5DyEsdXyiBAlytCEBxnaOjQy4RAz+3YXSiS0eBNacXp03UUrNGx4Pzpu/D0
QLFJhBnMFFCgy5to8/DvKnrTPgZdSURwqbIUcZdvU21f1HLR8tUTpaQnYffc/Whm
V8gnt1EL+0cc
=CbJC
-----END PGP SIGNATURE-----
Merge tag 'net-next-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Paolo Abeni:
"Core:
- Allow live renaming when an interface is up
- Add retpoline wrappers for tc, improving considerably the
performances of complex queue discipline configurations
- Add inet drop monitor support
- A few GRO performance improvements
- Add infrastructure for atomic dev stats, addressing long standing
data races
- De-duplicate common code between OVS and conntrack offloading
infrastructure
- A bunch of UBSAN_BOUNDS/FORTIFY_SOURCE improvements
- Netfilter: introduce packet parser for tunneled packets
- Replace IPVS timer-based estimators with kthreads to scale up the
workload with the number of available CPUs
- Add the helper support for connection-tracking OVS offload
BPF:
- Support for user defined BPF objects: the use case is to allocate
own objects, build own object hierarchies and use the building
blocks to build own data structures flexibly, for example, linked
lists in BPF
- Make cgroup local storage available to non-cgroup attached BPF
programs
- Avoid unnecessary deadlock detection and failures wrt BPF task
storage helpers
- A relevant bunch of BPF verifier fixes and improvements
- Veristat tool improvements to support custom filtering, sorting,
and replay of results
- Add LLVM disassembler as default library for dumping JITed code
- Lots of new BPF documentation for various BPF maps
- Add bpf_rcu_read_{,un}lock() support for sleepable programs
- Add RCU grace period chaining to BPF to wait for the completion of
access from both sleepable and non-sleepable BPF programs
- Add support storing struct task_struct objects as kptrs in maps
- Improve helper UAPI by explicitly defining BPF_FUNC_xxx integer
values
- Add libbpf *_opts API-variants for bpf_*_get_fd_by_id() functions
Protocols:
- TCP: implement Protective Load Balancing across switch links
- TCP: allow dynamically disabling TCP-MD5 static key, reverting back
to fast[er]-path
- UDP: Introduce optional per-netns hash lookup table
- IPv6: simplify and cleanup sockets disposal
- Netlink: support different type policies for each generic netlink
operation
- MPTCP: add MSG_FASTOPEN and FastOpen listener side support
- MPTCP: add netlink notification support for listener sockets events
- SCTP: add VRF support, allowing sctp sockets binding to VRF devices
- Add bridging MAC Authentication Bypass (MAB) support
- Extensions for Ethernet VPN bridging implementation to better
support multicast scenarios
- More work for Wi-Fi 7 support, comprising conversion of all the
existing drivers to internal TX queue usage
- IPSec: introduce a new offload type (packet offload) allowing
complete header processing and crypto offloading
- IPSec: extended ack support for more descriptive XFRM error
reporting
- RXRPC: increase SACK table size and move processing into a
per-local endpoint kernel thread, reducing considerably the
required locking
- IEEE 802154: synchronous send frame and extended filtering support,
initial support for scanning available 15.4 networks
- Tun: bump the link speed from 10Mbps to 10Gbps
- Tun/VirtioNet: implement UDP segmentation offload support
Driver API:
- PHY/SFP: improve power level switching between standard level 1 and
the higher power levels
- New API for netdev <-> devlink_port linkage
- PTP: convert existing drivers to new frequency adjustment
implementation
- DSA: add support for rx offloading
- Autoload DSA tagging driver when dynamically changing protocol
- Add new PCP and APPTRUST attributes to Data Center Bridging
- Add configuration support for 800Gbps link speed
- Add devlink port function attribute to enable/disable RoCE and
migratable
- Extend devlink-rate to support strict prioriry and weighted fair
queuing
- Add devlink support to directly reading from region memory
- New device tree helper to fetch MAC address from nvmem
- New big TCP helper to simplify temporary header stripping
New hardware / drivers:
- Ethernet:
- Marvel Octeon CNF95N and CN10KB Ethernet Switches
- Marvel Prestera AC5X Ethernet Switch
- WangXun 10 Gigabit NIC
- Motorcomm yt8521 Gigabit Ethernet
- Microchip ksz9563 Gigabit Ethernet Switch
- Microsoft Azure Network Adapter
- Linux Automation 10Base-T1L adapter
- PHY:
- Aquantia AQR112 and AQR412
- Motorcomm YT8531S
- PTP:
- Orolia ART-CARD
- WiFi:
- MediaTek Wi-Fi 7 (802.11be) devices
- RealTek rtw8821cu, rtw8822bu, rtw8822cu and rtw8723du USB
devices
- Bluetooth:
- Broadcom BCM4377/4378/4387 Bluetooth chipsets
- Realtek RTL8852BE and RTL8723DS
- Cypress.CYW4373A0 WiFi + Bluetooth combo device
Drivers:
- CAN:
- gs_usb: bus error reporting support
- kvaser_usb: listen only and bus error reporting support
- Ethernet NICs:
- Intel (100G):
- extend action skbedit to RX queue mapping
- implement devlink-rate support
- support direct read from memory
- nVidia/Mellanox (mlx5):
- SW steering improvements, increasing rules update rate
- Support for enhanced events compression
- extend H/W offload packet manipulation capabilities
- implement IPSec packet offload mode
- nVidia/Mellanox (mlx4):
- better big TCP support
- Netronome Ethernet NICs (nfp):
- IPsec offload support
- add support for multicast filter
- Broadcom:
- RSS and PTP support improvements
- AMD/SolarFlare:
- netlink extened ack improvements
- add basic flower matches to offload, and related stats
- Virtual NICs:
- ibmvnic: introduce affinity hint support
- small / embedded:
- FreeScale fec: add initial XDP support
- Marvel mv643xx_eth: support MII/GMII/RGMII modes for Kirkwood
- TI am65-cpsw: add suspend/resume support
- Mediatek MT7986: add RX wireless wthernet dispatch support
- Realtek 8169: enable GRO software interrupt coalescing per
default
- Ethernet high-speed switches:
- Microchip (sparx5):
- add support for Sparx5 TC/flower H/W offload via VCAP
- Mellanox mlxsw:
- add 802.1X and MAC Authentication Bypass offload support
- add ip6gre support
- Embedded Ethernet switches:
- Mediatek (mtk_eth_soc):
- improve PCS implementation, add DSA untag support
- enable flow offload support
- Renesas:
- add rswitch R-Car Gen4 gPTP support
- Microchip (lan966x):
- add full XDP support
- add TC H/W offload via VCAP
- enable PTP on bridge interfaces
- Microchip (ksz8):
- add MTU support for KSZ8 series
- Qualcomm 802.11ax WiFi (ath11k):
- support configuring channel dwell time during scan
- MediaTek WiFi (mt76):
- enable Wireless Ethernet Dispatch (WED) offload support
- add ack signal support
- enable coredump support
- remain_on_channel support
- Intel WiFi (iwlwifi):
- enable Wi-Fi 7 Extremely High Throughput (EHT) PHY capabilities
- 320 MHz channels support
- RealTek WiFi (rtw89):
- new dynamic header firmware format support
- wake-over-WLAN support"
* tag 'net-next-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2002 commits)
ipvs: fix type warning in do_div() on 32 bit
net: lan966x: Remove a useless test in lan966x_ptp_add_trap()
net: ipa: add IPA v4.7 support
dt-bindings: net: qcom,ipa: Add SM6350 compatible
bnxt: Use generic HBH removal helper in tx path
IPv6/GRO: generic helper to remove temporary HBH/jumbo header in driver
selftests: forwarding: Add bridge MDB test
selftests: forwarding: Rename bridge_mdb test
bridge: mcast: Support replacement of MDB port group entries
bridge: mcast: Allow user space to specify MDB entry routing protocol
bridge: mcast: Allow user space to add (*, G) with a source list and filter mode
bridge: mcast: Add support for (*, G) with a source list and filter mode
bridge: mcast: Avoid arming group timer when (S, G) corresponds to a source
bridge: mcast: Add a flag for user installed source entries
bridge: mcast: Expose __br_multicast_del_group_src()
bridge: mcast: Expose br_multicast_new_group_src()
bridge: mcast: Add a centralized error path
bridge: mcast: Place netlink policy before validation functions
bridge: mcast: Split (*, G) and (S, G) addition into different functions
bridge: mcast: Do not derive entry type from its filter mode
...
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmOXmxkUHHBhdWxAcGF1
bC1tb29yZS5jb20ACgkQ6iDy2pc3iXMPXg//cxfYC8lRtVpuGNCZWDietSiHzpzu
+qFntaTplvybJMQX0HfgNee5cTBZM+W5mp1BHRcZInvV5LRhyrVtgsxDBifutE4x
LyUJAw5SkiPdRC+XLDIRLKiZCobFBLVs2zO+qibIqsyR60pFjU6WXBLbJfidXBFR
yWudDbLU0YhQJCHdNHNqnHCgqrEculxn6q3QPvm/DX0xzBwkFHSSYBkGNvHW2ZTA
lKNreEOwEk5DTLIKjP4bJ72ixp0xbshw5CXuxtwB/12/4h8QbWbJVQLlIeZrTLmp
zQXQLJ3pCqKJ2OUCgMDK+wmkvLezd80BV3Due7KX0pT0YRDygoh5QEpZ5/8k8eG7
prxToh2gJWk2htfJF6kgMpAh9Jqewcke4BysbYVM/427OPZYwQqLDZDGOzbtT6pl
FYF+adN9wwkAErnHnPlzYipUEpBWurbjtsV8KFWNERoZ4YmzfSPEisRqGIHDGRws
bTyq/7qs5FXkb1zULELj8V+S2ULsmxPqsxJ63p9di54Uo9lHK0I+0IUtajGDdfze
psAasa9DD/oH2PAbSmpQ5Xo9XyfHRXsVuz1twEmEA14ML0m4wHbNWVHaK0aaXVdG
kJKSDSjMsiV+GiwNo7ISJ4pVdUpnMI/iZSghFfV28cJslNhJDeaREHaE/Wtn1/xF
/bCVmEfS16UoJsQ=
=klFk
-----END PGP SIGNATURE-----
Merge tag 'lsm-pr-20221212' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm
Pull lsm updates from Paul Moore:
- Improve the error handling in the device cgroup such that memory
allocation failures when updating the access policy do not
potentially alter the policy.
- Some minor fixes to reiserfs to ensure that it properly releases
LSM-related xattr values.
- Update the security_socket_getpeersec_stream() LSM hook to take
sockptr_t values.
Previously the net/BPF folks updated the getsockopt code in the
network stack to leverage the sockptr_t type to make it easier to
pass both kernel and __user pointers, but unfortunately when they did
so they didn't convert the LSM hook.
While there was/is no immediate risk by not converting the LSM hook,
it seems like this is a mistake waiting to happen so this patch
proactively does the LSM hook conversion.
- Convert vfs_getxattr_alloc() to return an int instead of a ssize_t
and cleanup the callers. Internally the function was never going to
return anything larger than an int and the callers were doing some
very odd things casting the return value; this patch fixes all that
and helps bring a bit of sanity to vfs_getxattr_alloc() and its
callers.
- More verbose, and helpful, LSM debug output when the system is booted
with "lsm.debug" on the command line. There are examples in the
commit description, but the quick summary is that this patch provides
better information about which LSMs are enabled and the ordering in
which they are processed.
- General comment and kernel-doc fixes and cleanups.
* tag 'lsm-pr-20221212' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm:
lsm: Fix description of fs_context_parse_param
lsm: Add/fix return values in lsm_hooks.h and fix formatting
lsm: Clarify documentation of vm_enough_memory hook
reiserfs: Add missing calls to reiserfs_security_free()
lsm,fs: fix vfs_getxattr_alloc() return type and caller error paths
device_cgroup: Roll back to original exceptions after copy failure
LSM: Better reporting of actual LSMs at boot
lsm: make security_socket_getpeersec_stream() sockptr_t safe
audit: Fix some kernel-doc warnings
lsm: remove obsoleted comments for security hooks
fs: edit a comment made in bad taste
Highlights include:
Bugfixes
- Fix a NULL pointer dereference in the mount parser
- Fix a memory stomp in decode_attr_security_label
- Fix a credential leak in _nfs4_discover_trunking()
- Fix a buffer leak in rpcrdma_req_create()
- Fix a leaked socket in rpc_sockname()
- Fix a deadlock between nfs4_open_recover_helper() and delegreturn
- Fix an Oops in nfs_d_automount()
- Fix a potential race in nfs_call_unlink()
- Multiple fixes for the open context mode
- NFSv4.2 READ_PLUS fixes
- Fix a regression in which small rsize/wsize values are being forbidden
- Fail client initialisation if the NFSv4.x state manager thread can't run
- avoid spurious warning of lost lock that is being unlocked.
- Ensure the initialisation of struct nfs4_label
Features and cleanups
- Trigger the "ls -l" readdir heuristic sooner
- Clear the file access cache upon login to ensure supplementary group
info is in sync between the client and server
- pnfs: Fix up the logging of layout stateids
- NFSv4.2: Change the default KConfig value for READ_PLUS
- Use sysfs_emit() instead of scnprintf() where appropriate
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAmOYmTQACgkQZwvnipYK
APLxWw//dAFjHXc3YOLi1J94nINZbOyAvdS2WAPVLRfMh8mEln9JAxVsUREz279N
02EFbW/ZAN3nd2SGdYgbdzTIIJ154LlDS9DmIdwgqK2I/tEZauFeDTpTBDIDNwMg
eruU4twttbgrE7lgVlE0ZDGlvMNFao462BriPeEN+WaaqQWB2NqMp/qoh1i4mbcn
rQPPiovn0LiW5K5ulMT5OFz9y4I4oNaJV77sPSUew0M0iLbGtI1OdrF844+8gzb0
BeMW/ArvM5TpuwMOGmzdt4pql8MTPf4qvcnafiMX69Eg2asGijZlr/AxOpSOYXSl
LDmAXczQ0VSsXYkKI2bJz1d5xWeFL4ZJVsaR8W/M4aewgqHWtAIs/b+Gk2x/S+Zo
s1lGFrHrvzhXd/c8LxuH+LBUxLqy6nep0sVWKOK/eQ79BUo8lLwev8YbNk8Sk4Rh
N2SrcJQeHu9uYKUbWuRWt5r2L9ffYVcSQQfEk5IzCvRuz6B7sFXtwNJK8rPjz78/
LLHNokQIPHUTYUa2P+zKwozwSwOc6y6eqEhxH8CvDJlo5OfDvT+9deyNb8ZOUxiq
UXHXG2qQivU0wTaqoowLp8D8FT3MQB9KkUZkikGf9d884qXgbWipx0kGOp+Y6JJp
JesK1KxcxtKZRiywBSz6fhWW78VknYWf4RR6cESjgpUSbuvblfE=
=Mz1l
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-6.2-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust
"Bugfixes:
- Fix NULL pointer dereference in the mount parser
- Fix memory stomp in decode_attr_security_label
- Fix credential leak in _nfs4_discover_trunking()
- Fix buffer leak in rpcrdma_req_create()
- Fix leaked socket in rpc_sockname()
- Fix deadlock between nfs4_open_recover_helper() and delegreturn
- Fix an Oops in nfs_d_automount()
- Fix potential race in nfs_call_unlink()
- Multiple fixes for the open context mode
- NFSv4.2 READ_PLUS fixes
- Fix a regression in which small rsize/wsize values are being
forbidden
- Fail client initialisation if the NFSv4.x state manager thread
can't run
- Avoid spurious warning of lost lock that is being unlocked.
- Ensure the initialisation of struct nfs4_label
Features and cleanups:
- Trigger the "ls -l" readdir heuristic sooner
- Clear the file access cache upon login to ensure supplementary
group info is in sync between the client and server
- pnfs: Fix up the logging of layout stateids
- NFSv4.2: Change the default KConfig value for READ_PLUS
- Use sysfs_emit() instead of scnprintf() where appropriate"
* tag 'nfs-for-6.2-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (24 commits)
NFSv4.2: Change the default KConfig value for READ_PLUS
NFSv4.x: Fail client initialisation if state manager thread can't run
fs: nfs: sysfs: use sysfs_emit() to instead of scnprintf()
NFS: use sysfs_emit() to instead of scnprintf()
NFS: Allow very small rsize & wsize again
NFSv4.2: Fix up READ_PLUS alignment
NFSv4.2: Set the correct size scratch buffer for decoding READ_PLUS
SUNRPC: Fix missing release socket in rpc_sockname()
xprtrdma: Fix regbuf data not freed in rpcrdma_req_create()
NFS: avoid spurious warning of lost lock that is being unlocked.
nfs: fix possible null-ptr-deref when parsing param
NFSv4: check FMODE_EXEC from open context mode in nfs4_opendata_access()
NFS: make sure open context mode have FMODE_EXEC when file open for exec
NFS4.x/pnfs: Fix up logging of layout stateids
NFS: Fix a race in nfs_call_unlink()
NFS: Fix an Oops in nfs_d_automount()
NFSv4: Fix a deadlock between nfs4_open_recover_helper() and delegreturn
NFSv4: Fix a credential leak in _nfs4_discover_trunking()
NFS: Trigger the "ls -l" readdir heuristic sooner
NFSv4.2: Fix initialisation of struct nfs4_label
...
It is better to return the default switch case with
'-EINVAL', in case new commands are added. otherwise,
return a uninitialized value of ret.
Signed-off-by: Li Qiong <liqiong@nfschina.com>
Reviewed-by: Simon Horman <horms@verge.net.au>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
32 bit platforms without 64bit div generate the following warning:
net/netfilter/ipvs/ip_vs_est.c: In function 'ip_vs_est_calc_limits':
include/asm-generic/div64.h:222:35: warning: comparison of distinct pointer types lacks a cast
222 | (void)(((typeof((n)) *)0) == ((uint64_t *)0)); \
| ^~
net/netfilter/ipvs/ip_vs_est.c:694:17: note: in expansion of macro 'do_div'
694 | do_div(val, loops);
| ^~~~~~
include/asm-generic/div64.h:222:35: warning: comparison of distinct pointer types lacks a cast
222 | (void)(((typeof((n)) *)0) == ((uint64_t *)0)); \
| ^~
net/netfilter/ipvs/ip_vs_est.c:700:33: note: in expansion of macro 'do_div'
700 | do_div(val, min_est);
| ^~~~~~
first argument of do_div() should be unsigned. We can't just cast
as do_div() updates it as well, so we need an lval.
Make val unsigned in the first place, all paths check that the value
they assign to this variables are non-negative already.
Fixes: 705dd34440 ("ipvs: use kthreads for stats estimation")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20221213032037.844517-1-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Merge in the left-over fixes before the net-next pull-request.
net/mptcp/subflow.c
d3295fee3c ("mptcp: use proper req destructor for IPv6")
36b122baf6 ("mptcp: add subflow_v(4,6)_send_synack()")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This release introduces support for the CB_RECALL_ANY operation.
NFSD can send this operation to request that clients return any
delegations they choose. The server uses this operation to handle
low memory scenarios or indicate to a client when that client has
reached the maximum number of delegations the server supports.
The NFSv4.2 READ_PLUS operation has been simplified temporarily
whilst support for sparse files in local filesystems and the VFS is
improved.
Two major data structure fixes appear in this release:
* The nfs4_file hash table is replaced with a resizable hash table
to reduce the latency of NFSv4 OPEN operations.
* Reference counting in the NFSD filecache has been hardened against
races.
In furtherance of removing support for NFSv2 in a subsequent kernel
release, a new Kconfig option enables server-side support for NFSv2
to be left out of a kernel build.
MAINTAINERS has been updated to indicate that changes to fs/exportfs
should go through the NFSD tree.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmOXPE4ACgkQM2qzM29m
f5faaRAAh7YT5V61afPbfgBybO5AbDzztpZSNjNjLZs78piSnFp6hP75yNtTviwQ
1o7St13/NkCmDaIdGUpr02U01zbM1BDOq2wGckImOJLNSgb7xHV5r4PqkRiFkh0t
QYSnwG+wp8fDUJeCL/nAOAu9I9EQUqHzWchxiU/h8ln2hN3rXUlIRSeo17Wy7zkD
cNIcoAjTi9fzY3dE6H4r+lZTdNCYH+AdzChmKrHdRZQwq0Xs3FWv4gAMTLbDuD4P
B6NDHz0Umn6XnFsJGptwozkwaWeMQw4GyJj/3iUiO8JF209SaoYXMPjJAyG6tYYa
fUrgv4UXGeXjigDbLBA5IYxfhX7GXjMQSaj3edhzyrl8P74q4/Cq/8fDUnAZ841m
E+TGSCPIQD0QuIjdXxLv9KLY8JNThSfcAt6jr5GBXhPZQr8xpS0BqK/Onr68fgZC
Lpull5xN68L4A1B7cf2GNPuMyvkBKxwSGXOehldh/BkvpVMjFwqd4/q5xWC+6CcQ
tbOkjTbbSS71nzJwZip0NphaYCa3qQPzKT4SZzn/I4I9W5otbwYBx734Bw46gTDE
ZPUXTuJ00VPgX07wbLRahg521Fwzr+8sk1WnVYq82PoaMh1l9FjzLNGouQWBdo3E
UzIo/KUfQKmoZce6O723L6OI4ffdK5oMtfaTpe+SiUPpV1lUAcA=
=jNlu
-----END PGP SIGNATURE-----
Merge tag 'nfsd-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd updates from Chuck Lever:
"This release introduces support for the CB_RECALL_ANY operation. NFSD
can send this operation to request that clients return any delegations
they choose. The server uses this operation to handle low memory
scenarios or indicate to a client when that client has reached the
maximum number of delegations the server supports.
The NFSv4.2 READ_PLUS operation has been simplified temporarily whilst
support for sparse files in local filesystems and the VFS is improved.
Two major data structure fixes appear in this release:
- The nfs4_file hash table is replaced with a resizable hash table to
reduce the latency of NFSv4 OPEN operations.
- Reference counting in the NFSD filecache has been hardened against
races.
In furtherance of removing support for NFSv2 in a subsequent kernel
release, a new Kconfig option enables server-side support for NFSv2 to
be left out of a kernel build.
MAINTAINERS has been updated to indicate that changes to fs/exportfs
should go through the NFSD tree"
* tag 'nfsd-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (49 commits)
NFSD: Avoid clashing function prototypes
SUNRPC: Fix crasher in unwrap_integ_data()
SUNRPC: Make the svc_authenticate tracepoint conditional
NFSD: Use only RQ_DROPME to signal the need to drop a reply
SUNRPC: Clean up xdr_write_pages()
SUNRPC: Don't leak netobj memory when gss_read_proxy_verf() fails
NFSD: add CB_RECALL_ANY tracepoints
NFSD: add delegation reaper to react to low memory condition
NFSD: add support for sending CB_RECALL_ANY
NFSD: refactoring courtesy_client_reaper to a generic low memory shrinker
trace: Relocate event helper files
NFSD: pass range end to vfs_fsync_range() instead of count
lockd: fix file selection in nlmsvc_cancel_blocked
lockd: ensure we use the correct file descriptor when unlocking
lockd: set missing fl_flags field when retrieving args
NFSD: Use struct_size() helper in alloc_session()
nfsd: return error if nfs4_setacl fails
lockd: set other missing fields when unlocking files
NFSD: Add an nfsd_file_fsync tracepoint
sunrpc: svc: Remove an unused static function svc_ungetu32()
...
KCSAN reported a race between writing req->status in p9_client_cb and
accessing it in p9_client_rpc's wait_event.
Accesses to req itself is protected by the data barrier (writing req
fields, write barrier, writing status // reading status, read barrier,
reading other req fields), but status accesses themselves apparently
also must be annotated properly with WRITE_ONCE/READ_ONCE when we
access it without locks.
Follows:
- error paths writing status in various threads all can notify
p9_client_rpc, so these all also need WRITE_ONCE
- there's a similar read loop in trans_virtio for zc case that also
needs READ_ONCE
- other reads in trans_fd should be protected by the trans_fd lock and
lists state machine, as corresponding writers all are within trans_fd
and should be under the same lock. If KCSAN complains on them we likely
will have something else to fix as well, so it's better to leave them
unmarked and look again if required.
Link: https://lkml.kernel.org/r/20221205124756.426350-1-asmadeus@codewreck.org
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Suggested-by: Marco Elver <elver@google.com>
Acked-by: Marco Elver <elver@google.com>
Reviewed-by: Christian Schoenebeck <linux_oss@crudebyte.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
direction misannotations and (hopefully) preventing
more of the same for the future.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iHQEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCY5ZzQAAKCRBZ7Krx/gZQ
65RZAP4nTkvOn0NZLVFkuGOx8pgJelXAvrteyAuecVL8V6CR4AD40qCVY51PJp8N
MzwiRTeqnGDxTTF7mgd//IB6hoatAA==
=bcvF
-----END PGP SIGNATURE-----
Merge tag 'pull-iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull iov_iter updates from Al Viro:
"iov_iter work; most of that is about getting rid of direction
misannotations and (hopefully) preventing more of the same for the
future"
* tag 'pull-iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
use less confusing names for iov_iter direction initializers
iov_iter: saner checks for attempt to copy to/from iterator
[xen] fix "direction" argument of iov_iter_kvec()
[vhost] fix 'direction' argument of iov_iter_{init,bvec}()
[target] fix iov_iter_bvec() "direction" argument
[s390] memcpy_real(): WRITE is "data source", not destination...
[s390] zcore: WRITE is "data source", not destination...
[infiniband] READ is "data destination", not source...
[fsi] WRITE is "data source", not destination...
[s390] copy_oldmem_kernel() - WRITE is "data source", not destination
csum_and_copy_to_iter(): handle ITER_DISCARD
get rid of unlikely() on page_copy_sane() calls
This KUnit next update for Linux 6.2-rc1 consists of several enhancements,
fixes, clean-ups, documentation updates, improvements to logging and KTAP
compliance of KUnit test output:
- log numbers in decimal and hex
- parse KTAP compliant test output
- allow conditionally exposing static symbols to tests
when KUNIT is enabled
- make static symbols visible during kunit testing
- clean-ups to remove unused structure definition
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEPZKym/RZuOCGeA/kCwJExA0NQxwFAmOXnPYACgkQCwJExA0N
Qxwf9RAAwdBKxgPZuKZ40v69Jm8YhaO3vyKUkyYRH59/HQGFUHMA2f2ONez4krEX
iXPgBFQ+7pB63FdgQi2HSg2z/u3xY02AaGgZGXDuNJDmg2xYjNDfZ0GjN6tuavlN
Liz01DGZkjZoVVXM6oV2xT8woBg/0BbdkKNL1OBO9RBZFHzwDryRzfXmQb8cKlNr
S+tkeZTlCA/s7UW2LNj4VlTzn6wgni4Y9gSk4wbQmSGWn3OX3rHaqAb7GiZ/yPGb
1WjbMeE8FwyydLU40aOZZ8V6AJRiw5VGPJyFzWJyWZ21xOgN9Z95b+I36z8RXraA
i/wnazO/FJsrhzvKL83rQkrSW6bpmVY+jGvk+L6deFM6Ro/vEWHJ4DgyKsIdMiJy
gUM1Q69szptq+ZRHGrZWPlVONBkBXMOL+fePbCbGcMzlaEAS/zsFYW9IBKcvLzwP
uHzzMS/cMmSUq52ZIyl9jhHQFVSoErCpJwQjAaZBQpYXPmE7yLcZItxnCaSUQTay
bRwyps5ph5md0oJTTFJKZ4Zx5FJ2ItjbC4y9BIexb9gYRDdRq723ivDoVENZl/Zk
DFIV95AY+mSxadS5vFagwWwX0ZN0KFKxeM8Tw7VTimal/0Sbglqp+oflsuKFD6JQ
b5HUixYifKMbWxkH5xrUb8NdjmBj561TYa8U4N+j3oOiaPYu5Ss=
=UQNn
-----END PGP SIGNATURE-----
Merge tag 'linux-kselftest-kunit-next-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull KUnit updates from Shuah Khan:
"Several enhancements, fixes, clean-ups, documentation updates,
improvements to logging and KTAP compliance of KUnit test output:
- log numbers in decimal and hex
- parse KTAP compliant test output
- allow conditionally exposing static symbols to tests when KUNIT is
enabled
- make static symbols visible during kunit testing
- clean-ups to remove unused structure definition"
* tag 'linux-kselftest-kunit-next-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: (29 commits)
Documentation: dev-tools: Clarify requirements for result description
apparmor: test: make static symbols visible during kunit testing
kunit: add macro to allow conditionally exposing static symbols to tests
kunit: tool: make parser preserve whitespace when printing test log
Documentation: kunit: Fix "How Do I Use This" / "Next Steps" sections
kunit: tool: don't include KTAP headers and the like in the test log
kunit: improve KTAP compliance of KUnit test output
kunit: tool: parse KTAP compliant test output
mm: slub: test: Use the kunit_get_current_test() function
kunit: Use the static key when retrieving the current test
kunit: Provide a static key to check if KUnit is actively running tests
kunit: tool: make --json do nothing if --raw_ouput is set
kunit: tool: tweak error message when no KTAP found
kunit: remove KUNIT_INIT_MEM_ASSERTION macro
Documentation: kunit: Remove redundant 'tips.rst' page
Documentation: KUnit: reword description of assertions
Documentation: KUnit: make usage.rst a superset of tips.rst, remove duplication
kunit: eliminate KUNIT_INIT_*_ASSERT_STRUCT macros
kunit: tool: remove redundant file.close() call in unit test
kunit: tool: unit tests all check parser errors, standardize formatting a bit
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEq5lC5tSkz8NBJiCnSfxwEqXeA64FAmOU+U8ACgkQSfxwEqXe
A67NnQ//Y5DltmvibyPd7r1TFT2gUYv+Rx3sUV9ZE1NYptd/SWhhcL8c5FZ70Fuw
bSKCa1uiWjOxosjXT1kGrWq3de7q7oUpAPSOGxgxzoaNURIt58N/ajItCX/4Au8I
RlGAScHy5e5t41/26a498kB6qJ441fBEqCYKQpPLINMBAhe8TQ+NVp0rlpUwNHFX
WrUGg4oKWxdBIW3HkDirQjJWDkkAiklRTifQh/Al4b6QDbOnRUGGCeckNOhixsvS
waHWTld+Td8jRrA4b82tUb2uVZ2/b8dEvj/A8CuTv4yC0lywoyMgBWmJAGOC+UmT
ZVNdGW02Jc2T+Iap8ZdsEmeLHNqbli4+IcbY5xNlov+tHJ2oz41H9TZoYKbudlr6
/ReAUPSn7i50PhbQlEruj3eg+M2gjOeh8OF8UKwwRK8PghvyWQ1ScW0l3kUhPIhI
PdIG6j4+D2mJc1FIj2rTVB+Bg933x6S+qx4zDxGlNp62AARUFYf6EgyD6aXFQVuX
RxcKb6cjRuFkzFiKc8zkqg5edZH+IJcPNuIBmABqTGBOxbZWURXzIQvK/iULqZa4
CdGAFIs6FuOh8pFHLI3R4YoHBopbHup/xKDEeAO9KZGyeVIuOSERDxxo5f/ITzcq
APvT77DFOEuyvanr8RMqqh0yUjzcddXqw9+ieufsAyDwjD9DTuE=
=QRhK
-----END PGP SIGNATURE-----
Merge tag 'random-6.2-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random
Pull random number generator updates from Jason Donenfeld:
- Replace prandom_u32_max() and various open-coded variants of it,
there is now a new family of functions that uses fast rejection
sampling to choose properly uniformly random numbers within an
interval:
get_random_u32_below(ceil) - [0, ceil)
get_random_u32_above(floor) - (floor, U32_MAX]
get_random_u32_inclusive(floor, ceil) - [floor, ceil]
Coccinelle was used to convert all current users of
prandom_u32_max(), as well as many open-coded patterns, resulting in
improvements throughout the tree.
I'll have a "late" 6.1-rc1 pull for you that removes the now unused
prandom_u32_max() function, just in case any other trees add a new
use case of it that needs to converted. According to linux-next,
there may be two trivial cases of prandom_u32_max() reintroductions
that are fixable with a 's/.../.../'. So I'll have for you a final
conversion patch doing that alongside the removal patch during the
second week.
This is a treewide change that touches many files throughout.
- More consistent use of get_random_canary().
- Updates to comments, documentation, tests, headers, and
simplification in configuration.
- The arch_get_random*_early() abstraction was only used by arm64 and
wasn't entirely useful, so this has been replaced by code that works
in all relevant contexts.
- The kernel will use and manage random seeds in non-volatile EFI
variables, refreshing a variable with a fresh seed when the RNG is
initialized. The RNG GUID namespace is then hidden from efivarfs to
prevent accidental leakage.
These changes are split into random.c infrastructure code used in the
EFI subsystem, in this pull request, and related support inside of
EFISTUB, in Ard's EFI tree. These are co-dependent for full
functionality, but the order of merging doesn't matter.
- Part of the infrastructure added for the EFI support is also used for
an improvement to the way vsprintf initializes its siphash key,
replacing an sleep loop wart.
- The hardware RNG framework now always calls its correct random.c
input function, add_hwgenerator_randomness(), rather than sometimes
going through helpers better suited for other cases.
- The add_latent_entropy() function has long been called from the fork
handler, but is a no-op when the latent entropy gcc plugin isn't
used, which is fine for the purposes of latent entropy.
But it was missing out on the cycle counter that was also being mixed
in beside the latent entropy variable. So now, if the latent entropy
gcc plugin isn't enabled, add_latent_entropy() will expand to a call
to add_device_randomness(NULL, 0), which adds a cycle counter,
without the absent latent entropy variable.
- The RNG is now reseeded from a delayed worker, rather than on demand
when used. Always running from a worker allows it to make use of the
CPU RNG on platforms like S390x, whose instructions are too slow to
do so from interrupts. It also has the effect of adding in new inputs
more frequently with more regularity, amounting to a long term
transcript of random values. Plus, it helps a bit with the upcoming
vDSO implementation (which isn't yet ready for 6.2).
- The jitter entropy algorithm now tries to execute on many different
CPUs, round-robining, in hopes of hitting even more memory latencies
and other unpredictable effects. It also will mix in a cycle counter
when the entropy timer fires, in addition to being mixed in from the
main loop, to account more explicitly for fluctuations in that timer
firing. And the state it touches is now kept within the same cache
line, so that it's assured that the different execution contexts will
cause latencies.
* tag 'random-6.2-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random: (23 commits)
random: include <linux/once.h> in the right header
random: align entropy_timer_state to cache line
random: mix in cycle counter when jitter timer fires
random: spread out jitter callback to different CPUs
random: remove extraneous period and add a missing one in comments
efi: random: refresh non-volatile random seed when RNG is initialized
vsprintf: initialize siphash key using notifier
random: add back async readiness notifier
random: reseed in delayed work rather than on-demand
random: always mix cycle counter in add_latent_entropy()
hw_random: use add_hwgenerator_randomness() for early entropy
random: modernize documentation comment on get_random_bytes()
random: adjust comment to account for removed function
random: remove early archrandom abstraction
random: use random.trust_{bootloader,cpu} command line option only
stackprotector: actually use get_random_canary()
stackprotector: move get_random_canary() into stackprotector.h
treewide: use get_random_u32_inclusive() when possible
treewide: use get_random_u32_{above,below}() instead of manual loop
treewide: use get_random_u32_below() instead of deprecated function
...
IPv6/TCP and GRO stacks can build big TCP packets with an added
temporary Hop By Hop header.
Is GSO is not involved, then the temporary header needs to be removed in
the driver. This patch provides a generic helper for drivers that need
to modify their headers in place.
Tested:
Compiled and ran with ethtool -K eth1 tso off
Could send Big TCP packets
Signed-off-by: Coco Li <lixiaoyan@google.com>
Link: https://lore.kernel.org/r/20221210041646.3587757-1-lixiaoyan@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Now that user space can specify additional attributes of port group
entries such as filter mode and source list, it makes sense to allow
user space to atomically modify these attributes by replacing entries
instead of forcing user space to delete the entries and add them back.
Replace MDB port group entries when the 'NLM_F_REPLACE' flag is
specified in the netlink message header.
When a (*, G) entry is replaced, update the following attributes: Source
list, state, filter mode, protocol and flags. If the entry is temporary
and in EXCLUDE mode, reset the group timer to the group membership
interval. If the entry is temporary and in INCLUDE mode, reset the
source timers of associated sources to the group membership interval.
Examples:
# bridge mdb replace dev br0 port dummy10 grp 239.1.1.1 permanent source_list 192.0.2.1,192.0.2.2 filter_mode include
# bridge -d -s mdb show
dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.2 permanent filter_mode include proto static 0.00
dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.1 permanent filter_mode include proto static 0.00
dev br0 port dummy10 grp 239.1.1.1 permanent filter_mode include source_list 192.0.2.2/0.00,192.0.2.1/0.00 proto static 0.00
# bridge mdb replace dev br0 port dummy10 grp 239.1.1.1 permanent source_list 192.0.2.1,192.0.2.3 filter_mode exclude proto zebra
# bridge -d -s mdb show
dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.3 permanent filter_mode include proto zebra blocked 0.00
dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.1 permanent filter_mode include proto zebra blocked 0.00
dev br0 port dummy10 grp 239.1.1.1 permanent filter_mode exclude source_list 192.0.2.3/0.00,192.0.2.1/0.00 proto zebra 0.00
# bridge mdb replace dev br0 port dummy10 grp 239.1.1.1 temp source_list 192.0.2.4,192.0.2.3 filter_mode include proto bgp
# bridge -d -s mdb show
dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.4 temp filter_mode include proto bgp 0.00
dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.3 temp filter_mode include proto bgp 0.00
dev br0 port dummy10 grp 239.1.1.1 temp filter_mode include source_list 192.0.2.4/259.44,192.0.2.3/259.44 proto bgp 0.00
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add the 'MDBE_ATTR_RTPORT' attribute to allow user space to specify the
routing protocol of the MDB port group entry. Enforce a minimum value of
'RTPROT_STATIC' to prevent user space from using protocol values that
should only be set by the kernel (e.g., 'RTPROT_KERNEL'). Maintain
backward compatibility by defaulting to 'RTPROT_STATIC'.
The protocol is already visible to user space in RTM_NEWMDB responses
and notifications via the 'MDBA_MDB_EATTR_RTPROT' attribute.
The routing protocol allows a routing daemon to distinguish between
entries configured by it and those configured by the administrator. Once
MDB flush is supported, the protocol can be used as a criterion
according to which the flush is performed.
Examples:
# bridge mdb add dev br0 port dummy10 grp 239.1.1.1 permanent proto kernel
Error: integer out of range.
# bridge mdb add dev br0 port dummy10 grp 239.1.1.1 permanent proto static
# bridge mdb add dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.1 permanent proto zebra
# bridge mdb add dev br0 port dummy10 grp 239.1.1.2 permanent source_list 198.51.100.1,198.51.100.2 filter_mode include proto 250
# bridge -d mdb show
dev br0 port dummy10 grp 239.1.1.2 src 198.51.100.2 permanent filter_mode include proto 250
dev br0 port dummy10 grp 239.1.1.2 src 198.51.100.1 permanent filter_mode include proto 250
dev br0 port dummy10 grp 239.1.1.2 permanent filter_mode include source_list 198.51.100.2/0.00,198.51.100.1/0.00 proto 250
dev br0 port dummy10 grp 239.1.1.1 src 192.0.2.1 permanent filter_mode include proto zebra
dev br0 port dummy10 grp 239.1.1.1 permanent filter_mode exclude proto static
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add new netlink attributes to the RTM_NEWMDB request that allow user
space to add (*, G) with a source list and filter mode.
The RTM_NEWMDB message can already dump such entries (created by the
kernel) so there is no need to add dump support. However, the message
contains a different set of attributes depending if it is a request or a
response. The naming and structure of the new attributes try to follow
the existing ones used in the response.
Request:
[ struct nlmsghdr ]
[ struct br_port_msg ]
[ MDBA_SET_ENTRY ]
struct br_mdb_entry
[ MDBA_SET_ENTRY_ATTRS ]
[ MDBE_ATTR_SOURCE ]
struct in_addr / struct in6_addr
[ MDBE_ATTR_SRC_LIST ] // new
[ MDBE_SRC_LIST_ENTRY ]
[ MDBE_SRCATTR_ADDRESS ]
struct in_addr / struct in6_addr
[ ...]
[ MDBE_ATTR_GROUP_MODE ] // new
u8
Response:
[ struct nlmsghdr ]
[ struct br_port_msg ]
[ MDBA_MDB ]
[ MDBA_MDB_ENTRY ]
[ MDBA_MDB_ENTRY_INFO ]
struct br_mdb_entry
[ MDBA_MDB_EATTR_TIMER ]
u32
[ MDBA_MDB_EATTR_SOURCE ]
struct in_addr / struct in6_addr
[ MDBA_MDB_EATTR_RTPROT ]
u8
[ MDBA_MDB_EATTR_SRC_LIST ]
[ MDBA_MDB_SRCLIST_ENTRY ]
[ MDBA_MDB_SRCATTR_ADDRESS ]
struct in_addr / struct in6_addr
[ MDBA_MDB_SRCATTR_TIMER ]
u8
[...]
[ MDBA_MDB_EATTR_GROUP_MODE ]
u8
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In preparation for allowing user space to add (*, G) entries with a
source list and associated filter mode, add the necessary plumbing to
handle such requests.
Extend the MDB configuration structure with a currently empty source
array and filter mode that is currently hard coded to EXCLUDE.
Add the source entries and the corresponding (S, G) entries before
making the new (*, G) port group entry visible to the data path.
Handle the creation of each source entry in a similar fashion to how it
is created from the data path in response to received Membership
Reports: Create the source entry, arm the source timer (if needed), add
a corresponding (S, G) forwarding entry and finally mark the source
entry as installed (by user space).
Add the (S, G) entry by populating an MDB configuration structure and
calling br_mdb_add_group_sg() as if a new entry is created by user
space, with the sole difference that the 'src_entry' field is set to
make sure that the group timer of such entries is never armed.
Note that it is not currently possible to add more than 32 source
entries to a port group entry. If this proves to be a problem we can
either increase 'PG_SRC_ENT_LIMIT' or avoid forcing a limit on entries
created by user space.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
User space will soon be able to install a (*, G) with a source list,
prompting the creation of a (S, G) entry for each source.
In this case, the group timer of the (S, G) entry should never be set.
Solve this by adding a new field to the MDB configuration structure that
denotes whether the (S, G) corresponds to a source or not.
The field will be set in a subsequent patch where br_mdb_add_group_sg()
is called in order to create a (S, G) entry for each user provided
source.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
There are a few places where the bridge driver differentiates between
(S, G) entries installed by the kernel (in response to Membership
Reports) and those installed by user space. One of them is when deleting
an (S, G) entry corresponding to a source entry that is being deleted.
While user space cannot currently add a source entry to a (*, G), it can
add an (S, G) entry that later corresponds to a source entry created by
the reception of a Membership Report. If this source entry is later
deleted because its source timer expired or because the (*, G) entry is
being deleted, the bridge driver will not delete the corresponding (S,
G) entry if it was added by user space as permanent.
This is going to be a problem when the ability to install a (*, G) with
a source list is exposed to user space. In this case, when user space
installs the (*, G) as permanent, then all the (S, G) entries
corresponding to its source list will also be installed as permanent.
When user space deletes the (*, G), all the source entries will be
deleted and the expectation is that the corresponding (S, G) entries
will be deleted as well.
Solve this by introducing a new source entry flag denoting that the
entry was installed by user space. When the entry is deleted, delete the
corresponding (S, G) entry even if it was installed by user space as
permanent, as the flag tells us that it was installed in response to the
source entry being created.
The flag will be set in a subsequent patch where source entries are
created in response to user requests.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Expose __br_multicast_del_group_src() which is symmetric to
br_multicast_new_group_src() and does not remove the installed {S, G}
forwarding entry, unlike br_multicast_del_group_src().
The function will be used in the error path when user space was able to
add a new source entry, but failed to install a corresponding forwarding
entry.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently, new group source entries are only created in response to
received Membership Reports. Subsequent patches are going to allow user
space to install (*, G) entries with a source list.
As a preparatory step, expose br_multicast_new_group_src() so that it
could later be invoked from the MDB code (i.e., br_mdb.c) that handles
RTM_NEWMDB messages.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Subsequent patches will add memory allocations in br_mdb_config_init()
as the MDB configuration structure will include a linked list of source
entries. This memory will need to be freed regardless if br_mdb_add()
succeeded or failed.
As a preparation for this change, add a centralized error path where the
memory will be freed.
Note that br_mdb_del() already has one error path and therefore does not
require any changes.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Subsequent patches are going to add additional validation functions and
netlink policies. Some of these functions will need to perform parsing
using nla_parse_nested() and the new policies.
In order to keep all the policies next to each other, move the current
policy to before the validation functions.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When the bridge is using IGMP version 3 or MLD version 2, it handles the
addition of (*, G) and (S, G) entries differently.
When a new (S, G) port group entry is added, all the (*, G) EXCLUDE
ports need to be added to the port group of the new entry. Similarly,
when a new (*, G) EXCLUDE port group entry is added, the port needs to
be added to the port group of all the matching (S, G) entries.
Subsequent patches will create more differences between both entry
types. Namely, filter mode and source list can only be specified for (*,
G) entries.
Given the current and future differences between both entry types,
handle the addition of each entry type in a different function, thereby
avoiding the creation of one complex function.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently, the filter mode (i.e., INCLUDE / EXCLUDE) of MDB entries
cannot be set from user space. Instead, it is set by the kernel
according to the entry type: (*, G) entries are treated as EXCLUDE and
(S, G) entries are treated as INCLUDE. This allows the kernel to derive
the entry type from its filter mode.
Subsequent patches will allow user space to set the filter mode of (*,
G) entries, making the current assumption incorrect.
As a preparation, remove the current assumption and instead determine
the entry type from its key, which is a more direct way.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
If dsa_tag_8021q_setup() fails, for example due to the inability of the
device to install a VLAN, the tag_8021q context of the switch will leak.
Make sure it is freed on the error path.
Fixes: 328621f613 ("net: dsa: tag_8021q: absorb dsa_8021q_setup into dsa_tag_8021q_{,un}register")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20221209235242.480344-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Similar to Bonding and Team, to prevent ipv6 addrconf with
IFF_NO_ADDRCONF in slave_dev->priv_flags for slave ports
is also needed in net failover.
Note that dev_open(slave_dev) is called in .slave_register,
which is called after the IFF_NO_ADDRCONF flag is set in
failover_slave_register().
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently, in bonding it reused the IFF_SLAVE flag and checked it
in ipv6 addrconf to prevent ipv6 addrconf.
However, it is not a proper flag to use for no ipv6 addrconf, for
bonding it has to move IFF_SLAVE flag setting ahead of dev_open()
in bond_enslave(). Also, IFF_MASTER/SLAVE are historical flags
used in bonding and eql, as Jiri mentioned, the new devices like
Team, Failover do not use this flag.
So as Jiri suggested, this patch adds IFF_NO_ADDRCONF in priv_flags
of the device to indicate no ipv6 addconf, and uses it in bonding
and moves IFF_SLAVE flag setting back to its original place.
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
tso_count_descs() is a small function doing simple calculation,
and tso_count_descs() is used in fast path, so inline it to
reduce the overhead of calls.
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Link: https://lore.kernel.org/r/20221212032426.16050-1-linyunsheng@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ptp_classify_raw() is not exactly cheap, since it invokes a BPF program
for every skb in the receive path. For switches which do not provide
ds->ops->port_rxtstamp(), running ptp_classify_raw() provides precisely
nothing, so check for the presence of the function pointer first, since
that is much cheaper.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>
Link: https://lore.kernel.org/r/20221209175840.390707-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
- Add a new VID/PID 0489/e0f2 for MT7922
- Add Realtek RTL8852BE support ID 0x0cb8:0xc559
- Add a new PID/VID 13d3/3549 for RTL8822CU
- Add support for broadcom BCM43430A0 & BCM43430A1
- Add CONFIG_BT_HCIBTUSB_POLL_SYNC
- Add CONFIG_BT_LE_L2CAP_ECRED
- Add support for CYW4373A0
- Add support for RTL8723DS
- Add more device IDs for WCN6855
- Add Broadcom BCM4377 family PCIe Bluetooth
-----BEGIN PGP SIGNATURE-----
iQJNBAABCAA3FiEE7E6oRXp8w05ovYr/9JCA4xAyCykFAmOXqQYZHGx1aXoudm9u
LmRlbnR6QGludGVsLmNvbQAKCRD0kIDjEDILKa4LEACMpjtWb7IhfSt42FC+p7sf
gXLZ1GOEZnP3x+rJUXfHRdVfRvkpABixnEFuqbzkCZLQMy6W01abtBJ0xC7mADIu
CKvef5IjUdEQJWbbAE5ZwFPxdWbWjoBnO39crl9W1hlpxatLy0QEFfC0SqZRpiiZ
Jo2BZ8g6eFP5l1GARhW+B1idjxaXI7JIg41s43PFFHOK/TkfpQxCVCeg4bv36wTD
Pp14oixAmJKXsxQoozWJcY9zLy6/6KEedfBzj0mKrOG5/v0O8YyXKoe2T0yjx1d8
t/VSV+XquYiyJ6FA9FsW8spS5GFxGjuEvKagGn11t+in9GIb8t1ve4oAL+y43nC8
br/Sx8FVbFRWu5UA0Xv4issEMp/s/1Zty7U15CHiaFwv6VNZuyXDDLZVzvgUULLl
3ikBrM1JG/AKdoHiIjwmv4qLpteZ7WWOqb3qT3cPMdnUaGBd9fMJ17qCRYs8roW7
9QoGcHUfsPUUAiM6OyBz5L2HbfZuZTdVWvPVqoeaQ7LEmGeqKe4KnGeep0fVpVe8
viKGtLbpi4+S/2IKQvEVvulmsXOved2uJef/x+leH8HShPDceQU9c+TG8RgxOygy
f8dhAgn5wV+AgaKzLDJj2xT3xb+0j9jctTWX5SDjvHUPmu23bcNcsA0jv+dknMFh
VhOZTE+Qze08ZVGfnRT4uw==
=Oj2B
-----END PGP SIGNATURE-----
Merge tag 'for-net-next-2022-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Luiz Augusto von Dentz says:
====================
bluetooth-next pull request for net-next:
- Add a new VID/PID 0489/e0f2 for MT7922
- Add Realtek RTL8852BE support ID 0x0cb8:0xc559
- Add a new PID/VID 13d3/3549 for RTL8822CU
- Add support for broadcom BCM43430A0 & BCM43430A1
- Add CONFIG_BT_HCIBTUSB_POLL_SYNC
- Add CONFIG_BT_LE_L2CAP_ECRED
- Add support for CYW4373A0
- Add support for RTL8723DS
- Add more device IDs for WCN6855
- Add Broadcom BCM4377 family PCIe Bluetooth
* tag 'for-net-next-2022-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next: (51 commits)
Bluetooth: Wait for HCI_OP_WRITE_AUTH_PAYLOAD_TO to complete
Bluetooth: ISO: Avoid circular locking dependency
Bluetooth: RFCOMM: don't call kfree_skb() under spin_lock_irqsave()
Bluetooth: hci_core: don't call kfree_skb() under spin_lock_irqsave()
Bluetooth: hci_bcsp: don't call kfree_skb() under spin_lock_irqsave()
Bluetooth: hci_h5: don't call kfree_skb() under spin_lock_irqsave()
Bluetooth: hci_ll: don't call kfree_skb() under spin_lock_irqsave()
Bluetooth: hci_qca: don't call kfree_skb() under spin_lock_irqsave()
Bluetooth: btusb: don't call kfree_skb() under spin_lock_irqsave()
Bluetooth: btintel: Fix missing free skb in btintel_setup_combined()
Bluetooth: hci_conn: Fix crash on hci_create_cis_sync
Bluetooth: btintel: Fix existing sparce warnings
Bluetooth: btusb: Fix existing sparce warning
Bluetooth: btusb: Fix new sparce warnings
Bluetooth: btusb: Add a new PID/VID 13d3/3549 for RTL8822CU
Bluetooth: btusb: Add Realtek RTL8852BE support ID 0x0cb8:0xc559
dt-bindings: net: realtek-bluetooth: Add RTL8723DS
Bluetooth: btusb: Add a new VID/PID 0489/e0f2 for MT7922
dt-bindings: bluetooth: broadcom: add BCM43430A0 & BCM43430A1
Bluetooth: hci_bcm4377: Fix missing pci_disable_device() on error in bcm4377_probe()
...
====================
Link: https://lore.kernel.org/r/20221212222322.1690780-1-luiz.dentz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS updates for net-next
1) Incorrect error check in nft_expr_inner_parse(), from Dan Carpenter.
2) Add DATA_SENT state to SCTP connection tracking helper, from
Sriram Yagnaraman.
3) Consolidate nf_confirm for ipv4 and ipv6, from Florian Westphal.
4) Add bitmask support for ipset, from Vishwanath Pai.
5) Handle icmpv6 redirects as RELATED, from Florian Westphal.
6) Add WARN_ON_ONCE() to impossible case in flowtable datapath,
from Li Qiong.
7) A large batch of IPVS updates to replace timer-based estimators by
kthreads to scale up wrt. CPUs and workload (millions of estimators).
Julian Anastasov says:
This patchset implements stats estimation in kthread context.
It replaces the code that runs on single CPU in timer context every 2
seconds and causing latency splats as shown in reports [1], [2], [3].
The solution targets setups with thousands of IPVS services,
destinations and multi-CPU boxes.
Spread the estimation on multiple (configured) CPUs and multiple
time slots (timer ticks) by using multiple chains organized under RCU
rules. When stats are not needed, it is recommended to use
run_estimation=0 as already implemented before this change.
RCU Locking:
- As stats are now RCU-locked, tot_stats, svc and dest which
hold estimator structures are now always freed from RCU
callback. This ensures RCU grace period after the
ip_vs_stop_estimator() call.
Kthread data:
- every kthread works over its own data structure and all
such structures are attached to array. For now we limit
kthreads depending on the number of CPUs.
- even while there can be a kthread structure, its task
may not be running, eg. before first service is added or
while the sysctl var is set to an empty cpulist or
when run_estimation is set to 0 to disable the estimation.
- the allocated kthread context may grow from 1 to 50
allocated structures for timer ticks which saves memory for
setups with small number of estimators
- a task and its structure may be released if all
estimators are unlinked from its chains, leaving the
slot in the array empty
- every kthread data structure allows limited number
of estimators. Kthread 0 is also used to initially
calculate the max number of estimators to allow in every
chain considering a sub-100 microsecond cond_resched
rate. This number can be from 1 to hundreds.
- kthread 0 has an additional job of optimizing the
adding of estimators: they are first added in
temp list (est_temp_list) and later kthread 0
distributes them to other kthreads. The optimization
is based on the fact that newly added estimator
should be estimated after 2 seconds, so we have the
time to offload the adding to chain from controlling
process to kthread 0.
- to add new estimators we use the last added kthread
context (est_add_ktid). The new estimators are linked to
the chains just before the estimated one, based on add_row.
This ensures their estimation will start after 2 seconds.
If estimators are added in bursts, common case if all
services and dests are initially configured, we may
spread the estimators to more chains and as result,
reducing the initial delay below 2 seconds.
Many thanks to Jiri Wiesner for his valuable comments
and for spending a lot of time reviewing and testing
the changes on different platforms with 48-256 CPUs and
1-8 NUMA nodes under different cpufreq governors.
The new IPVS estimators do not use workqueue infrastructure
because:
- The estimation can take long time when using multiple IPVS rules (eg.
millions estimator structures) and especially when box has multiple
CPUs due to the for_each_possible_cpu usage that expects packets from
any CPU. With est_nice sysctl we have more control how to prioritize the
estimation kthreads compared to other processes/kthreads that have
latency requirements (such as servers). As a benefit, we can see these
kthreads in top and decide if we will need some further control to limit
their CPU usage (max number of structure to estimate per kthread).
- with kthreads we run code that is read-mostly, no write/lock
operations to process the estimators in 2-second intervals.
- work items are one-shot: as estimators are processed every
2 seconds, they need to be re-added every time. This again
loads the timers (add_timer) if we use delayed works, as there are
no kthreads to do the timings.
[1] Report from Yunhong Jiang:
https://lore.kernel.org/netdev/D25792C1-1B89-45DE-9F10-EC350DC04ADC@gmail.com/
[2] https://marc.info/?l=linux-virtual-server&m=159679809118027&w=2
[3] Report from Dust:
https://archive.linuxvirtualserver.org/html/lvs-devel/2020-12/msg00000.html
* git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
ipvs: run_estimation should control the kthread tasks
ipvs: add est_cpulist and est_nice sysctl vars
ipvs: use kthreads for stats estimation
ipvs: use u64_stats_t for the per-cpu counters
ipvs: use common functions for stats allocation
ipvs: add rcu protection to stats
netfilter: flowtable: add a 'default' case to flowtable datapath
netfilter: conntrack: set icmpv6 redirects as RELATED
netfilter: ipset: Add support for new bitmask parameter
netfilter: conntrack: merge ipv4+ipv6 confirm functions
netfilter: conntrack: add sctp DATA_SENT state
netfilter: nft_inner: fix IS_ERR() vs NULL check
====================
Link: https://lore.kernel.org/r/20221211101204.1751-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This make sure HCI_OP_WRITE_AUTH_PAYLOAD_TO completes before notifying
the encryption change just as is done with HCI_OP_READ_ENC_KEY_SIZE.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
It is not allowed to call kfree_skb() from hardware interrupt
context or with interrupts being disabled. So replace kfree_skb()
with dev_kfree_skb_irq() under spin_lock_irqsave().
Fixes: 81be03e026 ("Bluetooth: RFCOMM: Replace use of memcpy_from_msg with bt_skb_sendmmsg")
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
It is not allowed to call kfree_skb() from hardware interrupt
context or with interrupts being disabled. So replace kfree_skb()
with dev_kfree_skb_irq() under spin_lock_irqsave().
Fixes: 9238f36a5a ("Bluetooth: Add request cmd_complete and cmd_status functions")
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
When attempting to connect multiple ISO sockets without using
DEFER_SETUP may result in the following crash:
BUG: KASAN: null-ptr-deref in hci_create_cis_sync+0x18b/0x2b0
Read of size 2 at addr 0000000000000036 by task kworker/u3:1/50
CPU: 0 PID: 50 Comm: kworker/u3:1 Not tainted
6.0.0-rc7-02243-gb84a13ff4eda #4373
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
BIOS 1.16.0-1.fc36 04/01/2014
Workqueue: hci0 hci_cmd_sync_work
Call Trace:
<TASK>
dump_stack_lvl+0x19/0x27
kasan_report+0xbc/0xf0
? hci_create_cis_sync+0x18b/0x2b0
hci_create_cis_sync+0x18b/0x2b0
? get_link_mode+0xd0/0xd0
? __ww_mutex_lock_slowpath+0x10/0x10
? mutex_lock+0xe0/0xe0
? get_link_mode+0xd0/0xd0
hci_cmd_sync_work+0x111/0x190
process_one_work+0x427/0x650
worker_thread+0x87/0x750
? process_one_work+0x650/0x650
kthread+0x14e/0x180
? kthread_exit+0x50/0x50
ret_from_fork+0x22/0x30
</TASK>
Fixes: 26afbd826e ("Bluetooth: Add initial implementation of CIS connections")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Broadcom controllers present on Apple Silicon devices use the upper
8 bits of the event type in the LE Extended Advertising Report for
the channel on which the frame has been received.
These bits are reserved according to the Bluetooth spec anyway such that
we can just drop them to ensure that the advertising results are parsed
correctly.
The following excerpt from a btmon trace shows a report received on
channel 37 by these controllers:
> HCI Event: LE Meta Event (0x3e) plen 55
LE Extended Advertising Report (0x0d)
Num reports: 1
Entry 0
Event type: 0x2513
Props: 0x0013
Connectable
Scannable
Use legacy advertising PDUs
Data status: Complete
Reserved (0x2500)
Legacy PDU Type: Reserved (0x2513)
Address type: Public (0x00)
Address: XX:XX:XX:XX:XX:XX (Shenzhen Jingxun Software [...])
Primary PHY: LE 1M
Secondary PHY: No packets
SID: no ADI field (0xff)
TX power: 127 dBm
RSSI: -76 dBm (0xb4)
Periodic advertising interval: 0.00 msec (0x0000)
Direct address type: Public (0x00)
Direct address: 00:00:00:00:00:00 (OUI 00-00-00)
Data length: 0x1d
[...]
Flags: 0x18
Simultaneous LE and BR/EDR (Controller)
Simultaneous LE and BR/EDR (Host)
Company: Harman International Industries, Inc. (87)
Data: [...]
Service Data (UUID 0xfddf):
Name (complete): JBL Flip 5
Signed-off-by: Sven Peter <sven@svenpeter.dev>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Replace kmalloc+memset by kzalloc
for better readability and simplicity.
This addresses the cocci warning below:
WARNING: kzalloc should be used for d, instead of kmalloc/memset
Signed-off-by: Kang Minchul <tegongkang@gmail.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
'err' is known to be <0 at this point.
So, some cases can not be reached because of a missing "-".
Add it.
Fixes: ca2045e059 ("Bluetooth: Add bt_status")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
This adds CONFIG_BT_LE_L2CAP_ECRED which can be used to enable L2CAP
Enhanced Credit Flow Control Mode by default, previously it was only
possible to set it via module parameter (e.g. bluetooth.enable_ecred=1).
Since L2CAP ECRED mode is required by the likes of EATT which is
recommended for LE Audio this enables it by default.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Tested-By: Tedd Ho-Jeong An <tedd.an@intel.com>
When validating the parameter length for MGMT_OP_ADD_EXT_ADV_PARAMS
command, use the correct op code in error status report:
was MGMT_OP_ADD_ADVERTISING, changed to MGMT_OP_ADD_EXT_ADV_PARAMS.
Fixes: 1241057283 ("Bluetooth: Break add adv into two mgmt commands")
Signed-off-by: Inga Stotland <inga.stotland@intel.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
If hci_register_suspend_notifier() returns error, the hdev and rfkill
are leaked. We could disregard the error and print a warning message
instead to avoid leaks, as it just means we won't be handing suspend
requests.
Fixes: 9952d90ea2 ("Bluetooth: Handle PM_SUSPEND_PREPARE and PM_POST_SUSPEND")
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Use kzalloc rather than duplicating its implementation, which makes code
simple and easy to understand.
./net/bluetooth/hci_conn.c:2038:6-13: WARNING: kzalloc should be used for cp, instead of kmalloc/memset.
Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=2406
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
For ISO BIS related functions in hci_conn.c, make dst_type values be HCI
address type values, not ISO socket address type values. This makes it
consistent with CIS functions.
Signed-off-by: Pauli Virtanen <pav@iki.fi>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
If a command is already sent, we take care of freeing it, but we
also need to cancel the timeout as well.
Signed-off-by: Archie Pusaka <apusaka@chromium.org>
Reviewed-by: Abhishek Pandit-Subedi <abhishekpandit@google.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
force_static_address shall be writable while hdev is initing but is not
considered powered yet since the static address is written only when
powered.
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Tested-by: Brian Gix <brian.gix@intel.com>