Abort path release flow rule object, however, commit path does not.
Update code to destroy these objects before releasing the transaction.
Fixes: c9626a2cbd ("netfilter: nf_tables: add hardware offload support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Release the list of new hooks that are pending to be registered in case
that unsupported flowtable flags are provided.
Fixes: 78d9f48f7f ("netfilter: nf_tables: add devices to existing flowtable")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The hook list is used if nft_trans_flowtable_update(trans) == true. However,
initialize this list for other cases for safety reasons.
Fixes: 78d9f48f7f ("netfilter: nf_tables: add devices to existing flowtable")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Remove inactive bool field in nft_hook object that was introduced in
abadb2f865 ("netfilter: nf_tables: delete devices from flowtable").
Move stale flowtable hooks to transaction list instead.
Deleting twice the same device does not result in ENOENT.
Fixes: abadb2f865 ("netfilter: nf_tables: delete devices from flowtable")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Use kfree_rcu(ptr, rcu) variant instead as described by ae089831ff
("netfilter: nf_tables: prefer kfree_rcu(ptr, rcu) variant").
Fixes: f9a43007d3 ("netfilter: nf_tables: double hook unregistration in netns path")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When no l3 address is given, priv->family is set to NFPROTO_INET and
the evaluation function isn't called.
Call it too so l4-only rewrite can work.
Also add a test case for this.
Fixes: a33f387ecd ("netfilter: nft_nat: allow to specify layer 4 protocol NAT only")
Reported-by: Yi Chen <yiche@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Laurent reported the enclosed report [1]
This bug triggers with following coditions:
0) Kernel built with CONFIG_DEBUG_PREEMPT=y
1) A new passive FastOpen TCP socket is created.
This FO socket waits for an ACK coming from client to be a complete
ESTABLISHED one.
2) A socket operation on this socket goes through lock_sock()
release_sock() dance.
3) While the socket is owned by the user in step 2),
a retransmit of the SYN is received and stored in socket backlog.
4) At release_sock() time, the socket backlog is processed while
in process context.
5) A SYNACK packet is cooked in response of the SYN retransmit.
6) -> tcp_rtx_synack() is called in process context.
Before blamed commit, tcp_rtx_synack() was always called from BH handler,
from a timer handler.
Fix this by using TCP_INC_STATS() & NET_INC_STATS()
which do not assume caller is in non preemptible context.
[1]
BUG: using __this_cpu_add() in preemptible [00000000] code: epollpep/2180
caller is tcp_rtx_synack.part.0+0x36/0xc0
CPU: 10 PID: 2180 Comm: epollpep Tainted: G OE 5.16.0-0.bpo.4-amd64 #1 Debian 5.16.12-1~bpo11+1
Hardware name: Supermicro SYS-5039MC-H8TRF/X11SCD-F, BIOS 1.7 11/23/2021
Call Trace:
<TASK>
dump_stack_lvl+0x48/0x5e
check_preemption_disabled+0xde/0xe0
tcp_rtx_synack.part.0+0x36/0xc0
tcp_rtx_synack+0x8d/0xa0
? kmem_cache_alloc+0x2e0/0x3e0
? apparmor_file_alloc_security+0x3b/0x1f0
inet_rtx_syn_ack+0x16/0x30
tcp_check_req+0x367/0x610
tcp_rcv_state_process+0x91/0xf60
? get_nohz_timer_target+0x18/0x1a0
? lock_timer_base+0x61/0x80
? preempt_count_add+0x68/0xa0
tcp_v4_do_rcv+0xbd/0x270
__release_sock+0x6d/0xb0
release_sock+0x2b/0x90
sock_setsockopt+0x138/0x1140
? __sys_getsockname+0x7e/0xc0
? aa_sk_perm+0x3e/0x1a0
__sys_setsockopt+0x198/0x1e0
__x64_sys_setsockopt+0x21/0x30
do_syscall_64+0x38/0xc0
entry_SYSCALL_64_after_hwframe+0x44/0xae
Fixes: 168a8f5805 ("tcp: TCP Fast Open Server - main code path")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Laurent Fasnacht <laurent.fasnacht@proton.ch>
Acked-by: Neal Cardwell <ncardwell@google.com>
Link: https://lore.kernel.org/r/20220530213713.601888-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
1) Missing proper sanitization for nft_set_desc_concat_parse().
2) Missing mutex in nf_tables pre_exit path.
3) Possible double hook unregistration from clean_net path.
4) Missing FLOWI_FLAG_ANYSRC flag in flowtable route lookup.
Fix incorrect source and destination address in case of NAT.
Patch from wenxu.
* git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: flowtable: fix nft_flow_route source address for nat case
netfilter: flowtable: fix missing FLOWI_FLAG_ANYSRC flag
netfilter: nf_tables: double hook unregistration in netns path
netfilter: nf_tables: hold mutex on netns pre_exit path
netfilter: nf_tables: sanitize nft_set_desc_concat_parse()
====================
Link: https://lore.kernel.org/r/20220531215839.84765-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In qdisc_run_end(), the spin_unlock() only has store-release semantic,
which guarantees all earlier memory access are visible before it. But
the subsequent test_bit() has no barrier semantics so may be reordered
ahead of the spin_unlock(). The store-load reordering may cause a packet
stuck problem.
The concurrent operations can be described as below,
CPU 0 | CPU 1
qdisc_run_end() | qdisc_run_begin()
. | .
----> /* may be reorderd here */ | .
| . | .
| spin_unlock() | set_bit()
| . | smp_mb__after_atomic()
---- test_bit() | spin_trylock()
. | .
Consider the following sequence of events:
CPU 0 reorder test_bit() ahead and see MISSED = 0
CPU 1 calls set_bit()
CPU 1 calls spin_trylock() and return fail
CPU 0 executes spin_unlock()
At the end of the sequence, CPU 0 calls spin_unlock() and does nothing
because it see MISSED = 0. The skb on CPU 1 has beed enqueued but no one
take it, until the next cpu pushing to the qdisc (if ever ...) will
notice and dequeue it.
This patch fix this by adding one explicit barrier. As spin_unlock() and
test_bit() ordering is a store-load ordering, a full memory barrier
smp_mb() is needed here.
Fixes: a90c57f2ce ("net: sched: fix packet stuck problem for lockless qdisc")
Signed-off-by: Guoju Fang <gjfang@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220528101628.120193-1-gjfang@linux.alibaba.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
For snat and dnat cases, the saddr should be taken from reverse tuple.
Fixes: 3412e16418 (netfilter: flowtable: nft_flow_route use more data for reverse route)
Signed-off-by: wenxu <wenxu@chinatelecom.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The nf_flow_table gets route through ip_route_output_key. If the saddr
is not local one, then FLOWI_FLAG_ANYSRC flag should be set. Without
this flag, the route lookup for other_dst will fail.
Fixes: 3412e16418 (netfilter: flowtable: nft_flow_route use more data for reverse route)
Signed-off-by: wenxu <wenxu@chinatelecom.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
__nft_release_hooks() is called from pre_netns exit path which
unregisters the hooks, then the NETDEV_UNREGISTER event is triggered
which unregisters the hooks again.
[ 565.221461] WARNING: CPU: 18 PID: 193 at net/netfilter/core.c:495 __nf_unregister_net_hook+0x247/0x270
[...]
[ 565.246890] CPU: 18 PID: 193 Comm: kworker/u64:1 Tainted: G E 5.18.0-rc7+ #27
[ 565.253682] Workqueue: netns cleanup_net
[ 565.257059] RIP: 0010:__nf_unregister_net_hook+0x247/0x270
[...]
[ 565.297120] Call Trace:
[ 565.300900] <TASK>
[ 565.304683] nf_tables_flowtable_event+0x16a/0x220 [nf_tables]
[ 565.308518] raw_notifier_call_chain+0x63/0x80
[ 565.312386] unregister_netdevice_many+0x54f/0xb50
Unregister and destroy netdev hook from netns pre_exit via kfree_rcu
so the NETDEV_UNREGISTER path see unregistered hooks.
Fixes: 767d1216bf ("netfilter: nftables: fix possible UAF over chains from packet path in netns")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
clean_net() runs in workqueue while walking over the lists, grab mutex.
Fixes: 767d1216bf ("netfilter: nftables: fix possible UAF over chains from packet path in netns")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Add several sanity checks for nft_set_desc_concat_parse():
- validate desc->field_count not larger than desc->field_len array.
- field length cannot be larger than desc->field_len (ie. U8_MAX)
- total length of the concatenation cannot be larger than register array.
Joint work with Florian Westphal.
Fixes: f3a2181e16 ("netfilter: nf_tables: Support for sets with multiple ranged fields")
Reported-by: <zhangziming.zzm@antgroup.com>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Commit 6fac592cca ("xen: update ring.h") missed to fix one use case
of RING_HAS_UNCONSUMED_REQUESTS().
Reported-by: Jan Beulich <jbeulich@suse.com>
Fixes: 6fac592cca ("xen: update ring.h")
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Wei Liu <wei.liu@kernel.org>
Link: https://lore.kernel.org/r/20220530113459.20124-1-jgross@suse.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
RFC 9131 changes default behaviour of handling RX of NA messages when the
corresponding entry is absent in the neighbour cache. The current
implementation is limited to accept just unsolicited NAs. However, the
RFC is more generic where it also accepts solicited NAs. Both types
should result in adding a STALE entry for this case.
Expand accept_untracked_na behaviour to also accept solicited NAs to
be compliant with the RFC and rename the sysctl knob to
accept_untracked_na.
Fixes: f9a2fb7331 ("net/ipv6: Introduce accept_unsolicited_na knob to implement router-side changes for RFC9131")
Signed-off-by: Arun Ajith S <aajith@arista.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20220530101414.65439-1-aajith@arista.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
When adding bond new parameter ns_targets. I forgot to print this
in bond master proc info. After updating, the bond master info will look
like:
ARP IP target/s (n.n.n.n form): 192.168.1.254
NS IPv6 target/s (XX::XX form): 2022::1, 2022::2
Fixes: 4e24be018e ("bonding: add new parameter ns_targets")
Reported-by: Li Liang <liali@redhat.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://lore.kernel.org/r/20220530062639.37179-1-liuhangbin@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Before 7beecaf7d5 ("net: phy: at803x: improve the WOL feature") patch
"at803x_get_wol" implementation used AT803X_INTR_ENABLE_WOL value to set
WAKE_MAGIC flag, and now AT803X_WOL_EN value is used for the same purpose.
The problem here is that the values of these two bits are different after
hardware reset: AT803X_INTR_ENABLE_WOL=0 after hardware reset, but
AT803X_WOL_EN=1. So now, if called right after boot, "at803x_get_wol" will
set WAKE_MAGIC flag, even if WOL function is not enabled by calling
"at803x_set_wol" function. The patch disables WOL function on probe thus
the behavior is consistent.
Fixes: 7beecaf7d5 ("net: phy: at803x: improve the WOL feature")
Signed-off-by: Viorel Suman <viorel.suman@nxp.com>
Link: https://lore.kernel.org/r/20220527084935.235274-1-viorel.suman@oss.nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Fix the following build warning when CONFIG_IPV6 is not set:
In function ‘fortify_memcpy_chk’,
inlined from ‘tcp_md5_do_add’ at net/ipv4/tcp_ipv4.c:1210:2:
./include/linux/fortify-string.h:328:4: error: call to ‘__write_overflow_field’ declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Werror=attribute-warning]
328 | __write_overflow_field(p_size_field, size);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: huhai <huhai@kylinos.cn>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20220526101213.2392980-1-zhanggenjian@kylinos.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Íñigo Huguet says:
====================
sfc: fix some efx_separate_tx_channels errors
Trying to load sfc driver with modparam efx_separate_tx_channels=1
resulted in errors during initialization and not being able to use the
NIC. This patches fix a few bugs and make it work again.
v2:
* added Martin's patch instead of a previous mine. Mine one solved some
of the initialization errors, but Martin's solves them also in all
possible cases.
* removed whitespaces cleanup, as requested by Jakub
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
tx_channel_offset is calculated in efx_allocate_msix_channels, but it is
also calculated again in efx_set_channels because it was originally done
there, and when efx_allocate_msix_channels was introduced it was
forgotten to be removed from efx_set_channels.
Moreover, the old calculation is wrong when using
efx_separate_tx_channels because now we can have XDP channels after the
TX channels, so n_channels - n_tx_channels doesn't point to the first TX
channel.
Remove the old calculation from efx_set_channels, and add the
initialization of this variable if MSI or legacy interrupts are used,
next to the initialization of the rest of the related variables, where
it was missing.
Fixes: 3990a8fffb ("sfc: allocate channels for XDP tx queues")
Reported-by: Tianhao Zhao <tizhao@redhat.com>
Signed-off-by: Íñigo Huguet <ihuguet@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Normally, all channels have RX and TX queues, but this is not true if
modparam efx_separate_tx_channels=1 is used. In that cases, some
channels only have RX queues and others only TX queues (or more
preciselly, they have them allocated, but not initialized).
Fix efx_channel_has_tx_queues to return the correct value for this case
too.
Messages shown at probe time before the fix:
sfc 0000:03:00.0 ens6f0np0: MC command 0x82 inlen 544 failed rc=-22 (raw=0) arg=0
------------[ cut here ]------------
netdevice: ens6f0np0: failed to initialise TXQ -1
WARNING: CPU: 1 PID: 626 at drivers/net/ethernet/sfc/ef10.c:2393 efx_ef10_tx_init+0x201/0x300 [sfc]
[...] stripped
RIP: 0010:efx_ef10_tx_init+0x201/0x300 [sfc]
[...] stripped
Call Trace:
efx_init_tx_queue+0xaa/0xf0 [sfc]
efx_start_channels+0x49/0x120 [sfc]
efx_start_all+0x1f8/0x430 [sfc]
efx_net_open+0x5a/0xe0 [sfc]
__dev_open+0xd0/0x190
__dev_change_flags+0x1b3/0x220
dev_change_flags+0x21/0x60
[...] stripped
Messages shown at remove time before the fix:
sfc 0000:03:00.0 ens6f0np0: failed to flush 10 queues
sfc 0000:03:00.0 ens6f0np0: failed to flush queues
Fixes: 8700aff089 ("sfc: fix channel allocation with brute force")
Reported-by: Tianhao Zhao <tizhao@redhat.com>
Signed-off-by: Martin Habets <habetsm.xilinx@gmail.com>
Tested-by: Íñigo Huguet <ihuguet@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some resources are allocated using pci_request_region().
It is more straightforward to release them with pci_release_region().
Fixes: 231ece36f5 ("enetc: Add mdio bus driver for the PCIe MDIO endpoint")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When setting bond NS target, we use bond_is_ip6_target_ok() to check
if the address valid. The link local address was wrongly rejected in
bond_changelink(), as most time the user just set the ARP/NS target to
gateway, while the IPv6 gateway is always a link local address when user
set up interface via SLAAC.
So remove the link local addr check when setting bond NS target.
Fixes: 129e3c1bab ("bonding: add new option ns_ip6_target")
Reported-by: Li Liang <liali@redhat.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: Jonathan Toppins <jtoppins@redhat.com>
Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use ida_alloc()/ida_free() instead of deprecated
ida_simple_get()/ida_simple_remove() .
Signed-off-by: keliu <liuke94@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Only report pause frame configuration for physical device. Logical
port of both PCI PF and PCI VF do not support it.
Fixes: 9fdc5d85a8 ("nfp: update ethtool reporting of pauseframe control")
Signed-off-by: Yu Xiao <yu.xiao@corigine.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This converts these files to use SPDX idenfifiers instead of license
text.
Signed-off-by: Sean Anderson <sean.anderson@seco.com>
Reviewed-by: Madalin Bucur <madalin.bucur@oss.nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use ida_alloc()/ida_free() instead of deprecated
ida_simple_get()/ida_simple_remove().
Signed-off-by: Ke Liu <liuke94@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
"struct smc_cdc_tx_pend **" can not directly convert
to "struct smc_wr_tx_pend_priv *".
Fixes: 2bced6aefa ("net/smc: put slot when connection is killed")
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Elder says:
====================
net: ipa: fix page free in two spots
When a receive buffer is not wrapped in an SKB and passed to the
network stack, the (compound) page gets freed within the IPA driver.
This is currently quite rare.
The pages are freed using __free_pages(), but they should instead be
freed using page_put(). This series fixes this, in two spots.
These patches work for the current linus/master branch, but won't
apply cleanly to earlier stable branches. (Nevertheless, the fix is
a trivial substitution everwhere __free_pages() is called.)
====================
Link: https://lore.kernel.org/r/20220526152314.1405629-1-elder@linaro.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently the (possibly compound) pages used for receive buffers are
freed using __free_pages(). But according to this comment above the
definition of that function, that's wrong:
If you want to use the page's reference count to decide
when to free the allocation, you should allocate a compound
page, and use put_page() instead of __free_pages().
Convert the call to __free_pages() in ipa_endpoint_replenish_one()
to use put_page() instead.
Fixes: 6a606b9015 ("net: ipa: allocate transaction in replenish loop")
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently the (possibly compound) page used for receive buffers are
freed using __free_pages(). But according to this comment above the
definition of that function, that's wrong:
If you want to use the page's reference count to decide when
to free the allocation, you should allocate a compound page,
and use put_page() instead of __free_pages().
Convert the call to __free_pages() in ipa_endpoint_trans_release()
to use put_page() instead.
Fixes: ed23f02680 ("net: ipa: define per-endpoint receive buffer size")
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Borkmann says:
====================
pull-request: bpf 2022-05-28
We've added 2 non-merge commits during the last 1 day(s) which contain
a total of 2 files changed, 6 insertions(+), 10 deletions(-).
The main changes are:
1) Fix ldx_probe_mem instruction in interpreter by properly zero-extending
the bpf_probe_read_kernel() read content, from Menglong Dong.
2) Fix stacktrace_build_id BPF selftest given urandom_read has been renamed
into urandom_read_iter in random driver, from Song Liu.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf: Fix probe read error in ___bpf_prog_run()
selftests/bpf: fix stacktrace_build_id with missing kprobe/urandom_read
====================
Link: https://lore.kernel.org/r/20220527235042.8526-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
I think there is something wrong with BPF_PROBE_MEM in ___bpf_prog_run()
in big-endian machine. Let's make a test and see what will happen if we
want to load a 'u16' with BPF_PROBE_MEM.
Let's make the src value '0x0001', the value of dest register will become
0x0001000000000000, as the value will be loaded to the first 2 byte of
DST with following code:
bpf_probe_read_kernel(&DST, SIZE, (const void *)(long) (SRC + insn->off));
Obviously, the value in DST is not correct. In fact, we can compare
BPF_PROBE_MEM with LDX_MEM_H:
DST = *(SIZE *)(unsigned long) (SRC + insn->off);
If the memory load is done by LDX_MEM_H, the value in DST will be 0x1 now.
And I think this error results in the test case 'test_bpf_sk_storage_map'
failing:
test_bpf_sk_storage_map:PASS:bpf_iter_bpf_sk_storage_map__open_and_load 0 nsec
test_bpf_sk_storage_map:PASS:socket 0 nsec
test_bpf_sk_storage_map:PASS:map_update 0 nsec
test_bpf_sk_storage_map:PASS:socket 0 nsec
test_bpf_sk_storage_map:PASS:map_update 0 nsec
test_bpf_sk_storage_map:PASS:socket 0 nsec
test_bpf_sk_storage_map:PASS:map_update 0 nsec
test_bpf_sk_storage_map:PASS:attach_iter 0 nsec
test_bpf_sk_storage_map:PASS:create_iter 0 nsec
test_bpf_sk_storage_map:PASS:read 0 nsec
test_bpf_sk_storage_map:FAIL:ipv6_sk_count got 0 expected 3
$10/26 bpf_iter/bpf_sk_storage_map:FAIL
The code of the test case is simply, it will load sk->sk_family to the
register with BPF_PROBE_MEM and check if it is AF_INET6. With this patch,
now the test case 'bpf_iter' can pass:
$10 bpf_iter:OK
Fixes: 2a02759ef5 ("bpf: Add support for BTF pointers to interpreter")
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jiang Biao <benbjiang@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Cc: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/bpf/20220524021228.533216-1-imagedong@tencent.com
Kernel function urandom_read is replaced with urandom_read_iter.
Therefore, kprobe on urandom_read is not working any more:
[root@eth50-1 bpf]# ./test_progs -n 161
test_stacktrace_build_id:PASS:skel_open_and_load 0 nsec
libbpf: kprobe perf_event_open() failed: No such file or directory
libbpf: prog 'oncpu': failed to create kprobe 'urandom_read+0x0' \
perf event: No such file or directory
libbpf: prog 'oncpu': failed to auto-attach: -2
test_stacktrace_build_id:FAIL:attach_tp err -2
161 stacktrace_build_id:FAIL
Fix this by replacing urandom_read with urandom_read_iter in the test.
Fixes: 1b388e7765 ("random: convert to using fops->read_iter()")
Reported-by: Mykola Lysenko <mykolal@fb.com>
Signed-off-by: Song Liu <song@kernel.org>
Acked-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20220526191608.2364049-1-song@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Add support for Telit LN910Cx 0x1250 composition
0x1250: rmnet, tty, tty, tty, tty
Signed-off-by: Carlo Lobrano <c.lobrano@gmail.com>
Acked-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix the MDIO interface declarations to reflect what is currently supported by
the PCI11010 / PCI11414 devices (C22 for RGMII and C22_C45 for SGMII)
Signed-off-by: Raju Lakkaraju <Raju.Lakkaraju@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following contain more Netfilter fixes for net:
1) syzbot warning in nfnetlink bind, from Florian.
2) Refetch conntrack after __nf_conntrack_confirm(), from Florian Westphal.
3) Move struct nf_ct_timeout back at the bottom of the ctnl_time, to
where it before recent update, also from Florian.
4) Add NL_SET_BAD_ATTR() to nf_tables netlink for proper set element
commands error reporting.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
syzbot reports:
BUG: KASAN: slab-out-of-bounds in __list_del_entry_valid+0xcc/0xf0 lib/list_debug.c:42
[..]
list_del include/linux/list.h:148 [inline]
cttimeout_net_exit+0x211/0x540 net/netfilter/nfnetlink_cttimeout.c:617
No reproducer so far. Looking at recent changes in this area
its clear that the free_head must not be at the end of the
structure because nf_ct_timeout structure has variable size.
Reported-by: <syzbot+92968395eedbdbd3617d@syzkaller.appspotmail.com>
Fixes: 78222bacfc ("netfilter: cttimeout: decouple unlink and free on netns destruction")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
In case the conntrack is clashing, insertion can free skb->_nfct and
set skb->_nfct to the already-confirmed entry.
This wasn't found before because the conntrack entry and the extension
space used to free'd after an rcu grace period, plus the race needs
events enabled to trigger.
Reported-by: <syzbot+793a590957d9c1b96620@syzkaller.appspotmail.com>
Fixes: 71d8c47fc6 ("netfilter: conntrack: introduce clash resolution on insertion race")
Fixes: 2ad9d7747c ("netfilter: conntrack: free extension area immediately")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
syzbot reports following warn:
WARNING: CPU: 0 PID: 3600 at net/netfilter/nfnetlink.c:703 nfnetlink_unbind+0x357/0x3b0 net/netfilter/nfnetlink.c:694
The syzbot generated program does this:
socket(AF_NETLINK, SOCK_RAW, NETLINK_NETFILTER) = 3
setsockopt(3, SOL_NETLINK, NETLINK_DROP_MEMBERSHIP, [1], 4) = 0
... which triggers 'WARN_ON_ONCE(nfnlnet->ctnetlink_listeners == 0)' check.
Instead of counting, just enable reporting for every bind request
and check if we still have listeners on unbind.
While at it, also add the needed bounds check on nfnl_group2type[]
access.
Reported-by: <syzbot+4903218f7fba0a2d6226@syzkaller.appspotmail.com>
Reported-by: <syzbot+afd2d80e495f96049571@syzkaller.appspotmail.com>
Fixes: 2794cdb0b9 ("netfilter: nfnetlink: allow to detect if ctnetlink listeners exist")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
of_get_child_by_name() returns a node pointer with refcount
incremented, we should use of_node_put() on it when done.
mv88e6xxx_mdio_register() pass the device node to of_mdiobus_register().
We don't need the device node after it.
Add missing of_node_put() to avoid refcount leak.
Fixes: a3c53be55c ("net: dsa: mv88e6xxx: Support multiple MDIO busses")
Signed-off-by: Miaoqian Lin <linmq006@gmail.com>
Reviewed-by: Marek Behún <kabel@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
of_get_child_by_name() returns a node pointer with refcount
incremented, we should use of_node_put() on it when not need anymore.
am65_cpsw_init_cpts() and am65_cpsw_nuss_probe() don't release
the refcount in error case.
Add missing of_node_put() to avoid refcount leak.
Fixes: b1f66a5bee ("net: ethernet: ti: am65-cpsw-nuss: enable packet timestamping support")
Fixes: 93a7653031 ("net: ethernet: ti: introduce am65x/j721e gigabit eth subsystem driver")
Signed-off-by: Miaoqian Lin <linmq006@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The "fsp->location" variable comes from user via ethtool_get_rxnfc().
Check that it is valid to prevent an out of bounds read.
Fixes: 7aab747e55 ("net: ethernet: mediatek: add ethtool functions to configure RX flows of HW LRO")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In qdisc_run_begin(), smp_mb__before_atomic() used before test_bit()
does not provide any ordering guarantee as test_bit() is not an atomic
operation. This, added to the fact that the spin_trylock() call at
the beginning of qdisc_run_begin() does not guarantee acquire
semantics if it does not grab the lock, makes it possible for the
following statement :
if (test_bit(__QDISC_STATE_MISSED, &qdisc->state))
to be executed before an enqueue operation called before
qdisc_run_begin().
As a result the following race can happen :
CPU 1 CPU 2
qdisc_run_begin() qdisc_run_begin() /* true */
set(MISSED) .
/* returns false */ .
. /* sees MISSED = 1 */
. /* so qdisc not empty */
. __qdisc_run()
. .
. pfifo_fast_dequeue()
----> /* may be done here */ .
| . clear(MISSED)
| . .
| . smp_mb __after_atomic();
| . .
| . /* recheck the queue */
| . /* nothing => exit */
| enqueue(skb1)
| .
| qdisc_run_begin()
| .
| spin_trylock() /* fail */
| .
| smp_mb__before_atomic() /* not enough */
| .
---- if (test_bit(MISSED))
return false; /* exit */
In the above scenario, CPU 1 and CPU 2 both try to grab the
qdisc->seqlock at the same time. Only CPU 2 succeeds and enters the
bypass code path, where it emits its skb then calls __qdisc_run().
CPU1 fails, sets MISSED and goes down the traditionnal enqueue() +
dequeue() code path. But when executing qdisc_run_begin() for the
second time, after enqueuing its skbuff, it sees the MISSED bit still
set (by itself) and consequently chooses to exit early without setting
it again nor trying to grab the spinlock again.
Meanwhile CPU2 has seen MISSED = 1, cleared it, checked the queue
and found it empty, so it returned.
At the end of the sequence, we end up with skb1 enqueued in the
backlog, both CPUs out of __dev_xmit_skb(), the MISSED bit not set,
and no __netif_schedule() called made. skb1 will now linger in the
qdisc until somebody later performs a full __qdisc_run(). Associated
to the bypass capacity of the qdisc, and the ability of the TCP layer
to avoid resending packets which it knows are still in the qdisc, this
can lead to serious traffic "holes" in a TCP connection.
We fix this by replacing the smp_mb__before_atomic() / test_bit() /
set_bit() / smp_mb__after_atomic() sequence inside qdisc_run_begin()
by a single test_and_set_bit() call, which is more concise and
enforces the needed memory barriers.
Fixes: 89837eb4b2 ("net: sched: add barrier to ensure correct ordering for lockless qdisc")
Signed-off-by: Vincent Ray <vray@kalrayinc.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20220526001746.2437669-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
At the moment, if devm_of_phy_get() returns an error the serdes
simply isn't set. While it is bad to ignore an error in general, there
is a particular bug that network isn't working if the serdes driver is
compiled as a module. In that case, devm_of_phy_get() returns
-EDEFER_PROBE and the error is silently ignored.
The serdes is optional, it is not there if the port is using RGMII, in
which case devm_of_phy_get() returns -ENODEV. Rearrange the error
handling so that -ENODEV will be handled but other error codes will
abort the probing.
Fixes: d28d6d2e37 ("net: lan966x: add port module support")
Signed-off-by: Michael Walle <michael@walle.cc>
Link: https://lore.kernel.org/r/20220525231239.1307298-1-michael@walle.cc
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
1) Fix UAF when creating non-stateful expression in set.
2) Set limit cost when cloning expression accordingly, from Phil Sutter.
* git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nft_limit: Clone packet limits' cost value
netfilter: nf_tables: disallow non-stateful expression in sets earlier
====================
Link: https://lore.kernel.org/r/20220526205411.315136-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>