Commit Graph

1185244 Commits

Author SHA1 Message Date
John Fastabend
901546fd8f bpf, sockmap: Handle fin correctly
The sockmap code is returning EAGAIN after a FIN packet is received and no
more data is on the receive queue. Correct behavior is to return 0 to the
user and the user can then close the socket. The EAGAIN causes many apps
to retry which masks the problem. Eventually the socket is evicted from
the sockmap because its released from sockmap sock free handling. The
issue creates a delay and can cause some errors on application side.

To fix this check on sk_msg_recvmsg side if length is zero and FIN flag
is set then set return to zero. A selftest will be added to check this
condition.

Fixes: 04919bed94 ("tcp: Introduce tcp_read_skb()")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: William Findlay <will@isovalent.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20230523025618.113937-6-john.fastabend@gmail.com
2023-05-23 16:10:18 +02:00
John Fastabend
405df89dd5 bpf, sockmap: Improved check for empty queue
We noticed some rare sk_buffs were stepping past the queue when system was
under memory pressure. The general theory is to skip enqueueing
sk_buffs when its not necessary which is the normal case with a system
that is properly provisioned for the task, no memory pressure and enough
cpu assigned.

But, if we can't allocate memory due to an ENOMEM error when enqueueing
the sk_buff into the sockmap receive queue we push it onto a delayed
workqueue to retry later. When a new sk_buff is received we then check
if that queue is empty. However, there is a problem with simply checking
the queue length. When a sk_buff is being processed from the ingress queue
but not yet on the sockmap msg receive queue its possible to also recv
a sk_buff through normal path. It will check the ingress queue which is
zero and then skip ahead of the pkt being processed.

Previously we used sock lock from both contexts which made the problem
harder to hit, but not impossible.

To fix instead of popping the skb from the queue entirely we peek the
skb from the queue and do the copy there. This ensures checks to the
queue length are non-zero while skb is being processed. Then finally
when the entire skb has been copied to user space queue or another
socket we pop it off the queue. This way the queue length check allows
bypassing the queue only after the list has been completely processed.

To reproduce issue we run NGINX compliance test with sockmap running and
observe some flakes in our testing that we attributed to this issue.

Fixes: 04919bed94 ("tcp: Introduce tcp_read_skb()")
Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: William Findlay <will@isovalent.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20230523025618.113937-5-john.fastabend@gmail.com
2023-05-23 16:10:11 +02:00
John Fastabend
bce22552f9 bpf, sockmap: Reschedule is now done through backlog
Now that the backlog manages the reschedule() logic correctly we can drop
the partial fix to reschedule from recvmsg hook.

Rescheduling on recvmsg hook was added to address a corner case where we
still had data in the backlog state but had nothing to kick it and
reschedule the backlog worker to run and finish copying data out of the
state. This had a couple limitations, first it required user space to
kick it introducing an unnecessary EBUSY and retry. Second it only
handled the ingress case and egress redirects would still be hung.

With the correct fix, pushing the reschedule logic down to where the
enomem error occurs we can drop this fix.

Fixes: bec217197b ("skmsg: Schedule psock work if the cached skb exists on the psock")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20230523025618.113937-4-john.fastabend@gmail.com
2023-05-23 16:10:04 +02:00
John Fastabend
29173d07f7 bpf, sockmap: Convert schedule_work into delayed_work
Sk_buffs are fed into sockmap verdict programs either from a strparser
(when the user might want to decide how framing of skb is done by attaching
another parser program) or directly through tcp_read_sock. The
tcp_read_sock is the preferred method for performance when the BPF logic is
a stream parser.

The flow for Cilium's common use case with a stream parser is,

 tcp_read_sock()
  sk_psock_verdict_recv
    ret = bpf_prog_run_pin_on_cpu()
    sk_psock_verdict_apply(sock, skb, ret)
     // if system is under memory pressure or app is slow we may
     // need to queue skb. Do this queuing through ingress_skb and
     // then kick timer to wake up handler
     skb_queue_tail(ingress_skb, skb)
     schedule_work(work);

The work queue is wired up to sk_psock_backlog(). This will then walk the
ingress_skb skb list that holds our sk_buffs that could not be handled,
but should be OK to run at some later point. However, its possible that
the workqueue doing this work still hits an error when sending the skb.
When this happens the skbuff is requeued on a temporary 'state' struct
kept with the workqueue. This is necessary because its possible to
partially send an skbuff before hitting an error and we need to know how
and where to restart when the workqueue runs next.

Now for the trouble, we don't rekick the workqueue. This can cause a
stall where the skbuff we just cached on the state variable might never
be sent. This happens when its the last packet in a flow and no further
packets come along that would cause the system to kick the workqueue from
that side.

To fix we could do simple schedule_work(), but while under memory pressure
it makes sense to back off some instead of continue to retry repeatedly. So
instead to fix convert schedule_work to schedule_delayed_work and add
backoff logic to reschedule from backlog queue on errors. Its not obvious
though what a good backoff is so use '1'.

To test we observed some flakes whil running NGINX compliance test with
sockmap we attributed these failed test to this bug and subsequent issue.

>From on list discussion. This commit

 bec217197b41("skmsg: Schedule psock work if the cached skb exists on the psock")

was intended to address similar race, but had a couple cases it missed.
Most obvious it only accounted for receiving traffic on the local socket
so if redirecting into another socket we could still get an sk_buff stuck
here. Next it missed the case where copied=0 in the recv() handler and
then we wouldn't kick the scheduler. Also its sub-optimal to require
userspace to kick the internal mechanisms of sockmap to wake it up and
copy data to user. It results in an extra syscall and requires the app
to actual handle the EAGAIN correctly.

Fixes: 04919bed94 ("tcp: Introduce tcp_read_skb()")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: William Findlay <will@isovalent.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20230523025618.113937-3-john.fastabend@gmail.com
2023-05-23 16:09:56 +02:00
John Fastabend
78fa0d61d9 bpf, sockmap: Pass skb ownership through read_skb
The read_skb hook calls consume_skb() now, but this means that if the
recv_actor program wants to use the skb it needs to inc the ref cnt
so that the consume_skb() doesn't kfree the sk_buff.

This is problematic because in some error cases under memory pressure
we may need to linearize the sk_buff from sk_psock_skb_ingress_enqueue().
Then we get this,

 skb_linearize()
   __pskb_pull_tail()
     pskb_expand_head()
       BUG_ON(skb_shared(skb))

Because we incremented users refcnt from sk_psock_verdict_recv() we
hit the bug on with refcnt > 1 and trip it.

To fix lets simply pass ownership of the sk_buff through the skb_read
call. Then we can drop the consume from read_skb handlers and assume
the verdict recv does any required kfree.

Bug found while testing in our CI which runs in VMs that hit memory
constraints rather regularly. William tested TCP read_skb handlers.

[  106.536188] ------------[ cut here ]------------
[  106.536197] kernel BUG at net/core/skbuff.c:1693!
[  106.536479] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[  106.536726] CPU: 3 PID: 1495 Comm: curl Not tainted 5.19.0-rc5 #1
[  106.537023] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.16.0-1 04/01/2014
[  106.537467] RIP: 0010:pskb_expand_head+0x269/0x330
[  106.538585] RSP: 0018:ffffc90000138b68 EFLAGS: 00010202
[  106.538839] RAX: 000000000000003f RBX: ffff8881048940e8 RCX: 0000000000000a20
[  106.539186] RDX: 0000000000000002 RSI: 0000000000000000 RDI: ffff8881048940e8
[  106.539529] RBP: ffffc90000138be8 R08: 00000000e161fd1a R09: 0000000000000000
[  106.539877] R10: 0000000000000018 R11: 0000000000000000 R12: ffff8881048940e8
[  106.540222] R13: 0000000000000003 R14: 0000000000000000 R15: ffff8881048940e8
[  106.540568] FS:  00007f277dde9f00(0000) GS:ffff88813bd80000(0000) knlGS:0000000000000000
[  106.540954] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  106.541227] CR2: 00007f277eeede64 CR3: 000000000ad3e000 CR4: 00000000000006e0
[  106.541569] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  106.541915] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  106.542255] Call Trace:
[  106.542383]  <IRQ>
[  106.542487]  __pskb_pull_tail+0x4b/0x3e0
[  106.542681]  skb_ensure_writable+0x85/0xa0
[  106.542882]  sk_skb_pull_data+0x18/0x20
[  106.543084]  bpf_prog_b517a65a242018b0_bpf_skskb_http_verdict+0x3a9/0x4aa9
[  106.543536]  ? migrate_disable+0x66/0x80
[  106.543871]  sk_psock_verdict_recv+0xe2/0x310
[  106.544258]  ? sk_psock_write_space+0x1f0/0x1f0
[  106.544561]  tcp_read_skb+0x7b/0x120
[  106.544740]  tcp_data_queue+0x904/0xee0
[  106.544931]  tcp_rcv_established+0x212/0x7c0
[  106.545142]  tcp_v4_do_rcv+0x174/0x2a0
[  106.545326]  tcp_v4_rcv+0xe70/0xf60
[  106.545500]  ip_protocol_deliver_rcu+0x48/0x290
[  106.545744]  ip_local_deliver_finish+0xa7/0x150

Fixes: 04919bed94 ("tcp: Introduce tcp_read_skb()")
Reported-by: William Findlay <will@isovalent.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: William Findlay <will@isovalent.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20230523025618.113937-2-john.fastabend@gmail.com
2023-05-23 16:09:47 +02:00
Anton Protopopov
b34ffb0c6d bpf: fix a memory leak in the LRU and LRU_PERCPU hash maps
The LRU and LRU_PERCPU maps allocate a new element on update before locking the
target hash table bucket. Right after that the maps try to lock the bucket.
If this fails, then maps return -EBUSY to the caller without releasing the
allocated element. This makes the element untracked: it doesn't belong to
either of free lists, and it doesn't belong to the hash table, so can't be
re-used; this eventually leads to the permanent -ENOMEM on LRU map updates,
which is unexpected. Fix this by returning the element to the local free list
if bucket locking fails.

Fixes: 20b6cc34ea ("bpf: Avoid hashtab deadlock with map_locked")
Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Link: https://lore.kernel.org/r/20230522154558.2166815-1-aspsk@isovalent.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-05-22 10:26:39 -07:00
Will Deacon
0613d8ca9a bpf: Fix mask generation for 32-bit narrow loads of 64-bit fields
A narrow load from a 64-bit context field results in a 64-bit load
followed potentially by a 64-bit right-shift and then a bitwise AND
operation to extract the relevant data.

In the case of a 32-bit access, an immediate mask of 0xffffffff is used
to construct a 64-bit BPP_AND operation which then sign-extends the mask
value and effectively acts as a glorified no-op. For example:

0:	61 10 00 00 00 00 00 00	r0 = *(u32 *)(r1 + 0)

results in the following code generation for a 64-bit field:

	ldr	x7, [x7]	// 64-bit load
	mov	x10, #0xffffffffffffffff
	and	x7, x7, x10

Fix the mask generation so that narrow loads always perform a 32-bit AND
operation:

	ldr	x7, [x7]	// 64-bit load
	mov	w10, #0xffffffff
	and	w7, w7, w10

Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Krzesimir Nowak <krzesimir@kinvolk.io>
Cc: Andrey Ignatov <rdna@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Fixes: 31fd85816d ("bpf: permits narrower load from bpf program context fields")
Signed-off-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20230518102528.1341-1-will@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-05-19 09:58:37 -07:00
Andrii Nakryiko
a820ca1a73 samples/bpf: Drop unnecessary fallthrough
__fallthrough is now not supported. Instead of renaming it to
now-canonical ([0]) fallthrough pseudo-keyword, just get rid of it and
equate 'h' case to default case, as both emit usage information and
succeed.

  [0] https://www.kernel.org/doc/html/latest/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20230516001718.317177-1-andrii@kernel.org
2023-05-16 19:44:05 +02:00
Jakub Kicinski
e1505c1cc8 bpf: netdev: init the offload table earlier
Some netdevices may get unregistered before late_initcall(),
we have to move the hashtable init earlier.

Fixes: f1fc43d039 ("bpf: Move offload initialization into late_initcall")
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217399
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20230505215836.491485-1-kuba@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-05-15 07:07:41 -07:00
Jeremy Sowden
5f5486b620 selftests/bpf: Fix pkg-config call building sign-file
When building sign-file, the call to get the CFLAGS for libcrypto is
missing white-space between `pkg-config` and `--cflags`:

  $(shell $(HOSTPKG_CONFIG)--cflags libcrypto 2> /dev/null)

Removing the redirection of stderr, we see:

  $ make -C tools/testing/selftests/bpf sign-file
  make: Entering directory '[...]/tools/testing/selftests/bpf'
  make: pkg-config--cflags: No such file or directory
    SIGN-FILE sign-file
  make: Leaving directory '[...]/tools/testing/selftests/bpf'

Add the missing space.

Fixes: fc97590668 ("selftests/bpf: Add test for bpf_verify_pkcs7_signature() kfunc")
Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Roberto Sassu <roberto.sassu@huawei.com>
Link: https://lore.kernel.org/bpf/20230426215032.415792-1-jeremy@azazel.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-05-14 14:59:44 -07:00
David S. Miller
b41caaded0 Merge branch 'hns3-fixes'
Hao Lan says:

====================
net: hns3: fix some bug for hns3

There are some bugfixes for the HNS3 ethernet driver. patch#1 fix miss
checking for rx packet. patch#2 fixes VF promisc mode not update
when mac table full bug, and patch#3 fixes a nterrupts not
initialization in VF FLR bug.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-13 17:12:24 +01:00
Jijie Shao
6b45d5ff8c net: hns3: fix reset timeout when enable full VF
The timeout of the cmdq reset command has been increased to
resolve the reset timeout issue in the full VF scenario.
The timeout of other cmdq commands remains unchanged.

Fixes: 8d307f8e8c ("net: hns3: create new set of unified hclge_comm_cmd_send APIs")
Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Signed-off-by: Hao Lan <lanhao@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-13 17:12:23 +01:00
Jie Wang
814d0c7860 net: hns3: fix reset delay time to avoid configuration timeout
Currently the hns3 vf function reset delays 5000ms before vf rebuild
process. In product applications, this delay is too long for application
configurations and causes configuration timeout.

According to the tests, 500ms delay is enough for reset process except PF
FLR. So this patch modifies delay to 500ms in these scenarios.

Fixes: 6988eb2a9b ("net: hns3: Add support to reset the enet/ring mgmt layer")
Signed-off-by: Jie Wang <wangjie125@huawei.com>
Signed-off-by: Hao Lan <lanhao@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-13 17:12:23 +01:00
Jijie Shao
f14db07064 net: hns3: fix sending pfc frames after reset issue
To prevent the system from abnormally sending PFC frames after an
abnormal reset. The hns3 driver notifies the firmware to disable pfc
before reset.

Fixes: 35d93a3004 ("net: hns3: adjust the process of PF reset")
Signed-off-by: Jijie Shao <shaojijie@huawei.com>
Signed-off-by: Hao Lan <lanhao@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-13 17:12:23 +01:00
Jie Wang
89f6bfb071 net: hns3: fix output information incomplete for dumping tx queue info with debugfs
In function hns3_dump_tx_queue_info, The print buffer is not enough when
the tx BD number is configured to 32760. As a result several BD
information wouldn't be displayed.

So fix it by increasing the tx queue print buffer length.

Fixes: 630a6738da ("net: hns3: adjust string spaces of some parameters of tx bd info in debugfs")
Signed-off-by: Jie Wang <wangjie125@huawei.com>
Signed-off-by: Hao Lan <lanhao@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-13 17:12:23 +01:00
David S. Miller
843eb679d8 Merge branch 'dsa-rzn1-a5psw-stp'
Alexis Lothoré says:

====================
net: dsa: rzn1-a5psw: fix STP states handling

This small series fixes STP support and while adding a new function to
enable/disable learning, use that to disable learning on standalone ports
at switch setup as reported by Vladimir Oltean.

This series was initially submitted on net-next by Clement Leger, but some
career evolutions has made him hand me over those topics.
Also, this new revision is submitted on net instead of net-next for V1
based on Vladimir Oltean's suggestion

Changes since v2:
- fix commit split by moving A5PSW_MGMT_CFG_ENABLE in relevant commit
- fix reverse christmas tree ordering in a5psw_port_stp_state_set

Changes since v1:
- fix typos in commit messages and doc
- re-split STP states handling commit
- add Fixes: tag and new Signed-off-by
- submit series as fix on net instead of net-next
- split learning and blocking setting functions
- remove unused define A5PSW_PORT_ENA_TX_SHIFT
- add boolean for tx/rx enabled for clarity
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-13 17:06:38 +01:00
Clément Léger
ec52b69c04 net: dsa: rzn1-a5psw: disable learning for standalone ports
When ports are in standalone mode, they should have learning disabled to
avoid adding new entries in the MAC lookup table which might be used by
other bridge ports to forward packets. While adding that, also make sure
learning is enabled for CPU port.

Fixes: 888cdb892b ("net: dsa: rzn1-a5psw: add Renesas RZ/N1 advanced 5 port switch driver")
Signed-off-by: Clément Léger <clement.leger@bootlin.com>
Signed-off-by: Alexis Lothoré <alexis.lothore@bootlin.com>
Reviewed-by: Piotr Raczynski <piotr.raczynski@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-13 17:06:38 +01:00
Alexis Lothoré
ebe9bc5095 net: dsa: rzn1-a5psw: fix STP states handling
stp_set_state() should actually allow receiving BPDU while in LEARNING
mode which is not the case. Additionally, the BLOCKEN bit does not
actually forbid sending forwarded frames from that port. To fix this, add
a5psw_port_tx_enable() function which allows to disable TX. However, while
its name suggest that TX is totally disabled, it is not and can still
allow to send BPDUs even if disabled. This can be done by using forced
forwarding with the switch tagging mechanism but keeping "filtering"
disabled (which is already the case in the rzn1-a5sw tag driver). With
these fixes, STP support is now functional.

Fixes: 888cdb892b ("net: dsa: rzn1-a5psw: add Renesas RZ/N1 advanced 5 port switch driver")
Signed-off-by: Clément Léger <clement.leger@bootlin.com>
Signed-off-by: Alexis Lothoré <alexis.lothore@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-13 17:06:38 +01:00
Clément Léger
9e4b45f20c net: dsa: rzn1-a5psw: enable management frames for CPU port
Currently, management frame were discarded before reaching the CPU port due
to a misconfiguration of the MGMT_CONFIG register. Enable them by setting
the correct value in this register in order to correctly receive management
frame and handle STP.

Fixes: 888cdb892b ("net: dsa: rzn1-a5psw: add Renesas RZ/N1 advanced 5 port switch driver")
Signed-off-by: Clément Léger <clement.leger@bootlin.com>
Signed-off-by: Alexis Lothoré <alexis.lothore@bootlin.com>
Reviewed-by: Piotr Raczynski <piotr.raczynski@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-13 17:06:38 +01:00
Xin Long
d80fc101d2 erspan: get the proto with the md version for collect_md
In commit 20704bd163 ("erspan: build the header with the right proto
according to erspan_ver"), it gets the proto with t->parms.erspan_ver,
but t->parms.erspan_ver is not used by collect_md branch, and instead
it should get the proto with md->version for collect_md.

Thanks to Kevin for pointing this out.

Fixes: 20704bd163 ("erspan: build the header with the right proto according to erspan_ver")
Fixes: 94d7d8f292 ("ip6_gre: add erspan v2 support")
Reported-by: Kevin Traynor <ktraynor@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: William Tu <u9012063@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-13 16:58:58 +01:00
Eric Dumazet
1e306ec49a tcp: fix possible sk_priority leak in tcp_v4_send_reset()
When tcp_v4_send_reset() is called with @sk == NULL,
we do not change ctl_sk->sk_priority, which could have been
set from a prior invocation.

Change tcp_v4_send_reset() to set sk_priority and sk_mark
fields before calling ip_send_unicast_reply().

This means tcp_v4_send_reset() and tcp_v4_send_ack()
no longer have to clear ctl_sk->sk_mark after
their call to ip_send_unicast_reply().

Fixes: f6c0f5d209 ("tcp: honor SO_PRIORITY in TIME_WAIT state")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-12 10:05:50 +01:00
Zhuang Shengen
6d4486efe9 vsock: avoid to close connected socket after the timeout
When client and server establish a connection through vsock,
the client send a request to the server to initiate the connection,
then start a timer to wait for the server's response. When the server's
RESPONSE message arrives, the timer also times out and exits. The
server's RESPONSE message is processed first, and the connection is
established. However, the client's timer also times out, the original
processing logic of the client is to directly set the state of this vsock
to CLOSE and return ETIMEDOUT. It will not notify the server when the port
is released, causing the server port remain.
when client's vsock_connect timeout,it should check sk state is
ESTABLISHED or not. if sk state is ESTABLISHED, it means the connection
is established, the client should not set the sk state to CLOSE

Note: I encountered this issue on kernel-4.18, which can be fixed by
this patch. Then I checked the latest code in the community
and found similar issue.

Fixes: d021c34405 ("VSOCK: Introduce VM Sockets")
Signed-off-by: Zhuang Shengen <zhuangshengen@huawei.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-12 10:04:10 +01:00
Pieter Jansen van Vuuren
134120b066 sfc: disable RXFCS and RXALL features by default
By default we would not want RXFCS and RXALL features enabled as they are
mainly intended for debugging purposes. This does not stop users from
enabling them later on as needed.

Fixes: 8e57daf706 ("sfc_ef100: RX path for EF100")
Signed-off-by: Pieter Jansen van Vuuren <pieter.jansen-van-vuuren@amd.com>
Co-developed-by: Edward Cree <ecree.xilinx@gmail.com>
Signed-off-by: Edward Cree <ecree.xilinx@gmail.com>
Reviewed-by: Martin Habets <habetsm.xilinx@gmail.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-12 10:00:48 +01:00
Jan Sokolowski
9113302bb4 ice: Fix undersized tx_flags variable
As not all ICE_TX_FLAGS_* fit in current 16-bit limited
tx_flags field that was introduced in the Fixes commit,
VLAN-related information would be discarded completely.
As such, creating a vlan and trying to run ping through
would result in no traffic passing.

Fix that by refactoring tx_flags variable into flags only and
a separate variable that holds VLAN ID. As there is some space left,
type variable can fit between those two. Pahole reports no size
change to ice_tx_buf struct.

Fixes: aa1d3faf71 ("ice: Robustify cleaning/completing XDP Tx buffers")
Signed-off-by: Jan Sokolowski <jan.sokolowski@intel.com>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-12 09:27:44 +01:00
Jakub Kicinski
47af429171 MAINTAINERS: exclude wireless drivers from netdev
It seems that we mostly get netdev CCed on wireless patches
which are written by people who don't know any better and
CC everything that get_maintainers spits out. Rather than
patches which indeed could benefit from general networking
review.

Marking them down in patchwork as Awaiting Upstream is
a bit tedious.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Acked-by: Kalle Valo <kvalo@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-12 08:58:43 +01:00
Huayu Chen
de9c1a23ad nfp: fix NFP_NET_MAX_DSCP definition error
The patch corrects the NFP_NET_MAX_DSCP definition in the main.h file.

The incorrect definition result DSCP bits not being mapped properly when
DCB is set. When NFP_NET_MAX_DSCP was defined as 4, the next 60 DSCP
bits failed to be set.

Fixes: 9b7fe8046d ("nfp: add DCB IEEE support")
Cc: stable@vger.kernel.org
Signed-off-by: Huayu Chen <huayu.chen@corigine.com>
Acked-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Louis Peens <louis.peens@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-12 08:55:47 +01:00
Jakub Kicinski
01e8f6cd10 MAINTAINERS: don't CC docs@ for netlink spec changes
Documentation/netlink/ contains machine-readable protocol
specs in YAML. Those are much like device tree bindings,
no point CCing docs@ for the changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
2023-05-12 08:53:17 +01:00
Marcelo Ricardo Leitner
d03a2f1762 MAINTAINERS: sctp: move Neil to CREDITS
Neil moved away from SCTP related duties.
Move him to CREDITS then and while at it, update SCTP
project website.

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-12 08:51:32 +01:00
Grygorii Strashko
0b01db2740 net: phy: dp83867: add w/a for packet errors seen with short cables
Introduce the W/A for packet errors seen with short cables (<1m) between
two DP83867 PHYs.

The W/A recommended by DM requires FFE Equalizer Configuration tuning by
writing value 0x0E81 to DSP_FFE_CFG register (0x012C), surrounded by hard
and soft resets as follows:

write_reg(0x001F, 0x8000); //hard reset
write_reg(DSP_FFE_CFG, 0x0E81);
write_reg(0x001F, 0x4000); //soft reset

Since  DP83867 PHY DM says "Changing this register to 0x0E81, will not
affect Long Cable performance.", enable the W/A by default.

Fixes: 2a10154abc ("net: phy: dp83867: Add TI dp83867 phy")
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-12 08:44:43 +01:00
Uwe Kleine-König
f816b9829b net: fec: Better handle pm_runtime_get() failing in .remove()
In the (unlikely) event that pm_runtime_get() (disguised as
pm_runtime_resume_and_get()) fails, the remove callback returned an
error early. The problem with this is that the driver core ignores the
error value and continues removing the device. This results in a
resource leak. Worse the devm allocated resources are freed and so if a
callback of the driver is called later the register mapping is already
gone which probably results in a crash.

Fixes: a31eda65ba ("net: fec: fix clock count mis-match")
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20230510200020.1534610-1-u.kleine-koenig@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-11 18:07:14 -07:00
Eric Dumazet
ef1148d448 ipv6: remove nexthop_fib6_nh_bh()
After blamed commit, nexthop_fib6_nh_bh() and nexthop_fib6_nh()
are the same.

Delete nexthop_fib6_nh_bh(), and convert /proc/net/ipv6_route
to standard rcu to avoid this splat:

[ 5723.180080] WARNING: suspicious RCU usage
[ 5723.180083] -----------------------------
[ 5723.180084] include/net/nexthop.h:516 suspicious rcu_dereference_check() usage!
[ 5723.180086]
other info that might help us debug this:

[ 5723.180087]
rcu_scheduler_active = 2, debug_locks = 1
[ 5723.180089] 2 locks held by cat/55856:
[ 5723.180091] #0: ffff9440a582afa8 (&p->lock){+.+.}-{3:3}, at: seq_read_iter (fs/seq_file.c:188)
[ 5723.180100] #1: ffffffffaac07040 (rcu_read_lock_bh){....}-{1:2}, at: rcu_lock_acquire (include/linux/rcupdate.h:326)
[ 5723.180109]
stack backtrace:
[ 5723.180111] CPU: 14 PID: 55856 Comm: cat Tainted: G S        I        6.3.0-dbx-DEV #528
[ 5723.180115] Call Trace:
[ 5723.180117]  <TASK>
[ 5723.180119] dump_stack_lvl (lib/dump_stack.c:107)
[ 5723.180124] dump_stack (lib/dump_stack.c:114)
[ 5723.180126] lockdep_rcu_suspicious (include/linux/context_tracking.h:122)
[ 5723.180132] ipv6_route_seq_show (include/net/nexthop.h:?)
[ 5723.180135] ? ipv6_route_seq_next (net/ipv6/ip6_fib.c:2605)
[ 5723.180140] seq_read_iter (fs/seq_file.c:272)
[ 5723.180145] seq_read (fs/seq_file.c:163)
[ 5723.180151] proc_reg_read (fs/proc/inode.c:316 fs/proc/inode.c:328)
[ 5723.180155] vfs_read (fs/read_write.c:468)
[ 5723.180160] ? up_read (kernel/locking/rwsem.c:1617)
[ 5723.180164] ksys_read (fs/read_write.c:613)
[ 5723.180168] __x64_sys_read (fs/read_write.c:621)
[ 5723.180170] do_syscall_64 (arch/x86/entry/common.c:?)
[ 5723.180174] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
[ 5723.180177] RIP: 0033:0x7fa455677d2a

Fixes: 09eed1192c ("neighbour: switch to standard rcu, instead of rcu_bh")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20230510154646.370659-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-11 18:07:05 -07:00
Jiri Pirko
e93c9378e3 devlink: change per-devlink netdev notifier to static one
The commit 565b4824c3 ("devlink: change port event netdev notifier
from per-net to global") changed original per-net notifier to be
per-devlink instance. That fixed the issue of non-receiving events
of netdev uninit if that moved to a different namespace.
That worked fine in -net tree.

However, later on when commit ee75f1fc44 ("net/mlx5e: Create
separate devlink instance for ethernet auxiliary device") and
commit 72ed5d5624 ("net/mlx5: Suspend auxiliary devices only in
case of PCI device suspend") were merged, a deadlock was introduced
when removing a namespace with devlink instance with another nested
instance.

Here there is the bad flow example resulting in deadlock with mlx5:
net_cleanup_work -> cleanup_net (takes down_read(&pernet_ops_rwsem) ->
devlink_pernet_pre_exit() -> devlink_reload() ->
mlx5_devlink_reload_down() -> mlx5_unload_one_devl_locked() ->
mlx5_detach_device() -> del_adev() -> mlx5e_remove() ->
mlx5e_destroy_devlink() -> devlink_free() ->
unregister_netdevice_notifier() (takes down_write(&pernet_ops_rwsem)

Steps to reproduce:
$ modprobe mlx5_core
$ ip netns add ns1
$ devlink dev reload pci/0000:08:00.0 netns ns1
$ ip netns del ns1

Resolve this by converting the notifier from per-devlink instance to
a static one registered during init phase and leaving it registered
forever. Use this notifier for all devlink port instances created
later on.

Note what a tree needs this fix only in case all of the cited fixes
commits are present.

Reported-by: Moshe Shemesh <moshe@nvidia.com>
Fixes: 565b4824c3 ("devlink: change port event netdev notifier from per-net to global")
Fixes: ee75f1fc44 ("net/mlx5e: Create separate devlink instance for ethernet auxiliary device")
Fixes: 72ed5d5624 ("net/mlx5: Suspend auxiliary devices only in case of PCI device suspend")
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Link: https://lore.kernel.org/r/20230510144621.932017-1-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-11 18:06:50 -07:00
Jakub Kicinski
7ce93d6f91 Merge branch 'selftests-seg6-make-srv6_end_dt4_l3vpn_test-more-robust'
Andrea Mayer says:

====================
selftests: seg6: make srv6_end_dt4_l3vpn_test more robust

This pachset aims to improve and make more robust the selftests performed to
check whether SRv6 End.DT4 beahvior works as expected under different system
configurations.
Some Linux distributions enable Deduplication Address Detection and Reverse
Path Filtering mechanisms by default which can interfere with SRv6 End.DT4
behavior and cause selftests to fail.

The following patches improve selftests for End.DT4 by taking these two
mechanisms into account. Specifically:
 - patch 1/2: selftests: seg6: disable DAD on IPv6 router cfg for
              srv6_end_dt4_l3vpn_test
 - patch 2/2: selftets: seg6: disable rp_filter by default in
              srv6_end_dt4_l3vpn_test
====================

Link: https://lore.kernel.org/r/20230510111638.12408-1-andrea.mayer@uniroma2.it
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-11 18:01:40 -07:00
Andrea Mayer
f97b8401e0 selftets: seg6: disable rp_filter by default in srv6_end_dt4_l3vpn_test
On some distributions, the rp_filter is automatically set (=1) by
default on a netdev basis (also on VRFs).
In an SRv6 End.DT4 behavior, decapsulated IPv4 packets are routed using
the table associated with the VRF bound to that tunnel. During lookup
operations, the rp_filter can lead to packet loss when activated on the
VRF.
Therefore, we chose to make this selftest more robust by explicitly
disabling the rp_filter during tests (as it is automatically set by some
Linux distributions).

Fixes: 2195444e09 ("selftests: add selftest for the SRv6 End.DT4 behavior")
Reported-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Tested-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-11 18:01:38 -07:00
Andrea Mayer
21a933c79a selftests: seg6: disable DAD on IPv6 router cfg for srv6_end_dt4_l3vpn_test
The srv6_end_dt4_l3vpn_test instantiates a virtual network consisting of
several routers (rt-1, rt-2) and hosts.
When the IPv6 addresses of rt-{1,2} routers are configured, the Deduplicate
Address Detection (DAD) kicks in when enabled in the Linux distros running
the selftests. DAD is used to check whether an IPv6 address is already
assigned in a network. Such a mechanism consists of sending an ICMPv6 Echo
Request and waiting for a reply.
As the DAD process could take too long to complete, it may cause the
failing of some tests carried out by the srv6_end_dt4_l3vpn_test script.

To make the srv6_end_dt4_l3vpn_test more robust, we disable DAD on routers
since we configure the virtual network manually and do not need any address
deduplication mechanism at all.

Fixes: 2195444e09 ("selftests: add selftest for the SRv6 End.DT4 behavior")
Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-11 18:01:38 -07:00
Linus Torvalds
6e27831b91 Networking fixes for 6.4-rc2, including fixes from netfilter
Current release - regressions:
 
   - mtk_eth_soc: fix NULL pointer dereference
 
 Previous releases - regressions:
 
   - core:
     - skb_partial_csum_set() fix against transport header magic value
     - fix load-tearing on sk->sk_stamp in sock_recv_cmsgs().
     - annotate sk->sk_err write from do_recvmmsg()
     - add vlan_get_protocol_and_depth() helper
 
   - netlink: annotate accesses to nlk->cb_running
 
   - netfilter: always release netdev hooks from notifier
 
 Previous releases - always broken:
 
   - core: deal with most data-races in sk_wait_event()
 
   - netfilter: fix possible bug_on with enable_hooks=1
 
   - eth: bonding: fix send_peer_notif overflow
 
   - eth: xpcs: fix incorrect number of interfaces
 
   - eth: ipvlan: fix out-of-bounds caused by unclear skb->cb
 
   - eth: stmmac: Initialize MAC_ONEUS_TIC_COUNTER register
 
 Signed-off-by: Paolo Abeni <pabeni@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmRcxawSHHBhYmVuaUBy
 ZWRoYXQuY29tAAoJECkkeY3MjxOkv6wQAJgfOBlDAkZNKHzwtMuFiLxECeEMWY9h
 wJCyiq0qXnz9p5ZqjdmTmA8B+jUp9VkpgN5Z3lid5hXDfzDrvXL1KGZW4pc4ooz9
 GUzrp0EUzO5UsyrlZRS9vJ9mbCGN5M1ZWtWH93g8OzGJPRnLs0Q/Tr4IFTBVKzVb
 GmJPy/ZYWYDjnvx3BgewRDuYeH3Rt9lsIt4Pxq/E+D8W3ypvVM0m3GvrO5eEzMeu
 EfeilAdmJGJUufeoGguKt0hheqILS3kNCjQO25XS2Lq1OqetnR/wqTwXaaVxL2du
 Eb2ca7wKkihDpl2l8bQ3ss6vqM0HEpZ63Y2PJaNBS8ASdLsMq4n2L6j2JMfT8hWY
 RG3nJS7F2UFLyYmCJjNL1/I+Z9XeMyFKnHORzHK1dAkMlhd+8NauKWAxdjlxMbxX
 p1msyTl54bG0g6FrU/zAirCWNAAZYCPdZG/XvA/2Jj9mdy64OlGlv/QdJvfjcx+C
 L6nkwZfwXU7QUwKeeTfP8abte2SLrXIxkJrnNEAntPnFOSmd16+/yvQ8JVlbWTMd
 JugJrSAIxjOglIr/1fsnUuV+Ab+JDYQv/wkoyzvtcY2tjhTAHzgmTwwSfeYiCTJE
 rEbjyVvVgMcLTUIk/R9QC5/k6nX/7/KRDHxPOMBX4boOsuA0ARVjzt8uKRvv/7cS
 dRV98RwvCKvD
 =MoPD
 -----END PGP SIGNATURE-----

Merge tag 'net-6.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
 "Including fixes from netfilter.

  Current release - regressions:

   - mtk_eth_soc: fix NULL pointer dereference

  Previous releases - regressions:

   - core:
      - skb_partial_csum_set() fix against transport header magic value
      - fix load-tearing on sk->sk_stamp in sock_recv_cmsgs().
      - annotate sk->sk_err write from do_recvmmsg()
      - add vlan_get_protocol_and_depth() helper

   - netlink: annotate accesses to nlk->cb_running

   - netfilter: always release netdev hooks from notifier

  Previous releases - always broken:

   - core: deal with most data-races in sk_wait_event()

   - netfilter: fix possible bug_on with enable_hooks=1

   - eth: bonding: fix send_peer_notif overflow

   - eth: xpcs: fix incorrect number of interfaces

   - eth: ipvlan: fix out-of-bounds caused by unclear skb->cb

   - eth: stmmac: Initialize MAC_ONEUS_TIC_COUNTER register"

* tag 'net-6.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (31 commits)
  af_unix: Fix data races around sk->sk_shutdown.
  af_unix: Fix a data race of sk->sk_receive_queue->qlen.
  net: datagram: fix data-races in datagram_poll()
  net: mscc: ocelot: fix stat counter register values
  ipvlan:Fix out-of-bounds caused by unclear skb->cb
  docs: networking: fix x25-iface.rst heading & index order
  gve: Remove the code of clearing PBA bit
  tcp: add annotations around sk->sk_shutdown accesses
  net: add vlan_get_protocol_and_depth() helper
  net: pcs: xpcs: fix incorrect number of interfaces
  net: deal with most data-races in sk_wait_event()
  net: annotate sk->sk_err write from do_recvmmsg()
  netlink: annotate accesses to nlk->cb_running
  kselftest: bonding: add num_grat_arp test
  selftests: forwarding: lib: add netns support for tc rule handle stats get
  Documentation: bonding: fix the doc of peer_notif_delay
  bonding: fix send_peer_notif overflow
  net: ethernet: mtk_eth_soc: fix NULL pointer dereference
  selftests: nft_flowtable.sh: check ingress/egress chain too
  selftests: nft_flowtable.sh: monitor result file sizes
  ...
2023-05-11 08:42:47 -05:00
Linus Torvalds
691e1eee1b media fixes for v6.4-rc2
-----BEGIN PGP SIGNATURE-----
 
 iQIyBAABCAAdFiEE+QmuaPwR3wnBdVwACF8+vY7k4RUFAmRcg4gACgkQCF8+vY7k
 4RWPGA/0CbjmuSvikL7h2Cjlusp1HAvjhcHyXElKuob87ZlMq248eEQjwmeYkefG
 Oc1q3dHJR52BHwMH/9RhRfwKxFcSjKtb3Ed8R492OxZgIg9urrpb8YA4Z1r4qMaU
 Ehd3AS1a8Fe8pLWNxG4cxcs7Nl5giMTnBORMsxcaLx4miRwkfoNWBCY7L50yjOGA
 l2OYht6sqZMI+CPc4I+C7cgAVCflJBnIGGBoKZI+HSRdMpN1Ndown/O2vc5bZbGh
 mw2MKgMr3+QUxjMm1kebHFeYIgLtg2p95/kjlQlMEqgQiIdhGIMpv3KqZjeRqZZW
 4WAawUMwZEmPXV01OgQtQQT79LYZYVLTxJewm8nRMBLsBbbYNCLLExzVdfY2cxxt
 aXHYnmLsL2wIicQIdNtq9NuptlVo6+zJpBpFQR5BfXT2VhBT/JF7iB7y5m6B269u
 Ah5Sv1ygB6JDrkxxPzdsRSavwxT4tMx5bzIUUvcg9D9AyfXgF/jxxK6MYuALWOv0
 YUZD1wH4XqoRKITqTfts8kvu9+jDkh9ruBHBwDM/i+v9CNyIAQaU32Yolit8IGgx
 ArrDiGigzbD2Z9e8NVZFJy+15PQgpCUBOUagU5w/eszhBo25n7N2a8EjNSHQEzYb
 7IHHVzjZzizhMLdJSWNAFolgmiL+DcZnuhC79rep+EdG7FIp1Q==
 =XxmV
 -----END PGP SIGNATURE-----

Merge tag 'media/v6.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media

Pull media fixes from Mauro Carvalho Chehab:

 - fix some unused-variable warning in mtk-mdp3

 - ignore unused suspend operations in nxp

 - some driver fixes in rcar-vin

* tag 'media/v6.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
  media: platform: mtk-mdp3: work around unused-variable warning
  media: nxp: ignore unused suspend operations
  media: rcar-vin: Select correct interrupt mode for V4L2_FIELD_ALTERNATE
  media: rcar-vin: Fix NV12 size alignment
  media: rcar-vin: Gen3 can not scale NV12
2023-05-11 08:35:52 -05:00
Jakub Kicinski
cceac92678 netfilter pull request 23-05-10
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEN9lkrMBJgcdVAPub1V2XiooUIOQFAmRbUnQACgkQ1V2XiooU
 IOQNVw//eoCiid3J4TuNdzHaBHXUlZvln1n5Z6K5fIz5ytrY1T7oKC8Y9StkzzWR
 29ToFLOJt1iRAcgxsghWRIwUzNwpuUdqgd9cUZMHMQxT0BJItp6FXUql2+1LkF/I
 b/gnnb90zyE7lBS/VSRyOiqMiJlP+Som22d7Nn5k2KfTYEdXKwfzjsWAu3W3Sb0s
 Lv/MA9DE42qcwiZubmFmDtOtAunPJFZm3HgkcAVeXoNkBDrSfkvxLIMYG6VfFNhQ
 AkKMyzX293wpwVxfOuQfJr4QVlxAgOQUko+FqajoWMBtfA3yldZjJ8RC7c9Af9uI
 ciOP11vHBCG84KrTabC5kdqOcvadreDiM/oIvk57ztQhCr3e+po+vIz6Cv4p9I30
 m5GXfgbtMRl6hM2S5lrRc5fNRkYJHE4aNvesFTGaLpK3LogusH1E9mH5jdjnbU42
 TwkIe250qJJelNn9ZxS5Jt0BgyogNfeiA9lOXmaQBpYmwIahrjDf0g8z2I85nPhF
 PDukjSXsxi4uHwpSF5wFrlqkAPiEX4vC95uSUTVbbBgZrHhNJdGgu4FJSHRM4mVo
 6Awxk2O2bvcDpXEfTBdD4EJF71bZ/aH2i8ddU1oupIl88O01TWTtkO4SYHqMX+tC
 fPnpGzBN8KJvHgeFxO4p0v1oelv2RtiITv1gi7YJOnq/+Vd2yy0=
 =HmfP
 -----END PGP SIGNATURE-----

Merge tag 'nf-23-05-10' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf

Pablo Neira Ayuso says:

====================
Netfilter updates for net

The following patchset contains Netfilter fixes for net:

1) Fix UAF when releasing netnamespace, from Florian Westphal.

2) Fix possible BUG_ON when nf_conntrack is enabled with enable_hooks,
   from Florian Westphal.

3) Fixes for nft_flowtable.sh selftest, from Boris Sukholitko.

4) Extend nft_flowtable.sh selftest to cover integration with
   ingress/egress hooks, from Florian Westphal.

* tag 'nf-23-05-10' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  selftests: nft_flowtable.sh: check ingress/egress chain too
  selftests: nft_flowtable.sh: monitor result file sizes
  selftests: nft_flowtable.sh: wait for specific nc pids
  selftests: nft_flowtable.sh: no need for ps -x option
  selftests: nft_flowtable.sh: use /proc for pid checking
  netfilter: conntrack: fix possible bug_on with enable_hooks=1
  netfilter: nf_tables: always release netdev hooks from notifier
====================

Link: https://lore.kernel.org/r/20230510083313.152961-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-10 19:08:58 -07:00
Jakub Kicinski
33dcee99e0 Merge branch 'af_unix-fix-two-data-races-reported-by-kcsan'
Kuniyuki Iwashima says:

====================
af_unix: Fix two data races reported by KCSAN.

KCSAN reported data races around these two fields for AF_UNIX sockets.

  * sk->sk_receive_queue->qlen
  * sk->sk_shutdown

Let's annotate them properly.
====================

Link: https://lore.kernel.org/r/20230510003456.42357-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-10 19:06:55 -07:00
Kuniyuki Iwashima
e1d09c2c2f af_unix: Fix data races around sk->sk_shutdown.
KCSAN found a data race around sk->sk_shutdown where unix_release_sock()
and unix_shutdown() update it under unix_state_lock(), OTOH unix_poll()
and unix_dgram_poll() read it locklessly.

We need to annotate the writes and reads with WRITE_ONCE() and READ_ONCE().

BUG: KCSAN: data-race in unix_poll / unix_release_sock

write to 0xffff88800d0f8aec of 1 bytes by task 264 on cpu 0:
 unix_release_sock+0x75c/0x910 net/unix/af_unix.c:631
 unix_release+0x59/0x80 net/unix/af_unix.c:1042
 __sock_release+0x7d/0x170 net/socket.c:653
 sock_close+0x19/0x30 net/socket.c:1397
 __fput+0x179/0x5e0 fs/file_table.c:321
 ____fput+0x15/0x20 fs/file_table.c:349
 task_work_run+0x116/0x1a0 kernel/task_work.c:179
 resume_user_mode_work include/linux/resume_user_mode.h:49 [inline]
 exit_to_user_mode_loop kernel/entry/common.c:171 [inline]
 exit_to_user_mode_prepare+0x174/0x180 kernel/entry/common.c:204
 __syscall_exit_to_user_mode_work kernel/entry/common.c:286 [inline]
 syscall_exit_to_user_mode+0x1a/0x30 kernel/entry/common.c:297
 do_syscall_64+0x4b/0x90 arch/x86/entry/common.c:86
 entry_SYSCALL_64_after_hwframe+0x72/0xdc

read to 0xffff88800d0f8aec of 1 bytes by task 222 on cpu 1:
 unix_poll+0xa3/0x2a0 net/unix/af_unix.c:3170
 sock_poll+0xcf/0x2b0 net/socket.c:1385
 vfs_poll include/linux/poll.h:88 [inline]
 ep_item_poll.isra.0+0x78/0xc0 fs/eventpoll.c:855
 ep_send_events fs/eventpoll.c:1694 [inline]
 ep_poll fs/eventpoll.c:1823 [inline]
 do_epoll_wait+0x6c4/0xea0 fs/eventpoll.c:2258
 __do_sys_epoll_wait fs/eventpoll.c:2270 [inline]
 __se_sys_epoll_wait fs/eventpoll.c:2265 [inline]
 __x64_sys_epoll_wait+0xcc/0x190 fs/eventpoll.c:2265
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x3b/0x90 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x72/0xdc

value changed: 0x00 -> 0x03

Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 222 Comm: dbus-broker Not tainted 6.3.0-rc7-02330-gca6270c12e20 #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014

Fixes: 3c73419c09 ("af_unix: fix 'poll for write'/ connected DGRAM sockets")
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-10 19:06:53 -07:00
Kuniyuki Iwashima
679ed006d4 af_unix: Fix a data race of sk->sk_receive_queue->qlen.
KCSAN found a data race of sk->sk_receive_queue->qlen where recvmsg()
updates qlen under the queue lock and sendmsg() checks qlen under
unix_state_sock(), not the queue lock, so the reader side needs
READ_ONCE().

BUG: KCSAN: data-race in __skb_try_recv_from_queue / unix_wait_for_peer

write (marked) to 0xffff888019fe7c68 of 4 bytes by task 49792 on cpu 0:
 __skb_unlink include/linux/skbuff.h:2347 [inline]
 __skb_try_recv_from_queue+0x3de/0x470 net/core/datagram.c:197
 __skb_try_recv_datagram+0xf7/0x390 net/core/datagram.c:263
 __unix_dgram_recvmsg+0x109/0x8a0 net/unix/af_unix.c:2452
 unix_dgram_recvmsg+0x94/0xa0 net/unix/af_unix.c:2549
 sock_recvmsg_nosec net/socket.c:1019 [inline]
 ____sys_recvmsg+0x3a3/0x3b0 net/socket.c:2720
 ___sys_recvmsg+0xc8/0x150 net/socket.c:2764
 do_recvmmsg+0x182/0x560 net/socket.c:2858
 __sys_recvmmsg net/socket.c:2937 [inline]
 __do_sys_recvmmsg net/socket.c:2960 [inline]
 __se_sys_recvmmsg net/socket.c:2953 [inline]
 __x64_sys_recvmmsg+0x153/0x170 net/socket.c:2953
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x3b/0x90 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x72/0xdc

read to 0xffff888019fe7c68 of 4 bytes by task 49793 on cpu 1:
 skb_queue_len include/linux/skbuff.h:2127 [inline]
 unix_recvq_full net/unix/af_unix.c:229 [inline]
 unix_wait_for_peer+0x154/0x1a0 net/unix/af_unix.c:1445
 unix_dgram_sendmsg+0x13bc/0x14b0 net/unix/af_unix.c:2048
 sock_sendmsg_nosec net/socket.c:724 [inline]
 sock_sendmsg+0x148/0x160 net/socket.c:747
 ____sys_sendmsg+0x20e/0x620 net/socket.c:2503
 ___sys_sendmsg+0xc6/0x140 net/socket.c:2557
 __sys_sendmmsg+0x11d/0x370 net/socket.c:2643
 __do_sys_sendmmsg net/socket.c:2672 [inline]
 __se_sys_sendmmsg net/socket.c:2669 [inline]
 __x64_sys_sendmmsg+0x58/0x70 net/socket.c:2669
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x3b/0x90 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x72/0xdc

value changed: 0x0000000b -> 0x00000001

Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 49793 Comm: syz-executor.0 Not tainted 6.3.0-rc7-02330-gca6270c12e20 #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-10 19:06:53 -07:00
Eric Dumazet
5bca1d081f net: datagram: fix data-races in datagram_poll()
datagram_poll() runs locklessly, we should add READ_ONCE()
annotations while reading sk->sk_err, sk->sk_shutdown and sk->sk_state.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20230509173131.3263780-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-10 19:06:49 -07:00
Linus Torvalds
80e62bc848 MAINTAINERS: re-sort all entries and fields
It's been a few years since we've sorted this thing, and the end result
is that we've added MAINTAINERS entries in the wrong order, and a number
of entries have their fields in non-canonical order too.

So roll this boulder up the hill one more time by re-running

   ./scripts/parse-maintainers.pl --order

on it.

This file ends up being fairly painful for merge conflicts even
normally, since unlike almost all other kernel files it's one of those
"everybody touches the same thing", and re-ordering all entries is only
going to make that worse.  But the alternative is to never do it at all,
and just let it all rot..

The rc2 week is likely the quietest and least painful time to do this.

Requested-by: Randy Dunlap <rdunlap@infradead.org>
Requested-by: Joe Perches <joe@perches.com>	# "Please use --order"
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-05-10 20:04:04 -05:00
Linus Torvalds
d295b66a7b \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmRcFLAACgkQnJ2qBz9k
 QNmSuQf+P+hFOHVpCb83S4TrNsfPC9M7Pv/M1Rw926p1gESNLez1ElAx2B35OIBZ
 y4ijqrSHgXbCsC3PvbxR9YfBPJZwIhtmKwjPC+9KlhAFB536Bvt/nt34fjRCWJiS
 nABnexegkirk7EQufAGVZQ1gtm1p0qSCA1B/l5AFl4KeLIin31U/08F2eE1OAo3K
 TmxhXzlC+JRXjf/X/X8mzJ8nz/hY5039BDvs7EpVdOsZcLFRQcCUygUCrOdV94uI
 1wyzMoGNaiIsZHpHPf3NtphUoE9IFpZQJGXMTR4ehBgFYkkOHYxsjW+wn8YOmaks
 qzRRXb9w0WdgAXChDGu1xYXV6Wpjiw==
 =kqSe
 -----END PGP SIGNATURE-----

Merge tag 'fsnotify_for_v6.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull inotify fix from Jan Kara:
 "A fix for possibly reporting invalid watch descriptor with inotify
  event"

* tag 'fsnotify_for_v6.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  inotify: Avoid reporting event with invalid wd
2023-05-10 17:07:42 -05:00
Linus Torvalds
2a78769da3 gfs2 fix
- Fix a NULL pointer dereference when mounting corrupted filesystems.
 -----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAmRbtSYUHGFncnVlbmJh
 QHJlZGhhdC5jb20ACgkQ1b+f6wMTZTrQSg/6A7xD25DpU1CSL18CEKMxnViXPNLP
 IYiTtkByCsRkgQKR4KMtVgPD/lEyWbhpuZJzSF4mZkslpnngjL/o+Rj4Sds9GvPL
 +66CSJAqjpJUjgGDvAoiJcZYOejRyqZLCHzKF54h0oJ3k9JCd3xN86EX8vKGoc2e
 F/As+E1Wnf0PP4jiERDyFhhg95MNxFuNVJ/JJ7mwtWpumx+60SpGzzUp575AHcoe
 4zHkVHlj6FAz2G8l8NADYAc0wMuhQ05g19psHzRDr2bB9zHBy9kWzXN0MYrhRvuY
 HuQ3CNDbXCJPXFjrNEmLV0f7Y8yCk/YNrYPbo4zV+AW0l7NzesqzrwkP/qgNfySD
 pnQJ/mQ76P/h728fln0pYvO8GnFUFoKKoCR27jH4F/Mntr4hgR4FZNDhvK6BMQNc
 PpipJcXWT8AUY45Sg3TLKPEhjp37K+abiMKfyTEFQRvLT83O6WyCUx2HYpZI7l87
 /aTIsidlYkb2ZSeLeXEfBplRXwF7NQNvvRJCGQKHAN7UbTB1o2RVpC5bn6UIghb1
 eQQZObs4AUfmzHthF2OaeJy08AiFidbOVLmbpZuj2iOWbJzmjJea8YT/GdntxhMT
 swmkHJqdmZEYoSKKvDmemT6l7qtOwijg7sTQ/d2D77lk8QPPIHdUw6g1OVvMqoNE
 oESpIA4O2fshlpU=
 =/s3O
 -----END PGP SIGNATURE-----

Merge tag 'gfs2-v6.3-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2

Pull gfs2 fix from Andreas Gruenbacher:

 - Fix a NULL pointer dereference when mounting corrupted filesystems

* tag 'gfs2-v6.3-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
  gfs2: Don't deref jdesc in evict
2023-05-10 16:48:35 -05:00
Bob Peterson
504a10d9e4 gfs2: Don't deref jdesc in evict
On corrupt gfs2 file systems the evict code can try to reference the
journal descriptor structure, jdesc, after it has been freed and set to
NULL. The sequence of events is:

init_journal()
...
fail_jindex:
   gfs2_jindex_free(sdp); <------frees journals, sets jdesc = NULL
      if (gfs2_holder_initialized(&ji_gh))
         gfs2_glock_dq_uninit(&ji_gh);
fail:
   iput(sdp->sd_jindex); <--references jdesc in evict_linked_inode
      evict()
         gfs2_evict_inode()
            evict_linked_inode()
               ret = gfs2_trans_begin(sdp, 0, sdp->sd_jdesc->jd_blocks);
<------references the now freed/zeroed sd_jdesc pointer.

The call to gfs2_trans_begin is done because the truncate_inode_pages
call can cause gfs2 events that require a transaction, such as removing
journaled data (jdata) blocks from the journal.

This patch fixes the problem by adding a check for sdp->sd_jdesc to
function gfs2_evict_inode. In theory, this should only happen to corrupt
gfs2 file systems, when gfs2 detects the problem, reports it, then tries
to evict all the system inodes it has read in up to that point.

Reported-by: Yang Lan <lanyang0908@gmail.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2023-05-10 17:15:18 +02:00
Linus Torvalds
ad2fd53a78 platform-drivers-x86 for v6.4-2
Highlights:
  -  thinkpad_acpi: Fix profile (performance/bal/low-power) regression on T490
  -  misc. other small fixes / hw-id additions
 
 The following is an automated git shortlog grouped by driver:
 
 hp-wmi:
  -  add micmute to hp_wmi_keymap struct
 
 intel_scu_pcidrv:
  -  Add back PCI ID for Medfield
 
 platform/mellanox:
  -  fix potential race in mlxbf-tmfifo driver
 
 platform/x86/intel-uncore-freq:
  -  Return error on write frequency
 
 thinkpad_acpi:
  -  Add profile force ability
  -  Fix platform profiles on T490
 
 touchscreen_dmi:
  -  Add info for the Dexp Ursus KX210i
  -  Add upside-down quirk for GDIX1002 ts on the Juno Tablet
 -----BEGIN PGP SIGNATURE-----
 
 iQFIBAABCAAyFiEEuvA7XScYQRpenhd+kuxHeUQDJ9wFAmRbcQ4UHGhkZWdvZWRl
 QHJlZGhhdC5jb20ACgkQkuxHeUQDJ9yFnQf9G83wo2B5SfjN7V7viYHEGB4InfQL
 +MbhwP+Efe++BkxwAzVMcVLSuYOsbqcrRBCiQkBVCyAF14LVVH3ymnn/upoqz9l4
 K4JUZorGIEPgl0SpGP+41j+P93XYfcfMYPRetZA8ErBpitSfHVf9oAG8vTgaemda
 Qxo8DWs4R1WGdg8KZ7icZ7shXiyzIeGBnQB/5EzpBrv6v6EmiJsUZh5ZZASOidrg
 IXTp8LgLhkFbIQBgHoOyHoM7puRI7MRbxu5YEJ+wKumFi4KmuTET5iDlZFpmqvoH
 eAU/JUWIsQ3I1QwTKFSLFYkJZCZrIIJHew1Lvs9F+/x1c2vs1QjOG8RoEg==
 =SEuW
 -----END PGP SIGNATURE-----

Merge tag 'platform-drivers-x86-v6.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86

Pull x86 platform driver fixes from Hans de Goede:
 "Nothing special to report just various small fixes:

   - thinkpad_acpi: Fix profile (performance/bal/low-power) regression
     on T490

   - misc other small fixes / hw-id additions"

* tag 'platform-drivers-x86-v6.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86:
  platform/mellanox: fix potential race in mlxbf-tmfifo driver
  platform/x86: touchscreen_dmi: Add info for the Dexp Ursus KX210i
  platform/x86: touchscreen_dmi: Add upside-down quirk for GDIX1002 ts on the Juno Tablet
  platform/x86: thinkpad_acpi: Add profile force ability
  platform/x86: thinkpad_acpi: Fix platform profiles on T490
  platform/x86: hp-wmi: add micmute to hp_wmi_keymap struct
  platform/x86/intel-uncore-freq: Return error on write frequency
  platform/x86: intel_scu_pcidrv: Add back PCI ID for Medfield
2023-05-10 09:36:42 -05:00
Colin Foster
cdc2e28e21 net: mscc: ocelot: fix stat counter register values
Commit d4c3676507 ("net: mscc: ocelot: keep ocelot_stat_layout by reg
address, not offset") organized the stats counters for Ocelot chips, namely
the VSC7512 and VSC7514. A few of the counter offsets were incorrect, and
were caught by this warning:

WARNING: CPU: 0 PID: 24 at drivers/net/ethernet/mscc/ocelot_stats.c:909
ocelot_stats_init+0x1fc/0x2d8
reg 0x5000078 had address 0x220 but reg 0x5000079 has address 0x214,
bulking broken!

Fix these register offsets.

Fixes: d4c3676507 ("net: mscc: ocelot: keep ocelot_stat_layout by reg address, not offset")
Signed-off-by: Colin Foster <colin.foster@in-advantage.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-10 12:11:18 +01:00
t.feng
90cbed5247 ipvlan:Fix out-of-bounds caused by unclear skb->cb
If skb enqueue the qdisc, fq_skb_cb(skb)->time_to_send is changed which
is actually skb->cb, and IPCB(skb_in)->opt will be used in
__ip_options_echo. It is possible that memcpy is out of bounds and lead
to stack overflow.
We should clear skb->cb before ip_local_out or ip6_local_out.

v2:
1. clean the stack info
2. use IPCB/IP6CB instead of skb->cb

crash on stable-5.10(reproduce in kasan kernel).
Stack info:
[ 2203.651571] BUG: KASAN: stack-out-of-bounds in
__ip_options_echo+0x589/0x800
[ 2203.653327] Write of size 4 at addr ffff88811a388f27 by task
swapper/3/0
[ 2203.655460] CPU: 3 PID: 0 Comm: swapper/3 Kdump: loaded Not tainted
5.10.0-60.18.0.50.h856.kasan.eulerosv2r11.x86_64 #1
[ 2203.655466] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.10.2-0-g5f4c7b1-20181220_000000-szxrtosci10000 04/01/2014
[ 2203.655475] Call Trace:
[ 2203.655481]  <IRQ>
[ 2203.655501]  dump_stack+0x9c/0xd3
[ 2203.655514]  print_address_description.constprop.0+0x19/0x170
[ 2203.655530]  __kasan_report.cold+0x6c/0x84
[ 2203.655586]  kasan_report+0x3a/0x50
[ 2203.655594]  check_memory_region+0xfd/0x1f0
[ 2203.655601]  memcpy+0x39/0x60
[ 2203.655608]  __ip_options_echo+0x589/0x800
[ 2203.655654]  __icmp_send+0x59a/0x960
[ 2203.655755]  nf_send_unreach+0x129/0x3d0 [nf_reject_ipv4]
[ 2203.655763]  reject_tg+0x77/0x1bf [ipt_REJECT]
[ 2203.655772]  ipt_do_table+0x691/0xa40 [ip_tables]
[ 2203.655821]  nf_hook_slow+0x69/0x100
[ 2203.655828]  __ip_local_out+0x21e/0x2b0
[ 2203.655857]  ip_local_out+0x28/0x90
[ 2203.655868]  ipvlan_process_v4_outbound+0x21e/0x260 [ipvlan]
[ 2203.655931]  ipvlan_xmit_mode_l3+0x3bd/0x400 [ipvlan]
[ 2203.655967]  ipvlan_queue_xmit+0xb3/0x190 [ipvlan]
[ 2203.655977]  ipvlan_start_xmit+0x2e/0xb0 [ipvlan]
[ 2203.655984]  xmit_one.constprop.0+0xe1/0x280
[ 2203.655992]  dev_hard_start_xmit+0x62/0x100
[ 2203.656000]  sch_direct_xmit+0x215/0x640
[ 2203.656028]  __qdisc_run+0x153/0x1f0
[ 2203.656069]  __dev_queue_xmit+0x77f/0x1030
[ 2203.656173]  ip_finish_output2+0x59b/0xc20
[ 2203.656244]  __ip_finish_output.part.0+0x318/0x3d0
[ 2203.656312]  ip_finish_output+0x168/0x190
[ 2203.656320]  ip_output+0x12d/0x220
[ 2203.656357]  __ip_queue_xmit+0x392/0x880
[ 2203.656380]  __tcp_transmit_skb+0x1088/0x11c0
[ 2203.656436]  __tcp_retransmit_skb+0x475/0xa30
[ 2203.656505]  tcp_retransmit_skb+0x2d/0x190
[ 2203.656512]  tcp_retransmit_timer+0x3af/0x9a0
[ 2203.656519]  tcp_write_timer_handler+0x3ba/0x510
[ 2203.656529]  tcp_write_timer+0x55/0x180
[ 2203.656542]  call_timer_fn+0x3f/0x1d0
[ 2203.656555]  expire_timers+0x160/0x200
[ 2203.656562]  run_timer_softirq+0x1f4/0x480
[ 2203.656606]  __do_softirq+0xfd/0x402
[ 2203.656613]  asm_call_irq_on_stack+0x12/0x20
[ 2203.656617]  </IRQ>
[ 2203.656623]  do_softirq_own_stack+0x37/0x50
[ 2203.656631]  irq_exit_rcu+0x134/0x1a0
[ 2203.656639]  sysvec_apic_timer_interrupt+0x36/0x80
[ 2203.656646]  asm_sysvec_apic_timer_interrupt+0x12/0x20
[ 2203.656654] RIP: 0010:default_idle+0x13/0x20
[ 2203.656663] Code: 89 f0 5d 41 5c 41 5d 41 5e c3 cc cc cc cc cc cc cc
cc cc cc cc cc cc 0f 1f 44 00 00 0f 1f 44 00 00 0f 00 2d 9f 32 57 00 fb
f4 <c3> cc cc cc cc 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 54 be 08
[ 2203.656668] RSP: 0018:ffff88810036fe78 EFLAGS: 00000256
[ 2203.656676] RAX: ffffffffaf2a87f0 RBX: ffff888100360000 RCX:
ffffffffaf290191
[ 2203.656681] RDX: 0000000000098b5e RSI: 0000000000000004 RDI:
ffff88811a3c4f60
[ 2203.656686] RBP: 0000000000000000 R08: 0000000000000001 R09:
ffff88811a3c4f63
[ 2203.656690] R10: ffffed10234789ec R11: 0000000000000001 R12:
0000000000000003
[ 2203.656695] R13: ffff888100360000 R14: 0000000000000000 R15:
0000000000000000
[ 2203.656729]  default_idle_call+0x5a/0x150
[ 2203.656735]  cpuidle_idle_call+0x1c6/0x220
[ 2203.656780]  do_idle+0xab/0x100
[ 2203.656786]  cpu_startup_entry+0x19/0x20
[ 2203.656793]  secondary_startup_64_no_verify+0xc2/0xcb

[ 2203.657409] The buggy address belongs to the page:
[ 2203.658648] page:0000000027a9842f refcount:1 mapcount:0
mapping:0000000000000000 index:0x0 pfn:0x11a388
[ 2203.658665] flags:
0x17ffffc0001000(reserved|node=0|zone=2|lastcpupid=0x1fffff)
[ 2203.658675] raw: 0017ffffc0001000 ffffea000468e208 ffffea000468e208
0000000000000000
[ 2203.658682] raw: 0000000000000000 0000000000000000 00000001ffffffff
0000000000000000
[ 2203.658686] page dumped because: kasan: bad access detected

To reproduce(ipvlan with IPVLAN_MODE_L3):
Env setting:
=======================================================
modprobe ipvlan ipvlan_default_mode=1
sysctl net.ipv4.conf.eth0.forwarding=1
iptables -t nat -A POSTROUTING -s 20.0.0.0/255.255.255.0 -o eth0 -j
MASQUERADE
ip link add gw link eth0 type ipvlan
ip -4 addr add 20.0.0.254/24 dev gw
ip netns add net1
ip link add ipv1 link eth0 type ipvlan
ip link set ipv1 netns net1
ip netns exec net1 ip link set ipv1 up
ip netns exec net1 ip -4 addr add 20.0.0.4/24 dev ipv1
ip netns exec net1 route add default gw 20.0.0.254
ip netns exec net1 tc qdisc add dev ipv1 root netem loss 10%
ifconfig gw up
iptables -t filter -A OUTPUT -p tcp --dport 8888 -j REJECT --reject-with
icmp-port-unreachable
=======================================================
And then excute the shell(curl any address of eth0 can reach):

for((i=1;i<=100000;i++))
do
        ip netns exec net1 curl x.x.x.x:8888
done
=======================================================

Fixes: 2ad7bf3638 ("ipvlan: Initial check-in of the IPVLAN driver.")
Signed-off-by: "t.feng" <fengtao40@huawei.com>
Suggested-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-10 10:33:06 +01:00
Randy Dunlap
77c964dad9 docs: networking: fix x25-iface.rst heading & index order
Fix the chapter heading for "X.25 Device Driver Interface" so that it
does not contain a trailing '-' character, which makes Sphinx
omit this heading from the contents.

Reverse the order of the x25.rst and x25-iface.rst files in the index
so that the project introduction (x25.rst) comes first.

Fixes: 883780af72 ("docs: networking: convert x25-iface.txt to ReST")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: linux-doc@vger.kernel.org
Cc: Martin Schiller <ms@dev.tdt.de>
Cc: linux-x25@vger.kernel.org
Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-10 10:31:46 +01:00