Commit Graph

1280477 Commits

Author SHA1 Message Date
Mykyta Yatsenko
08ac454e25 libbpf: Auto-attach struct_ops BPF maps in BPF skeleton
Similarly to `bpf_program`, support `bpf_map` automatic attachment in
`bpf_object__attach_skeleton`. Currently only struct_ops maps could be
attached.

On bpftool side, code-generate links in skeleton struct for struct_ops maps.
Similarly to `bpf_program_skeleton`, set links in `bpf_map_skeleton`.

On libbpf side, extend `bpf_map` with new `autoattach` field to support
enabling or disabling autoattach functionality, introducing
getter/setter for this field.

`bpf_object__(attach|detach)_skeleton` is extended with
attaching/detaching struct_ops maps logic.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240605175135.117127-1-yatsenko@meta.com
2024-06-06 10:06:05 -07:00
Linus Torvalds
d30d0e49da Including fixes from BPF and big collection of fixes for WiFi core
and drivers.
 
 Current release - regressions:
 
  - vxlan: fix regression when dropping packets due to invalid src addresses
 
  - bpf: fix a potential use-after-free in bpf_link_free()
 
  - xdp: revert support for redirect to any xsk socket bound to the same
    UMEM as it can result in a corruption
 
  - virtio_net:
    - add missing lock protection when reading return code from control_buf
    - fix false-positive lockdep splat in DIM
    - Revert "wifi: wilc1000: convert list management to RCU"
 
  - wifi: ath11k: fix error path in ath11k_pcic_ext_irq_config
 
 Previous releases - regressions:
 
  - rtnetlink: make the "split" NLM_DONE handling generic, restore the old
    behavior for two cases where we started coalescing those messages with
    normal messages, breaking sloppily-coded userspace
 
  - wifi:
    - cfg80211: validate HE operation element parsing
    - cfg80211: fix 6 GHz scan request building
    - mt76: mt7615: add missing chanctx ops
    - ath11k: move power type check to ASSOC stage, fix connecting
      to 6 GHz AP
    - ath11k: fix WCN6750 firmware crash caused by 17 num_vdevs
    - rtlwifi: ignore IEEE80211_CONF_CHANGE_RETRY_LIMITS
    - iwlwifi: mvm: fix a crash on 7265
 
 Previous releases - always broken:
 
  - ncsi: prevent multi-threaded channel probing, a spec violation
 
  - vmxnet3: disable rx data ring on dma allocation failure
 
  - ethtool: init tsinfo stats if requested, prevent unintentionally
    reporting all-zero stats on devices which don't implement any
 
  - dst_cache: fix possible races in less common IPv6 features
 
  - tcp: auth: don't consider TCP_CLOSE to be in TCP_AO_ESTABLISHED
 
  - ax25: fix two refcounting bugs
 
  - eth: ionic: fix kernel panic in XDP_TX action
 
 Misc:
 
  - tcp: count CLOSE-WAIT sockets for TCP_MIB_CURRESTAB
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmZh3mUACgkQMUZtbf5S
 IrvPwRAApv8X0ZIbPD5PuVEkiYuSkSE6QVou5GaVO7DzF4gj07zPNtCe6B/ZZdBu
 RLdlppxjAmVwdCRmUo0plxSydYZcqFpQqV6lRH/rbWmktWIp0pGIOAcOG7ISRPCC
 FAYJ4udSt4+wrq0hXTsE1KO1JZ0p7zE2bXxNC8uR8wgM9yonUjqhYdAUZhrl3yCY
 zOCD/+kvWFLYtehDcmyNK0ANS3yNveTNkRhXDc1UrpOGMtza60lf5u3bWK+sU5VS
 NGPe9cU60WKMQi6QnWFBZKIcp4Vgy2MukOLdNn9e8BRjFLh2dbY86LAmE4HWPA7I
 ONZagOfEjeOcRSCMdFHxui/PUDZLBZNhrnqQ6x8uC2yKwwIMr+CgEt5sCmVFwH6n
 3HTlWSjL38yuiVuYuhxGchmVnZfC4bLi2qAFF1oxhlDGViBDhAwi36MSCnjDpN8k
 Jo0x6crQLS/uvwVXPKWAUcQhy7OE69A3FwwA1PtkxRX5EQPn1if2Z7yq7YfYb9aD
 bChvCarlfuVDm+CBItphXg0ajVZc+im7+JK62Zn50A1cTbEK0lnYCOcmqzqiqrXI
 Vr3XXt6gVVnvwY374JDO1vmB5ft2IYBn7sWnLcIvR2UlggqEfqMdKSSwm7pOprG9
 YJ/LDAXVmG0kLN7rZUYUBLItnpuHAhYDrBOsV5HaFeksWauc1oY=
 =mwEJ
 -----END PGP SIGNATURE-----

Merge tag 'net-6.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from BPF and big collection of fixes for WiFi core and
  drivers.

  Current release - regressions:

   - vxlan: fix regression when dropping packets due to invalid src
     addresses

   - bpf: fix a potential use-after-free in bpf_link_free()

   - xdp: revert support for redirect to any xsk socket bound to the
     same UMEM as it can result in a corruption

   - virtio_net:
      - add missing lock protection when reading return code from
        control_buf
      - fix false-positive lockdep splat in DIM
      - Revert "wifi: wilc1000: convert list management to RCU"

   - wifi: ath11k: fix error path in ath11k_pcic_ext_irq_config

  Previous releases - regressions:

   - rtnetlink: make the "split" NLM_DONE handling generic, restore the
     old behavior for two cases where we started coalescing those
     messages with normal messages, breaking sloppily-coded userspace

   - wifi:
      - cfg80211: validate HE operation element parsing
      - cfg80211: fix 6 GHz scan request building
      - mt76: mt7615: add missing chanctx ops
      - ath11k: move power type check to ASSOC stage, fix connecting to
        6 GHz AP
      - ath11k: fix WCN6750 firmware crash caused by 17 num_vdevs
      - rtlwifi: ignore IEEE80211_CONF_CHANGE_RETRY_LIMITS
      - iwlwifi: mvm: fix a crash on 7265

  Previous releases - always broken:

   - ncsi: prevent multi-threaded channel probing, a spec violation

   - vmxnet3: disable rx data ring on dma allocation failure

   - ethtool: init tsinfo stats if requested, prevent unintentionally
     reporting all-zero stats on devices which don't implement any

   - dst_cache: fix possible races in less common IPv6 features

   - tcp: auth: don't consider TCP_CLOSE to be in TCP_AO_ESTABLISHED

   - ax25: fix two refcounting bugs

   - eth: ionic: fix kernel panic in XDP_TX action

  Misc:

   - tcp: count CLOSE-WAIT sockets for TCP_MIB_CURRESTAB"

* tag 'net-6.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (107 commits)
  selftests: net: lib: set 'i' as local
  selftests: net: lib: avoid error removing empty netns name
  selftests: net: lib: support errexit with busywait
  net: ethtool: fix the error condition in ethtool_get_phy_stats_ethtool()
  ipv6: fix possible race in __fib6_drop_pcpu_from()
  af_unix: Annotate data-race of sk->sk_shutdown in sk_diag_fill().
  af_unix: Use skb_queue_len_lockless() in sk_diag_show_rqlen().
  af_unix: Use skb_queue_empty_lockless() in unix_release_sock().
  af_unix: Use unix_recvq_full_lockless() in unix_stream_connect().
  af_unix: Annotate data-race of net->unx.sysctl_max_dgram_qlen.
  af_unix: Annotate data-races around sk->sk_sndbuf.
  af_unix: Annotate data-races around sk->sk_state in UNIX_DIAG.
  af_unix: Annotate data-race of sk->sk_state in unix_stream_read_skb().
  af_unix: Annotate data-races around sk->sk_state in sendmsg() and recvmsg().
  af_unix: Annotate data-race of sk->sk_state in unix_accept().
  af_unix: Annotate data-race of sk->sk_state in unix_stream_connect().
  af_unix: Annotate data-races around sk->sk_state in unix_write_space() and poll().
  af_unix: Annotate data-race of sk->sk_state in unix_inq_len().
  af_unix: Annodate data-races around sk->sk_state for writers.
  af_unix: Set sk->sk_state under unix_state_lock() for truly disconencted peer.
  ...
2024-06-06 09:55:27 -07:00
Linus Torvalds
2faf6332c5 Single patch, no behavior changes.
Tetsuo Handa (1):
   tomoyo: update project links
 
  Documentation/admin-guide/LSM/tomoyo.rst |   35 +++++++++----------------------
  MAINTAINERS                              |    2 -
  security/tomoyo/Kconfig                  |    2 -
  security/tomoyo/common.c                 |    2 -
  4 files changed, 14 insertions(+), 27 deletions(-)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABAgAGBQJmYchgAAoJEEJfEo0MZPUqqXYP/ROdUgeoGCYo4Fv7PKoQtiwm
 cCf53gQD0ozv2pVpYQH6TnF4MUfnqxEjskYgL9sJahwSQ8pdyj8SO08uVACBgwuJ
 1cXAGrSBFJEYTZY9/V3JTSbdTvqQsVTwSii3hj/VABfYTTQtnLdqPFmsslStIstx
 sGNwIZPvwX5xCTG6YkZCXBPtZGxtAhVvUueRF46525qZvmsgV7ziGfqNecNdfHyi
 6wiw9HXZJlaKcj+RNQrcc10JeX0g3we3gpVIa8FJ5+wnpOvVuQOtq9lm9Idzw6xo
 AsKyg3jTjDaJjIv125lv7++DIXyipvDK8TPZJOwiC8WYsChLveb+fZV3YMNLpz2N
 Qepgzejf1O7rLT55zJID4KQGwCkCTg7TJILLA57wFAa+7VGLspkvIXsNzOjpe9P7
 9ufclnrAkM1RbBIUqSOj1OcTm6dSBkNG32MI869NZ6M8UH3gDbmCLTsMNv7JT2Qe
 ax7E8zRqDTJBzH4dcAIKJ1pFF4lIj6H7dhbDJf0TPB89UGJdBdil4b+JIaJyZXEn
 0M/RFdPiiw/vGsaFn1m6RCkV0WuuLhUHCOhq+0ukzsVfs9XqXWs/Yfngt07I3ldH
 ALB+dE7sddFI0dvyrJub/MTd3KRHZfB6TF1mKeHQe7Y4lR1TNctQxUuqClDJVXaT
 a38bb4G+qgIOcVMHeSaL
 =cwIu
 -----END PGP SIGNATURE-----

Merge tag 'tomoyo-pr-20240606' of git://git.code.sf.net/p/tomoyo/tomoyo

Pull tomoyo fixlet from Tetsuo Handa:
 "Single patch to update project links, no behavior changes"

* tag 'tomoyo-pr-20240606' of git://git.code.sf.net/p/tomoyo/tomoyo:
  tomoyo: update project links
2024-06-06 09:48:57 -07:00
Linus Torvalds
a34adf6010 EFI fixes for v6.10 #2
- Ensure that .discard sections are really discarded in the EFI zboot
   image build
 - Return proper error numbers from efi-pstore
 - Add __nocfi annotations to EFI runtime wrappers
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQQm/3uucuRGn1Dmh0wbglWLn0tXAUCZmAmwwAKCRAwbglWLn0t
 XNbNAQDsnOTRK4Azr0rqHUvOoB2g+0XlIL9yR+r5MwV8lAdL+QD9GJpX7p7pzT4q
 aT4zzzoS1h9FFUNTDtE7by18bDBElgI=
 =RxkM
 -----END PGP SIGNATURE-----

Merge tag 'efi-fixes-for-v6.10-2' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi

Pull EFI fixes from Ard Biesheuvel:

 - Ensure that .discard sections are really discarded in the EFI zboot
   image build

 - Return proper error numbers from efi-pstore

 - Add __nocfi annotations to EFI runtime wrappers

* tag 'efi-fixes-for-v6.10-2' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
  efi: Add missing __nocfi annotations to runtime wrappers
  efi: pstore: Return proper errors on UEFI failures
  efi/libstub: zboot.lds: Discard .discard sections
2024-06-06 09:39:36 -07:00
Jakub Kicinski
27bc865408 Merge branch 'selftests-net-lib-small-fixes'
Matthieu Baerts says:

====================
selftests: net: lib: small fixes

While looking at using 'lib.sh' for the MPTCP selftests [1], we found
some small issues with 'lib.sh'. Here they are:

- Patch 1: fix 'errexit' (set -e) support with busywait. 'errexit' is
  supported in some functions, not all. A fix for v6.8+.

- Patch 2: avoid confusing error messages linked to the cleaning part
  when the netns setup fails. A fix for v6.8+.

- Patch 3: set a variable as local to avoid accidentally changing the
  value of a another one with the same name on the caller side. A fix
  for v6.10-rc1+.

Link: https://lore.kernel.org/mptcp/5f4615c3-0621-43c5-ad25-55747a4350ce@kernel.org/T/ [1]
====================

Link: https://lore.kernel.org/r/20240605-upstream-net-20240605-selftests-net-lib-fixes-v1-0-b3afadd368c9@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-06 08:29:07 -07:00
Matthieu Baerts (NGI0)
84a8bc3ec2 selftests: net: lib: set 'i' as local
Without this, the 'i' variable declared before could be overridden by
accident, e.g.

  for i in "${@}"; do
      __ksft_status_merge "${i}"  ## 'i' has been modified
      foo "${i}"                  ## using 'i' with an unexpected value
  done

After a quick look, it looks like 'i' is currently not used after having
been modified in __ksft_status_merge(), but still, better be safe than
sorry. I saw this while modifying the same file, not because I suspected
an issue somewhere.

Fixes: 596c8819cb ("selftests: forwarding: Have RET track kselftest framework constants")
Acked-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://lore.kernel.org/r/20240605-upstream-net-20240605-selftests-net-lib-fixes-v1-3-b3afadd368c9@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-06 08:29:07 -07:00
Matthieu Baerts (NGI0)
79322174bc selftests: net: lib: avoid error removing empty netns name
If there is an error to create the first netns with 'setup_ns()',
'cleanup_ns()' will be called with an empty string as first parameter.

The consequences is that 'cleanup_ns()' will try to delete an invalid
netns, and wait 20 seconds if the netns list is empty.

Instead of just checking if the name is not empty, convert the string
separated by spaces to an array. Manipulating the array is cleaner, and
calling 'cleanup_ns()' with an empty array will be a no-op.

Fixes: 25ae948b44 ("selftests/net: add lib.sh")
Cc: stable@vger.kernel.org
Acked-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://lore.kernel.org/r/20240605-upstream-net-20240605-selftests-net-lib-fixes-v1-2-b3afadd368c9@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-06 08:29:07 -07:00
Matthieu Baerts (NGI0)
41b02ea4c0 selftests: net: lib: support errexit with busywait
If errexit is enabled ('set -e'), loopy_wait -- or busywait and others
using it -- will stop after the first failure.

Note that if the returned status of loopy_wait is checked, and even if
errexit is enabled, Bash will not stop at the first error.

Fixes: 25ae948b44 ("selftests/net: add lib.sh")
Cc: stable@vger.kernel.org
Acked-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://lore.kernel.org/r/20240605-upstream-net-20240605-selftests-net-lib-fixes-v1-1-b3afadd368c9@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-06 08:29:07 -07:00
Alan Maguire
b24862bac7 selftests/bpf: Add btf_field_iter selftests
The added selftests verify that for every BTF kind we iterate correctly
over consituent strings and ids.

Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20240605153314.3727466-1-alan.maguire@oracle.com
2024-06-06 15:56:30 +02:00
Yonghong Song
7015843afc selftests/bpf: Fix send_signal test with nested CONFIG_PARAVIRT
Alexei reported that send_signal test may fail with nested CONFIG_PARAVIRT
configs. In this particular case, the base VM is AMD with 166 cpus, and I
run selftests with regular qemu on top of that and indeed send_signal test
failed. I also tried with an Intel box with 80 cpus and there is no issue.

The main qemu command line includes:

  -enable-kvm -smp 16 -cpu host

The failure log looks like:

  $ ./test_progs -t send_signal
  [   48.501588] watchdog: BUG: soft lockup - CPU#9 stuck for 26s! [test_progs:2225]
  [   48.503622] Modules linked in: bpf_testmod(O)
  [   48.503622] CPU: 9 PID: 2225 Comm: test_progs Tainted: G           O       6.9.0-08561-g2c1713a8f1c9-dirty #69
  [   48.507629] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
  [   48.511635] RIP: 0010:handle_softirqs+0x71/0x290
  [   48.511635] Code: [...] 10 0a 00 00 00 31 c0 65 66 89 05 d5 f4 fa 7e fb bb ff ff ff ff <49> c7 c2 cb
  [   48.518527] RSP: 0018:ffffc90000310fa0 EFLAGS: 00000246
  [   48.519579] RAX: 0000000000000000 RBX: 00000000ffffffff RCX: 00000000000006e0
  [   48.522526] RDX: 0000000000000006 RSI: ffff88810791ae80 RDI: 0000000000000000
  [   48.523587] RBP: ffffc90000fabc88 R08: 00000005a0af4f7f R09: 0000000000000000
  [   48.525525] R10: 0000000561d2f29c R11: 0000000000006534 R12: 0000000000000280
  [   48.528525] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
  [   48.528525] FS:  00007f2f2885cd00(0000) GS:ffff888237c40000(0000) knlGS:0000000000000000
  [   48.531600] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [   48.535520] CR2: 00007f2f287059f0 CR3: 0000000106a28002 CR4: 00000000003706f0
  [   48.537538] Call Trace:
  [   48.537538]  <IRQ>
  [   48.537538]  ? watchdog_timer_fn+0x1cd/0x250
  [   48.539590]  ? lockup_detector_update_enable+0x50/0x50
  [   48.539590]  ? __hrtimer_run_queues+0xff/0x280
  [   48.542520]  ? hrtimer_interrupt+0x103/0x230
  [   48.544524]  ? __sysvec_apic_timer_interrupt+0x4f/0x140
  [   48.545522]  ? sysvec_apic_timer_interrupt+0x3a/0x90
  [   48.547612]  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
  [   48.547612]  ? handle_softirqs+0x71/0x290
  [   48.547612]  irq_exit_rcu+0x63/0x80
  [   48.551585]  sysvec_apic_timer_interrupt+0x75/0x90
  [   48.552521]  </IRQ>
  [   48.553529]  <TASK>
  [   48.553529]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
  [   48.555609] RIP: 0010:finish_task_switch.isra.0+0x90/0x260
  [   48.556526] Code: [...] 9f 58 0a 00 00 48 85 db 0f 85 89 01 00 00 4c 89 ff e8 53 d9 bd 00 fb 66 90 <4d> 85 ed 74
  [   48.562524] RSP: 0018:ffffc90000fabd38 EFLAGS: 00000282
  [   48.563589] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff83385620
  [   48.563589] RDX: ffff888237c73ae4 RSI: 0000000000000000 RDI: ffff888237c6fd00
  [   48.568521] RBP: ffffc90000fabd68 R08: 0000000000000000 R09: 0000000000000000
  [   48.569528] R10: 0000000000000001 R11: 0000000000000000 R12: ffff8881009d0000
  [   48.573525] R13: ffff8881024e5400 R14: ffff88810791ae80 R15: ffff888237c6fd00
  [   48.575614]  ? finish_task_switch.isra.0+0x8d/0x260
  [   48.576523]  __schedule+0x364/0xac0
  [   48.577535]  schedule+0x2e/0x110
  [   48.578555]  pipe_read+0x301/0x400
  [   48.579589]  ? destroy_sched_domains_rcu+0x30/0x30
  [   48.579589]  vfs_read+0x2b3/0x2f0
  [   48.579589]  ksys_read+0x8b/0xc0
  [   48.583590]  do_syscall_64+0x3d/0xc0
  [   48.583590]  entry_SYSCALL_64_after_hwframe+0x4b/0x53
  [   48.586525] RIP: 0033:0x7f2f28703fa1
  [   48.587592] Code: [...] 00 00 00 0f 1f 44 00 00 f3 0f 1e fa 80 3d c5 23 14 00 00 74 13 31 c0 0f 05 <48> 3d 00 f0
  [   48.593534] RSP: 002b:00007ffd90f8cf88 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
  [   48.595589] RAX: ffffffffffffffda RBX: 00007ffd90f8d5e8 RCX: 00007f2f28703fa1
  [   48.595589] RDX: 0000000000000001 RSI: 00007ffd90f8cfb0 RDI: 0000000000000006
  [   48.599592] RBP: 00007ffd90f8d2f0 R08: 0000000000000064 R09: 0000000000000000
  [   48.602527] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
  [   48.603589] R13: 00007ffd90f8d608 R14: 00007f2f288d8000 R15: 0000000000f6bdb0
  [   48.605527]  </TASK>

In the test, two processes are communicating through pipe. Further debugging
with strace found that the above splat is triggered as read() syscall could
not receive the data even if the corresponding write() syscall in another
process successfully wrote data into the pipe.

The failed subtest is "send_signal_perf". The corresponding perf event has
sample_period 1 and config PERF_COUNT_SW_CPU_CLOCK. sample_period 1 means every
overflow event will trigger a call to the BPF program. So I suspect this may
overwhelm the system. So I increased the sample_period to 100,000 and the test
passed. The sample_period 10,000 still has the test failed.

In other parts of selftest, e.g., [1], sample_freq is used instead. So I
decided to use sample_freq = 1,000 since the test can pass as well.

  [1] https://lore.kernel.org/bpf/20240604070700.3032142-1-song@kernel.org/

Reported-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20240605201203.2603846-1-yonghong.song@linux.dev
2024-06-06 15:49:13 +02:00
Paolo Abeni
7493328144 Merge branch 'tcp-small-code-reorg'
Eric Dumazet says:

====================
tcp: small code reorg

Replace a WARN_ON_ONCE() that never triggered
to DEBUG_NET_WARN_ON_ONCE() in reqsk_free().

Move inet_reqsk_alloc() and reqsk_alloc()
to inet_connection_sock.c, to unclutter
net/ipv4/tcp_input.c and include/net/request_sock.h
====================

Link: https://lore.kernel.org/r/20240605071553.1365557-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-06 15:18:06 +02:00
Eric Dumazet
6971d21672 tcp: move reqsk_alloc() to inet_connection_sock.c
reqsk_alloc() has a single caller, no need to expose it
in include/net/request_sock.h.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-06 15:18:04 +02:00
Eric Dumazet
adbe695a97 tcp: move inet_reqsk_alloc() close to inet_reqsk_clone()
inet_reqsk_alloc() does not belong to tcp_input.c,
move it to inet_connection_sock.c instead.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-06 15:18:04 +02:00
Eric Dumazet
c34506406d tcp: small changes in reqsk_put() and reqsk_free()
In reqsk_free(), use DEBUG_NET_WARN_ON_ONCE()
instead of WARN_ON_ONCE() for a condition which never fired.

In reqsk_put() directly call __reqsk_free(), there is no
point checking rsk_refcnt again right after a transition to zero.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-06 15:18:04 +02:00
Paolo Abeni
fe300258a5 Merge branch 'mptcp-misc-cleanups'
Matthieu Baerts says:

====================
mptcp: misc. cleanups

Here is a small collection of miscellaneous cleanups:

- Patch 1 uses an MPTCP helper, instead of a TCP one, to do the same
  thing.

- Patch 2 adds a similar MPTCP helper, instead of using a TCP one
  directly.

- Patch 3 uses more appropriated terms in comments.

Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
====================

Link: https://lore.kernel.org/r/20240605-upstream-net-next-20240604-misc-cleanup-v1-0-ae2e35c3ecc5@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-06 15:13:49 +02:00
Davide Caratti
92f74c1e05 mptcp: refer to 'MPTCP' socket in comments
We used to call it 'master' socket at the early stages of MPTCP
development, but the correct wording is 'MPTCP' socket opposed to 'TCP
subflows': convert the last 3 comments to use a more appropriate term.

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-06 15:13:47 +02:00
Geliang Tang
5cdedad62e mptcp: add mptcp_space_from_win helper
As a wrapper of __tcp_space_from_win(), this patch adds a MPTCP dedicated
space_from_win helper mptcp_space_from_win() in protocol.h to paired with
mptcp_win_from_space().

Use it instead of __tcp_space_from_win() in both mptcp_rcv_space_adjust()
and mptcp_set_rcvlowat().

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-06 15:13:47 +02:00
Geliang Tang
5f0d0649c8 mptcp: use mptcp_win_from_space helper
The MPTCP dedicated win_from_space helper mptcp_win_from_space() is defined
in protocol.h, use it in mptcp_rcv_space_adjust() instead of using the TCP
one. Here scaling_ratio is the same as msk->scaling_ratio.

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-06 15:13:47 +02:00
Su Hui
0dcc53abf5 net: ethtool: fix the error condition in ethtool_get_phy_stats_ethtool()
Clang static checker (scan-build) warning:
net/ethtool/ioctl.c:line 2233, column 2
Called function pointer is null (null dereference).

Return '-EOPNOTSUPP' when 'ops->get_ethtool_phy_stats' is NULL to fix
this typo error.

Fixes: 201ed315f9 ("net/ethtool/ioctl: split ethtool_get_phy_stats into multiple helpers")
Signed-off-by: Su Hui <suhui@nfschina.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Hariprasad Kelam <hkelam@marvell.com>
Link: https://lore.kernel.org/r/20240605034742.921751-1-suhui@nfschina.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-06 13:34:33 +02:00
Jason Xing
9b6a30febd net: allow rps/rfs related configs to be switched
After John Sperbeck reported a compile error if the CONFIG_RFS_ACCEL
is off, I found that I cannot easily enable/disable the config
because of lack of the prompt when using 'make menuconfig'. Therefore,
I decided to change rps/rfc related configs altogether.

Signed-off-by: Jason Xing <kernelxing@tencent.com>
Link: https://lore.kernel.org/r/20240605022932.33703-1-kerneljasonxing@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-06 13:18:48 +02:00
Eric Dumazet
b01e1c0307 ipv6: fix possible race in __fib6_drop_pcpu_from()
syzbot found a race in __fib6_drop_pcpu_from() [1]

If compiler reads more than once (*ppcpu_rt),
second read could read NULL, if another cpu clears
the value in rt6_get_pcpu_route().

Add a READ_ONCE() to prevent this race.

Also add rcu_read_lock()/rcu_read_unlock() because
we rely on RCU protection while dereferencing pcpu_rt.

[1]

Oops: general protection fault, probably for non-canonical address 0xdffffc0000000012: 0000 [#1] PREEMPT SMP KASAN PTI
KASAN: null-ptr-deref in range [0x0000000000000090-0x0000000000000097]
CPU: 0 PID: 7543 Comm: kworker/u8:17 Not tainted 6.10.0-rc1-syzkaller-00013-g2bfcfd584ff5 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/02/2024
Workqueue: netns cleanup_net
 RIP: 0010:__fib6_drop_pcpu_from.part.0+0x10a/0x370 net/ipv6/ip6_fib.c:984
Code: f8 48 c1 e8 03 80 3c 28 00 0f 85 16 02 00 00 4d 8b 3f 4d 85 ff 74 31 e8 74 a7 fa f7 49 8d bf 90 00 00 00 48 89 f8 48 c1 e8 03 <80> 3c 28 00 0f 85 1e 02 00 00 49 8b 87 90 00 00 00 48 8b 0c 24 48
RSP: 0018:ffffc900040df070 EFLAGS: 00010206
RAX: 0000000000000012 RBX: 0000000000000001 RCX: ffffffff89932e16
RDX: ffff888049dd1e00 RSI: ffffffff89932d7c RDI: 0000000000000091
RBP: dffffc0000000000 R08: 0000000000000005 R09: 0000000000000007
R10: 0000000000000001 R11: 0000000000000006 R12: ffff88807fa080b8
R13: fffffbfff1a9a07d R14: ffffed100ff41022 R15: 0000000000000001
FS:  0000000000000000(0000) GS:ffff8880b9200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000001b32c26000 CR3: 000000005d56e000 CR4: 00000000003526f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
  __fib6_drop_pcpu_from net/ipv6/ip6_fib.c:966 [inline]
  fib6_drop_pcpu_from net/ipv6/ip6_fib.c:1027 [inline]
  fib6_purge_rt+0x7f2/0x9f0 net/ipv6/ip6_fib.c:1038
  fib6_del_route net/ipv6/ip6_fib.c:1998 [inline]
  fib6_del+0xa70/0x17b0 net/ipv6/ip6_fib.c:2043
  fib6_clean_node+0x426/0x5b0 net/ipv6/ip6_fib.c:2205
  fib6_walk_continue+0x44f/0x8d0 net/ipv6/ip6_fib.c:2127
  fib6_walk+0x182/0x370 net/ipv6/ip6_fib.c:2175
  fib6_clean_tree+0xd7/0x120 net/ipv6/ip6_fib.c:2255
  __fib6_clean_all+0x100/0x2d0 net/ipv6/ip6_fib.c:2271
  rt6_sync_down_dev net/ipv6/route.c:4906 [inline]
  rt6_disable_ip+0x7ed/0xa00 net/ipv6/route.c:4911
  addrconf_ifdown.isra.0+0x117/0x1b40 net/ipv6/addrconf.c:3855
  addrconf_notify+0x223/0x19e0 net/ipv6/addrconf.c:3778
  notifier_call_chain+0xb9/0x410 kernel/notifier.c:93
  call_netdevice_notifiers_info+0xbe/0x140 net/core/dev.c:1992
  call_netdevice_notifiers_extack net/core/dev.c:2030 [inline]
  call_netdevice_notifiers net/core/dev.c:2044 [inline]
  dev_close_many+0x333/0x6a0 net/core/dev.c:1585
  unregister_netdevice_many_notify+0x46d/0x19f0 net/core/dev.c:11193
  unregister_netdevice_many net/core/dev.c:11276 [inline]
  default_device_exit_batch+0x85b/0xae0 net/core/dev.c:11759
  ops_exit_list+0x128/0x180 net/core/net_namespace.c:178
  cleanup_net+0x5b7/0xbf0 net/core/net_namespace.c:640
  process_one_work+0x9fb/0x1b60 kernel/workqueue.c:3231
  process_scheduled_works kernel/workqueue.c:3312 [inline]
  worker_thread+0x6c8/0xf70 kernel/workqueue.c:3393
  kthread+0x2c1/0x3a0 kernel/kthread.c:389
  ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:147
  ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244

Fixes: d52d3997f8 ("ipv6: Create percpu rt6_info")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/r/20240604193549.981839-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-06 13:05:54 +02:00
Paolo Abeni
411c0ea696 Merge branch 'af_unix-fix-lockless-access-of-sk-sk_state-and-others-fields'
Kuniyuki Iwashima says:

====================
af_unix: Fix lockless access of sk->sk_state and others fields.

The patch 1 fixes a bug where SOCK_DGRAM's sk->sk_state is changed
to TCP_CLOSE even if the socket is connect()ed to another socket.

The rest of this series annotates lockless accesses to the following
fields.

  * sk->sk_state
  * sk->sk_sndbuf
  * net->unx.sysctl_max_dgram_qlen
  * sk->sk_receive_queue.qlen
  * sk->sk_shutdown

Note that with this series there is skb_queue_empty() left in
unix_dgram_disconnected() that needs to be changed to lockless
version, and unix_peer(other) access there should be protected
by unix_state_lock().

This will require some refactoring, so another series will follow.

Changes:
  v2:
    * Patch 1: Fix wrong double lock

  v1: https://lore.kernel.org/netdev/20240603143231.62085-1-kuniyu@amazon.com/
====================

Link: https://lore.kernel.org/r/20240604165241.44758-1-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-06 12:57:18 +02:00
Kuniyuki Iwashima
efaf24e30e af_unix: Annotate data-race of sk->sk_shutdown in sk_diag_fill().
While dumping sockets via UNIX_DIAG, we do not hold unix_state_lock().

Let's use READ_ONCE() to read sk->sk_shutdown.

Fixes: e4e541a848 ("sock-diag: Report shutdown for inet and unix sockets (v2)")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-06 12:57:15 +02:00
Kuniyuki Iwashima
5d915e584d af_unix: Use skb_queue_len_lockless() in sk_diag_show_rqlen().
We can dump the socket queue length via UNIX_DIAG by specifying
UDIAG_SHOW_RQLEN.

If sk->sk_state is TCP_LISTEN, we return the recv queue length,
but here we do not hold recvq lock.

Let's use skb_queue_len_lockless() in sk_diag_show_rqlen().

Fixes: c9da99e647 ("unix_diag: Fixup RQLEN extension report")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-06 12:57:15 +02:00
Kuniyuki Iwashima
83690b82d2 af_unix: Use skb_queue_empty_lockless() in unix_release_sock().
If the socket type is SOCK_STREAM or SOCK_SEQPACKET, unix_release_sock()
checks the length of the peer socket's recvq under unix_state_lock().

However, unix_stream_read_generic() calls skb_unlink() after releasing
the lock.  Also, for SOCK_SEQPACKET, __skb_try_recv_datagram() unlinks
skb without unix_state_lock().

Thues, unix_state_lock() does not protect qlen.

Let's use skb_queue_empty_lockless() in unix_release_sock().

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-06 12:57:15 +02:00
Kuniyuki Iwashima
45d872f0e6 af_unix: Use unix_recvq_full_lockless() in unix_stream_connect().
Once sk->sk_state is changed to TCP_LISTEN, it never changes.

unix_accept() takes advantage of this characteristics; it does not
hold the listener's unix_state_lock() and only acquires recvq lock
to pop one skb.

It means unix_state_lock() does not prevent the queue length from
changing in unix_stream_connect().

Thus, we need to use unix_recvq_full_lockless() to avoid data-race.

Now we remove unix_recvq_full() as no one uses it.

Note that we can remove READ_ONCE() for sk->sk_max_ack_backlog in
unix_recvq_full_lockless() because of the following reasons:

  (1) For SOCK_DGRAM, it is a written-once field in unix_create1()

  (2) For SOCK_STREAM and SOCK_SEQPACKET, it is changed under the
      listener's unix_state_lock() in unix_listen(), and we hold
      the lock in unix_stream_connect()

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-06 12:57:15 +02:00
Kuniyuki Iwashima
bd9f2d0573 af_unix: Annotate data-race of net->unx.sysctl_max_dgram_qlen.
net->unx.sysctl_max_dgram_qlen is exposed as a sysctl knob and can be
changed concurrently.

Let's use READ_ONCE() in unix_create1().

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-06 12:57:15 +02:00
Kuniyuki Iwashima
b0632e53e0 af_unix: Annotate data-races around sk->sk_sndbuf.
sk_setsockopt() changes sk->sk_sndbuf under lock_sock(), but it's
not used in af_unix.c.

Let's use READ_ONCE() to read sk->sk_sndbuf in unix_writable(),
unix_dgram_sendmsg(), and unix_stream_sendmsg().

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-06 12:57:15 +02:00
Kuniyuki Iwashima
0aa3be7b3e af_unix: Annotate data-races around sk->sk_state in UNIX_DIAG.
While dumping AF_UNIX sockets via UNIX_DIAG, sk->sk_state is read
locklessly.

Let's use READ_ONCE() there.

Note that the result could be inconsistent if the socket is dumped
during the state change.  This is common for other SOCK_DIAG and
similar interfaces.

Fixes: c9da99e647 ("unix_diag: Fixup RQLEN extension report")
Fixes: 2aac7a2cb0 ("unix_diag: Pending connections IDs NLA")
Fixes: 45a96b9be6 ("unix_diag: Dumping all sockets core")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-06 12:57:15 +02:00
Kuniyuki Iwashima
af4c733b6b af_unix: Annotate data-race of sk->sk_state in unix_stream_read_skb().
unix_stream_read_skb() is called from sk->sk_data_ready() context
where unix_state_lock() is not held.

Let's use READ_ONCE() there.

Fixes: 77462de14a ("af_unix: Add read_sock for stream socket types")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-06 12:57:14 +02:00
Kuniyuki Iwashima
8a34d4e8d9 af_unix: Annotate data-races around sk->sk_state in sendmsg() and recvmsg().
The following functions read sk->sk_state locklessly and proceed only if
the state is TCP_ESTABLISHED.

  * unix_stream_sendmsg
  * unix_stream_read_generic
  * unix_seqpacket_sendmsg
  * unix_seqpacket_recvmsg

Let's use READ_ONCE() there.

Fixes: a05d2ad1c1 ("af_unix: Only allow recv on connected seqpacket sockets.")
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-06 12:57:14 +02:00
Kuniyuki Iwashima
1b536948e8 af_unix: Annotate data-race of sk->sk_state in unix_accept().
Once sk->sk_state is changed to TCP_LISTEN, it never changes.

unix_accept() takes the advantage and reads sk->sk_state without
holding unix_state_lock().

Let's use READ_ONCE() there.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-06 12:57:14 +02:00
Kuniyuki Iwashima
a9bf9c7dc6 af_unix: Annotate data-race of sk->sk_state in unix_stream_connect().
As small optimisation, unix_stream_connect() prefetches the client's
sk->sk_state without unix_state_lock() and checks if it's TCP_CLOSE.

Later, sk->sk_state is checked again under unix_state_lock().

Let's use READ_ONCE() for the first check and TCP_CLOSE directly for
the second check.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-06 12:57:14 +02:00
Kuniyuki Iwashima
eb0718fb3e af_unix: Annotate data-races around sk->sk_state in unix_write_space() and poll().
unix_poll() and unix_dgram_poll() read sk->sk_state locklessly and
calls unix_writable() which also reads sk->sk_state without holding
unix_state_lock().

Let's use READ_ONCE() in unix_poll() and unix_dgram_poll() and pass
it to unix_writable().

While at it, we remove TCP_SYN_SENT check in unix_dgram_poll() as
that state does not exist for AF_UNIX socket since the code was added.

Fixes: 1586a5877d ("af_unix: do not report POLLOUT on listeners")
Fixes: 3c73419c09 ("af_unix: fix 'poll for write'/ connected DGRAM sockets")
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-06 12:57:14 +02:00
Kuniyuki Iwashima
3a0f38eb28 af_unix: Annotate data-race of sk->sk_state in unix_inq_len().
ioctl(SIOCINQ) calls unix_inq_len() that checks sk->sk_state first
and returns -EINVAL if it's TCP_LISTEN.

Then, for SOCK_STREAM sockets, unix_inq_len() returns the number of
bytes in recvq.

However, unix_inq_len() does not hold unix_state_lock(), and the
concurrent listen() might change the state after checking sk->sk_state.

If the race occurs, 0 is returned for the listener, instead of -EINVAL,
because the length of skb with embryo is 0.

We could hold unix_state_lock() in unix_inq_len(), but it's overkill
given the result is true for pre-listen() TCP_CLOSE state.

So, let's use READ_ONCE() for sk->sk_state in unix_inq_len().

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-06 12:57:14 +02:00
Kuniyuki Iwashima
942238f973 af_unix: Annodate data-races around sk->sk_state for writers.
sk->sk_state is changed under unix_state_lock(), but it's read locklessly
in many places.

This patch adds WRITE_ONCE() on the writer side.

We will add READ_ONCE() to the lockless readers in the following patches.

Fixes: 83301b5367 ("af_unix: Set TCP_ESTABLISHED for datagram sockets too")
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-06 12:57:14 +02:00
Kuniyuki Iwashima
26bfb8b570 af_unix: Set sk->sk_state under unix_state_lock() for truly disconencted peer.
When a SOCK_DGRAM socket connect()s to another socket, the both sockets'
sk->sk_state are changed to TCP_ESTABLISHED so that we can register them
to BPF SOCKMAP.

When the socket disconnects from the peer by connect(AF_UNSPEC), the state
is set back to TCP_CLOSE.

Then, the peer's state is also set to TCP_CLOSE, but the update is done
locklessly and unconditionally.

Let's say socket A connect()ed to B, B connect()ed to C, and A disconnects
from B.

After the first two connect()s, all three sockets' sk->sk_state are
TCP_ESTABLISHED:

  $ ss -xa
  Netid State  Recv-Q Send-Q  Local Address:Port  Peer Address:PortProcess
  u_dgr ESTAB  0      0       @A 641              * 642
  u_dgr ESTAB  0      0       @B 642              * 643
  u_dgr ESTAB  0      0       @C 643              * 0

And after the disconnect, B's state is TCP_CLOSE even though it's still
connected to C and C's state is TCP_ESTABLISHED.

  $ ss -xa
  Netid State  Recv-Q Send-Q  Local Address:Port  Peer Address:PortProcess
  u_dgr UNCONN 0      0       @A 641              * 0
  u_dgr UNCONN 0      0       @B 642              * 643
  u_dgr ESTAB  0      0       @C 643              * 0

In this case, we cannot register B to SOCKMAP.

So, when a socket disconnects from the peer, we should not set TCP_CLOSE to
the peer if the peer is connected to yet another socket, and this must be
done under unix_state_lock().

Note that we use WRITE_ONCE() for sk->sk_state as there are many lockless
readers.  These data-races will be fixed in the following patches.

Fixes: 83301b5367 ("af_unix: Set TCP_ESTABLISHED for datagram sockets too")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-06 12:57:14 +02:00
Eric Dumazet
98aa546af5 inet: remove (struct uncached_list)->quarantine
This list is used to tranfert dst that are handled by
rt_flush_dev() and rt6_uncached_list_flush_dev() out
of the per-cpu lists.

But quarantine list is not used later.

If we simply use list_del_init(&rt->dst.rt_uncached),
this also removes the dst from per-cpu list.

This patch also makes the future calls to rt_del_uncached_list()
and rt6_uncached_list_del() faster, because no spinlock
acquisition is needed anymore.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240604165150.726382-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-06 12:33:25 +02:00
Eric Dumazet
b4cb4a1391 net: use unrcu_pointer() helper
Toke mentioned unrcu_pointer() existence, allowing
to remove some of the ugly casts we have when using
xchg() for rcu protected pointers.

Also make inet_rcv_compat const.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Toke Høiland-Jørgensen <toke@redhat.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20240604111603.45871-1-edumazet@google.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-06 11:52:52 +02:00
Aleksandr Mishin
b0c9a26435 net: wwan: iosm: Fix tainted pointer delete is case of region creation fail
In case of region creation fail in ipc_devlink_create_region(), previously
created regions delete process starts from tainted pointer which actually
holds error code value.
Fix this bug by decreasing region index before delete.

Found by Linux Verification Center (linuxtesting.org) with SVACE.

Fixes: 4dcd183fbd ("net: wwan: iosm: devlink registration")
Signed-off-by: Aleksandr Mishin <amishin@t-argos.ru>
Acked-by: Sergey Ryazanov <ryazanov.s.a@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240604082500.20769-1-amishin@t-argos.ru
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-06 10:15:14 +02:00
Paolo Abeni
59d0f48160 Merge branch 'improve-gbeth-performance-on-renesas-rz-g2l-and-related-socs'
Paul Barker says:

====================
Improve GbEth performance on Renesas RZ/G2L and related SoCs

This series aims to improve performance of the GbEth IP in the Renesas
RZ/G2L SoC family and the RZ/G3S SoC, which use the ravb driver. Along
the way, we do some refactoring and ensure that napi_complete_done() is
used in accordance with the NAPI documentation for both GbEth and R-Car
code paths.

Much of the performance improvement comes from enabling SW IRQ
Coalescing for all SoCs using the GbEth IP, and NAPI Threaded mode for
single core SoCs using the GbEth IP. These can be enabled/disabled at
runtime via sysfs, but our goal is to set sensible defaults which get
good performance on the affected SoCs.

The rest of the performance improvement comes from using a page pool to
allocate RX buffers, and reducing the allocation size from >8kB to 2kB.

The overall performance impact of this patch series seen in testing with
iperf3 is as follows (see patches 5-7 for more detailed results):
  * RZ/G2L:
    * TCP TX: +1.8% bandwidth
    * TCP RX: +1% bandwidth at 47% less CPU load
    * UDP RX: +1% bandwidth at 26% less CPU load

  * RZ/G2UL:
    * TCP TX: +37% bandwidth
    * TCP RX: +43% bandwidth
    * UDP TX: -8% bandwidth
    * UDP RX: +32500% bandwidth (!)

  * RZ/G3S:
    * TCP TX: +25% bandwidth
    * TCP RX: +76% bandwidth
    * UDP TX: -9% bandwidth
    * UDP RX: +37900% bandwidth (!)

  * RZ/Five:
    * TCP TX: +18% bandwidth
    * TCP RX: +212% bandwidth
    * UDP TX: +2% bandwidth
    * UDP RX: +inf bandwidth (test no longer crashes)

There is no significant impact on bandwidth or CPU load in testing on
RZ/G2H or R-Car M3N.

Fixing the crash in UDP RX testing for RZ/Five is a cumulative effect of
patches 1, 2, 5 & 6 so this is very difficult to break out as a bugfix
for backporting.

Changes v4->v5:
  * Added Sergey's Reviewed-by tags.
  * Improved the commit message for patch 2/7.
  * Re-wrapped to 80 cols, except where this would significantly impact
    readability.
  * Use lower case `skb` consistently in comments.
  * Included <net/page_pool/types.h> in ravb.h.
  * Moved rx_buffer_size so it is in the same place in ravb_hw_info as
    rx_max_desc_use was previously.
  * Used reverse xmas tree ordering in variable declarations.
  * Split lines after binary operators, instead of before.
  * Factor subtraction of sizeof(__sum16) out of the if condition in
    ravb_rx_csum_gbeth().
  * Add blank lines after variable declarations where needed.
  * Used goto instead of break to handle napi_build_skb() failure in
    ravb_rx_gbeth(). Break was incorrectly scoped to the surrounding
    switch statement, when it's the outer loop we really want to break
    out of.
  * Used continue instead of break to handle NULL priv->rx_1st_skb in
    ravb_rx_gbeth() as we may still be able to process further
    descriptors.
  * Unconditionally set priv->rx_1st_skb = NULL after processing a
    packet in ravb_rx_gbeth(). We don't need to check die_dt as this
    will be a no-op for single descriptor packets.
  * Moved napi_build_skb() call after dma_sync_single_for_cpu() in
    ravb_rx_rcar() to align the order of operations with ravb_rx_gbeth()
    and ensure the data is sync'd before it is accessed.
  * Moved zeroing of rx_buff->page to the end of packet processing in
    ravb_rx_rcar() to align the order of operations with
    ravb_rx_gbeth().

Changes v3->v4:
  * Dependency patches have merged so this is no longer an RFC.
  * Fixed update of stats->rx_packets.
  * Simplified refactoring following feedback from Niklas and Sergey.
  * Renamed needs_irq_coalesce -> coalesce_irqs.
  * Used a separate page pool for each RX queue.
  * Passed struct ravb_rx_desc to ravb_alloc_rx_buffer() so that we can
    simplify the calling function.
  * Explained the calculation of rx_desc->ds_cc.
  * Added handling of nonlinear SKBs in ravb_rx_csum_gbeth().
  * Used Niklas' suggested commit message for patch 2/7.
  * Added Sergey's Reviewed-by tags to patches 5/7 and 6/7.

Changes v2->v3:
  * Incorporated feedback on RFC v2 from Sergey.
  * Split out bugfixes and rebased. This changed the order of what was
    the first 5 patches of v2 and things look a little different so I've
    not picked up Reviewed-by tags from v2.
  * Further refactoring and tidy up of RX ring refill and
    ravb_rx_gbeth().
  * Switched to using a page pool to allocate RX buffers.
  * Re-tested and provided updated performance figures.

Changes v1->v2:
  * Marked as RFC as the series depends on unmerged patches.
  * Refactored R-Car code paths as well as GbEth code paths.
  * Updated references to the patches this series depends on.
====================

Link: https://lore.kernel.org/r/20240604072825.7490-1-paul.barker.ct@bp.renesas.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-06 10:00:02 +02:00
Paul Barker
966726324b net: ravb: Allocate RX buffers via page pool
This patch makes multiple changes that can't be separated:

  1) Allocate plain RX buffers via a page pool instead of allocating
     SKBs, then use build_skb() when a packet is received.
  2) For GbEth IP, reduce the RX buffer size to 2kB.
  3) For GbEth IP, merge packets which span more than one RX descriptor
     as SKB fragments instead of copying data.

Implementing (1) without (2) would require the use of an order-1 page
pool (instead of an order-0 page pool split into page fragments) for
GbEth.

Implementing (2) without (3) would leave us no space to re-assemble
packets which span more than one RX descriptor.

Implementing (3) without (1) would not be possible as the network stack
expects to use put_page() or page_pool_put_page() to free SKB fragments
after an SKB is consumed.

RX checksum offload support is adjusted to handle both linear and
nonlinear (fragmented) packets.

This patch gives the following improvements during testing with iperf3.

  * RZ/G2L:
    * TCP RX: same bandwidth at -43% CPU load (70% -> 40%)
    * UDP RX: same bandwidth at -17% CPU load (88% -> 74%)

  * RZ/G2UL:
    * TCP RX: +30% bandwidth (726Mbps -> 941Mbps)
    * UDP RX: +417% bandwidth (108Mbps -> 558Mbps)

  * RZ/G3S:
    * TCP RX: +64% bandwidth (562Mbps -> 920Mbps)
    * UDP RX: +420% bandwidth (90Mbps -> 468Mbps)

  * RZ/Five:
    * TCP RX: +217% bandwidth (145Mbps -> 459Mbps)
    * UDP RX: +470% bandwidth (20Mbps -> 114Mbps)

There is no significant impact on bandwidth or CPU load in testing on
RZ/G2H or R-Car M3N.

Signed-off-by: Paul Barker <paul.barker.ct@bp.renesas.com>
Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-06 09:59:58 +02:00
Paul Barker
65c482bc22 net: ravb: Use NAPI threaded mode on 1-core CPUs with GbEth IP
NAPI Threaded mode (along with the previously enabled SW IRQ Coalescing)
is required to improve network stack performance for single core SoCs
using the GbEth IP (currently the RZ/G2L SoC family and the RZ/G3S SoC).

This patch gives the following improvements during testing with iperf3.

  * RZ/G2UL:
    * TCP TX: +32% bandwidth (638Mbps -> 841Mbps)
    * TXP RX: +8.8% bandwidth (667Mbps -> 726Mbps)
    * UDP RX: +104% bandwidth (53Mbps -> 108Mbps)

  * RZ/G3S:
    * TCP TX: 29% bandwidth (529Mbps -> 681Mbps)
    * UDP RX: +1290% bandwidth (6.46Mbps -> 90Mbps)

  * RZ/Five:
    * UDP RX: Test no longer crashes (0 -> 20 Mbps)

This patch gives the following reductions in performance in the same
testing:

  * RZ/G2UL:
    * UDP TX: -7.5% bandwidth (594Mbps -> 549Mbps)

  * RZ/G3S:
    * UDP TX: -5% bandwidth (625Mbps -> 594Mbps)

These losses are considered acceptable given the benefits shown above.
If UDP TX bandwidth must be maximised for a particular use case, NAPI
threaded mode can be disabled at runtime via sysfs writes.

The improvement of UDP RX bandwidth for the single core SoCs (RZ/G2UL &
RZ/G3S) is particularly critical.

Signed-off-by: Paul Barker <paul.barker.ct@bp.renesas.com>
Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-06 09:59:58 +02:00
Paul Barker
7b39c1814c net: ravb: Enable SW IRQ Coalescing for GbEth
Software IRQ Coalescing is required to improve network stack performance
in the RZ/G2L SoC family and the RZ/G3S SoC, i.e. the SoCs which use the
GbEth IP.

This patch gives the following improvements during testing with iperf3:

  * RZ/G2L:
    * TCP RX: same bandwidth with -6% CPU load (76% -> 71%)
    * UDP RX: same bandwidth with -10% CPU load (99% -> 89%)

  * RZ/G2UL:
    * UDP RX: +4200% bandwidth (1.23Mbps -> 53Mbps)

  * RZ/G3S:
    * UDP RX: +425% bandwidth (1.23Mbps -> 6.46Mbps)

The improvement of UDP RX bandwidth for the single core SoCs (RZ/G2UL &
RZ/G3S) is particularly critical.

Signed-off-by: Paul Barker <paul.barker.ct@bp.renesas.com>
Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-06 09:59:58 +02:00
Paul Barker
3ee43f09cb net: ravb: Refactor GbEth RX code path
We can reduce code duplication in ravb_rx_gbeth().

Signed-off-by: Paul Barker <paul.barker.ct@bp.renesas.com>
Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-06 09:59:57 +02:00
Paul Barker
37a01c12e9 net: ravb: Refactor RX ring refill
To reduce code duplication, we add a new RX ring refill function which
can handle both the initial RX ring population (which was split between
ravb_ring_init() and ravb_ring_format()) and the RX ring refill after
polling (in ravb_rx()).

Signed-off-by: Paul Barker <paul.barker.ct@bp.renesas.com>
Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-06 09:59:57 +02:00
Paul Barker
b0e0e20dc6 net: ravb: Align poll function with NAPI docs
Align ravb_poll() with the documentation in
`Documentation/networking/kapi.rst` and
`Documentation/networking/napi.rst`.

The documentation says that we should prefer napi_complete_done() over
napi_complete(), and using the former allows us to properly support busy
polling. We should ensure that napi_complete_done() is only called if
the work budget has not been exhausted, and we should only re-arm
interrupts if it returns true.

Signed-off-by: Paul Barker <paul.barker.ct@bp.renesas.com>
Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-06 09:59:57 +02:00
Paul Barker
118e640af3 net: ravb: Simplify poll & receive functions
We don't need to pass the work budget to ravb_rx() by reference, it's
cleaner to pass this by value and return the amount of work done. This
allows us to simplify the ravb_poll() function and use the common
`work_done` variable name seen in other network drivers for consistency
and ease of understanding.

This is a pure refactor and should not affect behaviour.

Signed-off-by: Paul Barker <paul.barker.ct@bp.renesas.com>
Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-06-06 09:59:57 +02:00
Jakub Kicinski
7da375e2c7 Merge branch 'net-mlx5e-shampo-enable-hw-gro-once-more'
Tariq Toukan says:

====================
net/mlx5e: SHAMPO, Enable HW GRO once more

This series enables hardware GRO for ConnectX-7 and newer NICs.
SHAMPO stands for Split Header And Merge Payload Offload.

The first part of the series contains important fixes and improvements.

The second part reworks the HW GRO counters.

Lastly, HW GRO is perf optimized and enabled.

Here are the bandwidth numbers for a simple iperf3 test over a single rq
where the application and irq are pinned to the same CPU:

+---------+--------+--------+-----------+-------------+
| streams | SW GRO | HW GRO | Unit      | Improvement |
+---------+--------+--------+-----------+-------------+
| 1       | 36     | 57     | Gbits/sec |    1.6 x    |
| 4       | 34     | 50     | Gbits/sec |    1.5 x    |
| 8       | 31     | 43     | Gbits/sec |    1.4 x    |
+---------+--------+--------+-----------+-------------+

Benchmark details:
VM based setup
CPU: Intel(R) Xeon(R) Platinum 8380 CPU, 24 cores
NIC: ConnectX-7 100GbE
iperf3 and irq running on same CPU over a single receive queue
====================

Link: https://lore.kernel.org/r/20240603212219.1037656-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-05 20:20:47 -07:00
Dragos Tatulea
14ae2fd12b net/mlx5e: SHAMPO, Coalesce skb fragments to page size
When doing hardware GRO (SHAMPO), the driver puts each data payload of a
packet from the wire into one skb fragment. TCP Zero-Copy expects page
sized skb fragments to be able to do it's page-flipping magic. With the
current way of arranging fragments by the driver, only specific MTUs
(page sized multiple + header size) will yield such page sized fragments
in a high percentage.

This change improves payload arrangement in the skb for hardware GRO by
coalescing payloads into a single skb fragment when possible.

To demonstrate the fix, running tcp_mmap with a MTU of 1500 yields:
- Before:  0 % bytes mmap'ed
- After : 81 % bytes mmap'ed

More importantly, coalescing considerably improves the HW GRO performance.
Here are the results for a iperf3 bandwidth benchmark:
+---------+--------+--------+------------------------+-----------+
| streams | SW GRO | HW GRO | HW GRO with coalescing | Unit      |
|---------+--------+--------+------------------------+-----------|
| 1       | 36     | 42     | 57                     | Gbits/sec |
| 4       | 34     | 39     | 50                     | Gbits/sec |
| 8       | 31     | 35     | 43                     | Gbits/sec |
+---------+--------+--------+------------------------+-----------+

Benchmark details:
VM based setup
CPU: Intel(R) Xeon(R) Platinum 8380 CPU, 24 cores
NIC: ConnectX-7 100GbE
iperf3 and irq running on same CPU over a single receive queue

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20240603212219.1037656-15-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-05 20:20:46 -07:00