Commit 0227f058aa ("net/smc: Unbind r/w buffer size from clcsock
and make them tunable") introduced the net.smc.rmem and net.smc.wmem
sysctls to specify the size of buffers to be used for SMC type
connections. This created a regression for users that specified the
buffer size via setsockopt() as the effective buffer size was now
doubled.
Re-introduce the division by 2 in the SMC buffer create code and level
this out by duplicating the net.smc.[rw]mem values used for initializing
sk_rcvbuf/sk_sndbuf at socket creation time. This gives users of both
methods (setsockopt or sysctl) the effective buffer size that they
expect.
Initialize net.smc.[rw]mem from its own constant of 64kB, respectively.
Internal performance tests show that this value is a good compromise
between throughput/latency and memory consumption. Also, this decouples
it from any tuning that was done to net.ipv4.tcp_[rw]mem[1] before the
module for SMC protocol was loaded. Check that no more than INT_MAX / 2
is assigned to net.smc.[rw]mem, in order to avoid any overflow condition
when that is doubled for use in sk_sndbuf or sk_rcvbuf.
While at it, drop the confusing sk_buf_size variable from
__smc_buf_create and name "compressed" buffer size variables more
consistently.
Background:
Before the commit mentioned above, SMC's buffer allocator in
__smc_buf_create() always used half of the sockets' sk_rcvbuf/sk_sndbuf
value as initial value to search for appropriate buffers. If the search
resorted to using a bigger buffer when all buffers of the specified
size were busy, the duplicate of the used effective buffer size is
stored back to sk_rcvbuf/sk_sndbuf.
When available, buffers of exactly the size that a user had specified as
input to setsockopt() were used, despite setsockopt()'s documentation in
"man 7 socket" talking of a mandatory duplication:
[...]
SO_SNDBUF
Sets or gets the maximum socket send buffer in bytes.
The kernel doubles this value (to allow space for book‐
keeping overhead) when it is set using setsockopt(2),
and this doubled value is returned by getsockopt(2).
The default value is set by the
/proc/sys/net/core/wmem_default file and the maximum
allowed value is set by the /proc/sys/net/core/wmem_max
file. The minimum (doubled) value for this option is
2048.
[...]
Fixes: 0227f058aa ("net/smc: Unbind r/w buffer size from clcsock and make them tunable")
Co-developed-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Jan Karcher <jaka@linux.ibm.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: Gerd Bayer <gbayer@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit f9aab6f2ce ("net/smc: immediate freeing in smc_lgr_cleanup_early()")
left behind smc_lgr_schedule_free_work_fast() declaration.
And since commit 349d43127d ("net/smc: fix kernel panic caused by race of smc_sock")
smc_ib_modify_qp_reset() is not used anymore.
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Link: https://lore.kernel.org/r/20230729121929.17180-1-yuehaibing@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
sk->sk_mark is often read while another thread could change the value.
Fixes: 4a19ec5800 ("[NET]: Introducing socket mark socket option.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Drop the smc_sendpage() code as smc_sendmsg() just passes the call down to
the underlying TCP socket and smc_tx_sendpage() is just a wrapper around
its sendmsg implementation.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Karsten Graul <kgraul@linux.ibm.com>
cc: Wenjia Zhang <wenjia@linux.ibm.com>
cc: Jan Karcher <jaka@linux.ibm.com>
cc: "D. Wythe" <alibuda@linux.alibaba.com>
cc: Tony Lu <tonylu@linux.alibaba.com>
cc: Wen Gu <guwen@linux.alibaba.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
Link: https://lore.kernel.org/r/20230623225513.2732256-10-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As MSG_SENDPAGE_NOTLAST is being phased out along with sendpage(), don't
use it further in than the sendpage methods, but rather translate it to
MSG_MORE and use that instead.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
cc: Bernard Metzler <bmt@zurich.ibm.com>
cc: Jason Gunthorpe <jgg@ziepe.ca>
cc: Leon Romanovsky <leon@kernel.org>
cc: John Fastabend <john.fastabend@gmail.com>
cc: Jakub Sitnicki <jakub@cloudflare.com>
cc: David Ahern <dsahern@kernel.org>
cc: Karsten Graul <kgraul@linux.ibm.com>
cc: Wenjia Zhang <wenjia@linux.ibm.com>
cc: Jan Karcher <jaka@linux.ibm.com>
cc: "D. Wythe" <alibuda@linux.alibaba.com>
cc: Tony Lu <tonylu@linux.alibaba.com>
cc: Wen Gu <guwen@linux.alibaba.com>
cc: Boris Pismenny <borisp@nvidia.com>
cc: Steffen Klassert <steffen.klassert@secunet.com>
cc: Herbert Xu <herbert@gondor.apana.org.au>
Link: https://lore.kernel.org/r/20230623225513.2732256-2-dhowells@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
SMCRv1 has a similar issue to SMCRv2 (see link below) that may access
invalid MRs of RMBs when construct LLC ADD LINK CONT messages.
BUG: kernel NULL pointer dereference, address: 0000000000000014
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 5 PID: 48 Comm: kworker/5:0 Kdump: loaded Tainted: G W E 6.4.0-rc3+ #49
Workqueue: events smc_llc_add_link_work [smc]
RIP: 0010:smc_llc_add_link_cont+0x160/0x270 [smc]
RSP: 0018:ffffa737801d3d50 EFLAGS: 00010286
RAX: ffff964f82144000 RBX: ffffa737801d3dd8 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff964f81370c30
RBP: ffffa737801d3dd4 R08: ffff964f81370000 R09: ffffa737801d3db0
R10: 0000000000000001 R11: 0000000000000060 R12: ffff964f82e70000
R13: ffff964f81370c38 R14: ffffa737801d3dd3 R15: 0000000000000001
FS: 0000000000000000(0000) GS:ffff9652bfd40000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000014 CR3: 000000008fa20004 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
smc_llc_srv_rkey_exchange+0xa7/0x190 [smc]
smc_llc_srv_add_link+0x3ae/0x5a0 [smc]
smc_llc_add_link_work+0xb8/0x140 [smc]
process_one_work+0x1e5/0x3f0
worker_thread+0x4d/0x2f0
? __pfx_worker_thread+0x10/0x10
kthread+0xe5/0x120
? __pfx_kthread+0x10/0x10
ret_from_fork+0x2c/0x50
</TASK>
When an alernate RNIC is available in system, SMC will try to add a new
link based on the RNIC for resilience. All the RMBs in use will be mapped
to the new link. Then the RMBs' MRs corresponding to the new link will
be filled into LLC messages. For SMCRv1, they are ADD LINK CONT messages.
However smc_llc_add_link_cont() may mistakenly access to unused RMBs which
haven't been mapped to the new link and have no valid MRs, thus causing a
crash. So this patch fixes it.
Fixes: 87f88cda21 ("net/smc: rkey processing for a new link as SMC client")
Link: https://lore.kernel.org/r/1685101741-74826-3-git-send-email-guwen@linux.alibaba.com
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We encountered a crash when using SMCRv2. It is caused by a logical
error in smc_llc_fill_ext_v2().
BUG: kernel NULL pointer dereference, address: 0000000000000014
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 7 PID: 453 Comm: kworker/7:4 Kdump: loaded Tainted: G W E 6.4.0-rc3+ #44
Workqueue: events smc_llc_add_link_work [smc]
RIP: 0010:smc_llc_fill_ext_v2+0x117/0x280 [smc]
RSP: 0018:ffffacb5c064bd88 EFLAGS: 00010282
RAX: ffff9a6bc1c3c02c RBX: ffff9a6be3558000 RCX: 0000000000000000
RDX: 0000000000000002 RSI: 0000000000000002 RDI: 000000000000000a
RBP: ffffacb5c064bdb8 R08: 0000000000000040 R09: 000000000000000c
R10: ffff9a6bc0910300 R11: 0000000000000002 R12: 0000000000000000
R13: 0000000000000002 R14: ffff9a6bc1c3c02c R15: ffff9a6be3558250
FS: 0000000000000000(0000) GS:ffff9a6eefdc0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000014 CR3: 000000010b078003 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
smc_llc_send_add_link+0x1ae/0x2f0 [smc]
smc_llc_srv_add_link+0x2c9/0x5a0 [smc]
? cc_mkenc+0x40/0x60
smc_llc_add_link_work+0xb8/0x140 [smc]
process_one_work+0x1e5/0x3f0
worker_thread+0x4d/0x2f0
? __pfx_worker_thread+0x10/0x10
kthread+0xe5/0x120
? __pfx_kthread+0x10/0x10
ret_from_fork+0x2c/0x50
</TASK>
When an alernate RNIC is available in system, SMC will try to add a new
link based on the RNIC for resilience. All the RMBs in use will be mapped
to the new link. Then the RMBs' MRs corresponding to the new link will be
filled into SMCRv2 LLC ADD LINK messages.
However, smc_llc_fill_ext_v2() mistakenly accesses to unused RMBs which
haven't been mapped to the new link and have no valid MRs, thus causing
a crash. So this patch fixes the logic.
Fixes: b4ba4652b3 ("net/smc: extend LLC layer for SMC-Rv2")
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
When finding the first RMB of link group, it should start from the
current RMB list whose index is 0. So fix it.
Fixes: b4ba4652b3 ("net/smc: extend LLC layer for SMC-Rv2")
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
We found a crash when using SMCRv2 with 2 Mellanox ConnectX-4. It
can be reproduced by:
- smc_run nginx
- smc_run wrk -t 32 -c 500 -d 30 http://<ip>:<port>
BUG: kernel NULL pointer dereference, address: 0000000000000014
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 8000000108713067 P4D 8000000108713067 PUD 151127067 PMD 0
Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 4 PID: 2441 Comm: kworker/4:249 Kdump: loaded Tainted: G W E 6.4.0-rc1+ #42
Workqueue: smc_hs_wq smc_listen_work [smc]
RIP: 0010:smc_clc_send_confirm_accept+0x284/0x580 [smc]
RSP: 0018:ffffb8294b2d7c78 EFLAGS: 00010a06
RAX: ffff8f1873238880 RBX: ffffb8294b2d7dc8 RCX: 0000000000000000
RDX: 00000000000000b4 RSI: 0000000000000001 RDI: 0000000000b40c00
RBP: ffffb8294b2d7db8 R08: ffff8f1815c5860c R09: 0000000000000000
R10: 0000000000000400 R11: 0000000000000000 R12: ffff8f1846f56180
R13: ffff8f1815c5860c R14: 0000000000000001 R15: 0000000000000001
FS: 0000000000000000(0000) GS:ffff8f1aefd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000014 CR3: 00000001027a0001 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
? mlx5_ib_map_mr_sg+0xa1/0xd0 [mlx5_ib]
? smcr_buf_map_link+0x24b/0x290 [smc]
? __smc_buf_create+0x4ee/0x9b0 [smc]
smc_clc_send_accept+0x4c/0xb0 [smc]
smc_listen_work+0x346/0x650 [smc]
? __schedule+0x279/0x820
process_one_work+0x1e5/0x3f0
worker_thread+0x4d/0x2f0
? __pfx_worker_thread+0x10/0x10
kthread+0xe5/0x120
? __pfx_kthread+0x10/0x10
ret_from_fork+0x2c/0x50
</TASK>
During the CLC handshake, server sequentially tries available SMCRv2
and SMCRv1 devices in smc_listen_work().
If an SMCRv2 device is found. SMCv2 based link group and link will be
assigned to the connection. Then assumed that some buffer assignment
errors happen later in the CLC handshake, such as RMB registration
failure, server will give up SMCRv2 and try SMCRv1 device instead. But
the resources assigned to the connection won't be reset.
When server tries SMCRv1 device, the connection creation process will
be executed again. Since conn->lnk has been assigned when trying SMCRv2,
it will not be set to the correct SMCRv1 link in
smcr_lgr_conn_assign_link(). So in such situation, conn->lgr points to
correct SMCRv1 link group but conn->lnk points to the SMCRv2 link
mistakenly.
Then in smc_clc_send_confirm_accept(), conn->rmb_desc->mr[link->link_idx]
will be accessed. Since the link->link_idx is not correct, the related
MR may not have been initialized, so crash happens.
| Try SMCRv2 device first
| |-> conn->lgr: assign existed SMCRv2 link group;
| |-> conn->link: assign existed SMCRv2 link (link_idx may be 1 in SMC_LGR_SYMMETRIC);
| |-> sndbuf & RMB creation fails, quit;
|
| Try SMCRv1 device then
| |-> conn->lgr: create SMCRv1 link group and assign;
| |-> conn->link: keep SMCRv2 link mistakenly;
| |-> sndbuf & RMB creation succeed, only RMB->mr[link_idx = 0]
| initialized.
|
| Then smc_clc_send_confirm_accept() accesses
| conn->rmb_desc->mr[conn->link->link_idx, which is 1], then crash.
v
This patch tries to fix this by cleaning conn->lnk before assigning
link. In addition, it is better to reset the connection and clean the
resources assigned if trying SMCRv2 failed in buffer creation or
registration.
Fixes: e49300a6bf ("net/smc: add listen processing for SMC-Rv2")
Link: https://lore.kernel.org/r/20220523055056.2078994-1-liuyacan@corp.netease.com/
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
__condition is evaluated twice in sk_wait_event() macro.
First invocation is lockless, and reads can race with writes,
as spotted by syzbot.
BUG: KCSAN: data-race in sk_stream_wait_connect / tcp_disconnect
write to 0xffff88812d83d6a0 of 4 bytes by task 9065 on cpu 1:
tcp_disconnect+0x2cd/0xdb0
inet_shutdown+0x19e/0x1f0 net/ipv4/af_inet.c:911
__sys_shutdown_sock net/socket.c:2343 [inline]
__sys_shutdown net/socket.c:2355 [inline]
__do_sys_shutdown net/socket.c:2363 [inline]
__se_sys_shutdown+0xf8/0x140 net/socket.c:2361
__x64_sys_shutdown+0x31/0x40 net/socket.c:2361
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
read to 0xffff88812d83d6a0 of 4 bytes by task 9040 on cpu 0:
sk_stream_wait_connect+0x1de/0x3a0 net/core/stream.c:75
tcp_sendmsg_locked+0x2e4/0x2120 net/ipv4/tcp.c:1266
tcp_sendmsg+0x30/0x50 net/ipv4/tcp.c:1484
inet6_sendmsg+0x63/0x80 net/ipv6/af_inet6.c:651
sock_sendmsg_nosec net/socket.c:724 [inline]
sock_sendmsg net/socket.c:747 [inline]
__sys_sendto+0x246/0x300 net/socket.c:2142
__do_sys_sendto net/socket.c:2154 [inline]
__se_sys_sendto net/socket.c:2150 [inline]
__x64_sys_sendto+0x78/0x90 net/socket.c:2150
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
value changed: 0x00000000 -> 0x00000068
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
switching from a user process to a kernel thread.
- More folio conversions from Kefeng Wang, Zhang Peng and Pankaj Raghav.
- zsmalloc performance improvements from Sergey Senozhatsky.
- Yue Zhao has found and fixed some data race issues around the
alteration of memcg userspace tunables.
- VFS rationalizations from Christoph Hellwig:
- removal of most of the callers of write_one_page().
- make __filemap_get_folio()'s return value more useful
- Luis Chamberlain has changed tmpfs so it no longer requires swap
backing. Use `mount -o noswap'.
- Qi Zheng has made the slab shrinkers operate locklessly, providing
some scalability benefits.
- Keith Busch has improved dmapool's performance, making part of its
operations O(1) rather than O(n).
- Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd,
permitting userspace to wr-protect anon memory unpopulated ptes.
- Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive rather
than exclusive, and has fixed a bunch of errors which were caused by its
unintuitive meaning.
- Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature,
which causes minor faults to install a write-protected pte.
- Vlastimil Babka has done some maintenance work on vma_merge():
cleanups to the kernel code and improvements to our userspace test
harness.
- Cleanups to do_fault_around() by Lorenzo Stoakes.
- Mike Rapoport has moved a lot of initialization code out of various
mm/ files and into mm/mm_init.c.
- Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for
DRM, but DRM doesn't use it any more.
- Lorenzo has also coverted read_kcore() and vread() to use iterators
and has thereby removed the use of bounce buffers in some cases.
- Lorenzo has also contributed further cleanups of vma_merge().
- Chaitanya Prakash provides some fixes to the mmap selftesting code.
- Matthew Wilcox changes xfs and afs so they no longer take sleeping
locks in ->map_page(), a step towards RCUification of pagefaults.
- Suren Baghdasaryan has improved mmap_lock scalability by switching to
per-VMA locking.
- Frederic Weisbecker has reworked the percpu cache draining so that it
no longer causes latency glitches on cpu isolated workloads.
- Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig
logic.
- Liu Shixin has changed zswap's initialization so we no longer waste a
chunk of memory if zswap is not being used.
- Yosry Ahmed has improved the performance of memcg statistics flushing.
- David Stevens has fixed several issues involving khugepaged,
userfaultfd and shmem.
- Christoph Hellwig has provided some cleanup work to zram's IO-related
code paths.
- David Hildenbrand has fixed up some issues in the selftest code's
testing of our pte state changing.
- Pankaj Raghav has made page_endio() unneeded and has removed it.
- Peter Xu contributed some rationalizations of the userfaultfd
selftests.
- Yosry Ahmed has fixed an issue around memcg's page recalim accounting.
- Chaitanya Prakash has fixed some arm-related issues in the
selftests/mm code.
- Longlong Xia has improved the way in which KSM handles hwpoisoned
pages.
- Peter Xu fixes a few issues with uffd-wp at fork() time.
- Stefan Roesch has changed KSM so that it may now be used on a
per-process and per-cgroup basis.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZEr3zQAKCRDdBJ7gKXxA
jlLoAP0fpQBipwFxED0Us4SKQfupV6z4caXNJGPeay7Aj11/kQD/aMRC2uPfgr96
eMG3kwn2pqkB9ST2QpkaRbxA//eMbQY=
=J+Dj
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- Nick Piggin's "shoot lazy tlbs" series, to improve the peformance of
switching from a user process to a kernel thread.
- More folio conversions from Kefeng Wang, Zhang Peng and Pankaj
Raghav.
- zsmalloc performance improvements from Sergey Senozhatsky.
- Yue Zhao has found and fixed some data race issues around the
alteration of memcg userspace tunables.
- VFS rationalizations from Christoph Hellwig:
- removal of most of the callers of write_one_page()
- make __filemap_get_folio()'s return value more useful
- Luis Chamberlain has changed tmpfs so it no longer requires swap
backing. Use `mount -o noswap'.
- Qi Zheng has made the slab shrinkers operate locklessly, providing
some scalability benefits.
- Keith Busch has improved dmapool's performance, making part of its
operations O(1) rather than O(n).
- Peter Xu adds the UFFD_FEATURE_WP_UNPOPULATED feature to userfaultd,
permitting userspace to wr-protect anon memory unpopulated ptes.
- Kirill Shutemov has changed MAX_ORDER's meaning to be inclusive
rather than exclusive, and has fixed a bunch of errors which were
caused by its unintuitive meaning.
- Axel Rasmussen give userfaultfd the UFFDIO_CONTINUE_MODE_WP feature,
which causes minor faults to install a write-protected pte.
- Vlastimil Babka has done some maintenance work on vma_merge():
cleanups to the kernel code and improvements to our userspace test
harness.
- Cleanups to do_fault_around() by Lorenzo Stoakes.
- Mike Rapoport has moved a lot of initialization code out of various
mm/ files and into mm/mm_init.c.
- Lorenzo Stoakes removd vmf_insert_mixed_prot(), which was added for
DRM, but DRM doesn't use it any more.
- Lorenzo has also coverted read_kcore() and vread() to use iterators
and has thereby removed the use of bounce buffers in some cases.
- Lorenzo has also contributed further cleanups of vma_merge().
- Chaitanya Prakash provides some fixes to the mmap selftesting code.
- Matthew Wilcox changes xfs and afs so they no longer take sleeping
locks in ->map_page(), a step towards RCUification of pagefaults.
- Suren Baghdasaryan has improved mmap_lock scalability by switching to
per-VMA locking.
- Frederic Weisbecker has reworked the percpu cache draining so that it
no longer causes latency glitches on cpu isolated workloads.
- Mike Rapoport cleans up and corrects the ARCH_FORCE_MAX_ORDER Kconfig
logic.
- Liu Shixin has changed zswap's initialization so we no longer waste a
chunk of memory if zswap is not being used.
- Yosry Ahmed has improved the performance of memcg statistics
flushing.
- David Stevens has fixed several issues involving khugepaged,
userfaultfd and shmem.
- Christoph Hellwig has provided some cleanup work to zram's IO-related
code paths.
- David Hildenbrand has fixed up some issues in the selftest code's
testing of our pte state changing.
- Pankaj Raghav has made page_endio() unneeded and has removed it.
- Peter Xu contributed some rationalizations of the userfaultfd
selftests.
- Yosry Ahmed has fixed an issue around memcg's page recalim
accounting.
- Chaitanya Prakash has fixed some arm-related issues in the
selftests/mm code.
- Longlong Xia has improved the way in which KSM handles hwpoisoned
pages.
- Peter Xu fixes a few issues with uffd-wp at fork() time.
- Stefan Roesch has changed KSM so that it may now be used on a
per-process and per-cgroup basis.
* tag 'mm-stable-2023-04-27-15-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits)
mm,unmap: avoid flushing TLB in batch if PTE is inaccessible
shmem: restrict noswap option to initial user namespace
mm/khugepaged: fix conflicting mods to collapse_file()
sparse: remove unnecessary 0 values from rc
mm: move 'mmap_min_addr' logic from callers into vm_unmapped_area()
hugetlb: pte_alloc_huge() to replace huge pte_alloc_map()
maple_tree: fix allocation in mas_sparse_area()
mm: do not increment pgfault stats when page fault handler retries
zsmalloc: allow only one active pool compaction context
selftests/mm: add new selftests for KSM
mm: add new KSM process and sysfs knobs
mm: add new api to enable ksm per process
mm: shrinkers: fix debugfs file permissions
mm: don't check VMA write permissions if the PTE/PMD indicates write permissions
migrate_pages_batch: fix statistics for longterm pin retry
userfaultfd: use helper function range_in_vma()
lib/show_mem.c: use for_each_populated_zone() simplify code
mm: correct arg in reclaim_pages()/reclaim_clean_pages_from_list()
fs/buffer: convert create_page_buffers to folio_create_buffers
fs/buffer: add folio_create_empty_buffers helper
...
With Eric's ref tracker, syzbot finally found a repro for
use-after-free in tcp_write_timer_handler() by kernel TCP
sockets. [0]
If SMC creates a kernel socket in __smc_create(), the kernel
socket is supposed to be freed in smc_clcsock_release() by
calling sock_release() when we close() the parent SMC socket.
However, at the end of smc_clcsock_release(), the kernel
socket's sk_state might not be TCP_CLOSE. This means that
we have not called inet_csk_destroy_sock() in __tcp_close()
and have not stopped the TCP timers.
The kernel socket's TCP timers can be fired later, so we
need to hold a refcnt for net as we do for MPTCP subflows
in mptcp_subflow_create_socket().
[0]:
leaked reference.
sk_alloc (./include/net/net_namespace.h:335 net/core/sock.c:2108)
inet_create (net/ipv4/af_inet.c:319 net/ipv4/af_inet.c:244)
__sock_create (net/socket.c:1546)
smc_create (net/smc/af_smc.c:3269 net/smc/af_smc.c:3284)
__sock_create (net/socket.c:1546)
__sys_socket (net/socket.c:1634 net/socket.c:1618 net/socket.c:1661)
__x64_sys_socket (net/socket.c:1672)
do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
==================================================================
BUG: KASAN: slab-use-after-free in tcp_write_timer_handler (net/ipv4/tcp_timer.c:378 net/ipv4/tcp_timer.c:624 net/ipv4/tcp_timer.c:594)
Read of size 1 at addr ffff888052b65e0d by task syzrepro/18091
CPU: 0 PID: 18091 Comm: syzrepro Tainted: G W 6.3.0-rc4-01174-gb5d54eb5899a #7
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.amzn2022.0.1 04/01/2014
Call Trace:
<IRQ>
dump_stack_lvl (lib/dump_stack.c:107)
print_report (mm/kasan/report.c:320 mm/kasan/report.c:430)
kasan_report (mm/kasan/report.c:538)
tcp_write_timer_handler (net/ipv4/tcp_timer.c:378 net/ipv4/tcp_timer.c:624 net/ipv4/tcp_timer.c:594)
tcp_write_timer (./include/linux/spinlock.h:390 net/ipv4/tcp_timer.c:643)
call_timer_fn (./arch/x86/include/asm/jump_label.h:27 ./include/linux/jump_label.h:207 ./include/trace/events/timer.h:127 kernel/time/timer.c:1701)
__run_timers.part.0 (kernel/time/timer.c:1752 kernel/time/timer.c:2022)
run_timer_softirq (kernel/time/timer.c:2037)
__do_softirq (./arch/x86/include/asm/jump_label.h:27 ./include/linux/jump_label.h:207 ./include/trace/events/irq.h:142 kernel/softirq.c:572)
__irq_exit_rcu (kernel/softirq.c:445 kernel/softirq.c:650)
irq_exit_rcu (kernel/softirq.c:664)
sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1107 (discriminator 14))
</IRQ>
Fixes: ac7138746e ("smc: establish new socket family")
Reported-by: syzbot+7e1e1bdb852961150198@syzkaller.appspotmail.com
Link: https://lore.kernel.org/netdev/000000000000a3f51805f8bcc43a@google.com/
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We can change smc_sk() to propagate its argument const qualifier,
thanks to container_of_const().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Karsten Graul <kgraul@linux.ibm.com>
Cc: Wenjia Zhang <wenjia@linux.ibm.com>
Cc: Jan Karcher <jaka@linux.ibm.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The refcount wr_tx_refcnt may cause cache thrashing problems among
cores and we can use percpu ref to mitigate this issue here. We
gain some performance improvement with percpu ref here on our
customized smc-r verion. Applying cache alignment may also mitigate
this problem but it seem more reasonable to use percpu ref here.
We can also replace wr_reg_refcnt with one percpu reference like
wr_tx_refcnt.
redis-benchmark on smc-r with atomic wr_tx_refcnt:
SET: 525707.06 requests per second, p50=0.087 msec
GET: 554877.38 requests per second, p50=0.087 msec
redis-benchmark on the percpu_ref version:
SET: 540482.06 requests per second, p50=0.087 msec
GET: 570711.12 requests per second, p50=0.079 msec
Cases are like "redis-benchmark -h x.x.x.x -q -t set,get -P 1 -n
5000000 -c 50 -d 10 --threads 4".
Signed-off-by: Kai Shen <KaiShen@linux.alibaba.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Previously, v2 support was derived from a very specific format of the SEID
as part of the SMC-D codebase. Make this part of the SMC-D device API, so
implementers do not need to adhere to a specific SEID format.
Signed-off-by: Stefan Raspl <raspl@linux.ibm.com>
Reviewed-and-tested-by: Jan Karcher <jaka@linux.ibm.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
CLC message initialization was not properly reversed in error handling path.
Reported-and-suggested-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Stefan Raspl <raspl@linux.ibm.com>
Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When performing a stress test on SMC-R by rmmod mlx5_ib driver
during the wrk/nginx test, we found that there is a probability
of triggering a panic while terminating all link groups.
This issue dues to the race between smc_smcr_terminate_all()
and smc_buf_create().
smc_smcr_terminate_all
smc_buf_create
/* init */
conn->sndbuf_desc = NULL;
...
__smc_lgr_terminate
smc_conn_kill
smc_close_abort
smc_cdc_get_slot_and_msg_send
__softirqentry_text_start
smc_wr_tx_process_cqe
smc_cdc_tx_handler
READ(conn->sndbuf_desc->len);
/* panic dues to NULL sndbuf_desc */
conn->sndbuf_desc = xxx;
This patch tries to fix the issue by always to check the sndbuf_desc
before send any cdc msg, to make sure that no null pointer is
seen during cqe processing.
Fixes: 0b29ec6436 ("net/smc: immediate termination for SMCR link groups")
Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Link: https://lore.kernel.org/r/1678263432-17329-1-git-send-email-alibuda@linux.alibaba.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Before determining whether the msg has unsupported options, it has been
prematurely terminated by the wrong status check.
For the application, the general usages of MSG_FASTOPEN likes
fd = socket(...)
/* rather than connect */
sendto(fd, data, len, MSG_FASTOPEN)
Hence, We need to check the flag before state check, because the sock
state here is always SMC_INIT when applications tries MSG_FASTOPEN.
Once we found unsupported options, fallback it to TCP.
Fixes: ee9dfbef02 ("net/smc: handle sockopts forcing fallback")
Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
v2 -> v1: Optimize code style
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a certain probability that following
exceptions will occur in the wrk benchmark test:
Running 10s test @ http://11.213.45.6:80
8 threads and 64 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 3.72ms 13.94ms 245.33ms 94.17%
Req/Sec 1.96k 713.67 5.41k 75.16%
155262 requests in 10.10s, 23.10MB read
Non-2xx or 3xx responses: 3
We will find that the error is HTTP 400 error, which is a serious
exception in our test, which means the application data was
corrupted.
Consider the following scenarios:
CPU0 CPU1
buf_desc->used = 0;
cmpxchg(buf_desc->used, 0, 1)
deal_with(buf_desc)
memset(buf_desc->cpu_addr,0);
This will cause the data received by a victim connection to be cleared,
thus triggering an HTTP 400 error in the server.
This patch exchange the order between clear used and memset, add
barrier to ensure memory consistency.
Fixes: 1c5526968e ("net/smc: Clear memory when release and reuse buffer")
Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is a certain chance to trigger the following panic:
PID: 5900 TASK: ffff88c1c8af4100 CPU: 1 COMMAND: "kworker/1:48"
#0 [ffff9456c1cc79a0] machine_kexec at ffffffff870665b7
#1 [ffff9456c1cc79f0] __crash_kexec at ffffffff871b4c7a
#2 [ffff9456c1cc7ab0] crash_kexec at ffffffff871b5b60
#3 [ffff9456c1cc7ac0] oops_end at ffffffff87026ce7
#4 [ffff9456c1cc7ae0] page_fault_oops at ffffffff87075715
#5 [ffff9456c1cc7b58] exc_page_fault at ffffffff87ad0654
#6 [ffff9456c1cc7b80] asm_exc_page_fault at ffffffff87c00b62
[exception RIP: ib_alloc_mr+19]
RIP: ffffffffc0c9cce3 RSP: ffff9456c1cc7c38 RFLAGS: 00010202
RAX: 0000000000000000 RBX: 0000000000000002 RCX: 0000000000000004
RDX: 0000000000000010 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffff88c1ea281d00 R8: 000000020a34ffff R9: ffff88c1350bbb20
R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
R13: 0000000000000010 R14: ffff88c1ab040a50 R15: ffff88c1ea281d00
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#7 [ffff9456c1cc7c60] smc_ib_get_memory_region at ffffffffc0aff6df [smc]
#8 [ffff9456c1cc7c88] smcr_buf_map_link at ffffffffc0b0278c [smc]
#9 [ffff9456c1cc7ce0] __smc_buf_create at ffffffffc0b03586 [smc]
The reason here is that when the server tries to create a second link,
smc_llc_srv_add_link() has no protection and may add a new link to
link group. This breaks the security environment protected by
llc_conf_mutex.
Fixes: 2d2209f201 ("net/smc: first part of add link processing as SMC server")
Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
Reviewed-by: Larysa Zaremba <larysa.zaremba@intel.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit e48c414ee6 ("[INET]: Generalise the TCP sock ID lookup routines")
commented out the definition of SOCK_REFCNT_DEBUG in 2005 and later another
commit 463c84b97f ("[NET]: Introduce inet_connection_sock") removed it.
Since we could track all of them through bpf and kprobe related tools
and the feature could print loads of information which might not be
that helpful even under a little bit pressure, the whole feature which
has been inactive for many years is no longer supported.
Link: https://lore.kernel.org/lkml/20230211065153.54116-1-kerneljasonxing@gmail.com/
Suggested-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Wenjia Zhang <wenjia@linux.ibm.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
It's clear that rmbs_lock and sndbufs_lock are aims to protect the
rmbs list or the sndbufs list.
During connection establieshment, smc_buf_get_slot() will always
be invoked, and it only performs read semantics in rmbs list and
sndbufs list.
Based on the above considerations, we replace mutex with rw_semaphore.
Only smc_buf_get_slot() use down_read() to allow smc_buf_get_slot()
run concurrently, other part use down_write() to keep exclusive
semantics.
Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Unlike smc_buf_create() and smcr_buf_unuse(), smcr_lgr_reg_rmbs() is
exclusive when assigned rmb_desc was not registered, although it can be
executed in parallel when assigned rmb_desc was registered already
and only performs read semtamics on it. Hence, we can not simply replace
it with read semaphore.
The idea here is that if the assigned rmb_desc was registered already,
use read semaphore to protect the critical section, once the assigned
rmb_desc was not registered, keep using keep write semaphore still
to keep its exclusivity.
Thanks to the reusable features of rmb_desc, which allows us to execute
in parallel in most cases.
Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Following is part of Off-CPU graph during frequent SMC-R short-lived
processing:
process_one_work (51.19%)
smc_close_passive_work (28.36%)
smcr_buf_unuse (28.34%)
rwsem_down_write_slowpath (28.22%)
smc_listen_work (22.83%)
smc_clc_wait_msg (1.84%)
smc_buf_create (20.45%)
smcr_buf_map_usable_links
rwsem_down_write_slowpath (20.43%)
smcr_lgr_reg_rmbs (0.53%)
rwsem_down_write_slowpath (0.43%)
smc_llc_do_confirm_rkey (0.08%)
We can clearly see that during the connection establishment time,
waiting time of connections is not on IO, but on llc_conf_mutex.
What is more important, the core critical area (smcr_buf_unuse() &
smc_buf_create()) only perfroms read semantics on links, we can
easily replace it with read semaphore.
Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
llc_conf_mutex was used to protect links and link related configurations
in the same link group, for example, add or delete links. However,
in most cases, the protected critical area has only read semantics and
with no write semantics at all, such as obtaining a usable link or an
available rmb_desc.
This patch do simply code refactoring, replace mutex with rw_semaphore,
replace mutex_lock with down_write and replace mutex_unlock with
up_write.
Theoretically, this replacement is equivalent, but after this patch,
we can distinguish lock granularity according to different semantics
of critical areas.
Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Number of files depend on linux/splice.h getting included
by linux/skbuff.h which soon will no longer be the case.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The struct device for ISM devices was part of struct smcd_dev. Move to
struct ism_dev, provide a new API call in struct smcd_ops, and convert
existing SMCD code accordingly.
Furthermore, remove struct smcd_dev from struct ism_dev.
This is the final part of a bigger overhaul of the interfaces between SMC
and ISM.
Signed-off-by: Stefan Raspl <raspl@linux.ibm.com>
Signed-off-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ism module had SMC-D-specific code sprinkled across the entire module.
We are now consolidating the SMC-D-specific parts into the latter parts
of the module, so it becomes more clear what code is intended for use with
ISM, and which parts are glue code for usage in the context of SMC-D.
This is the fourth part of a bigger overhaul of the interfaces between SMC
and ISM.
Signed-off-by: Stefan Raspl <raspl@linux.ibm.com>
Signed-off-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We separate the code implementing the struct smcd_ops API in the ISM
device driver from the functions that may be used by other exploiters of
ISM devices.
Note: We start out small, and don't offer the whole breadth of the ISM
device for public use, as many functions are specific to or likely only
ever used in the context of SMC-D.
This is the third part of a bigger overhaul of the interfaces between SMC
and ISM.
Signed-off-by: Stefan Raspl <raspl@linux.ibm.com>
Signed-off-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Register the smc module with the new ism device driver API.
This is the second part of a bigger overhaul of the interfaces between SMC
and ISM.
Signed-off-by: Stefan Raspl <raspl@linux.ibm.com>
Signed-off-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a new API that allows other drivers to concurrently access ISM devices.
To do so, we introduce a new API that allows other modules to register for
ISM device usage. Furthermore, we move the GID to struct ism, where it
belongs conceptually, and rename and relocate struct smcd_event to struct
ism_event.
This is the first part of a bigger overhaul of the interfaces between SMC
and ISM.
Signed-off-by: Stefan Raspl <raspl@linux.ibm.com>
Signed-off-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Removing an ISM device prior to terminating its associated connections
doesn't end well.
Signed-off-by: Stefan Raspl <raspl@linux.ibm.com>
Signed-off-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As suggested by Cong, introduce a tracepoint for all ->sk_data_ready()
callback implementations. For example:
<...>
iperf-609 [002] ..... 70.660425: sk_data_ready: family=2 protocol=6 func=sock_def_readable
iperf-609 [002] ..... 70.660436: sk_data_ready: family=2 protocol=6 func=sock_def_readable
<...>
Suggested-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
READ/WRITE proved to be actively confusing - the meanings are
"data destination, as used with read(2)" and "data source, as
used with write(2)", but people keep interpreting those as
"we read data from it" and "we write data to it", i.e. exactly
the wrong way.
Call them ITER_DEST and ITER_SOURCE - at least that is harder
to misinterpret...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In smc_init(), register_pernet_subsys(&smc_net_stat_ops) is called
without any error handling.
If it fails, registering of &smc_net_ops won't be reverted.
And if smc_nl_init() fails, &smc_net_stat_ops itself won't be reverted.
This leaves wild ops in subsystem linkedlist and when another module
tries to call register_pernet_operations() it triggers page fault:
BUG: unable to handle page fault for address: fffffbfff81b964c
RIP: 0010:register_pernet_operations+0x1b9/0x5f0
Call Trace:
<TASK>
register_pernet_subsys+0x29/0x40
ebtables_init+0x58/0x1000 [ebtables]
...
Fixes: 194730a9be ("net/smc: Make SMC statistics network namespace aware")
Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Link: https://lore.kernel.org/r/20221101093722.127223-1-chenzhongjin@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
If smc_wr_alloc_lgr_mem() fails then return an error code. Don't return
success.
Fixes: 8799e310fb ("net/smc: add v2 support to the work request layer")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, SMC uses smc->sk.sk_{rcv|snd}buf to create buffers for
send buffer and RMB. And the values of buffer size are from tcp_{w|r}mem
in clcsock.
The buffer size from TCP socket doesn't fit SMC well. Generally, buffers
are usually larger than TCP for SMC-R/-D to get higher performance, for
they are different underlay devices and paths.
So this patch unbinds buffer size from TCP, and introduces two sysctl
knobs to tune them independently. Also, these knobs are per net
namespace and work for containers.
Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
SMC-R tests the viability of link by sending out TEST_LINK LLC
messages over RoCE fabric when connections on link have been
idle for a time longer than keepalive interval (testlink time).
But using tcp_keepalive_time as testlink time maybe not quite
suitable because it is default no less than two hours[1], which
is too long for single link to find peer dead. The active host
will still use peer-dead link (QP) sending messages, and can't
find out until get IB_WC_RETRY_EXC_ERR error CQEs, which takes
more time than TEST_LINK timeout (SMC_LLC_WAIT_TIME) normally.
So this patch introduces a independent sysctl for SMC-R to set
link keepalive time, in order to detect link down in time. The
default value is 30 seconds.
[1] https://www.rfc-editor.org/rfc/rfc1122#page-101
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
There might be a potential race between SMC-R buffer map and
link group termination.
smc_smcr_terminate_all() | smc_connect_rdma()
--------------------------------------------------------------
| smc_conn_create()
for links in smcibdev |
schedule links down |
| smc_buf_create()
| \- smcr_buf_map_usable_links()
| \- no usable links found,
| (rmb->mr = NULL)
|
| smc_clc_send_confirm()
| \- access conn->rmb_desc->mr[]->rkey
| (panic)
During reboot and IB device module remove, all links will be set
down and no usable links remain in link groups. In such situation
smcr_buf_map_usable_links() should return an error and stop the
CLC flow accessing to uninitialized mr.
Fixes: b9247544c1 ("net/smc: convert static link ID instances to support multiple links")
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Link: https://lore.kernel.org/r/1663656189-32090-1-git-send-email-guwen@linux.alibaba.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
drivers/net/ethernet/freescale/fec.h
7d650df99d ("net: fec: add pm_qos support on imx6q platform")
40c79ce13b ("net: fec: add stop mode support for imx8 platform")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
For passive connections, the refcount increment has been done in
smc_clcsock_accept()-->smc_sock_alloc().
Fixes: 3b2dec2603 ("net/smc: restructure client and server code in af_smc")
Signed-off-by: Yacan Liu <liuyacan@corp.netease.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220830152314.838736-1-liuyacan@corp.netease.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
We had historically not checked that genlmsghdr.reserved
is 0 on input which prevents us from using those precious
bytes in the future.
One use case would be to extend the cmd field, which is
currently just 8 bits wide and 256 is not a lot of commands
for some core families.
To make sure that new families do the right thing by default
put the onus of opting out of validation on existing families.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Paul Moore <paul@paul-moore.com> (NetLabel)
Signed-off-by: David S. Miller <davem@davemloft.net>
Previously, the smc and smc_diag modules were automatically loaded as
dependencies of the ism module whenever an ISM device was present.
With the pending rework of the ISM API, the smc module will no longer
automatically be loaded in presence of an ISM device. Usage of an AF_SMC
socket will still trigger loading of the smc modules, but usage of a
netlink socket will not.
This is addressed by setting the correct module aliases.
Signed-off-by: Stefan Raspl <raspl@linux.ibm.com>
Signed-off-by: Wenjia Zhang < wenjia@linux.ibm.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make the DMBE bits, which are passed on individually in ism_move() as
parameter idx, available to the receiver.
Signed-off-by: Stefan Raspl <raspl@linux.ibm.com>
Signed-off-by: Wenjia Zhang < wenjia@linux.ibm.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reworked signature of the function to retrieve the system EID: No plausible
reason to use a double pointer. And neither to pass in the device as an
argument, as this identifier is by definition per system, not per device.
Plus some minor consistency edits.
Signed-off-by: Stefan Raspl <raspl@linux.ibm.com>
Signed-off-by: Wenjia Zhang < wenjia@linux.ibm.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This struct is used in a single place only, and its usage generates
inefficient code. Time to clean up!
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-and-tested-by: Stefan Raspl <raspl@linux.ibm.com>
Signed-off-by: Wenjia Zhang < wenjia@linux.ibm.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While reading sysctl_tcp_keepalive_(time|probes|intvl), they can be changed
concurrently. Thus, we need to add READ_ONCE() to their readers.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Extend SMC-R link group netlink attribute SMC_GEN_LGR_SMCR.
Introduce SMC_NLA_LGR_R_BUF_TYPE to show the buffer type of
SMC-R link group.
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On long-running enterprise production servers, high-order contiguous
memory pages are usually very rare and in most cases we can only get
fragmented pages.
When replacing TCP with SMC-R in such production scenarios, attempting
to allocate high-order physically contiguous sndbufs and RMBs may result
in frequent memory compaction, which will cause unexpected hung issue
and further stability risks.
So this patch is aimed to allow SMC-R link group to use virtually
contiguous sndbufs and RMBs to avoid potential issues mentioned above.
Whether to use physically or virtually contiguous buffers can be set
by sysctl smcr_buf_type.
Note that using virtually contiguous buffers will bring an acceptable
performance regression, which can be mainly divided into two parts:
1) regression in data path, which is brought by additional address
translation of sndbuf by RNIC in Tx. But in general, translating
address through MTT is fast.
Taking 256KB sndbuf and RMB as an example, the comparisons in qperf
latency and bandwidth test with physically and virtually contiguous
buffers are as follows:
- client:
smc_run taskset -c <cpu> qperf <server> -oo msg_size:1:64K:*2\
-t 5 -vu tcp_{bw|lat}
- server:
smc_run taskset -c <cpu> qperf
[latency]
msgsize tcp smcr smcr-use-virt-buf
1 11.17 us 7.56 us 7.51 us (-0.67%)
2 10.65 us 7.74 us 7.56 us (-2.31%)
4 11.11 us 7.52 us 7.59 us ( 0.84%)
8 10.83 us 7.55 us 7.51 us (-0.48%)
16 11.21 us 7.46 us 7.51 us ( 0.71%)
32 10.65 us 7.53 us 7.58 us ( 0.61%)
64 10.95 us 7.74 us 7.80 us ( 0.76%)
128 11.14 us 7.83 us 7.87 us ( 0.47%)
256 10.97 us 7.94 us 7.92 us (-0.28%)
512 11.23 us 7.94 us 8.20 us ( 3.25%)
1024 11.60 us 8.12 us 8.20 us ( 0.96%)
2048 14.04 us 8.30 us 8.51 us ( 2.49%)
4096 16.88 us 9.13 us 9.07 us (-0.64%)
8192 22.50 us 10.56 us 11.22 us ( 6.26%)
16384 28.99 us 12.88 us 13.83 us ( 7.37%)
32768 40.13 us 16.76 us 16.95 us ( 1.16%)
65536 68.70 us 24.68 us 24.85 us ( 0.68%)
[bandwidth]
msgsize tcp smcr smcr-use-virt-buf
1 1.65 MB/s 1.59 MB/s 1.53 MB/s (-3.88%)
2 3.32 MB/s 3.17 MB/s 3.08 MB/s (-2.67%)
4 6.66 MB/s 6.33 MB/s 6.09 MB/s (-3.85%)
8 13.67 MB/s 13.45 MB/s 11.97 MB/s (-10.99%)
16 25.36 MB/s 27.15 MB/s 24.16 MB/s (-11.01%)
32 48.22 MB/s 54.24 MB/s 49.41 MB/s (-8.89%)
64 106.79 MB/s 107.32 MB/s 99.05 MB/s (-7.71%)
128 210.21 MB/s 202.46 MB/s 201.02 MB/s (-0.71%)
256 400.81 MB/s 416.81 MB/s 393.52 MB/s (-5.59%)
512 746.49 MB/s 834.12 MB/s 809.99 MB/s (-2.89%)
1024 1292.33 MB/s 1641.96 MB/s 1571.82 MB/s (-4.27%)
2048 2007.64 MB/s 2760.44 MB/s 2717.68 MB/s (-1.55%)
4096 2665.17 MB/s 4157.44 MB/s 4070.76 MB/s (-2.09%)
8192 3159.72 MB/s 4361.57 MB/s 4270.65 MB/s (-2.08%)
16384 4186.70 MB/s 4574.13 MB/s 4501.17 MB/s (-1.60%)
32768 4093.21 MB/s 4487.42 MB/s 4322.43 MB/s (-3.68%)
65536 4057.14 MB/s 4735.61 MB/s 4555.17 MB/s (-3.81%)
2) regression in buffer initialization and destruction path, which is
brought by additional MR operations of sndbufs. But thanks to link
group buffer reuse mechanism, the impact of this kind of regression
decreases as times of buffer reuse increases.
Taking 256KB sndbuf and RMB as an example, latency of some key SMC-R
buffer-related function obtained by bpftrace are as follows:
Function Phys-bufs Virt-bufs
smcr_new_buf_create() 67154 ns 79164 ns
smc_ib_buf_map_sg() 525 ns 928 ns
smc_ib_get_memory_region() 162294 ns 161191 ns
smc_wr_reg_send() 9957 ns 9635 ns
smc_ib_put_memory_region() 203548 ns 198374 ns
smc_ib_buf_unmap_sg() 508 ns 1158 ns
------------
Test environment notes:
1. Above tests run on 2 VMs within the same Host.
2. The NIC is ConnectX-4Lx, using SRIOV and passing through 2 VFs to
the each VM respectively.
3. VMs' vCPUs are binded to different physical CPUs, and the binded
physical CPUs are isolated by `isolcpus=xxx` cmdline.
4. NICs' queue number are set to 1.
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch introduces a new SMC-R specific element buf_type
in struct smc_link_group, for recording the value of sysctl
smcr_buf_type when link group is created.
New created link group will create and reuse buffers of the
type specified by buf_type.
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch introduces the sysctl smcr_buf_type for setting
the type of SMC-R sndbufs and RMBs.
Valid values includes:
- SMCR_PHYS_CONT_BUFS, which means use physically contiguous
buffers for better performance and is the default value.
- SMCR_VIRT_CONT_BUFS, which means use virtually contiguous
buffers in case of physically contiguous memory is scarce.
- SMCR_MIXED_BUFS, which means first try to use physically
contiguous buffers. If not available, then use virtually
contiguous buffers.
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some CPU, such as Xeon, can guarantee DMA cache coherency.
So it is no need to use dma sync APIs to flush cache on such CPUs.
In order to avoid calling dma sync APIs on the IO path, use the
dma_need_sync to check whether smc_buf_desc needs dma sync when
creating smc_buf_desc.
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
smc_ib_sync_sg_for_cpu/device are the ops used for dma memory cache
consistency. Smc sndbufs are dma buffers, where CPU writes data to
it and PCIE device reads data from it. So for sndbufs,
smc_ib_sync_sg_for_device is needed and smc_ib_sync_sg_for_cpu is
redundant as PCIE device will not write the buffers. Smc rmbs
are dma buffers, where PCIE device write data to it and CPU read
data from it. So for rmbs, smc_ib_sync_sg_for_cpu is needed and
smc_ib_sync_sg_for_device is redundant as CPU will not write the buffers.
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Netdev reference helpers have a dev_ prefix for historic
reasons. Renaming the old helpers would be too much churn
but we can rename the tracking ones which are relatively
recent and should be the default for new code.
Rename:
dev_hold_track() -> netdev_hold()
dev_put_track() -> netdev_put()
dev_replace_track() -> netdev_ref_replace()
Link: https://lore.kernel.org/r/20220608043955.919359-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
"struct smc_cdc_tx_pend **" can not directly convert
to "struct smc_wr_tx_pend_priv *".
Fixes: 2bced6aefa ("net/smc: put slot when connection is killed")
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the process of checking whether RDMAv2 is available, the current
implementation first sets ini->smcrv2.ib_dev_v2, and then allocates
smc buf desc and register rmb, but the latter may fail. In this case,
the pointer should be reset.
Fixes: e49300a6bf ("net/smc: add listen processing for SMC-Rv2")
Signed-off-by: liuyacan <liuyacan@corp.netease.com>
Reviewed-by: Karsten Graul <kgraul@linux.ibm.com>
Link: https://lore.kernel.org/r/20220525085408.812273-1-liuyacan@corp.netease.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In the process of checking whether RDMAv2 is available, the current
implementation first sets ini->smcrv2.ib_dev_v2, and then allocates
smc buf desc, but the latter may fail. Unfortunately, the caller
will only check the former. In this case, a NULL pointer reference
will occur in smc_clc_send_confirm_accept() when accessing
conn->rmb_desc.
This patch does two things:
1. Use the return code to determine whether V2 is available.
2. If the return code is NODEV, continue to check whether V1 is
available.
Fixes: e49300a6bf ("net/smc: add listen processing for SMC-Rv2")
Signed-off-by: liuyacan <liuyacan@corp.netease.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Same trigger condition as commit 86434744. When setsockopt runs
in parallel to a connect(), and switch the socket into fallback
mode. Then the sk_refcnt is incremented in smc_connect(), but
its state stay in SMC_INIT (NOT SMC_ACTIVE). This cause the
corresponding sk_refcnt decrement in __smc_release() will not be
performed.
Fixes: 86434744fe ("net/smc: add fallback check to connect()")
Signed-off-by: liuyacan <liuyacan@corp.netease.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rdma write with inline flag when sending small packages,
whose length is shorter than the qp's max_inline_data, can
help reducing latency.
In my test environment, which are 2 VMs running on the same
physical host and whose NICs(ConnectX-4Lx) are working on
SR-IOV mode, qperf shows 0.5us-0.7us improvement in latency.
Test command:
server: smc_run taskset -c 1 qperf
client: smc_run taskset -c 1 qperf <server ip> -oo \
msg_size:1:2K:*2 -t 30 -vu tcp_lat
The results shown below:
msgsize before after
1B 11.2 us 10.6 us (-0.6 us)
2B 11.2 us 10.7 us (-0.5 us)
4B 11.3 us 10.7 us (-0.6 us)
8B 11.2 us 10.6 us (-0.6 us)
16B 11.3 us 10.7 us (-0.6 us)
32B 11.3 us 10.6 us (-0.7 us)
64B 11.2 us 11.2 us (0 us)
128B 11.2 us 11.2 us (0 us)
256B 11.2 us 11.2 us (0 us)
512B 11.4 us 11.3 us (-0.1 us)
1KB 11.4 us 11.5 us (0.1 us)
2KB 11.5 us 11.5 us (0 us)
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Tested-by: kernel test robot <lkp@intel.com>
Acked-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As cdc msg's length is 44B, cdc msgs can be sent inline in
most rdma devices, which can help reducing sending latency.
In my test environment, which are 2 VMs running on the same
physical host and whose NICs(ConnectX-4Lx) are working on
SR-IOV mode, qperf shows 0.4us-0.7us improvement in latency.
Test command:
server: smc_run taskset -c 1 qperf
client: smc_run taskset -c 1 qperf <server ip> -oo \
msg_size:1:2K:*2 -t 30 -vu tcp_lat
The results shown below:
msgsize before after
1B 11.9 us 11.2 us (-0.7 us)
2B 11.7 us 11.2 us (-0.5 us)
4B 11.7 us 11.3 us (-0.4 us)
8B 11.6 us 11.2 us (-0.4 us)
16B 11.7 us 11.3 us (-0.4 us)
32B 11.7 us 11.3 us (-0.4 us)
64B 11.7 us 11.2 us (-0.5 us)
128B 11.6 us 11.2 us (-0.4 us)
256B 11.8 us 11.2 us (-0.6 us)
512B 11.8 us 11.4 us (-0.4 us)
1KB 11.9 us 11.4 us (-0.5 us)
2KB 12.1 us 11.5 us (-0.6 us)
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Tested-by: kernel test robot <lkp@intel.com>
Acked-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Connect with O_NONBLOCK will not be completed immediately
and returns -EINPROGRESS. It is possible to use selector/poll
for completion by selecting the socket for writing. After select
indicates writability, a second connect function call will return
0 to indicate connected successfully as TCP does, but smc returns
-EISCONN. Use socket state for smc to indicate connect state, which
can help smc aligning the connect behaviour with TCP.
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Acked-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Non blocking sendmsg will return -EAGAIN when any signal pending
and no send space left, while non blocking recvmsg return -EINTR
when signal pending and no data received. This may makes confused.
As TCP returns -EAGAIN in the conditions described above. Align the
behavior of smc with TCP.
Fixes: 846e344eb7 ("net/smc: add receive timeout check")
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Acked-by: Karsten Graul <kgraul@linux.ibm.com>
Link: https://lore.kernel.org/r/20220512030820.73848-1-guangguan.wang@linux.alibaba.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
syzbot reported a slab-out-of-bounds/use-after-free issue,
which was caused by accessing an already freed smc sock in
fallback-specific callback functions of clcsock.
This patch fixes the issue by restoring fallback-specific
callback functions to original ones and resetting clcsock
sk_user_data to NULL before freeing smc sock.
Meanwhile, this patch introduces sk_callback_lock to make
the access and assignment to sk_user_data mutually exclusive.
Reported-by: syzbot+b425899ed22c6943e00b@syzkaller.appspotmail.com
Fixes: 341adeec9a ("net/smc: Forward wakeup to smc socket waitqueue after fallback")
Link: https://lore.kernel.org/r/00000000000013ca8105d7ae3ada@google.com/
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Acked-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Both listen and fallback process will save the current clcsock
callback functions and establish new ones. But if both of them
happen, the saved callback functions will be overwritten.
So this patch introduces some helpers to ensure that only save
the original callback functions of clcsock.
Fixes: 341adeec9a ("net/smc: Forward wakeup to smc socket waitqueue after fallback")
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Acked-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In the current implementation, when TCP initiates a connection
to an unavailable [ip,port], ECONNREFUSED will be stored in the
TCP socket, but SMC will not. However, some apps (like curl) use
getsockopt(,,SO_ERROR,,) to get the error information, which makes
them miss the error message and behave strangely.
Fixes: 50717a37db ("net/smc: nonblocking connect rework")
Signed-off-by: liuyacan <liuyacan@corp.netease.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Acked-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit e5d5aadcf3 ("net/smc: fix sk_refcnt underflow on linkdown
and fallback"), for a fallback connection, __smc_release() does not call
sock_put() if its state is already SMC_CLOSED.
When calling smc_shutdown() after falling back, its state is set to
SMC_CLOSED but does not call sock_put(), so this patch calls it.
Reported-and-tested-by: syzbot+6e29a053eb165bd50de5@syzkaller.appspotmail.com
Fixes: e5d5aadcf3 ("net/smc: fix sk_refcnt underflow on linkdown and fallback")
Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>
Acked-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Child sockets may inherit the af_ops from the parent listen socket.
When the listen socket is released then the af_ops of the child socket
points to released memory.
Solve that by restoring the original af_ops for child sockets which
inherited the parent af_ops. And clear any inherited user_data of the
parent socket.
Fixes: 8270d9c210 ("net/smc: Limit backlog connections")
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: D. Wythe <alibuda@linux.alibaba.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
dev_name() was called with dev.parent as argument but without to
NULL-check it before.
Solve this by checking the pointer before the call to dev_name().
Fixes: af5f60c7e3 ("net/smc: allow PCI IDs as ib device names in the pnet table")
Reported-by: syzbot+03e3e228510223dabd34@syzkaller.appspotmail.com
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Using snprintf() to convert not null-terminated strings to null
terminated strings may cause out of bounds read in the source string.
Therefore use memcpy() and terminate the target string with a null
afterwards.
Fixes: fa08666255 ("net/smc: add support for user defined EIDs")
Fixes: 3c572145c2 ("net/smc: add generic netlink support for system EID")
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The current autocork algorithms will delay the data transmission
in BH context to smc_release_cb() when sock_lock is hold by user.
So there is a possibility that when connection is being actively
closed (sock_lock is hold by user now), some corked data still
remains in sndbuf, waiting to be sent by smc_release_cb(). This
will cause:
- smc_close_stream_wait(), which is called under the sock_lock,
has a high probability of timeout because data transmission is
delayed until sock_lock is released.
- Unexpected data sends may happen after connction closed and use
the rtoken which has been deleted by remote peer through
LLC_DELETE_RKEY messages.
So this patch will try to send out the remaining corked data in
sndbuf before active close process, to ensure data integrity and
avoid unexpected data transmission after close.
Reported-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Fixes: 6b88af839d ("net/smc: don't send in the BH context if sock_owned_by_user")
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Acked-by: Karsten Graul <kgraul@linux.ibm.com>
Link: https://lore.kernel.org/r/1648447836-111521-1-git-send-email-guwen@linux.alibaba.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Recently added smc_sysctl_net_exit() forgot to free
the memory allocated from smc_sysctl_net_init()
for non initial network namespace.
Fixes: 462791bbfa ("net/smc: add sysctl interface for SMC")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Tony Lu <tonylu@linux.alibaba.com>
Cc: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
when CONFIG_SYSCTL not set, smc_sysctl_net_init/exit
need to be static inline to avoid missing-prototypes
if compile with W=1.
Since __net_exit has noinline annotation when CONFIG_NET_NS
not set, it should not be used with static inline.
So remove the __net_init/exit when CONFIG_SYSCTL not set.
Fixes: 7de8eb0d90 ("net/smc: fix compile warning for smc_sysctl")
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220309033051.41893-1-dust.li@linux.alibaba.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
kernel test robot reports multiple warning for smc_sysctl:
In file included from net/smc/smc_sysctl.c:17:
>> net/smc/smc_sysctl.h:23:5: warning: no previous prototype \
for function 'smc_sysctl_init' [-Wmissing-prototypes]
int smc_sysctl_init(void)
^
and
>> WARNING: modpost: vmlinux.o(.text+0x12ced2d): Section mismatch \
in reference from the function smc_sysctl_exit() to the variable
.init.data:smc_sysctl_ops
The function smc_sysctl_exit() references
the variable __initdata smc_sysctl_ops.
This is often because smc_sysctl_exit lacks a __initdata
annotation or the annotation of smc_sysctl_ops is wrong.
and
net/smc/smc_sysctl.c: In function 'smc_sysctl_init_net':
net/smc/smc_sysctl.c:47:17: error: 'struct netns_smc' has no member named 'smc_hdr'
47 | net->smc.smc_hdr = register_net_sysctl(net, "net/smc", table);
Since we don't need global sysctl initialization. To make things
clean and simple, remove the global pernet_operations and
smc_sysctl_{init|exit}. Call smc_sysctl_net_{init|exit} directly
from smc_net_{init|exit}.
Also initialized sysctl_autocorking_size if CONFIG_SYSCTL it not
set, this make sure SMC autocorking is enabled by default if
CONFIG_SYSCTL is not set.
Fixes: 462791bbfa ("net/smc: add sysctl interface for SMC")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Tested-by: Randy Dunlap <rdunlap@infradead.org> # build-tested
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit a505cce6f7.
Leon says:
We already discussed that. SMC should be changed to use
RDMA CQ pool API
drivers/infiniband/core/cq.c.
ib_poll_handler() has much better implementation (tracing,
IRQ rescheduling, proper error handling) than this SMC variant.
Since we will switch to ib_poll_handler() in the future,
revert this patch.
Link: https://lore.kernel.org/netdev/20220301105332.GA9417@linux.alibaba.com/
Suggested-by: Leon Romanovsky <leon@kernel.org>
Suggested-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The problem of SMC_CLC_DECL_ERR_REGRMB on the server is very clear.
Based on the fact that whether a new SMC connection can be accepted or
not depends on not only the limit of conn nums, but also the available
entries of rtoken. Since the rtoken release is trigger by peer, while
the conn nums is decrease by local, tons of thing can happen in this
time difference.
This only thing that needs to be mentioned is that now all connection
creations are completely protected by smc_server_lgr_pending lock, it's
enough to check only the available entries in rtokens_used_mask.
Fixes: cd6851f303 ("smc: remote memory buffers (RMBs)")
Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The main reason for this unexpected SMC_CLC_DECL_ERR_REGRMB in client
dues to following execution sequence:
Server Conn A: Server Conn B: Client Conn B:
smc_lgr_unregister_conn
smc_lgr_register_conn
smc_clc_send_accept ->
smc_rtoken_add
smcr_buf_unuse
-> Client Conn A:
smc_rtoken_delete
smc_lgr_unregister_conn() makes current link available to assigned to new
incoming connection, while smcr_buf_unuse() has not executed yet, which
means that smc_rtoken_add may fail because of insufficient rtoken_entry,
reversing their execution order will avoid this problem.
Fixes: 3e034725c0 ("net/smc: common functions for RMBs and send buffers")
Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Send data all the way down to the RDMA device is a time
consuming operation(get a new slot, maybe do RDMA Write
and send a CDC, etc). Moving those operations from BH
to user context is good for performance.
If the sock_lock is hold by user, we don't try to send
data out in the BH context, but just mark we should
send. Since the user will release the sock_lock soon, we
can do the sending there.
Add smc_release_cb() which will be called in release_sock()
and try send in the callback if needed.
This patch moves the sending part out from BH if sock lock
is hold by user. In my testing environment, this saves about
20% softirq in the qperf 4K tcp_bw test in the sender side
with no noticeable throughput drop.
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When we are handling softirq workload, enable hardirq may
again interrupt the current routine of softirq, and then
try to raise softirq again. This only wastes CPU cycles
and won't have any real gain.
Since IB_CQ_REPORT_MISSED_EVENTS already make sure if
ib_req_notify_cq() returns 0, it is safe to wait for the
next event, with no need to poll the CQ again in this case.
This patch disables hardirq during the processing of softirq,
and re-arm the CQ after softirq is done. Somehow like NAPI.
Co-developed-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
rmbe_update_limit is used to limit announcing receive
window updating too frequently. RFC7609 request a minimal
increase in the window size of 10% of the receive buffer
space. But current implementation used:
min_t(int, rmbe_size / 10, SOCK_MIN_SNDBUF / 2)
and SOCK_MIN_SNDBUF / 2 == 2304 Bytes, which is almost
always less then 10% of the receive buffer space.
This causes the receiver always sending CDC message to
update its consumer cursor when it consumes more then 2K
of data. And as a result, we may encounter something like
"TCP silly window syndrome" when sending 2.5~8K message.
This patch fixes this using max(rmbe_size / 10, SOCK_MIN_SNDBUF / 2).
With this patch and SMC autocorking enabled, qperf 2K/4K/8K
tcp_bw test shows 45%/75%/40% increase in throughput respectively.
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit ea785a1a573b("net/smc: Send directly when
TCP_CORK is cleared"), we don't use delayed work
to implement cork.
This patch use the same algorithm, removes the
delayed work when setting TCP_NODELAY and send
directly in setsockopt(). This also makes the
TCP_NODELAY the same as TCP.
Cc: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This add a new sysctl: net.smc.autocorking_size
We can dynamically change the behaviour of autocorking
by change the value of autocorking_size.
Setting to 0 disables autocorking in SMC
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds autocorking support for SMC which could improve
throughput for small message by x3+.
The main idea is borrowed from TCP autocorking with some RDMA
specific modification:
1. The first message should never cork to make sure we won't
bring extra latency
2. If we have posted any Tx WRs to the NIC that have not
completed, cork the new messages until:
a) Receive CQE for the last Tx WR
b) We have corked enough message on the connection
3. Try to push the corked data out when we receive CQE of
the last Tx WR to prevent the corked messages hang in
the send queue.
Both SMC autocorking and TCP autocorking check the TX completion
to decide whether we should cork or not. The difference is
when we got a SMC Tx WR completion, the data have been confirmed
by the RNIC while TCP TX completion just tells us the data
have been sent out by the local NIC.
Add an atomic variable tx_pushing in smc_connection to make
sure only one can send to let it cork more and save CDC slot.
SMC autocorking should not bring extra latency since the first
message will always been sent out immediately.
The qperf tcp_bw test shows more than x4 increase under small
message size with Mellanox connectX4-Lx, same result with other
throughput benchmarks like sockperf/netperf.
The qperf tcp_lat test shows SMC autocorking has not increase any
ping-pong latency.
Test command:
client: smc_run taskset -c 1 qperf smc-server -oo msg_size:1:64K:*2 \
-t 30 -vu tcp_{bw|lat}
server: smc_run taskset -c 1 qperf
=== Bandwidth ====
MsgSize(Bytes) SMC-NoCork TCP SMC-AutoCorking
1 0.578 MB/s 2.392 MB/s(313.57%) 2.647 MB/s(357.72%)
2 1.159 MB/s 4.780 MB/s(312.53%) 5.153 MB/s(344.71%)
4 2.283 MB/s 10.266 MB/s(349.77%) 10.363 MB/s(354.02%)
8 4.668 MB/s 19.040 MB/s(307.86%) 21.215 MB/s(354.45%)
16 9.147 MB/s 38.904 MB/s(325.31%) 41.740 MB/s(356.32%)
32 18.369 MB/s 79.587 MB/s(333.25%) 82.392 MB/s(348.52%)
64 36.562 MB/s 148.668 MB/s(306.61%) 161.564 MB/s(341.89%)
128 72.961 MB/s 274.913 MB/s(276.80%) 325.363 MB/s(345.94%)
256 144.705 MB/s 512.059 MB/s(253.86%) 633.743 MB/s(337.96%)
512 288.873 MB/s 884.977 MB/s(206.35%) 1250.681 MB/s(332.95%)
1024 574.180 MB/s 1337.736 MB/s(132.98%) 2246.121 MB/s(291.19%)
2048 1095.192 MB/s 1865.952 MB/s( 70.38%) 2057.767 MB/s( 87.89%)
4096 2066.157 MB/s 2380.337 MB/s( 15.21%) 2173.983 MB/s( 5.22%)
8192 3717.198 MB/s 2733.073 MB/s(-26.47%) 3491.223 MB/s( -6.08%)
16384 4742.221 MB/s 2958.693 MB/s(-37.61%) 4637.692 MB/s( -2.20%)
32768 5349.550 MB/s 3061.285 MB/s(-42.77%) 5385.796 MB/s( 0.68%)
65536 5162.919 MB/s 3731.408 MB/s(-27.73%) 5223.890 MB/s( 1.18%)
==== Latency ====
MsgSize(Bytes) SMC-NoCork TCP SMC-AutoCorking
1 10.540 us 11.938 us( 13.26%) 10.573 us( 0.31%)
2 10.996 us 11.992 us( 9.06%) 10.269 us( -6.61%)
4 10.229 us 11.687 us( 14.25%) 10.240 us( 0.11%)
8 10.203 us 11.653 us( 14.21%) 10.402 us( 1.95%)
16 10.530 us 11.313 us( 7.44%) 10.599 us( 0.66%)
32 10.241 us 11.586 us( 13.13%) 10.223 us( -0.18%)
64 10.693 us 11.652 us( 8.97%) 10.251 us( -4.13%)
128 10.597 us 11.579 us( 9.27%) 10.494 us( -0.97%)
256 10.409 us 11.957 us( 14.87%) 10.710 us( 2.89%)
512 11.088 us 12.505 us( 12.78%) 10.547 us( -4.88%)
1024 11.240 us 12.255 us( 9.03%) 10.787 us( -4.03%)
2048 11.485 us 16.970 us( 47.76%) 11.256 us( -1.99%)
4096 12.077 us 13.948 us( 15.49%) 12.230 us( 1.27%)
8192 13.683 us 16.693 us( 22.00%) 13.786 us( 0.75%)
16384 16.470 us 23.615 us( 43.38%) 16.459 us( -0.07%)
32768 22.540 us 40.966 us( 81.75%) 23.284 us( 3.30%)
65536 34.192 us 73.003 us(113.51%) 34.233 us( 0.12%)
With SMC autocorking support, we can archive better throughput
than TCP in most message sizes without any latency trade-off.
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch add sysctl interface to support container environment
for SMC as we talk in the mail list.
Link: https://lore.kernel.org/netdev/20220224020253.GF5443@linux.alibaba.com
Co-developed-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This also calls trace_smc_tx_sendmsg() even if data is corked. For ease
of understanding, if statements are not expanded here.
Link: https://lore.kernel.org/all/f4166712-9a1e-51a0-409d-b7df25a66c52@linux.ibm.com/
Fixes: 139653bc66 ("net/smc: Remove corked dealyed work")
Suggested-by: Stefan Raspl <raspl@linux.ibm.com>
Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch calls smc_ib_unregister_client() when tcp_register_ulp()
fails, and make sure to clean it up.
Fixes: d7cd421da9 ("net/smc: Introduce TCP ULP support")
Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There's a potential leak issue under following execution sequence :
smc_release smc_connect_work
if (sk->sk_state == SMC_INIT)
send_clc_confirim
tcp_abort();
...
sk.sk_state = SMC_ACTIVE
smc_close_active
switch(sk->sk_state) {
...
case SMC_ACTIVE:
smc_close_final()
// then wait peer closed
Unfortunately, tcp_abort() may discard CLC CONFIRM messages that are
still in the tcp send buffer, in which case our connection token cannot
be delivered to the server side, which means that we cannot get a
passive close message at all. Therefore, it is impossible for the to be
disconnected at all.
This patch tries a very simple way to avoid this issue, once the state
has changed to SMC_ACTIVE after tcp_abort(), we can actively abort the
smc connection, considering that the state is SMC_INIT before
tcp_abort(), abandoning the complete disconnection process should not
cause too much problem.
In fact, this problem may exist as long as the CLC CONFIRM message is
not received by the server. Whether a timer should be added after
smc_close_final() needs to be discussed in the future. But even so, this
patch provides a faster release for connection in above case, it should
also be valuable.
Fixes: 39f41f367b ("net/smc: common release code for non-accepted sockets")
Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
Acked-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
smc_pnetid_by_table_ib() uses read_lock() and then it calls smc_pnet_apply_ib()
which, in turn, calls mutex_lock(&smc_ib_devices.mutex).
read_lock() disables preemption. Therefore, the code acquires a mutex while in
atomic context and it leads to a SAC bug.
Fix this bug by replacing the rwlock with a mutex.
Reported-and-tested-by: syzbot+4f322a6d84e991c38775@syzkaller.appspotmail.com
Fixes: 64e28b52c7 ("net/smc: add pnet table namespace support")
Confirmed-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>
Acked-by: Karsten Graul <kgraul@linux.ibm.com>
Link: https://lore.kernel.org/r/20220223100252.22562-1-fmdefrancesco@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
These two error paths need to release_sock(sk) before returning.
Fixes: a6a6fe27ba ("net/smc: Dynamic control handshake limitation by socket options")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: D. Wythe <alibuda@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When smc_connect_clc() times out, it will return -EAGAIN(tcp_recvmsg
retuns -EAGAIN while timeout), then this value will passed to the
application, which is quite confusing to the applications, makes
inconsistency with TCP.
From the manual of connect, ETIMEDOUT is more suitable, and this patch
try convert EAGAIN to ETIMEDOUT in that case.
Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
Reviewed-by: Karsten Graul <kgraul@linux.ibm.com>
Link: https://lore.kernel.org/r/1644913490-21594-1-git-send-email-alibuda@linux.alibaba.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The previous patch introduces a lock-free version of smc_tx_work() to
solve unnecessary lock contention, which is expected to be held lock.
So this adds comment to remind people to keep an eye out for locks.
Suggested-by: Stefan Raspl <raspl@linux.ibm.com>
Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Although we can control SMC handshake limitation through socket options,
which means that applications who need it must modify their code. It's
quite troublesome for many existing applications. This patch modifies
the global default value of SMC handshake limitation through netlink,
providing a way to put constraint on handshake without modifies any code
for applications.
Suggested-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch aims to add dynamic control for SMC handshake limitation for
every smc sockets, in production environment, it is possible for the
same applications to handle different service types, and may have
different opinion on SMC handshake limitation.
This patch try socket options to complete it, since we don't have socket
option level for SMC yet, which requires us to implement it at the same
time.
This patch does the following:
- add new socket option level: SOL_SMC.
- add new SMC socket option: SMC_LIMIT_HS.
- provide getter/setter for SMC socket options.
Link: https://lore.kernel.org/all/20f504f961e1a803f85d64229ad84260434203bd.1644323503.git.alibuda@linux.alibaba.com/
Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch intends to provide a mechanism to put constraint on SMC
connections visit according to the pressure of SMC handshake process.
At present, frequent visits will cause the incoming connections to be
backlogged in SMC handshake queue, raise the connections established
time. Which is quite unacceptable for those applications who base on
short lived connections.
There are two ways to implement this mechanism:
1. Put limitation after TCP established.
2. Put limitation before TCP established.
In the first way, we need to wait and receive CLC messages that the
client will potentially send, and then actively reply with a decline
message, in a sense, which is also a sort of SMC handshake, affect the
connections established time on its way.
In the second way, the only problem is that we need to inject SMC logic
into TCP when it is about to reply the incoming SYN, since we already do
that, it's seems not a problem anymore. And advantage is obvious, few
additional processes are required to complete the constraint.
This patch use the second way. After this patch, connections who beyond
constraint will not informed any SMC indication, and SMC will not be
involved in any of its subsequent processes.
Link: https://lore.kernel.org/all/1641301961-59331-1-git-send-email-alibuda@linux.alibaba.com/
Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Current implementation does not handling backlog semantics, one
potential risk is that server will be flooded by infinite amount
connections, even if client was SMC-incapable.
This patch works to put a limit on backlog connections, referring to the
TCP implementation, we divides SMC connections into two categories:
1. Half SMC connection, which includes all TCP established while SMC not
connections.
2. Full SMC connection, which includes all SMC established connections.
For half SMC connection, since all half SMC connections starts with TCP
established, we can achieve our goal by put a limit before TCP
established. Refer to the implementation of TCP, this limits will based
on not only the half SMC connections but also the full connections,
which is also a constraint on full SMC connections.
For full SMC connections, although we know exactly where it starts, it's
quite hard to put a limit before it. The easiest way is to block wait
before receive SMC confirm CLC message, while it's under protection by
smc_server_lgr_pending, a global lock, which leads this limit to the
entire host instead of a single listen socket. Another way is to drop
the full connections, but considering the cast of SMC connections, we
prefer to keep full SMC connections.
Even so, the limits of full SMC connections still exists, see commits
about half SMC connection below.
After this patch, the limits of backend connection shows like:
For SMC:
1. Client with SMC-capability can makes 2 * backlog full SMC connections
or 1 * backlog half SMC connections and 1 * backlog full SMC
connections at most.
2. Client without SMC-capability can only makes 1 * backlog half TCP
connections and 1 * backlog full TCP connections.
Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In multithread and 10K connections benchmark, the backend TCP connection
established very slowly, and lots of TCP connections stay in SYN_SENT
state.
Client: smc_run wrk -c 10000 -t 4 http://server
the netstate of server host shows like:
145042 times the listen queue of a socket overflowed
145042 SYNs to LISTEN sockets dropped
One reason of this issue is that, since the smc_tcp_listen_work() shared
the same workqueue (smc_hs_wq) with smc_listen_work(), while the
smc_listen_work() do blocking wait for smc connection established. Once
the workqueue became congested, it's will block the accept() from TCP
listen.
This patch creates a independent workqueue(smc_tcp_ls_wq) for
smc_tcp_listen_work(), separate it from smc_listen_work(), which is
quite acceptable considering that smc_tcp_listen_work() runs very fast.
Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The callback functions of clcsock will be saved and replaced during
the fallback. But if the fallback happens more than once, then the
copies of these callback functions will be overwritten incorrectly,
resulting in a loop call issue:
clcsk->sk_error_report
|- smc_fback_error_report() <------------------------------|
|- smc_fback_forward_wakeup() | (loop)
|- clcsock_callback() (incorrectly overwritten) |
|- smc->clcsk_error_report() ------------------|
So this patch fixes the issue by saving these function pointers only
once in the fallback and avoiding overwriting.
Reported-by: syzbot+4de3c0e8a263e1e499bc@syzkaller.appspotmail.com
Fixes: 341adeec9a ("net/smc: Forward wakeup to smc socket waitqueue after fallback")
Link: https://lore.kernel.org/r/0000000000006d045e05d78776f6@google.com
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
My last patch moved the netdev_tracker_alloc() call to a section
protected by a write_lock().
I should have replaced GFP_KERNEL with GFP_ATOMIC to avoid the infamous:
BUG: sleeping function called from invalid context at include/linux/sched/mm.h:256
Fixes: 28f9222138 ("net/smc: fix ref_tracker issue in smc_pnet_add()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The change of sizeof(struct smc_diag_linkinfo) by commit 79d39fc503
("net/smc: Add netlink net namespace support") introduced an ABI
regression: since struct smc_diag_lgrinfo contains an object of
type "struct smc_diag_linkinfo", offset of all subsequent members
of struct smc_diag_lgrinfo was changed by that change.
As result, applications compiled with the old version
of struct smc_diag_linkinfo will receive garbage in
struct smc_diag_lgrinfo.role if the kernel implements
this new version of struct smc_diag_linkinfo.
Fix this regression by reverting the part of commit 79d39fc503 that
changes struct smc_diag_linkinfo. After all, there is SMC_GEN_NETLINK
interface which is good enough, so there is probably no need to touch
the smc_diag ABI in the first place.
Fixes: 79d39fc503 ("net/smc: Add netlink net namespace support")
Signed-off-by: Dmitry V. Levin <ldv@altlinux.org>
Reviewed-by: Karsten Graul <kgraul@linux.ibm.com>
Link: https://lore.kernel.org/r/20220202030904.GA9742@altlinux.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This introduces a new corked flag, MSG_SENDPAGE_NOTLAST, which is
involved in syscall sendfile() [1], it indicates this is not the last
page. So we can cork the data until the page is not specify this flag.
It has the same effect as MSG_MORE, but existed in sendfile() only.
This patch adds a option MSG_SENDPAGE_NOTLAST for corking data, try to
cork more data before sending when using sendfile(), which acts like
TCP's behaviour. Also, this reimplements the default sendpage to inform
that it is supported to some extent.
[1] https://man7.org/linux/man-pages/man2/sendfile.2.html
Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Based on the manual of TCP_CORK [1] and MSG_MORE [2], these two options
have the same effect. Applications can set these options and informs the
kernel to pend the data, and send them out only when the socket or
syscall does not specify this flag. In other words, there's no need to
send data out by a delayed work, which will queue a lot of work.
This removes corked delayed work with SMC_TX_CORK_DELAY (250ms), and the
applications control how/when to send them out. It improves the
performance for sendfile and throughput, and remove unnecessary race of
lock_sock(). This also unlocks the limitation of sndbuf, and try to fill
it up before sending.
[1] https://linux.die.net/man/7/tcp
[2] https://man7.org/linux/man-pages/man2/send.2.html
Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
According to the man page of TCP_CORK [1], if set, don't send out
partial frames. All queued partial frames are sent when option is
cleared again.
When applications call setsockopt to disable TCP_CORK, this call is
protected by lock_sock(), and tries to mod_delayed_work() to 0, in order
to send pending data right now. However, the delayed work smc_tx_work is
also protected by lock_sock(). There introduces lock contention for
sending data.
To fix it, send pending data directly which acts like TCP, without
lock_sock() protected in the context of setsockopt (already lock_sock()ed),
and cancel unnecessary dealyed work, which is protected by lock.
[1] https://linux.die.net/man/7/tcp
Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When we replace TCP with SMC and a fallback occurs, there may be
some socket waitqueue entries remaining in smc socket->wq, such
as eppoll_entries inserted by userspace applications.
After the fallback, data flows over TCP/IP and only clcsocket->wq
will be woken up. Applications can't be notified by the entries
which were inserted in smc socket->wq before fallback. So we need
a mechanism to wake up smc socket->wq at the same time if some
entries remaining in it.
The current workaround is to transfer the entries from smc socket->wq
to clcsock->wq during the fallback. But this may cause a crash
like this:
general protection fault, probably for non-canonical address 0xdead000000000100: 0000 [#1] PREEMPT SMP PTI
CPU: 3 PID: 0 Comm: swapper/3 Kdump: loaded Tainted: G E 5.16.0+ #107
RIP: 0010:__wake_up_common+0x65/0x170
Call Trace:
<IRQ>
__wake_up_common_lock+0x7a/0xc0
sock_def_readable+0x3c/0x70
tcp_data_queue+0x4a7/0xc40
tcp_rcv_established+0x32f/0x660
? sk_filter_trim_cap+0xcb/0x2e0
tcp_v4_do_rcv+0x10b/0x260
tcp_v4_rcv+0xd2a/0xde0
ip_protocol_deliver_rcu+0x3b/0x1d0
ip_local_deliver_finish+0x54/0x60
ip_local_deliver+0x6a/0x110
? tcp_v4_early_demux+0xa2/0x140
? tcp_v4_early_demux+0x10d/0x140
ip_sublist_rcv_finish+0x49/0x60
ip_sublist_rcv+0x19d/0x230
ip_list_rcv+0x13e/0x170
__netif_receive_skb_list_core+0x1c2/0x240
netif_receive_skb_list_internal+0x1e6/0x320
napi_complete_done+0x11d/0x190
mlx5e_napi_poll+0x163/0x6b0 [mlx5_core]
__napi_poll+0x3c/0x1b0
net_rx_action+0x27c/0x300
__do_softirq+0x114/0x2d2
irq_exit_rcu+0xb4/0xe0
common_interrupt+0xba/0xe0
</IRQ>
<TASK>
The crash is caused by privately transferring waitqueue entries from
smc socket->wq to clcsock->wq. The owners of these entries, such as
epoll, have no idea that the entries have been transferred to a
different socket wait queue and still use original waitqueue spinlock
(smc socket->wq.wait.lock) to make the entries operation exclusive,
but it doesn't work. The operations to the entries, such as removing
from the waitqueue (now is clcsock->wq after fallback), may cause a
crash when clcsock waitqueue is being iterated over at the moment.
This patch tries to fix this by no longer transferring wait queue
entries privately, but introducing own implementations of clcsock's
callback functions in fallback situation. The callback functions will
forward the wakeup to smc socket->wq if clcsock->wq is actually woken
up and smc socket->wq has remaining entries.
Fixes: 2153bd1e3d ("net/smc: Transfer remaining wait queue entries during fallback")
Suggested-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Acked-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We encountered a crash in smc_setsockopt() and it is caused by
accessing smc->clcsock after clcsock was released.
BUG: kernel NULL pointer dereference, address: 0000000000000020
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 1 PID: 50309 Comm: nginx Kdump: loaded Tainted: G E 5.16.0-rc4+ #53
RIP: 0010:smc_setsockopt+0x59/0x280 [smc]
Call Trace:
<TASK>
__sys_setsockopt+0xfc/0x190
__x64_sys_setsockopt+0x20/0x30
do_syscall_64+0x34/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f16ba83918e
</TASK>
This patch tries to fix it by holding clcsock_release_lock and
checking whether clcsock has already been released before access.
In case that a crash of the same reason happens in smc_getsockopt()
or smc_switch_to_fallback(), this patch also checkes smc->clcsock
in them too. And the caller of smc_switch_to_fallback() will identify
whether fallback succeeds according to the return value.
Fixes: fd57770dd1 ("net/smc: wait for pending work before clcsock release_sock")
Link: https://lore.kernel.org/lkml/5dd7ffd1-28e2-24cc-9442-1defec27375e@linux.ibm.com/T/
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Acked-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
A hung_task is observed when removing SMC-R devices. Suppose that
a link group has two active links(lnk_A, lnk_B) associated with two
different SMC-R devices(dev_A, dev_B). When dev_A is removed, the
link group will be removed from smc_lgr_list and added into
lgr_linkdown_list. lnk_A will be cleared and smcibdev(A)->lnk_cnt
will reach to zero. However, when dev_B is removed then, the link
group can't be found in smc_lgr_list and lnk_B won't be cleared,
making smcibdev->lnk_cnt never reaches zero, which causes a hung_task.
This patch fixes this issue by restoring the implementation of
smc_smcr_terminate_all() to what it was before commit 349d43127d
("net/smc: fix kernel panic caused by race of smc_sock"). The original
implementation also satisfies the intention that make sure QP destroy
earlier than CQ destroy because we will always wait for smcibdev->lnk_cnt
reaches zero, which guarantees QP has been destroyed.
Fixes: 349d43127d ("net/smc: fix kernel panic caused by race of smc_sock")
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The declaration of smc_wr_tx_dismiss_slots() is unused.
So remove it.
Fixes: 349d43127d ("net/smc: fix kernel panic caused by race of smc_sock")
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We encountered some crashes caused by the race between SMC-R
link access and link clear that triggered by abnormal link
group termination, such as port error.
Here is an example of this kind of crashes:
BUG: kernel NULL pointer dereference, address: 0000000000000000
Workqueue: smc_hs_wq smc_listen_work [smc]
RIP: 0010:smc_llc_flow_initiate+0x44/0x190 [smc]
Call Trace:
<TASK>
? __smc_buf_create+0x75a/0x950 [smc]
smcr_lgr_reg_rmbs+0x2a/0xbf [smc]
smc_listen_work+0xf72/0x1230 [smc]
? process_one_work+0x25c/0x600
process_one_work+0x25c/0x600
worker_thread+0x4f/0x3a0
? process_one_work+0x600/0x600
kthread+0x15d/0x1a0
? set_kthread_struct+0x40/0x40
ret_from_fork+0x1f/0x30
</TASK>
smc_listen_work() __smc_lgr_terminate()
---------------------------------------------------------------
| smc_lgr_free()
| |- smcr_link_clear()
| |- memset(lnk, 0)
smc_listen_rdma_reg() |
|- smcr_lgr_reg_rmbs() |
|- smc_llc_flow_initiate() |
|- access lnk->lgr (panic) |
These crashes are similarly caused by clearing SMC-R link
resources when some functions is still accessing to them.
This patch tries to fix the issue by introducing reference
count of SMC-R links and ensuring that the sensitive resources
of links won't be cleared until reference count reaches zero.
The operation to the SMC-R link reference count can be concluded
as follows:
object [hold or initialized as 1] [put]
--------------------------------------------------------------------
links smcr_link_init() smcr_link_clear()
connections smc_conn_create() smc_conn_free()
Through this way, the clear of SMC-R links is later than the
free of all the smc connections above it, thus avoiding the
unsafe reference to SMC-R links.
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is no longer suitable to identify whether a smc connection
is registered in a link group through checking if conn->lgr
is NULL, because conn->lgr won't be reset even the connection
is unregistered from a link group.
So this patch introduces a new helper smc_conn_lgr_valid() and
replaces all the check of conn->lgr in original implementation
with the new helper to judge if conn->lgr is valid to use.
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We encountered some crashes caused by the race between the access
and the termination of link groups.
Here are some of panic stacks we met:
1) Race between smc_clc_wait_msg() and __smc_lgr_terminate()
BUG: kernel NULL pointer dereference, address: 00000000000002f0
Workqueue: smc_hs_wq smc_listen_work [smc]
RIP: 0010:smc_clc_wait_msg+0x3eb/0x5c0 [smc]
Call Trace:
<TASK>
? smc_clc_send_accept+0x45/0xa0 [smc]
? smc_clc_send_accept+0x45/0xa0 [smc]
smc_listen_work+0x783/0x1220 [smc]
? finish_task_switch+0xc4/0x2e0
? process_one_work+0x1ad/0x3c0
process_one_work+0x1ad/0x3c0
worker_thread+0x4c/0x390
? rescuer_thread+0x320/0x320
kthread+0x149/0x190
? set_kthread_struct+0x40/0x40
ret_from_fork+0x1f/0x30
</TASK>
smc_listen_work() abnormal case like port error
---------------------------------------------------------------
| __smc_lgr_terminate()
| |- smc_conn_kill()
| |- smc_lgr_unregister_conn()
| |- set conn->lgr = NULL
smc_clc_wait_msg() |
|- access conn->lgr (panic) |
2) Race between smc_setsockopt() and __smc_lgr_terminate()
BUG: kernel NULL pointer dereference, address: 00000000000002e8
RIP: 0010:smc_setsockopt+0x17a/0x280 [smc]
Call Trace:
<TASK>
__sys_setsockopt+0xfc/0x190
__x64_sys_setsockopt+0x20/0x30
do_syscall_64+0x34/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
</TASK>
smc_setsockopt() abnormal case like port error
--------------------------------------------------------------
| __smc_lgr_terminate()
| |- smc_conn_kill()
| |- smc_lgr_unregister_conn()
| |- set conn->lgr = NULL
mod_delayed_work() |
|- access conn->lgr (panic) |
There are some other panic places and they are caused by the
similar reason as described above, which is accessing link
group after termination, thus getting a NULL pointer or invalid
resource.
Currently, there seems to be no synchronization between the
link group access and a sudden termination of it. This patch
tries to fix this by introducing reference count of link group
and not freeing link group until reference count is zero.
Link group might be referred to by links or smc connections. So
the operation to the link group reference count can be concluded
as follows:
object [hold or initialized as 1] [put]
-------------------------------------------------------------------
link group smc_lgr_create() smc_lgr_free()
connections smc_conn_create() smc_conn_free()
links smcr_link_init() smcr_link_clear()
Througth this way, we extend the life cycle of link group and
ensure it is longer than the life cycle of connections and links
above it, so that avoid invalid access to link group after its
termination.
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
SMC connections might fail to be registered in a link group due to
unable to find a usable link during its creation. As a result,
smc_conn_create() will return a failure and most resources related
to the connection won't be applied or initialized, such as
conn->abort_work or conn->lnk.
If smc_conn_free() is invoked later, it will try to access the
uninitialized resources related to the connection, thus causing
a warning or crash.
This patch tries to fix this by resetting conn->lgr to NULL if an
abnormal exit occurs in smc_lgr_register_conn(), thus avoiding the
access to uninitialized resources in smc_conn_free().
Meanwhile, the new created link group should be terminated if smc
connections can't be registered in it. So smc_lgr_cleanup_early() is
modified to take care of link group only and invoked to terminate
unusable link group by smc_conn_create(). The call to smc_conn_free()
is moved out from smc_lgr_cleanup_early() to smc_conn_abort().
Fixes: 56bc3b2094 ("net/smc: assign link to a new connection")
Suggested-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Acked-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add comments for both smc_link_sendable() and smc_link_usable()
to help better distinguish and use them.
No function changes.
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The pointer link is being re-assigned the same value that it was
initialized with in the previous declaration statement. The
re-assignment is redundant and can be removed.
Fixes: 387707fdf4 ("net/smc: convert static link ID to dynamic references")
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This implements TCP ULP for SMC, helps applications to replace TCP with
SMC protocol in place. And we use it to implement transparent
replacement.
This replaces original TCP sockets with SMC, reuse TCP as clcsock when
calling setsockopt with TCP_ULP option, and without any overhead.
To replace TCP sockets with SMC, there are two approaches:
- use setsockopt() syscall with TCP_ULP option, if error, it would
fallback to TCP.
- use BPF prog with types BPF_CGROUP_INET_SOCK_CREATE or others to
replace transparently. BPF hooks some points in create socket, bind
and others, users can inject their BPF logics without modifying their
applications, and choose which connections should be replaced with SMC
by calling setsockopt() in BPF prog, based on rules, such as TCP tuples,
PID, cgroup, etc...
BPF doesn't support calling setsockopt with TCP_ULP now, I will send the
patches after this accepted.
Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This prints net namespace ID, helps us to distinguish different net
namespaces when using tracepoints.
Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds net namespace ID to the kernel log, net_cookie is unique in
the whole system. It is useful in container environment.
Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds net namespace ID to diag of linkgroup, helps us to distinguish
different namespaces, and net_cookie is unique in the whole system.
Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, rdma device supports exclusive net namespace isolation,
however linkgroup doesn't know and support ibdev net namespace.
Applications in the containers don't want to share the nics if we
enabled rdma exclusive mode. Every net namespaces should have their own
linkgroups.
This patch introduce a new field net for linkgroup, which is standing
for the ibdev net namespace in the linkgroup. The net in linkgroup is
initialized with the net namespace of link's ibdev. It compares the net
of linkgroup and sock or ibdev before choose it, if no matched, create
new one in current net namespace. If rdma net namespace exclusive mode
is not enabled, it behaves as before.
Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexei Starovoitov says:
====================
pull-request: bpf-next 2021-12-30
The following pull-request contains BPF updates for your *net-next* tree.
We've added 72 non-merge commits during the last 20 day(s) which contain
a total of 223 files changed, 3510 insertions(+), 1591 deletions(-).
The main changes are:
1) Automatic setrlimit in libbpf when bpf is memcg's in the kernel, from Andrii.
2) Beautify and de-verbose verifier logs, from Christy.
3) Composable verifier types, from Hao.
4) bpf_strncmp helper, from Hou.
5) bpf.h header dependency cleanup, from Jakub.
6) get_func_[arg|ret|arg_cnt] helpers, from Jiri.
7) Sleepable local storage, from KP.
8) Extend kfunc with PTR_TO_CTX, PTR_TO_MEM argument support, from Kumar.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
commit 077cdda764 ("net/mlx5e: TC, Fix memory leak with rules with internal port")
commit 31108d142f ("net/mlx5: Fix some error handling paths in 'mlx5e_tc_add_fdb_flow()'")
commit 4390c6edc0 ("net/mlx5: Fix some error handling paths in 'mlx5e_tc_add_fdb_flow()'")
https://lore.kernel.org/all/20211229065352.30178-1-saeed@kernel.org/
net/smc/smc_wr.c
commit 49dc9013e3 ("net/smc: Use the bitmap API when applicable")
commit 349d43127d ("net/smc: fix kernel panic caused by race of smc_sock")
bitmap_zero()/memset() is removed by the fix
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Using the bitmap API is less verbose than hand writing them.
It also improves the semantic.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
sock.h is pretty heavily used (5k objects rebuilt on x86 after
it's touched). We can drop the include of filter.h from it and
add a forward declaration of struct sk_filter instead.
This decreases the number of rebuilt objects when bpf.h
is touched from ~5k to ~1k.
There's a lot of missing includes this was masking. Primarily
in networking tho, this time.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Marc Kleine-Budde <mkl@pengutronix.de>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Acked-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://lore.kernel.org/bpf/20211229004913.513372-1-kuba@kernel.org
A crash occurs when smc_cdc_tx_handler() tries to access smc_sock
but smc_release() has already freed it.
[ 4570.695099] BUG: unable to handle page fault for address: 000000002eae9e88
[ 4570.696048] #PF: supervisor write access in kernel mode
[ 4570.696728] #PF: error_code(0x0002) - not-present page
[ 4570.697401] PGD 0 P4D 0
[ 4570.697716] Oops: 0002 [#1] PREEMPT SMP NOPTI
[ 4570.698228] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.16.0-rc4+ #111
[ 4570.699013] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8c24b4c 04/0
[ 4570.699933] RIP: 0010:_raw_spin_lock+0x1a/0x30
<...>
[ 4570.711446] Call Trace:
[ 4570.711746] <IRQ>
[ 4570.711992] smc_cdc_tx_handler+0x41/0xc0
[ 4570.712470] smc_wr_tx_tasklet_fn+0x213/0x560
[ 4570.712981] ? smc_cdc_tx_dismisser+0x10/0x10
[ 4570.713489] tasklet_action_common.isra.17+0x66/0x140
[ 4570.714083] __do_softirq+0x123/0x2f4
[ 4570.714521] irq_exit_rcu+0xc4/0xf0
[ 4570.714934] common_interrupt+0xba/0xe0
Though smc_cdc_tx_handler() checked the existence of smc connection,
smc_release() may have already dismissed and released the smc socket
before smc_cdc_tx_handler() further visits it.
smc_cdc_tx_handler() |smc_release()
if (!conn) |
|
|smc_cdc_tx_dismiss_slots()
| smc_cdc_tx_dismisser()
|
|sock_put(&smc->sk) <- last sock_put,
| smc_sock freed
bh_lock_sock(&smc->sk) (panic) |
To make sure we won't receive any CDC messages after we free the
smc_sock, add a refcount on the smc_connection for inflight CDC
message(posted to the QP but haven't received related CQE), and
don't release the smc_connection until all the inflight CDC messages
haven been done, for both success or failed ones.
Using refcount on CDC messages brings another problem: when the link
is going to be destroyed, smcr_link_clear() will reset the QP, which
then remove all the pending CQEs related to the QP in the CQ. To make
sure all the CQEs will always come back so the refcount on the
smc_connection can always reach 0, smc_ib_modify_qp_reset() was replaced
by smc_ib_modify_qp_error().
And remove the timeout in smc_wr_tx_wait_no_pending_sends() since we
need to wait for all pending WQEs done, or we may encounter use-after-
free when handling CQEs.
For IB device removal routine, we need to wait for all the QPs on that
device been destroyed before we can destroy CQs on the device, or
the refcount on smc_connection won't reach 0 and smc_sock cannot be
released.
Fixes: 5f08318f61 ("smc: connection data control (CDC)")
Reported-by: Wen Gu <guwen@linux.alibaba.com>
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We found smc_llc_send_link_delete_all() sometimes wait
for 2s timeout when testing with RDMA link up/down.
It is possible when a smc_link is in ACTIVATING state,
the underlaying QP is still in RESET or RTR state, which
cannot send any messages out.
smc_llc_send_link_delete_all() use smc_link_usable() to
checks whether the link is usable, if the QP is still in
RESET or RTR state, but the smc_link is in ACTIVATING, this
LLC message will always fail without any CQE entering the
CQ, and we will always wait 2s before timeout.
Since we cannot send any messages through the QP before
the QP enter RTS. I add a wrapper smc_link_sendable()
which checks the state of QP along with the link state.
And replace smc_link_usable() with smc_link_sendable()
in all LLC & CDC message sending routine.
Fixes: 5f08318f61 ("smc: connection data control (CDC)")
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In smc_wr_tx_send_wait() the completion on index specified by
pend->idx is initialized and after smc_wr_tx_send() was called the wait
for completion starts. pend->idx is used to get the correct index for
the wait, but the pend structure could already be cleared in
smc_wr_tx_process_cqe().
Introduce pnd_idx to hold and use a local copy of the correct index.
Fixes: 09c61d24f9 ("net/smc: wait for departure of an IB message")
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In nginx/wrk benchmark, there's a hung problem with high probability
on case likes that: (client will last several minutes to exit)
server: smc_run nginx
client: smc_run wrk -c 10000 -t 1 http://server
Client hangs with the following backtrace:
0 [ffffa7ce8Of3bbf8] __schedule at ffffffff9f9eOd5f
1 [ffffa7ce8Of3bc88] schedule at ffffffff9f9eløe6
2 [ffffa7ce8Of3bcaO] schedule_timeout at ffffffff9f9e3f3c
3 [ffffa7ce8Of3bd2O] wait_for_common at ffffffff9f9el9de
4 [ffffa7ce8Of3bd8O] __flush_work at ffffffff9fOfeOl3
5 [ffffa7ce8øf3bdfO] smc_release at ffffffffcO697d24 [smc]
6 [ffffa7ce8Of3be2O] __sock_release at ffffffff9f8O2e2d
7 [ffffa7ce8Of3be4ø] sock_close at ffffffff9f8ø2ebl
8 [ffffa7ce8øf3be48] __fput at ffffffff9f334f93
9 [ffffa7ce8Of3be78] task_work_run at ffffffff9flOlff5
10 [ffffa7ce8Of3beaO] do_exit at ffffffff9fOe5Ol2
11 [ffffa7ce8Of3bflO] do_group_exit at ffffffff9fOe592a
12 [ffffa7ce8Of3bf38] __x64_sys_exit_group at ffffffff9fOe5994
13 [ffffa7ce8Of3bf4O] do_syscall_64 at ffffffff9f9d4373
14 [ffffa7ce8Of3bfsO] entry_SYSCALL_64_after_hwframe at ffffffff9fa0007c
This issue dues to flush_work(), which is used to wait for
smc_connect_work() to finish in smc_release(). Once lots of
smc_connect_work() was pending or all executing work dangling,
smc_release() has to block until one worker comes to free, which
is equivalent to wait another smc_connnect_work() to finish.
In order to fix this, There are two changes:
1. For those idle smc_connect_work(), cancel it from the workqueue; for
executing smc_connect_work(), waiting for it to finish. For that
purpose, replace flush_work() with cancel_work_sync().
2. Since smc_connect() hold a reference for passive closing, if
smc_connect_work() has been cancelled, release the reference.
Fixes: 24ac3a08e6 ("net/smc: rebuild nonblocking connect")
Reported-by: Tony Lu <tonylu@linux.alibaba.com>
Tested-by: Dust Li <dust.li@linux.alibaba.com>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
Acked-by: Karsten Graul <kgraul@linux.ibm.com>
Link: https://lore.kernel.org/r/1639571361-101128-1-git-send-email-alibuda@linux.alibaba.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Currently, buffers are cleared when smc connections are created and
buffers are reused. This slows down the speed of establishing new
connections. In most cases, the applications want to establish
connections as quickly as possible.
This patch moves memset() from connection creation path to release and
buffer unuse path, this trades off between speed of establishing and
release.
Test environments:
- CPU Intel Xeon Platinum 8 core, mem 32 GiB, nic Mellanox CX4
- socket sndbuf / rcvbuf: 16384 / 131072 bytes
- w/o first round, 5 rounds, avg, 100 conns batch per round
- smc_buf_create() use bpftrace kprobe, introduces extra latency
Latency benchmarks for smc_buf_create():
w/o patch : 19040.0 ns
w/ patch : 1932.6 ns
ratio : 10.2% (-89.8%)
Latency benchmarks for socket create and connect:
w/o patch : 143.3 us
w/ patch : 102.2 us
ratio : 71.3% (-28.7%)
The latency of establishing connections is reduced by 28.7%.
Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>
Reviewed-by: Wen Gu <guwen@linux.alibaba.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Link: https://lore.kernel.org/r/20211203113331.2818873-1-kgraul@linux.ibm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>