On most filesystems, there is no reason to delay reaping an nfsd_file
just because its underlying inode is still under writeback. nfsd just
relies on client activity or the local flusher threads to do writeback.
The main exception is NFS, which flushes all of its dirty data on last
close. Add a new EXPORT_OP_FLUSH_ON_CLOSE flag to allow filesystems to
signal that they do this, and only skip closing files under writeback on
such filesystems.
Also, remove a redundant NULL file pointer check in
nfsd_file_check_writeback, and clean up nfs's export op flag
definitions.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The last thing that filp_close does is an fput, so don't bother taking
and putting the extra reference.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
David Howells mentioned that he found this bit of code confusing, so
sprinkle in some comments to clarify.
Reported-by: David Howells <dhowells@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
An error from break_lease is non-fatal, so we needn't destroy the
nfsd_file in that case. Just put the reference like we normally would
and return the error.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
test_bit returns bool, so we can just compare the result of that to the
key->gc value without the "!!".
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Since v4 files are expected to be long-lived, there's little value in
closing them out of the cache when there is conflicting access.
Change the comparator to also match the gc value in the key. Change both
of the current users of that key to set the gc value in the key to
"true".
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
commit 4bb7aac70b ("net: phy: fix circular LEDS_CLASS dependencies")
solved a build failure, but introduces a new config knob with a default
'y' value: PHYLIB_LEDS.
The latter is against the current new config policy. The exception
was raised to allow the user to catch bad configurations without led
support.
Anyway the current definition of PHYLIB_LEDS does not fit the above
goal: if LEDS_CLASS is disabled, the new config will be available
only with PHYLIB disabled, too.
Hide the mentioned config, to preserve the randconfig testing done so
far, while respecting the mentioned policy.
Suggested-by: Andrew Lunn <andrew@lunn.ch>
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/d82489be8ed911c383c3447e9abf469995ccf39a.1682496488.git.pabeni@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Pointer variables of void * type do not require type cast.
Signed-off-by: wuych <yunchuan@nfschina.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
syzkaller reported [0] memory leaks of an UDP socket and ZEROCOPY
skbs. We can reproduce the problem with these sequences:
sk = socket(AF_INET, SOCK_DGRAM, 0)
sk.setsockopt(SOL_SOCKET, SO_TIMESTAMPING, SOF_TIMESTAMPING_TX_SOFTWARE)
sk.setsockopt(SOL_SOCKET, SO_ZEROCOPY, 1)
sk.sendto(b'', MSG_ZEROCOPY, ('127.0.0.1', 53))
sk.close()
sendmsg() calls msg_zerocopy_alloc(), which allocates a skb, sets
skb->cb->ubuf.refcnt to 1, and calls sock_hold(). Here, struct
ubuf_info_msgzc indirectly holds a refcnt of the socket. When the
skb is sent, __skb_tstamp_tx() clones it and puts the clone into
the socket's error queue with the TX timestamp.
When the original skb is received locally, skb_copy_ubufs() calls
skb_unclone(), and pskb_expand_head() increments skb->cb->ubuf.refcnt.
This additional count is decremented while freeing the skb, but struct
ubuf_info_msgzc still has a refcnt, so __msg_zerocopy_callback() is
not called.
The last refcnt is not released unless we retrieve the TX timestamped
skb by recvmsg(). Since we clear the error queue in inet_sock_destruct()
after the socket's refcnt reaches 0, there is a circular dependency.
If we close() the socket holding such skbs, we never call sock_put()
and leak the count, sk, and skb.
TCP has the same problem, and commit e0c8bccd40 ("net: stream:
purge sk_error_queue in sk_stream_kill_queues()") tried to fix it
by calling skb_queue_purge() during close(). However, there is a
small chance that skb queued in a qdisc or device could be put
into the error queue after the skb_queue_purge() call.
In __skb_tstamp_tx(), the cloned skb should not have a reference
to the ubuf to remove the circular dependency, but skb_clone() does
not call skb_copy_ubufs() for zerocopy skb. So, we need to call
skb_orphan_frags_rx() for the cloned skb to call skb_copy_ubufs().
[0]:
BUG: memory leak
unreferenced object 0xffff88800c6d2d00 (size 1152):
comm "syz-executor392", pid 264, jiffies 4294785440 (age 13.044s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 cd af e8 81 00 00 00 00 ................
02 00 07 40 00 00 00 00 00 00 00 00 00 00 00 00 ...@............
backtrace:
[<0000000055636812>] sk_prot_alloc+0x64/0x2a0 net/core/sock.c:2024
[<0000000054d77b7a>] sk_alloc+0x3b/0x800 net/core/sock.c:2083
[<0000000066f3c7e0>] inet_create net/ipv4/af_inet.c:319 [inline]
[<0000000066f3c7e0>] inet_create+0x31e/0xe40 net/ipv4/af_inet.c:245
[<000000009b83af97>] __sock_create+0x2ab/0x550 net/socket.c:1515
[<00000000b9b11231>] sock_create net/socket.c:1566 [inline]
[<00000000b9b11231>] __sys_socket_create net/socket.c:1603 [inline]
[<00000000b9b11231>] __sys_socket_create net/socket.c:1588 [inline]
[<00000000b9b11231>] __sys_socket+0x138/0x250 net/socket.c:1636
[<000000004fb45142>] __do_sys_socket net/socket.c:1649 [inline]
[<000000004fb45142>] __se_sys_socket net/socket.c:1647 [inline]
[<000000004fb45142>] __x64_sys_socket+0x73/0xb0 net/socket.c:1647
[<0000000066999e0e>] do_syscall_x64 arch/x86/entry/common.c:50 [inline]
[<0000000066999e0e>] do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80
[<0000000017f238c1>] entry_SYSCALL_64_after_hwframe+0x63/0xcd
BUG: memory leak
unreferenced object 0xffff888017633a00 (size 240):
comm "syz-executor392", pid 264, jiffies 4294785440 (age 13.044s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 2d 6d 0c 80 88 ff ff .........-m.....
backtrace:
[<000000002b1c4368>] __alloc_skb+0x229/0x320 net/core/skbuff.c:497
[<00000000143579a6>] alloc_skb include/linux/skbuff.h:1265 [inline]
[<00000000143579a6>] sock_omalloc+0xaa/0x190 net/core/sock.c:2596
[<00000000be626478>] msg_zerocopy_alloc net/core/skbuff.c:1294 [inline]
[<00000000be626478>] msg_zerocopy_realloc+0x1ce/0x7f0 net/core/skbuff.c:1370
[<00000000cbfc9870>] __ip_append_data+0x2adf/0x3b30 net/ipv4/ip_output.c:1037
[<0000000089869146>] ip_make_skb+0x26c/0x2e0 net/ipv4/ip_output.c:1652
[<00000000098015c2>] udp_sendmsg+0x1bac/0x2390 net/ipv4/udp.c:1253
[<0000000045e0e95e>] inet_sendmsg+0x10a/0x150 net/ipv4/af_inet.c:819
[<000000008d31bfde>] sock_sendmsg_nosec net/socket.c:714 [inline]
[<000000008d31bfde>] sock_sendmsg+0x141/0x190 net/socket.c:734
[<0000000021e21aa4>] __sys_sendto+0x243/0x360 net/socket.c:2117
[<00000000ac0af00c>] __do_sys_sendto net/socket.c:2129 [inline]
[<00000000ac0af00c>] __se_sys_sendto net/socket.c:2125 [inline]
[<00000000ac0af00c>] __x64_sys_sendto+0xe1/0x1c0 net/socket.c:2125
[<0000000066999e0e>] do_syscall_x64 arch/x86/entry/common.c:50 [inline]
[<0000000066999e0e>] do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80
[<0000000017f238c1>] entry_SYSCALL_64_after_hwframe+0x63/0xcd
Fixes: f214f915e7 ("tcp: enable MSG_ZEROCOPY")
Fixes: b5947e5d1e ("udp: msg_zerocopy")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After failing to verify configuration, it returns directly without
releasing link, which may cause memory leak.
Paolo Abeni thinks that the whole code of this driver is quite
"suboptimal" and looks unmainatained since at least ~15y, so he
suggests that we could simply remove the whole driver, please
take it into consideration.
Simon Horman suggests that the fix label should be set to
"Linux-2.6.12-rc2" considering that the problem has existed
since the driver was introduced and the commit above doesn't
seem to exist in net/net-next.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Gan Gecen <gangecen@hust.edu.cn>
Reviewed-by: Dongliang Mu <dzm91@hust.edu.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the action of an xdp program was XDP_TX, lan966x was creating
a xdp_frame and use this one to send the frame back. But it is also
possible to send back the frame without needing a xdp_frame, because
it is possible to send it back using the page.
And then once the frame is transmitted is possible to use directly
page_pool_recycle_direct as lan966x is using page pools.
This would save some CPU usage on this path, which results in higher
number of transmitted frames. Bellow are the statistics:
Frame size: Improvement:
64 ~8%
256 ~11%
512 ~8%
1000 ~0%
1500 ~0%
Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://lore.kernel.org/r/20230422142344.3630602-1-horatiu.vultur@microchip.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmRHI44ACgkQ6rmadz2v
bTonHxAAjERX8x00+KXAzM0gEpxunjpeL9l/4Zc8zMutKuDW5X2cZ9q0enlvWjuf
+bEMAsPorVd4VSHKO5450ldp/a3HN9Y6PUkfuc6rQWITDXpt/MSgkmqUB0VtG17J
Tq8M2mv3FrHNhGjf+UOQ9dZABbaQYtGiM75jZ3+Aj3hblfa04CxLDfbWxpCS3akP
JYEwUbT+8v+HA+sb27wHmOeXABtjBfjiZh4z9fEoyd62c9H71feYOMIZ8TDsyb6N
kMZXxWWBHiQUGx2RBVUYjWkpkVaaK0UGHaNPLApdwUfO7wEF3JqBPl+Ejde+YJRq
ixEN6gVCwt7JW2joBrP408wj1autEw0OmJ2buCFaK4O46/TvSTbu5IRWAIx6k/zc
xVrrXZRApIflfpd/EIRlBGPIYjoF8xbVS5Jvp/dXUXqgOsp+OAdRHBJZN+gLQ050
QjuhepVdAj7IgtDLF84S7nvwgmorhg+81u7ZCgBaYuSjLm+pOCzs1KhjdSokkDIS
44z37YQKr4+5x2AHs681wSBLwyS3OreO8N77Qs3nKV61Vq2R5YxRfvUpppXY49um
mD/vimRg4oQStQSFKQ9eX4ZGl5D+lrhpPumBVj0rGDZx+vgSurC+rO/G4CAHlFrZ
mLpOXI6Q8X1aB2hTot5ViVKV+wADVvXhry0TP+3++8O1EKxoTj0=
=cMwX
-----END PGP SIGNATURE-----
Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:
====================
pull-request: bpf-next 2023-04-24
We've added 5 non-merge commits during the last 3 day(s) which contain
a total of 7 files changed, 87 insertions(+), 44 deletions(-).
The main changes are:
1) Workaround for bpf iter selftest due to lack of subprog support
in precision tracking, from Andrii.
2) Disable bpf_refcount_acquire kfunc until races are fixed, from Dave.
3) One more test_verifier test converted from asm macro to asm in C,
from Eduard.
4) Fix build with NETFILTER=y INET=n config, from Florian.
5) Add __rcu_read_{lock,unlock} into deny list, from Yafang.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next:
selftests/bpf: avoid mark_all_scalars_precise() trigger in one of iter tests
bpf: Add __rcu_read_{lock,unlock} into btf id deny list
bpf: Disable bpf_refcount_acquire kfunc calls until race conditions are fixed
selftests/bpf: verifier/prevent_map_lookup converted to inline assembly
bpf: fix link failure with NETFILTER=y INET=n
====================
Link: https://lore.kernel.org/r/20230425005648.86714-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Gerhard Engleder says:
====================
tsnep: XDP socket zero-copy support
Implement XDP socket zero-copy support for tsnep driver. I tried to
follow existing drivers like igc as far as possible. But one main
difference is that tsnep does not need any reconfiguration for XDP BPF
program setup. So I decided to keep this behavior no matter if a XSK
pool is used or not. As a result, tsnep starts using the XSK pool even
if no XDP BPF program is available.
Another difference is that I tried to prevent potentially failing
allocations during XSK pool setup. E.g. both memory models for page pool
and XSK pool are registered all the time. Thus, XSK pool setup cannot
end up with not working queues.
Some prework is done to reduce the last two XSK commits to actual XSK
changes.
====================
Link: https://lore.kernel.org/r/20230421194656.48063-1-gerhard@engleder-embedded.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Send and complete XSK pool frames within TX NAPI context. NAPI context
is triggered by ndo_xsk_wakeup.
Test results with A53 1.2GHz:
xdpsock txonly copy mode, 64 byte frames:
pps pkts 1.00
tx 284,409 11,398,144
Two CPUs with 100% and 10% utilization.
xdpsock txonly zero-copy mode, 64 byte frames:
pps pkts 1.00
tx 511,929 5,890,368
Two CPUs with 100% and 1% utilization.
xdpsock l2fwd copy mode, 64 byte frames:
pps pkts 1.00
rx 248,985 7,315,885
tx 248,921 7,315,885
Two CPUs with 100% and 10% utilization.
xdpsock l2fwd zero-copy mode, 64 byte frames:
pps pkts 1.00
rx 254,735 3,039,456
tx 254,735 3,039,456
Two CPUs with 100% and 4% utilization.
Packet rate increases and CPU utilization is reduced in both cases.
Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add support for XSK zero-copy to RX path. The setup of the XSK pool can
be done at runtime. If the netdev is running, then the queue must be
disabled and enabled during reconfiguration. This can be done easily
with functions introduced in previous commits.
A more important property is that, if the netdev is running, then the
setup of the XSK pool shall not stop the netdev in case of errors. A
broken netdev after a failed XSK pool setup is bad behavior. Therefore,
the allocation and setup of resources during XSK pool setup is done only
before any queue is disabled. Additionally, freeing and later allocation
of resources is eliminated in some cases. Page pool entries are kept for
later use. Two memory models are registered in parallel. As a result,
the XSK pool setup cannot fail during queue reconfiguration.
In contrast to other drivers, XSK pool setup and XDP BPF program setup
are separate actions. XSK pool setup can be done without any XDP BPF
program. The XDP BPF program can be added, removed or changed without
any reconfiguration of the XSK pool.
Test results with A53 1.2GHz:
xdpsock rxdrop copy mode, 64 byte frames:
pps pkts 1.00
rx 856,054 10,625,775
Two CPUs with both 100% utilization.
xdpsock rxdrop zero-copy mode, 64 byte frames:
pps pkts 1.00
rx 889,388 4,615,284
Two CPUs with 100% and 20% utilization.
Packet rate increases and CPU utilization is reduced.
100% CPU load seems to the base load. This load is consumed by ksoftirqd
just for dropping the generated packets without xdpsock running.
Using batch API reduced CPU utilization slightly, but measurements are
not stable enough to provide meaningful numbers.
Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The function tsnep_rx_poll() is already pretty long and the skb receive
action can be reused for XSK zero-copy support. Move page based skb
receive to separate function.
Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Move queue enable and disable code to separate functions. This way the
activation and deactivation of the queues are defined actions, which can
be used in future execution paths.
This functions will be used for the queue reconfiguration at runtime,
which is necessary for XSK zero-copy support.
Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Make initialization of TX and RX queues less dynamic by moving some
initialization from netdev open/close to device probing.
Additionally, move some initialization code to separate functions to
enable future use in other execution paths.
This is done as preparation for queue reconfigure at runtime, which is
necessary for XSK zero-copy support.
Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
TX/RX ring size is static and power of 2 to enable compiler to optimize
modulo operation to mask operation. Make this optimization already in
the code and don't rely on the compiler.
CPU utilisation during high packet rate has not changed. So no
performance improvement has been measured. But it is best practice to
prevent modulo operations.
Suggested-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Up to 4 LEDs can be attached to the PHY, add support for setting
brightness manually.
Signed-off-by: Alexander Stein <alexander.stein@ew.tq-group.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20230424134625.303957-1-alexander.stein@ew.tq-group.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
'reg' is always encoded in 32 bits, thus it has to be read using the
function with the corresponding bit width.
Fixes: 01e5b728e9 ("net: phy: Add a binding for PHY LEDs")
Signed-off-by: Alexander Stein <alexander.stein@ew.tq-group.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/20230424141648.317944-1-alexander.stein@ew.tq-group.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Smatch complains that:
nfcsim_debugfs_init_dev() warn: 'dev_dir' is an error pointer or valid
According to the documentation of the debugfs_create_dir() function,
there is no need to check the return value of this function.
Just delete the dead code.
Signed-off-by: Jianuo Kuang <u202110722@hust.edu.cn>
Reviewed-by: Dongliang Mu <dzm91@hust.edu.cn>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Link: https://lore.kernel.org/r/20230424024140.34607-1-u202110722@hust.edu.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pointer variables of void * type do not require type cast.
Signed-off-by: wuych <yunchuan@nfschina.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20230424101550.664319-1-yunchuan@nfschina.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
SET_COALESCE may change operation mode and parameters in one call.
Changing operation mode may cause the driver to reset the parameter
values to what is a reasonable default for new operation mode.
Since driver does not know which parameters come from user and which
are echoed back from ->get, driver may ignore the parameters when
switching operation modes.
This used to be inevitable for ioctl() but in netlink we know which
parameters are actually specified by the user.
We could inform which parameters were set by the user but this would
lead to a lot of code duplication in the drivers. Instead try to call
the drivers twice if both mode and params are changed. The set method
already checks if any params need updating so in case the driver did
the right thing the first time around - there will be no second call
to it's ->set method (only an extra call to ->get()).
For mlx5 for example before this patch we'd see:
# ethtool -C eth0 adaptive-rx on adaptive-tx on
# ethtool -C eth0 adaptive-rx off adaptive-tx off \
tx-usecs 123 rx-usecs 123
Adaptive RX: off TX: off
rx-usecs: 3
rx-frames: 32
tx-usecs: 16
tx-frames: 32
[...]
After the change:
# ethtool -C eth0 adaptive-rx on adaptive-tx on
# ethtool -C eth0 adaptive-rx off adaptive-tx off \
tx-usecs 123 rx-usecs 123
Adaptive RX: off TX: off
rx-usecs: 123
rx-frames: 32
tx-usecs: 123
tx-frames: 32
[...]
This only works for netlink, so it's a small discrepancy between
netlink and ioctl(). Since we anticipate most users to move to
netlink I believe it's worth making their lives easier.
Link: https://lore.kernel.org/r/20230420233302.944382-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Haiyang Zhang says:
====================
Update coding style and check alloc_frag
Follow up patches for the jumbo frame support.
As suggested by Jakub Kicinski, update coding style, and check napi_alloc_frag
for possible fallback to single pages.
====================
Link: https://lore.kernel.org/r/1682096818-30056-1-git-send-email-haiyangz@microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
netdev/napi_alloc_frag() may fall back to single page which is smaller
than the requested size.
Add error checking to avoid memory overwritten.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Rename mana_refill_rxoob for naming consistency.
And remove some empty lines between function call and error
checking.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Lorenzo Bianconi says:
====================
add page_pool support for page recycling in veth driver
Introduce page_pool support in veth driver in order to recycle pages in
veth_convert_skb_to_xdp_buff routine and avoid reallocating the skb through
the page allocator when we run a xdp program on the device and we receive
skbs from the stack.
====================
Link: https://lore.kernel.org/r/cover.1682188837.git.lorenzo@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Introduce page_pool stats support to report info about local page_pool
through ethtool
Tested-by: Maryam Tahhan <mtahhan@redhat.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Introduce page_pool support in veth driver in order to recycle pages
in veth_convert_skb_to_xdp_buff routine and avoid reallocating the skb
through the page allocator.
The patch has been tested sending tcp traffic to a veth pair where the
remote peer is running a simple xdp program just returning xdp_pass:
veth upstream codebase:
MTU 1500B: ~ 8Gbps
MTU 8000B: ~ 13.9Gbps
veth upstream codebase + pp support:
MTU 1500B: ~ 9.2Gbps
MTU 8000B: ~ 16.2Gbps
Tested-by: Maryam Tahhan <mtahhan@redhat.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Brad Spencer provided a detailed report [0] that when calling getsockopt()
for AF_NETLINK, some SOL_NETLINK options set only 1 byte even though such
options require at least sizeof(int) as length.
The options return a flag value that fits into 1 byte, but such behaviour
confuses users who do not initialise the variable before calling
getsockopt() and do not strictly check the returned value as char.
Currently, netlink_getsockopt() uses put_user() to copy data to optlen and
optval, but put_user() casts the data based on the pointer, char *optval.
As a result, only 1 byte is set to optval.
To avoid this behaviour, we need to use copy_to_user() or cast optval for
put_user().
Note that this changes the behaviour on big-endian systems, but we document
that the size of optval is int in the man page.
$ man 7 netlink
...
Socket options
To set or get a netlink socket option, call getsockopt(2) to read
or setsockopt(2) to write the option with the option level argument
set to SOL_NETLINK. Unless otherwise noted, optval is a pointer to
an int.
Fixes: 9a4595bc7e ("[NETLINK]: Add set/getsockopt options to support more than 32 groups")
Fixes: be0c22a46c ("netlink: add NETLINK_BROADCAST_ERROR socket option")
Fixes: 38938bfe34 ("netlink: add NETLINK_NO_ENOBUFS socket flag")
Fixes: 0a6a3a23ea ("netlink: add NETLINK_CAP_ACK socket option")
Fixes: 2d4bc93368 ("netlink: extended ACK reporting")
Fixes: 89d35528d1 ("netlink: Add new socket option to enable strict checking on dumps")
Reported-by: Brad Spencer <bspencer@blackberry.com>
Link: https://lore.kernel.org/netdev/ZD7VkNWFfp22kTDt@datsun.rim.net/
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Johannes Berg <johannes@sipsolutions.net>
Link: https://lore.kernel.org/r/20230421185255.94606-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
iter_pass_iter_ptr_to_subprog subtest is relying on actual array size
being passed as subprog parameter. This combined with recent fixes to
precision tracking in conditional jumps ([0]) is now causing verifier to
backtrack all the way to the point where sum() and fill() subprogs are
called, at which point precision backtrack bails out and forces all the
states to have precise SCALAR registers. This in turn causes each
possible value of i within fill() and sum() subprogs to cause
a different non-equivalent state, preventing iterator code to converge.
For now, change the test to assume fixed size of passed in array. Once
BPF verifier supports precision tracking across subprogram calls, these
changes will be reverted as unnecessary.
[0] 71b547f561 ("bpf: Fix incorrect verifier pruning due to missing register precision taints")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230424235128.1941726-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEN9lkrMBJgcdVAPub1V2XiooUIOQFAmRDIGwACgkQ1V2XiooU
IOTR2xAAgV2g59s0pPO0VnAs9WTKiMgm+9JZz2uLwOT1/nvVzdz8P7wXTJK47m57
qziX//P0S37D6e5UkdSHpGBSfeIAMwciDd5WVJgsZYhpZ/QMS3y4GEUXVybMhV+p
oqX86Kc1u91+kLbgFF1CdbMDvw+oNW43QrIYR1Ok2MqlnZ3dlDajkw5KLYWFvb6t
9BCmkqZ1xheuN1l8CPZ0La3a9fwqTuC0P7TgSlO1BZF5nhADajcGt8lF8Td98zEZ
Mrb4VyOrhT6UZgGd+UW1ZlNyduO2Cb6Tj7jx9Es5gX8fDjmnZyAJf1qW9Iinoo46
4Yr+61oSNNGdE63m1lLm2ZQQJGDlUsuGCjUMW4hyVe8YIrqQg+cOXaTkCa40Xvos
qvObjrsnAd9xHtBchk4IaQ84J56ifalXULAJs8IZGM5K2XUNwnF7kIOaVtV8Rq+K
BSSprrM9Z7Lz5W5ucRLlBmDYSDd+ESUnhgJgIuf1CZ04mRBupF4IkqCU5INmVhJR
472cSc9DKPHmsnFFdszidxBwM6mg00qQ9M4Qvl+dQIOs0a28Pr8nHPOSStgMaecd
NAPGGEMdRLNaH5KYN1VbseUbbFhXQpXcuNCF1q8fdzpQYyo/+I/lu618sFEhy08r
KN6+JIDdrcAdMyYgL3rQy57u58xYlwU2NLIE0SLHLqI4grtTxnw=
=T5eS
-----END PGP SIGNATURE-----
Merge tag 'nf-next-23-04-22' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Pablo Neira Ayuso says:
====================
Netfilter/IPVS updates for net-next
1) Reduce jumpstack footprint: Stash chain in last rule marker in blob for
tracing. Remove last rule and chain from jumpstack. From Florian Westphal.
2) nf_tables validates all tables before committing the new rules.
Unfortunately, this has two drawbacks:
- Since addition of the transaction mutex pernet state gets written to
outside of the locked section from the cleanup callback, this is
wrong so do this cleanup directly after table has passed all checks.
- Revalidate tables that saw no changes. This can be avoided by
keeping the validation state per table, not per netns.
From Florian Westphal.
3) Get rid of a few redundant pointers in the traceinfo structure.
The three removed pointers are used in the expression evaluation loop,
so gcc keeps them in registers. Passing them to the (inlined) helpers
thus doesn't increase nft_do_chain text size, while stack is reduced
by another 24 bytes on 64bit arches. From Florian Westphal.
4) IPVS cleanups in several ways without implementing any functional
changes, aside from removing some debugging output:
- Update width of source for ip_vs_sync_conn_options
The operation is safe, use an annotation to describe it properly.
- Consistently use array_size() in ip_vs_conn_init()
It seems better to use helpers consistently.
- Remove {Enter,Leave}Function. These seem to be well past their
use-by date.
- Correct spelling in comments.
From Simon Horman.
5) Extended netlink error report for netdevice in flowtables and
netdev/chains. Allow for incrementally add/delete devices to netdev
basechain. Allow to create netdev chain without device.
* tag 'nf-next-23-04-22' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
netfilter: nf_tables: allow to create netdev chain without device
netfilter: nf_tables: support for deleting devices in an existing netdev chain
netfilter: nf_tables: support for adding new devices to an existing netdev chain
netfilter: nf_tables: rename function to destroy hook list
netfilter: nf_tables: do not send complete notification of deletions
netfilter: nf_tables: extended netlink error reporting for netdevice
ipvs: Correct spelling in comments
ipvs: Remove {Enter,Leave}Function
ipvs: Consistently use array_size() in ip_vs_conn_init()
ipvs: Update width of source for ip_vs_sync_conn_options
netfilter: nf_tables: do not store rule in traceinfo structure
netfilter: nf_tables: do not store verdict in traceinfo structure
netfilter: nf_tables: do not store pktinfo in traceinfo structure
netfilter: nf_tables: remove unneeded conditional
netfilter: nf_tables: make validation state per table
netfilter: nf_tables: don't write table validation state without mutex
netfilter: nf_tables: don't store chain address on jump
netfilter: nf_tables: don't store address of last rule on jump
netfilter: nf_tables: merge nft_rules_old structure and end of ruleblob marker
====================
Link: https://lore.kernel.org/r/20230421235021.216950-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The tracing recursion prevention mechanism must be protected by rcu, that
leaves __rcu_read_{lock,unlock} unprotected by this mechanism. If we trace
them, the recursion will happen. Let's add them into the btf id deny list.
When CONFIG_PREEMPT_RCU is enabled, it can be reproduced with a simple bpf
program as such:
SEC("fentry/__rcu_read_lock")
int fentry_run()
{
return 0;
}
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Link: https://lore.kernel.org/r/20230424161104.3737-2-laoar.shao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
As reported by Kumar in [0], the shared ownership implementation for BPF
programs has some race conditions which need to be addressed before it
can safely be used. This patch does so in a minimal way instead of
ripping out shared ownership entirely, as proper fixes for the issues
raised will follow ASAP, at which point this patch's commit can be
reverted to re-enable shared ownership.
The patch removes the ability to call bpf_refcount_acquire_impl from BPF
programs. Programs can only bump refcount and obtain a new owning
reference using this kfunc, so removing the ability to call it
effectively disables shared ownership.
Instead of changing success / failure expectations for
bpf_refcount-related selftests, this patch just disables them from
running for now.
[0]: https://lore.kernel.org/bpf/d7hyspcow5wtjcmw4fugdgyp3fwhljwuscp3xyut5qnwivyeru@ysdq543otzv2/
Reported-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Link: https://lore.kernel.org/r/20230424204321.2680232-1-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
- Introduce devcoredump support
- Add support for Realtek RTL8821CS, RTL8851B, RTL8852BS
- Add support for Mediatek MT7663, MT7922
- Add support for NXP w8997
- Add support for Actions Semi ATS2851
- Add support for QTI WCN6855
- Add support for Marvell 88W8997
-----BEGIN PGP SIGNATURE-----
iQJNBAABCAA3FiEE7E6oRXp8w05ovYr/9JCA4xAyCykFAmRGEjgZHGx1aXoudm9u
LmRlbnR6QGludGVsLmNvbQAKCRD0kIDjEDILKaMPEACVgwFPOwbSUiv+tZrLKjo8
VzsuRYDXba9xV88L1gMWMer25wrVKS0+LfGbR0qXdU3zMWhmGInPVWGeGgpuJtjp
bu6jJeQyhM//BXcVNZ90FrPBG8+IyCLdPRwh/+c3qiXFvLSnDq4fhcJLtFxY2X8f
EhEIUEY+WFjvSqjGirOucsixwm4v9nZg98hD4hFv80iFM6eWiCh12zOj8qPPcBGA
SHgMrrk/7NQC5gJv+VCvZXx+I49kq1YpVcTXBAK7zpcdtnbHzpkWJGuSAr6JHqx8
zblr/1kcGGN5tEuhMyiM81DuVTaYfeOy1M1GUgb3q/lvSQWuYiRNQhe5OqZ11ady
txefiQlfmCsLGoQ6DtSh4rr9ygvjzrZ5VKquWmy037TZ7+a5fF9iStsngMZhnX+W
tcvbW0T/D02aR1uLVJaXy6tD/K6wqsm/AIJ5jYY1jaC9iQgJLAHI1qOV2b96Q28v
eppr9fYwX56JVsVZHnbFlk1Gf5zZffZn5RCq6iagZKFo3zdehgLEiyD4FQFK94qn
b3+r8E9h6r9FPJ7rip5Vaxe7lyr9WAf3HcQMjfd41zqgB4RSE2z4jVRPh98RKzDa
hFH0KhVZZVqPSu3CPZ7JWXBGWYD7/AiFLD6UrUy+9bbe+6nmYKYu6NSw4LGe6962
1Uuab5QdUzSjD8db+945Xg==
=DOQV
-----END PGP SIGNATURE-----
Merge tag 'for-net-next-2023-04-23' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
bluetooth-next pull request for net-next:
- Introduce devcoredump support
- Add support for Realtek RTL8821CS, RTL8851B, RTL8852BS
- Add support for Mediatek MT7663, MT7922
- Add support for NXP w8997
- Add support for Actions Semi ATS2851
- Add support for QTI WCN6855
- Add support for Marvell 88W8997
I am not currently maintaining the kernel PPP code, so remove my
address from the MAINTAINERS entry for it.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This makes sure hci_cmd_sync_queue only queue new work if HCI_RUNNING
has been set otherwise there is a risk of commands being sent while
turning off.
Because hci_cmd_sync_queue can no longer queue work while HCI_RUNNING is
not set it cannot be used to power on adapters so instead
hci_cmd_sync_submit is introduced which bypass the HCI_RUNNING check, so
it behaves like the old implementation.
Link: https://lore.kernel.org/all/CAB4PzUpDMvdc8j2MdeSAy1KkAE-D3woprCwAdYWeOc-3v3c9Sw@mail.gmail.com/
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
WCN6855 will report memdump via ACL data or HCI event when
it get crashed, so we collect memdump to debug firmware.
Signed-off-by: Tim Jiang <quic_tjiang@quicinc.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
This enables flow control before checking for bootloader signature and
deciding whether FW download is needed or not. In case of V1 bootloader
chips w8987 and w8997, it is observed that if WLAN FW is downloaded first
and power save is enabled in wlan core, bootloader signatures are not
emitted by the BT core when the chip is put to sleep. As a result, the
driver skips FW download and subsequent HCI commands get timeout errors
in dmesg as shown below:
[ 112.898867] Bluetooth: hci0: Opcode 0x c03 failed: -110
[ 114.914865] Bluetooth: hci0: Setting baudrate failed (-110)
[ 116.930856] Bluetooth: hci0: Setting wake-up method failed (-110)
By enabling the flow control, the host enables its RTS pin, and an
interrupt in chip's UART peripheral causes the bootloader to wake up,
enabling the bootloader signatures, which then helps in downloading
the bluetooth FW file.
This changes all instances of 0/1 for serdev_device_set_flow_control()
to false/true.
Signed-off-by: Neeraj Sanjay Kale <neeraj.sanjaykale@nxp.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Some of the sync commands might take a long time to complete, e.g.
LE Create Connection when the peer device isn't responding might take
20 seconds before it times out. If suspend command is issued during
this time, it will need to wait for completion since both commands are
using the same sync lock.
This patch cancel any running sync commands before attempting to
suspend or adapter power off.
Signed-off-by: Archie Pusaka <apusaka@chromium.org>
Reviewed-by: Ying Hsu <yinghsu@chromium.org>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Realtek changed the format of the firmware file as v2. The driver
should implement the patch to extract the firmware data from the
firmware file. The future chips must apply this patch for firmware loading.
This patch is compatible with the both previous format and v2 as well.
Signed-off-by: Allen Chen <allen_chen@realsil.com.cn>
Signed-off-by: Alex Lu <alex_lu@realsil.com.cn>
Tested-by: Hilda Wu <hildawu@realtek.com>
Signed-off-by: Max Chou <max.chou@realtek.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
API hci_devcd_init() stores its u32 type parameter @dump_size into
skb, but it does not specify which byte order is used to store the
integer, let us take little endian to store and parse the integer.
Fixes: f5cc609d09d4 ("Bluetooth: Add support for hci devcoredump")
Signed-off-by: Zijun Hu <quic_zijuhu@quicinc.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Previously, capability was checked using capable(), which verified that the
caller of the ioctl system call had the required capability. In addition,
the result of the check would be stored in the HCI_SOCK_TRUSTED flag,
making it persistent for the socket.
However, malicious programs can abuse this approach by deliberately sharing
an HCI socket with a privileged task. The HCI socket will be marked as
trusted when the privileged task occasionally makes an ioctl call.
This problem can be solved by using sk_capable() to check capability, which
ensures that not only the current task but also the socket opener has the
specified capability, thus reducing the risk of privilege escalation
through the previously identified vulnerability.
Cc: stable@vger.kernel.org
Fixes: f81f5b2db8 ("Bluetooth: Send control open and close messages for HCI raw sockets")
Signed-off-by: Ruihan Li <lrh2000@pku.edu.cn>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>