Use the correct print format. Printing an unsigned int value should use %u
instead of %d. For details, please read document:
Documentation/core-api/printk-formats.rst
Signed-off-by: Kai Ye <yekai13@huawei.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
all have been moved to generic_net infra. On x86_64, this reduces
struct net size from 70 to 63 cache lines (4480 to 4032 byte).
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
dwork struct is large (>128 byte) and not needed when conntrack module
is not loaded.
Place it in net_generic data instead. The struct net dwork member is now
obsolete and will be removed in a followup patch.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
No need to keep this in struct net, place it in the net_generic data.
The sysctl pointer is removed from struct net in a followup patch.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This moves all nf_tables pernet data from struct net to a net_generic
extension, with the exception of the gencursor.
The latter is used in the data path and also outside of the nf_tables
core. All others are only used from the configuration plane.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
ebtables currently uses net->xt.tables[BRIDGE], but upcoming
patch will move net->xt.tables away from struct net.
To avoid exposing x_tables internals to ebtables, use a private list
instead.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This allows followup patch to remove the defrag_ipv4 member from struct
net. It also allows to auto-remove the hooks later on by adding a
_disable() function. This will be done later in a follow patch series.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This allows followup patch to remove these members from struct net.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
reduce size of struct net and make this self-contained.
The member in struct net is kept to minimize changes to struct net
layout, it will be removed in a separate patch.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
No need to place it in struct net, nfnetlink is a module and usage
doesn't occur in fastpath.
Also remove rcu usage:
Not a single reader of net->nfnl uses rcu accessors.
When exit_batch callbacks are executed the net namespace is already dead
so no calls to these functions are possible anymore (else we'd get NULL
deref crash too).
If the module is removed, then modules that call any of those functions
have been removed too so no calls to nfnl functions are possible either.
The nfnl and nfl_stash pointers in struct net are no longer used, they
will be removed in a followup patch to minimize changes to struct net
(causes rebuild for entire network stack).
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This removes the only reference of net->nfnl outside of the nfnetlink
module. This allows to move net->nfnl to net_generic infra.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
KMSAN found uninitialized value at batadv_tt_prepare_tvlv_local_data()
[1], for commit ced72933a5 ("batman-adv: use CRC32C instead of CRC16
in TT code") inserted 'reserved' field into "struct batadv_tvlv_tt_data"
and commit 7ea7b4a142 ("batman-adv: make the TT CRC logic VLAN
specific") moved that field to "struct batadv_tvlv_tt_vlan_data" but left
that field uninitialized.
[1] https://syzkaller.appspot.com/bug?id=07f3e6dba96f0eb3cabab986adcd8a58b9bdbe9d
Reported-by: syzbot <syzbot+50ee810676e6a089487b@syzkaller.appspotmail.com>
Tested-by: syzbot <syzbot+50ee810676e6a089487b@syzkaller.appspotmail.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Fixes: ced72933a5 ("batman-adv: use CRC32C instead of CRC16 in TT code")
Fixes: 7ea7b4a142 ("batman-adv: make the TT CRC logic VLAN specific")
Acked-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The driver can match via multiple methods. Its acpi_device_id table is
referenced via ACPI_PTR() so it will be unused for !CONFIG_ACPI builds:
drivers/net/ethernet/smsc/smsc911x.c:2652:36: warning:
‘smsc911x_acpi_match’ defined but not used [-Wunused-const-variable=]
Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Limiting the scope of the variable vector_ring_chain to the block where it
is used.
Fixes: 424eb834a9 ("net: hns3: Unified HNS3 {VF|PF} Ethernet Driver for hip08 SoC")
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Salil Mehta says:
====================
Misc. fixes for hns3 driver
Fixes for the miscellaneous problems found during the review of the code.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Code to defer the reset(which caps the frequency of the reset) schedules the
timer and returns. Hence, following 'else-if' looks un-necessary.
Fixes: 9de0b86f64 ("net: hns3: Prevent to request reset frequently")
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This removes the left over check and assignment which is no longer used
anywhere in the function and should have been removed as part of the
below mentioned patch.
Fixes: 012fcb52f6 ("net: hns3: activate reset timer when calling reset_event")
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, we can specify ifindex on link creation. This change allows
to specify ifindex when a device is moved to another network namespace.
Even now, a device ifindex can be changed if there is another device
with the same ifindex in the target namespace. So this change doesn't
introduce completely new behavior, it adds more control to the process.
CRIU users want to restore containers with pre-created network devices.
A user will provide network devices and instructions where they have to
be restored, then CRIU will restore network namespaces and move devices
into them. The problem is that devices have to be restored with the same
indexes that they have before C/R.
Cc: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Suggested-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These patches fix a series of spelling errors in net/nfc module.
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Found by virtue of ipv6 raw sockets not honouring the per-socket
IP{,V6}_FREEBIND setting.
Based on hits found via:
git grep '[.]ip_nonlocal_bind'
We fix both raw ipv6 sockets to honour IP{,V6}_FREEBIND and IP{,V6}_TRANSPARENT,
and we fix sctp sockets to honour IP{,V6}_TRANSPARENT (they already honoured
FREEBIND), and not just the ipv6 'ip_nonlocal_bind' sysctl.
The helper is defined as:
static inline bool ipv6_can_nonlocal_bind(struct net *net, struct inet_sock *inet) {
return net->ipv6.sysctl.ip_nonlocal_bind || inet->freebind || inet->transparent;
}
so this change only widens the accepted opt-outs and is thus a clean bugfix.
I'm not entirely sure what 'fixes' tag to add, since this is AFAICT an ancient bug,
but IMHO this should be applied to stable kernels as far back as possible.
As such I'm adding a 'fixes' tag with the commit that originally added the helper,
which happened in 4.19. Backporting to older LTS kernels (at least 4.9 and 4.14)
would presumably require open-coding it or backporting the helper as well.
Other possibly relevant commits:
v4.18-rc6-1502-g83ba4645152d net: add helpers checking if socket can be bound to nonlocal address
v4.18-rc6-1431-gd0c1f01138c4 net/ipv6: allow any source address for sendmsg pktinfo with ip_nonlocal_bind
v4.14-rc5-271-gb71d21c274ef sctp: full support for ipv6 ip_nonlocal_bind & IP_FREEBIND
v4.7-rc7-1883-g9b9742022888 sctp: support ipv6 nonlocal bind
v4.1-12247-g35a256fee52c ipv6: Nonlocal bind
Cc: Lorenzo Colitti <lorenzo@google.com>
Fixes: 83ba464515 ("net: add helpers checking if socket can be bound to nonlocal address")
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Reviewed-By: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
'struct ovs_zone_limit' has more members than initialized in
ovs_ct_limit_get_default_limit(). The rest of the memory is a random
kernel stack content that ends up being sent to userspace.
Fix that by using designated initializer that will clear all
non-specified fields.
Fixes: 11efd5cb04 ("openvswitch: Support conntrack zone limit")
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix kernel-doc warning:
Documentation/networking/tipc:66: /home/sfr/next/next/net/tipc/name_table.c
:558: WARNING: Unexpected indentation.
Documentation/networking/tipc:66: /home/sfr/next/next/net/tipc/name_table.c
:559: WARNING: Block quote ends without a blank line; unexpected unindent.
Due to blank line missing.
Fixes: 908148bc50 ("tipc: refactor tipc_sendmsg() and tipc_lookup_anycast()")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Link: https://lore.kernel.org/netdev/20210318172255.63185609@canb.auug.org.au/
Signed-off-by: Wu XiangCheng <bobwxc@email.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
struct ip6_sf_socklist and ipv6_mc_socklist are per-socket MLD data.
These data are protected by rtnl lock, socket lock, and RCU.
So, when these are used, it verifies whether rtnl lock is acquired or not.
ip6_mc_msfget() is called by do_ipv6_getsockopt().
But caller doesn't acquire rtnl lock.
So, when these data are used in the ip6_mc_msfget() lockdep warns about it.
But accessing these is actually safe because socket lock was acquired by
do_ipv6_getsockopt().
So, it changes lockdep annotation from rtnl lock to socket lock.
(rtnl_dereference -> sock_dereference)
Locking graph for mld data is like below:
When writing mld data:
do_ipv6_setsockopt()
rtnl_lock
lock_sock
(mld functions)
idev->mc_lock(if per-interface mld data is modified)
When reading mld data:
do_ipv6_getsockopt()
lock_sock
ip6_mc_msfget()
Splat looks like:
=============================
WARNING: suspicious RCU usage
5.12.0-rc4+ #503 Not tainted
-----------------------------
net/ipv6/mcast.c:610 suspicious rcu_dereference_protected() usage!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
1 lock held by mcast-listener-/923:
#0: ffff888007958a70 (sk_lock-AF_INET6){+.+.}-{0:0}, at:
ipv6_get_msfilter+0xaf/0x190
stack backtrace:
CPU: 1 PID: 923 Comm: mcast-listener- Not tainted 5.12.0-rc4+ #503
Call Trace:
dump_stack+0xa4/0xe5
ip6_mc_msfget+0x553/0x6c0
? ipv6_sock_mc_join_ssm+0x10/0x10
? lockdep_hardirqs_on_prepare+0x3e0/0x3e0
? mark_held_locks+0xb7/0x120
? lockdep_hardirqs_on_prepare+0x27c/0x3e0
? __local_bh_enable_ip+0xa5/0xf0
? lock_sock_nested+0x82/0xf0
ipv6_get_msfilter+0xc3/0x190
? compat_ipv6_get_msfilter+0x300/0x300
? lock_downgrade+0x690/0x690
do_ipv6_getsockopt.isra.6.constprop.13+0x1809/0x29e0
? do_ipv6_mcast_group_source+0x150/0x150
? register_lock_class+0x1750/0x1750
? kvm_sched_clock_read+0x14/0x30
? sched_clock+0x5/0x10
? sched_clock_cpu+0x18/0x170
? find_held_lock+0x3a/0x1c0
? lock_downgrade+0x690/0x690
? ipv6_getsockopt+0xdb/0x1b0
ipv6_getsockopt+0xdb/0x1b0
[ ... ]
Fixes: 88e2ca3080 ("mld: convert ifmcaddr6 to RCU")
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some lines of code can be merged into an equivalent 'skb_add_rx_frag()'
call which is less verbose.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Acked-by: Manish Chopra <manishc@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This ++ is confusing. It looks duplicated with the one already performed in
'skb_fill_page_desc()'.
In fact, it is harmless. 'nr_frags' is written twice with the same value.
Once, because of the nr_frags++, and once because of the 'nr_frags = i + 1'
in 'skb_fill_page_desc()'.
So axe this post-increment to avoid confusion.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Acked-by: Manish Chopra <manishc@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some lines of code can be merged into an equivalent 'skb_add_rx_frag()'
call which is less verbose.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
'page_address(skb_frag_page()) + skb_frag_off()' can be replaced by an
equivalent 'skb_frag_address()' which is less verbose.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is no need to use 'list_for_each_entry_safe' here, as nothing is
removed from the list in the 'for' loop.
Use 'list_for_each_entry' instead, it is slightly less verbose.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
X.25 Layer 3 (the Packet Layer) expects layer 2 to provide a reliable
datalink service such that no packets are reordered or dropped. And
X.25 Layer 2 (the LAPB layer) is indeed designed to provide such service.
However, this reliability is not preserved when a driver calls "netif_rx"
to deliver the received packets to layer 3, because "netif_rx" will put
the packets into per-CPU queues before they are delivered to layer 3.
If there are multiple CPUs, the order of the packets may not be preserved.
The per-CPU queues may also drop packets if there are too many.
Therefore, we should not call "netif_rx" to let it queue the packets.
Instead, we should use our own queue that won't reorder or drop packets.
This patch changes all X.25 drivers to use their own queues instead of
calling "netif_rx". The patch also documents this requirement in the
"x25-iface" documentation.
Cc: Martin Schiller <ms@dev.tdt.de>
Signed-off-by: Xie He <xie.he.0141@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull workqueue fixes from Tejun Heo:
"Two workqueue fixes.
One is around debugobj and poses no risk. The other is to prevent the
stall watchdog from firing spuriously in certain conditions. Not as
trivial as debugobj change but is still fairly low risk"
* 'for-5.12-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue/watchdog: Make unbound workqueues aware of touch_softlockup_watchdog() 84;0;0c84;0;0c There are two workqueue-specific watchdog timestamps:
workqueue: Move the position of debug_work_activate() in __queue_work()
For each device, the nosy driver allocates a pcilynx structure.
A use-after-free might happen in the following scenario:
1. Open nosy device for the first time and call ioctl with command
NOSY_IOC_START, then a new client A will be malloced and added to
doubly linked list.
2. Open nosy device for the second time and call ioctl with command
NOSY_IOC_START, then a new client B will be malloced and added to
doubly linked list.
3. Call ioctl with command NOSY_IOC_START for client A, then client A
will be readded to the doubly linked list. Now the doubly linked
list is messed up.
4. Close the first nosy device and nosy_release will be called. In
nosy_release, client A will be unlinked and freed.
5. Close the second nosy device, and client A will be referenced,
resulting in UAF.
The root cause of this bug is that the element in the doubly linked list
is reentered into the list.
Fix this bug by adding a check before inserting a client. If a client
is already in the linked list, don't insert it.
The following KASAN report reveals it:
BUG: KASAN: use-after-free in nosy_release+0x1ea/0x210
Write of size 8 at addr ffff888102ad7360 by task poc
CPU: 3 PID: 337 Comm: poc Not tainted 5.12.0-rc5+ #6
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
Call Trace:
nosy_release+0x1ea/0x210
__fput+0x1e2/0x840
task_work_run+0xe8/0x180
exit_to_user_mode_prepare+0x114/0x120
syscall_exit_to_user_mode+0x1d/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xae
Allocated by task 337:
nosy_open+0x154/0x4d0
misc_open+0x2ec/0x410
chrdev_open+0x20d/0x5a0
do_dentry_open+0x40f/0xe80
path_openat+0x1cf9/0x37b0
do_filp_open+0x16d/0x390
do_sys_openat2+0x11d/0x360
__x64_sys_open+0xfd/0x1a0
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xae
Freed by task 337:
kfree+0x8f/0x210
nosy_release+0x158/0x210
__fput+0x1e2/0x840
task_work_run+0xe8/0x180
exit_to_user_mode_prepare+0x114/0x120
syscall_exit_to_user_mode+0x1d/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xae
The buggy address belongs to the object at ffff888102ad7300 which belongs to the cache kmalloc-128 of size 128
The buggy address is located 96 bytes inside of 128-byte region [ffff888102ad7300, ffff888102ad7380)
[ Modified to use 'list_empty()' inside proper lock - Linus ]
Link: https://lore.kernel.org/lkml/1617433116-5930-1-git-send-email-zheyuma97@gmail.com/
Reported-and-tested-by: 马哲宇 (Zheyu Ma) <zheyuma97@gmail.com>
Signed-off-by: Zheyu Ma <zheyuma97@gmail.com>
Cc: Greg Kroah-Hartman <greg@kroah.com>
Cc: Stefan Richter <stefanr@s5r6.in-berlin.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
84;0;0c84;0;0c
There are two workqueue-specific watchdog timestamps:
+ @wq_watchdog_touched_cpu (per-CPU) updated by
touch_softlockup_watchdog()
+ @wq_watchdog_touched (global) updated by
touch_all_softlockup_watchdogs()
watchdog_timer_fn() checks only the global @wq_watchdog_touched for
unbound workqueues. As a result, unbound workqueues are not aware
of touch_softlockup_watchdog(). The watchdog might report a stall
even when the unbound workqueues are blocked by a known slow code.
Solution:
touch_softlockup_watchdog() must touch also the global @wq_watchdog_touched
timestamp.
The global timestamp can no longer be used for bound workqueues because
it is now updated from all CPUs. Instead, bound workqueues have to check
only @wq_watchdog_touched_cpu and these timestamps have to be updated for
all CPUs in touch_all_softlockup_watchdogs().
Beware:
The change might cause the opposite problem. An unbound workqueue
might get blocked on CPU A because of a real softlockup. The workqueue
watchdog would miss it when the timestamp got touched on CPU B.
It is acceptable because softlockups are detected by softlockup
watchdog. The workqueue watchdog is there to detect stalls where
a work never finishes, for example, because of dependencies of works
queued into the same workqueue.
V3:
- Modify the commit message clearly according to Petr's suggestion.
Signed-off-by: Wang Qing <wangqing@vivo.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The debug_work_activate() is called on the premise that
the work can be inserted, because if wq be in WQ_DRAINING
status, insert work may be failed.
Fixes: e41e704bc4 ("workqueue: improve destroy_workqueue() debuggability")
Signed-off-by: Zqiang <qiang.zhang@windriver.com>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
'skb_push()'/'skb_postpush_rcsum()' can be replaced by an equivalent
'skb_push_rcsum()' which is less verbose.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
This series provides trivial updates and cleanup to mlx5 driver
1) Support for matching on ct_state inv and rel flag in connection tracking
2) Reject TC rules that redirect from a VF to itself
3) Parav provided some E-Switch cleanups that could be summarized to:
3.1) Packing and Reduce structure sizes
3.2) Dynamic allocation of rate limit tables and structures
4) Vu Makes the netdev arfs and vlan tables allocation dynamic.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAmBnpQQACgkQSD+KveBX
+j48uAf/S6VrX6NPn0j1nG9LRPt679KYv5eync/PUPxAbHb9DLkNYYTjrbQYp77H
KSFjRr7r7CXW5Vl6ZeAEA9O2ZzzNtpuTIDrTOaLu3oM6FgGho5Y8PIzsq4zUOudd
y948kRosZedYIYCtgPFjHqhAOGkEZl7GX5oSSF2WTYtfPn8tgTU91FCYSTtEk+rM
UBm9VuuZnz5hBtesl5zjqym6l27U3xsJpxvpR465iRB0rrIz0gaoyFLI1NTouMVz
XYZZwsFtIrl4Ffnm18ayqO1zM7VgybpvPX2zrWddX8Eb7Cbr9r90PCK4qbtiUtnI
fELrif57Rh6TNIQxafEovM5UGaZ+eA==
=kCdd
-----END PGP SIGNATURE-----
Merge tag 'mlx5-updates-2021-04-02' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-updates-2021-04-02
This series provides trivial updates and cleanup to mlx5 driver
1) Support for matching on ct_state inv and rel flag in connection tracking
2) Reject TC rules that redirect from a VF to itself
3) Parav provided some E-Switch cleanups that could be summarized to:
3.1) Packing and Reduce structure sizes
3.2) Dynamic allocation of rate limit tables and structures
4) Vu Makes the netdev arfs and vlan tables allocation dynamic.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The number of MSI-X vectors is a PCI property visible through lspci. The
field is read-only and configured by the device. The mlx5 devices work in
a static or dynamic assignment mode.
Static assignment means that all newly created VFs have a preset number of
MSI-X vectors determined by device configuration parameters. This can
result in some VFs having too many or too few MSI-X vectors. Till now this
has been the only means of fine-tuning the MSI-X vector count and it was
acceptable for small numbers of VFs.
With dynamic assignment the inefficiency of having a fixed number of MSI-X
vectors can be avoided with each VF having exactly the required
vectors. Userspace will provide this information while provisioning the VF
for use, based on the intended use. For instance if being used with a VM,
the MSI-X vector count might be matched to the CPU count of the VM.
For compatibility mlx5 continues to start up with MSI-X vector assignment,
but the kernel can now access a larger dynamic vector pool and assign more
vectors to created VFs.
Link: https://lore.kernel.org/linux-pci/20210314124256.70253-4-leon@kernel.org
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
These new fields declare the number of MSI-X vectors that is possible to
allocate on the VF through PF configuration.
Value must be in range defined by min_dynamic_vf_msix_table_size and
max_dynamic_vf_msix_table_size.
The driver should continue to query its MSI-X table through PCI
configuration header.
Link: https://lore.kernel.org/linux-pci/20210314124256.70253-3-leon@kernel.org
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
A typical cloud provider SR-IOV use case is to create many VFs for use by
guest VMs. The VFs may not be assigned to a VM until a customer requests a
VM of a certain size, e.g., number of CPUs. A VF may need MSI-X vectors
proportional to the number of CPUs in the VM, but there is no standard way
to change the number of MSI-X vectors supported by a VF.
Some Mellanox ConnectX devices support dynamic assignment of MSI-X vectors
to SR-IOV VFs. This can be done by the PF driver after VFs are enabled,
and it can be done without affecting VFs that are already in use. The
hardware supports a limited pool of MSI-X vectors that can be assigned to
the PF or to individual VFs. This is device-specific behavior that
requires support in the PF driver.
Add a read-only "sriov_vf_total_msix" sysfs file for the PF and a writable
"sriov_vf_msix_count" file for each VF. Management software may use these
to learn how many MSI-X vectors are available and to dynamically assign
them to VFs before the VFs are passed through to a VM.
If the PF driver implements the ->sriov_get_vf_total_msix() callback,
"sriov_vf_total_msix" contains the total number of MSI-X vectors available
for distribution among VFs.
If no driver is bound to the VF, writing "N" to "sriov_vf_msix_count" uses
the PF driver ->sriov_set_msix_vec_count() callback to assign "N" MSI-X
vectors to the VF. When a VF driver subsequently reads the MSI-X Message
Control register, it will see the new Table Size "N".
Link: https://lore.kernel.org/linux-pci/20210314124256.70253-2-leon@kernel.org
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmBoyXQQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpiOXD/4wnXFHOmt5HMHmEXU4bB5b46snh3iCU4QM
w7pqgPMgS8uRZZ16FWpzDwccC2NZDoFRHhFxDrOqdKMO1DQlzk1yfSeKjNs9h2Oe
mJ6JuuaX6bEk2RRmjqqM5aMCazE2+tWtDydveL9OrJO6uiPcZYFPiRDwafDzQo4n
YPhUhPm+dqn/4Li6Fap2ieCeLqXNEUKpA0/Bd9QQV0Q/oZq5oLdJefIk8EMXH/tf
eKH2muZDjOV0FYdG8lPsNAF0c5qJ/aID+jlhyUz8Bkn31lOS96d5rzXoFq+AonsC
gVwLbaMcAibHrDjNxIQGcEU0VSjvvfy9GAfjJ3uSuyjpE4dNMe2fuU/B3rF6xb6E
upEfAik+frvzfFuZx11SZh/JwNBatJh35DVZZczI48YKk+s2MI0q9+lNINLtL6bD
3J287jnZbETU54WCMruiRwjQ3J1YWOj2pfxrPT9J0NZdQ0r7MXsDJecGNu+nL0X3
Ry7IpXUqabCf+4+XrGZ2NG6/kd5D/smatc5FqTkyeih3mw94iprNuWanzBlUZYQV
8ybrVtW37caAd1vyx0/p2I1LV5Y0eNGpSWz/WTDj0FINPCPjLq/VmPsIYgihmvz4
joWEzGnBKqds/CAvSneyz2d9MVlQ1083Az/Vi9g4w0IHG/p3ekQGBPDFBvQyszq9
pNkMpQ9Wxw==
=edGl
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.12-2021-04-03' of git://git.kernel.dk/linux-block
POull io_uring fix from Jens Axboe:
"Just fixing a silly braino in a previous patch, where we'd end up
failing to compile if CONFIG_BLOCK isn't enabled.
Not that a lot of people do that, but kernel bot spotted it and it's
probably prudent to just flush this out now before -rc6.
Sorry about that, none of my test compile configs have !CONFIG_BLOCK"
* tag 'io_uring-5.12-2021-04-03' of git://git.kernel.dk/linux-block:
io_uring: fix !CONFIG_BLOCK compilation failure
Ong Boon Leong says:
====================
stmmac: Add XDP support
This is the v4 patch series for adding XDP native support to stmmac.
Changes in v4:
5/6: Move TX clean timer setup to the end of NAPI RX process and
group it under stmmac_finalize_xdp_rx().
Also, fixed stmmac_xdp_xmit_back() returns STMMAC_XDP_CONSUMED
if XDP buffer conversion to XDP frame fails.
6/6: Move xdp_do_flush(0 into stmmac_finalize_xdp_rx() and combine
the XDP verdict of XDP TX and XDP REDIRECT together.
I retested the patch series on the 'xdp2' and 'xdp_redirect' related to
changes above and found the result to be satisfactory.
History of previous patch series:
v3: https://patchwork.kernel.org/project/netdevbpf/cover/20210331154135.8507-1-boon.leong.ong@intel.com/
v2: https://patchwork.kernel.org/project/netdevbpf/list/?series=457757
v1: https://patchwork.kernel.org/project/netdevbpf/list/?series=457139
It will be great if community can help to test or review the v4 series
and provide me any input if any.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the support of XDP_REDIRECT to another remote cpu for
further action. It also implements ndo_xdp_xmit ops, enabling the driver
to transmit packets forwarded to it by XDP program running on another
interface.
This patch has been tested using "xdp_redirect_cpu" for XDP_REDIRECT
+ drop testing. It also been tested with "xdp_redirect" sample app
which can be used to exercise ndo_xdp_xmit ops. The burst traffics are
generated using pktgen_sample03_burst_single_flow.sh in samples/pktgen
directory.
v4: Move xdp_do_flush() processing into stmmac_finalize_xdp_rx() and
combined the XDP verdict of XDP TX and REDIRECT together.
v3: Added 'nq->trans_start = jiffies' to avoid TX time-out as we are
sharing TX queue between slow path and XDP. Thanks to Jakub Kicinski
for point out.
Signed-off-by: Ong Boon Leong <boon.leong.ong@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support for XDP_TX action which enables XDP program to
transmit back received frames.
This patch has been tested with the "xdp2" app located in samples/bpf
dir. The DUT receives burst traffic packet generated using pktgen script
'pktgen_sample03_burst_single_flow.sh'.
v4: Moved stmmac_tx_timer_arm() to be done once at the end of NAPI RX.
Fixed stmmac_xdp_xmit_back() to return STMMAC_XDP_CONSUMED if
XDP buffer to frame conversion fails. Thanks to Jakub's input.
v3: Added 'nq->trans_start = jiffies' to avoid TX time-out as we are
sharing TX queue between slow path and XDP. Thanks to Jakub Kicinski
for pointing out.
Signed-off-by: Ong Boon Leong <boon.leong.ong@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the initial XDP support to stmmac driver. It supports
XDP_PASS, XDP_DROP and XDP_ABORTED actions. Upcoming patches will add
support for XDP_TX and XDP_REDIRECT.
To support XDP headroom, this patch adds page_offset into RX buffer and
change the dma_sync_single_for_device|cpu(). The DMA address used for
RX operation are changed to take into page_offset too. As page_pool
can handle dma_sync_single_for_device() on behalf of driver with
PP_FLAG_DMA_SYNC_DEV flag, we skip doing that in stmmac driver.
Current stmmac driver supports split header support (SPH) in RX but
the flexibility of splitting header and payload at different position
makes it very complex to be supported for XDP processing. In addition,
jumbo frame is not supported in XDP to keep the initial codes simple.
This patch has been tested with the sample app "xdp1" located in
samples/bpf directory for both SKB and Native (XDP) mode. The burst
traffic generated using pktgen_sample03_burst_single_flow.sh in
samples/pktgen directory.
Changes in v3:
- factor in xdp header and tail adjustment done by XDP program.
Thanks to Jakub Kicinski for pointing out the gap in v2.
Changes in v2:
- fix for "warning: variable 'len' set but not used" reported by lkp.
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Ong Boon Leong <boon.leong.ong@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>