Up to now several high speed NICs have custom mechanisms of recycling
the allocated memory they use for their payloads.
Our page_pool API already has recycling capabilities that are always
used when we are running in 'XDP mode'. So let's tweak the API and the
kernel network stack slightly and allow the recycling to happen even
during the standard operation.
The API doesn't take into account 'split page' policies used by those
drivers currently, but can be extended once we have users for that.
The idea is to be able to intercept the packet on skb_release_data().
If it's a buffer coming from our page_pool API recycle it back to the
pool for further usage or just release the packet entirely.
To achieve that we introduce a bit in struct sk_buff (pp_recycle:1) and
a field in struct page (page->pp) to store the page_pool pointer.
Storing the information in page->pp allows us to recycle both SKBs and
their fragments.
We could have skipped the skb bit entirely, since identical information
can bederived from struct page. However, in an effort to affect the free path
as less as possible, reading a single bit in the skb which is already
in cache, is better that trying to derive identical information for the
page stored data.
The driver or page_pool has to take care of the sync operations on it's own
during the buffer recycling since the buffer is, after opting-in to the
recycling, never unmapped.
Since the gain on the drivers depends on the architecture, we are not
enabling recycling by default if the page_pool API is used on a driver.
In order to enable recycling the driver must call skb_mark_for_recycle()
to store the information we need for recycling in page->pp and
enabling the recycling bit, or page_pool_store_mem_info() for a fragment.
Co-developed-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Co-developed-by: Matteo Croce <mcroce@microsoft.com>
Signed-off-by: Matteo Croce <mcroce@microsoft.com>
Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently the aging period for tcp/udp connections is hard coded to
30 seconds. Aged tcp/udp connections configure a hard coded 120/30
seconds pickup timeout for conntrack.
This configuration may be too aggressive or permissive for some users.
Dynamically configure the nf flow table GC timeout intervals according
to the user defined values.
Signed-off-by: Oz Shlomo <ozsh@nvidia.com>
Reviewed-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
UDP connections may be offloaded from nf conntrack to nf flow table.
Offloaded connections are aged after 30 seconds of inactivity.
Once aged, ownership is returned to conntrack with a hard coded pickup
time of 30 seconds, after which the connection may be deleted.
eted. The current aging intervals may be too aggressive for some users.
Provide users with the ability to control the nf flow table offload
aging and pickup time intervals via sysctl parameter as a pre-step for
configuring the nf flow table GC timeout intervals.
Signed-off-by: Oz Shlomo <ozsh@nvidia.com>
Reviewed-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
TCP connections may be offloaded from nf conntrack to nf flow table.
Offloaded connections are aged after 30 seconds of inactivity.
Once aged, ownership is returned to conntrack with a hard coded pickup
time of 120 seconds, after which the connection may be deleted.
eted. The current aging intervals may be too aggressive for some users.
Provide users with the ability to control the nf flow table offload
aging and pickup time intervals via sysctl parameter as a pre-step for
configuring the nf flow table GC timeout intervals.
Signed-off-by: Oz Shlomo <ozsh@nvidia.com>
Reviewed-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
MPTCP is builtin, so no need to add EXPORT_SYMBOL()s.
It will be used to support SO_TIMESTAMP(NS) ancillary
messages in the mptcp receive path.
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Similar to previous patch: expose SO_TIMESTAMPING helper so we do not
have to copy & paste this into the mptcp core.
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This exports SO_TIMESTAMP_* function for re-use by MPTCP.
Without this there is too much copy & paste needed to support
this from mptcp setsockopt path.
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
caif_enroll_dev() can fail in some cases. Ingnoring
these cases can lead to memory leak due to not assigning
link_support pointer to anywhere.
Fixes: 7c18d2205e ("caif: Restructure how link caif link layer enroll")
Cc: stable@vger.kernel.org
Signed-off-by: Pavel Skripkin <paskripkin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Added cfserl_release() function.
Cc: stable@vger.kernel.org
Signed-off-by: Pavel Skripkin <paskripkin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Refactor DEVLINK_CMD_RATE_{GET|SET} command handlers to support setting
a node as a parent for another rate object (leaf or node) by means of
new attribute DEVLINK_ATTR_RATE_PARENT_NODE_NAME. Extend devlink ops
with new callbacks rate_{leaf|node}_parent_set() to set node as a parent
for rate object to allow supporting drivers to implement rate grouping
through devlink. Driver implementations are allowed to support leafs
or node children only. Invoking callback with NULL as parent should be
threated by the driver as unset parent action.
Extend rate object struct with reference counter to disallow deleting a
node with any child pointing to it. User should unset parent for the
child explicitly.
Example:
$ devlink port function rate add netdevsim/netdevsim10/group1
$ devlink port function rate add netdevsim/netdevsim10/group2
$ devlink port function rate set netdevsim/netdevsim10/group1 parent group2
$ devlink port function rate show netdevsim/netdevsim10/group1
netdevsim/netdevsim10/group1: type node parent group2
$ devlink port function rate set netdevsim/netdevsim10/group1 noparent
Co-developed-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement support for DEVLINK_CMD_RATE_{NEW|DEL} commands that are used
to create and delete devlink rate nodes. Add new attribute
DEVLINK_ATTR_RATE_NODE_NAME that specify node name string. The node name
is an alphanumeric identifier. No valid node name can be a devlink port
index, eg. decimal number. Extend devlink ops with new callbacks
rate_node_{new|del}() and rate_node_tx_{share|max}_set() to allow
supporting drivers to implement ports rate grouping and setting tx rate
of rate nodes through devlink.
Expose devlink_rate_nodes_destroy() function to allow vendor driver do
proper cleanup of internally allocated resources for the nodes if the
driver goes down or due to any other reasons which requires nodes to be
destroyed.
Disallow moving device from switchdev to legacy mode if any node exists
on that device. User must explicitly delete nodes before switching mode.
Example:
$ devlink port function rate add netdevsim/netdevsim10/group1
$ devlink port function rate set netdevsim/netdevsim10/group1 \
tx_share 10mbit tx_max 100mbit
Add + set command can be combined:
$ devlink port function rate add netdevsim/netdevsim10/group1 \
tx_share 10mbit tx_max 100mbit
$ devlink port function rate show netdevsim/netdevsim10/group1
netdevsim/netdevsim10/group1: type node tx_share 10mbit tx_max 100mbit
$ devlink port function rate del netdevsim/netdevsim10/group1
Co-developed-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implement support for DEVLINK_CMD_RATE_SET command with new attributes
DEVLINK_ATTR_RATE_TX_{SHARE|MAX} that are used to set devlink rate
shared/max tx rate values. Extend devlink ops with new callbacks
rate_leaf_tx_{share|max}_set() to allow supporting drivers to implement
rate control through devlink.
New attributes are optional. Driver implementations are allowed to
support either or both of them.
Shared rate example:
$ devlink port function rate set netdevsim/netdevsim10/0 tx_share 10mbit
$ devlink port function rate show netdevsim/netdevsim10/0
netdevsim/netdevsim10/0: type leaf tx_share 10mbit
Max rate example:
$ devlink port function rate set netdevsim/netdevsim10/0 tx_max 100mbit
$ devlink port function rate show netdevsim/netdevsim10/0
netdevsim/netdevsim10/0: type leaf tx_max 100mbit
Co-developed-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow registering rate object for devlink ports with dedicated
devlink_rate_leaf_{create|destroy}() API. Implement new netlink
DEVLINK_CMD_RATE_GET command that is used to retrieve rate object info.
Add new DEVLINK_CMD_RATE_{NEW|DEL} commands that are used for
notifications when creating/deleting leaf rate object.
Rate API is intended to be used for rate limiting of individual
devlink ports (leafs) and their aggregates (nodes).
Example:
$ devlink port show
pci/0000:03:00.0/0
pci/0000:03:00.0/1
$ devlink port function rate show
pci/0000:03:00.0/0: type leaf
pci/0000:03:00.0/1: type leaf
Co-developed-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for net-next:
1) Support for SCTP chunks matching on nf_tables, from Phil Sutter.
2) Skip LDMXCSR, we don't need a valid MXCSR state. From Stefano Brivio.
3) CONFIG_RETPOLINE for nf_tables set lookups, from Florian Westphal.
4) A few Kconfig leading spaces removal, from Juerg Haefliger.
5) Remove spinlock from xt_limit, from Jason Baron.
6) Remove useless initialization in xt_CT, oneliner from Yang Li.
7) Tree-wide replacement of netlink_unicast() by nfnetlink_unicast().
8) Reduce footprint of several structures: xt_action_param,
nft_pktinfo and nf_hook_state, from Florian.
10) Add nft_thoff() and nft_sk() helpers and use them, also from Florian.
11) Fix documentation in nf_tables pipapo avx2, from Florian Westphal.
12) Fix clang-12 fmt string warnings, also from Florian.
====================
Currently vlan modification action checks existence of vlan priority by
comparing it to 0. Therefore it is impossible to modify existing vlan
tag to have priority 0.
For example, the following tc command will change the vlan id but will
not affect vlan priority:
tc filter add dev eth1 ingress matchall action vlan modify id 300 \
priority 0 pipe mirred egress redirect dev eth2
The incoming packet on eth1:
ethertype 802.1Q (0x8100), vlan 200, p 4, ethertype IPv4
will be changed to:
ethertype 802.1Q (0x8100), vlan 300, p 4, ethertype IPv4
although the user has intended to have p == 0.
The fix is to add tcfv_push_prio_exists flag to struct tcf_vlan_params
and rely on it when deciding to set the priority.
Fixes: 45a497f2d1 (net/sched: act_vlan: Introduce TCA_VLAN_ACT_MODIFY vlan action)
Signed-off-by: Boris Sukholitko <boris.sukholitko@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When a netdev with active TLS offload goes down, tls_device_down is
called to stop the offload and tear down the TLS context. However, the
socket stays alive, and it still points to the TLS context, which is now
deallocated. If a netdev goes up, while the connection is still active,
and the data flow resumes after a number of TCP retransmissions, it will
lead to a use-after-free of the TLS context.
This commit addresses this bug by keeping the context alive until its
normal destruction, and implements the necessary fallbacks, so that the
connection can resume in software (non-offloaded) kTLS mode.
On the TX side tls_sw_fallback is used to encrypt all packets. The RX
side already has all the necessary fallbacks, because receiving
non-decrypted packets is supported. The thing needed on the RX side is
to block resync requests, which are normally produced after receiving
non-decrypted packets.
The necessary synchronization is implemented for a graceful teardown:
first the fallbacks are deployed, then the driver resources are released
(it used to be possible to have a tls_dev_resync after tls_dev_del).
A new flag called TLS_RX_DEV_DEGRADED is added to indicate the fallback
mode. It's used to skip the RX resync logic completely, as it becomes
useless, and some objects may be released (for example, resync_async,
which is allocated and freed by the driver).
Fixes: e8f6979981 ("net/tls: Add generic NIC offload infrastructure")
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RCU synchronization is guaranteed to finish in finite time, unlike a
busy loop that polls a flag. This patch is a preparation for the bugfix
in the next patch, where the same synchronize_net() call will also be
used to sync with the TX datapath.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The commit cb17ed29a7 ("mac80211: parse radiotap header when selecting Tx
queue") moved the code to validate the radiotap header from
ieee80211_monitor_start_xmit to ieee80211_parse_tx_radiotap. This made is
possible to share more code with the new Tx queue selection code for
injected frames. But at the same time, it now required the call of
ieee80211_parse_tx_radiotap at the beginning of functions which wanted to
handle the radiotap header. And this broke the rate parser for radiotap
header parser.
The radiotap parser for rates is operating most of the time only on the
data in the actual radiotap header. But for the 802.11a/b/g rates, it must
also know the selected band from the chandef information. But this
information is only written to the ieee80211_tx_info at the end of the
ieee80211_monitor_start_xmit - long after ieee80211_parse_tx_radiotap was
already called. The info->band information was therefore always 0
(NL80211_BAND_2GHZ) when the parser code tried to access it.
For a 5GHz only device, injecting a frame with 802.11a rates would cause a
NULL pointer dereference because local->hw.wiphy->bands[NL80211_BAND_2GHZ]
would most likely have been NULL when the radiotap parser searched for the
correct rate index of the driver.
Cc: stable@vger.kernel.org
Reported-by: Ben Greear <greearb@candelatech.com>
Fixes: cb17ed29a7 ("mac80211: parse radiotap header when selecting Tx queue")
Signed-off-by: Mathy Vanhoef <Mathy.Vanhoef@kuleuven.be>
[sven@narfation.org: added commit message]
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Link: https://lore.kernel.org/r/20210530133226.40587-1-sven@narfation.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Commit a05829a722 ("cfg80211: avoid holding the RTNL when calling the
driver") dropped usage of RTNL here and replaced it with
hw->wiphy->mutex. But we didn't update the comments.
Fixes: a05829a722 ("cfg80211: avoid holding the RTNL when calling the driver")
Signed-off-by: Brian Norris <briannorris@chromium.org>
Link: https://lore.kernel.org/r/20210505202829.1039400-1-briannorris@chromium.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Init it on demand in the nft_compat expression. This reduces size
of nft_pktinfo from 48 to 24 bytes on x86_64.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The functions pass extra skb arg, but either its not used or the helpers
can already access it via pkt->skb.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This allows to change storage placement later on without changing readers.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This allows to change storage placement later on without changing readers.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Extend nft_set_do_lookup() to use direct calls when retpoline feature
is enabled.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Followup patch will add a CONFIG_RETPOLINE wrapper to avoid
the ops->lookup() indirection cost for retpoline builds.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso says:
====================
Netfilter/IPVS fixes for net
The following patchset contains Netfilter/IPVS fixes for net:
1) Fix incorrect sockopts unregistration from error path,
from Florian Westphal.
2) A few patches to provide better error reporting when missing kernel
netfilter options are missing in .config.
3) Fix dormant table flag updates.
4) Memleak in IPVS when adding service with IP_VS_SVC_F_HASHED flag.
* git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf:
ipvs: ignore IP_VS_SVC_F_HASHED flag when adding service
netfilter: nf_tables: fix table flag updates
netfilter: nf_tables: extended netlink error reporting for chain type
netfilter: nf_tables: missing error reporting for not selected expressions
netfilter: conntrack: unregister ipv4 sockopts on error unwind
====================
Link: https://lore.kernel.org/r/20210527190115.98503-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
can and wireless trees. Notably including fixes for the recently
announced "FragAttacks" WiFi vulnerabilities. Rather large batch,
touching some core parts of the stack, too, but nothing hair-raising.
Current release - regressions:
- tipc: make node link identity publish thread safe
- dsa: felix: re-enable TAS guard band mode
- stmmac: correct clocks enabled in stmmac_vlan_rx_kill_vid()
- stmmac: fix system hang if change mac address after interface ifdown
Current release - new code bugs:
- mptcp: avoid OOB access in setsockopt()
- bpf: Fix nested bpf_bprintf_prepare with more per-cpu buffers
- ethtool: stats: fix a copy-paste error - init correct array size
Previous releases - regressions:
- sched: fix packet stuck problem for lockless qdisc
- net: really orphan skbs tied to closing sk
- mlx4: fix EEPROM dump support
- bpf: fix alu32 const subreg bound tracking on bitwise operations
- bpf: fix mask direction swap upon off reg sign change
- bpf, offload: reorder offload callback 'prepare' in verifier
- stmmac: Fix MAC WoL not working if PHY does not support WoL
- packetmmap: fix only tx timestamp on request
- tipc: skb_linearize the head skb when reassembling msgs
Previous releases - always broken:
- mac80211: address recent "FragAttacks" vulnerabilities
- mac80211: do not accept/forward invalid EAPOL frames
- mptcp: avoid potential error message floods
- bpf, ringbuf: deny reserve of buffers larger than ringbuf to prevent
out of buffer writes
- bpf: forbid trampoline attach for functions with variable arguments
- bpf: add deny list of functions to prevent inf recursion of tracing
programs
- tls splice: check SPLICE_F_NONBLOCK instead of MSG_DONTWAIT
- can: isotp: prevent race between isotp_bind() and isotp_setsockopt()
- netfilter: nft_set_pipapo_avx2: Add irq_fpu_usable() check,
fallback to non-AVX2 version
Misc:
- bpf: add kconfig knob for disabling unpriv bpf by default
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmCuy2gACgkQMUZtbf5S
IruE5BAAhihia5EaiV71Bz/Cqr/d+osv5u283riKT8kBft0bWFVFFnT3iweWyR0/
5X+bB6zmr80Cuqh45ZeYyq+zJtiAAlsbD5hqBIGdMriSWLxciNKjVJRzuEjuqnek
USMW/LqGyf4NhmLogmQKpx8XcKSG7VYuK7vPrsH8us1dL5vIssceIXn8R9Dzj9NN
P77K5Z+Oka8XQJgetNLxR3tDAM/92RwIshotkhJbRwgiUvzb+wbnrnSOAZCIPgku
ydJyOxOklln1Sx07SejgzEl33ri0CkioDPThBWpOn7Mu0JrYKukXPKludoZcRYuJ
2jNLYfbH0ZS5EkOfk89h7j7MDoAJMUK72M+S1w5DEYz6eH2EjhAq9noZ6E1iQH+U
9vfoIvQjPh6Zhyk5QeM4dpt0cvR7rSElXkLVxo/x0dSBAi2rIng1bKeCUtv2J689
CsoD0oghtEzvUTYVxY6iNr15OFGl6KsZv4tVQ709gGA36sDlK8ozGbJH5WReobBl
f8H2WJlj2tVW5V75yUoio8TumDw34yk/5xlJFzm9GOwkqBrUcqOraHtHdUIsa4qr
KbELQQ9QVt4zYdLAiWy5BL/QLycp0ibmA1IB8W1bxEVSK1JXzREHzPxv85KOfZkn
8+vzNHmk2PEZYYsExiEykc5jXKOCPs8L0rJ6p4OverlbpDZcwIg=
=peMK
-----END PGP SIGNATURE-----
Merge tag 'net-5.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Networking fixes for 5.13-rc4, including fixes from bpf, netfilter,
can and wireless trees. Notably including fixes for the recently
announced "FragAttacks" WiFi vulnerabilities. Rather large batch,
touching some core parts of the stack, too, but nothing hair-raising.
Current release - regressions:
- tipc: make node link identity publish thread safe
- dsa: felix: re-enable TAS guard band mode
- stmmac: correct clocks enabled in stmmac_vlan_rx_kill_vid()
- stmmac: fix system hang if change mac address after interface
ifdown
Current release - new code bugs:
- mptcp: avoid OOB access in setsockopt()
- bpf: Fix nested bpf_bprintf_prepare with more per-cpu buffers
- ethtool: stats: fix a copy-paste error - init correct array size
Previous releases - regressions:
- sched: fix packet stuck problem for lockless qdisc
- net: really orphan skbs tied to closing sk
- mlx4: fix EEPROM dump support
- bpf: fix alu32 const subreg bound tracking on bitwise operations
- bpf: fix mask direction swap upon off reg sign change
- bpf, offload: reorder offload callback 'prepare' in verifier
- stmmac: Fix MAC WoL not working if PHY does not support WoL
- packetmmap: fix only tx timestamp on request
- tipc: skb_linearize the head skb when reassembling msgs
Previous releases - always broken:
- mac80211: address recent "FragAttacks" vulnerabilities
- mac80211: do not accept/forward invalid EAPOL frames
- mptcp: avoid potential error message floods
- bpf, ringbuf: deny reserve of buffers larger than ringbuf to
prevent out of buffer writes
- bpf: forbid trampoline attach for functions with variable arguments
- bpf: add deny list of functions to prevent inf recursion of tracing
programs
- tls splice: check SPLICE_F_NONBLOCK instead of MSG_DONTWAIT
- can: isotp: prevent race between isotp_bind() and
isotp_setsockopt()
- netfilter: nft_set_pipapo_avx2: Add irq_fpu_usable() check,
fallback to non-AVX2 version
Misc:
- bpf: add kconfig knob for disabling unpriv bpf by default"
* tag 'net-5.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (172 commits)
net: phy: Document phydev::dev_flags bits allocation
mptcp: validate 'id' when stopping the ADD_ADDR retransmit timer
mptcp: avoid error message on infinite mapping
mptcp: drop unconditional pr_warn on bad opt
mptcp: avoid OOB access in setsockopt()
nfp: update maintainer and mailing list addresses
net: mvpp2: add buffer header handling in RX
bnx2x: Fix missing error code in bnx2x_iov_init_one()
net: zero-initialize tc skb extension on allocation
net: hns: Fix kernel-doc
sctp: fix the proc_handler for sysctl encap_port
sctp: add the missing setting for asoc encap_port
bpf, selftests: Adjust few selftest result_unpriv outcomes
bpf: No need to simulate speculative domain for immediates
bpf: Fix mask direction swap upon off reg sign change
bpf: Wrap aux data inside bpf_sanitize_info container
bpf: Fix BPF_LSM kconfig symbol dependency
selftests/bpf: Add test for l3 use of bpf_redirect_peer
bpftool: Add sock_release help info for cgroup attach/prog load command
net: dsa: microchip: enable phy errata workaround on 9567
...
This patch adds two flags BPF_F_BROADCAST and BPF_F_EXCLUDE_INGRESS to
extend xdp_redirect_map for broadcast support.
With BPF_F_BROADCAST the packet will be broadcasted to all the interfaces
in the map. with BPF_F_EXCLUDE_INGRESS the ingress interface will be
excluded when do broadcasting.
When getting the devices in dev hash map via dev_map_hash_get_next_key(),
there is a possibility that we fall back to the first key when a device
was removed. This will duplicate packets on some interfaces. So just walk
the whole buckets to avoid this issue. For dev array map, we also walk the
whole map to find valid interfaces.
Function bpf_clear_redirect_map() was removed in
commit ee75aef23a ("bpf, xdp: Restructure redirect actions").
Add it back as we need to use ri->map again.
With test topology:
+-------------------+ +-------------------+
| Host A (i40e 10G) | ---------- | eno1(i40e 10G) |
+-------------------+ | |
| Host B |
+-------------------+ | |
| Host C (i40e 10G) | ---------- | eno2(i40e 10G) |
+-------------------+ | |
| +------+ |
| veth0 -- | Peer | |
| veth1 -- | | |
| veth2 -- | NS | |
| +------+ |
+-------------------+
On Host A:
# pktgen/pktgen_sample03_burst_single_flow.sh -i eno1 -d $dst_ip -m $dst_mac -s 64
On Host B(Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz, 128G Memory):
Use xdp_redirect_map and xdp_redirect_map_multi in samples/bpf for testing.
All the veth peers in the NS have a XDP_DROP program loaded. The
forward_map max_entries in xdp_redirect_map_multi is modify to 4.
Testing the performance impact on the regular xdp_redirect path with and
without patch (to check impact of additional check for broadcast mode):
5.12 rc4 | redirect_map i40e->i40e | 2.0M | 9.7M
5.12 rc4 | redirect_map i40e->veth | 1.7M | 11.8M
5.12 rc4 + patch | redirect_map i40e->i40e | 2.0M | 9.6M
5.12 rc4 + patch | redirect_map i40e->veth | 1.7M | 11.7M
Testing the performance when cloning packets with the redirect_map_multi
test, using a redirect map size of 4, filled with 1-3 devices:
5.12 rc4 + patch | redirect_map multi i40e->veth (x1) | 1.7M | 11.4M
5.12 rc4 + patch | redirect_map multi i40e->veth (x2) | 1.1M | 4.3M
5.12 rc4 + patch | redirect_map multi i40e->veth (x3) | 0.8M | 2.6M
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Link: https://lore.kernel.org/bpf/20210519090747.1655268-3-liuhangbin@gmail.com
The dormant flag need to be updated from the preparation phase,
otherwise, two consecutive requests to dorm a table in the same batch
might try to remove the same hooks twice, resulting in the following
warning:
hook not found, pf 3 num 0
WARNING: CPU: 0 PID: 334 at net/netfilter/core.c:480 __nf_unregister_net_hook+0x1eb/0x610 net/netfilter/core.c:480
Modules linked in:
CPU: 0 PID: 334 Comm: kworker/u4:5 Not tainted 5.12.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Workqueue: netns cleanup_net
RIP: 0010:__nf_unregister_net_hook+0x1eb/0x610 net/netfilter/core.c:480
This patch is a partial revert of 0ce7cf4127 ("netfilter: nftables:
update table flags from the commit phase") to restore the previous
behaviour.
However, there is still another problem: A batch containing a series of
dorm-wakeup-dorm table and vice-versa also trigger the warning above
since hook unregistration happens from the preparation phase, while hook
registration occurs from the commit phase.
To fix this problem, this patch adds two internal flags to annotate the
original dormant flag status which are __NFT_TABLE_F_WAS_DORMANT and
__NFT_TABLE_F_WAS_AWAKEN, to restore it from the abort path.
The __NFT_TABLE_F_UPDATE bitmask allows to handle the dormant flag update
with one single transaction.
Reported-by: syzbot+7ad5cd1615f2d89c6e7e@syzkaller.appspotmail.com
Fixes: 0ce7cf4127 ("netfilter: nftables: update table flags from the commit phase")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
A subsequent patch will add a new multipath hash policy where the packet
fields used for multipath hash calculation are determined by user space.
This patch adds a sysctl that allows user space to set these fields.
The packet fields are represented using a bitmask and are common between
IPv4 and IPv6 to allow user space to use the same numbering across both
protocols. For example, to hash based on standard 5-tuple:
# sysctl -w net.ipv6.fib_multipath_hash_fields=0x0037
net.ipv6.fib_multipath_hash_fields = 0x0037
To avoid introducing holes in 'struct netns_sysctl_ipv6', move the
'bindv6only' field after the multipath hash fields.
The kernel rejects unknown fields, for example:
# sysctl -w net.ipv6.fib_multipath_hash_fields=0x1000
sysctl: setting key "net.ipv6.fib_multipath_hash_fields": Invalid argument
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
A subsequent patch will add a new multipath hash policy where the packet
fields used for multipath hash calculation are determined by user space.
This patch adds a sysctl that allows user space to set these fields.
The packet fields are represented using a bitmask and are common between
IPv4 and IPv6 to allow user space to use the same numbering across both
protocols. For example, to hash based on standard 5-tuple:
# sysctl -w net.ipv4.fib_multipath_hash_fields=0x0037
net.ipv4.fib_multipath_hash_fields = 0x0037
The kernel rejects unknown fields, for example:
# sysctl -w net.ipv4.fib_multipath_hash_fields=0x1000
sysctl: setting key "net.ipv4.fib_multipath_hash_fields": Invalid argument
More fields can be added in the future, if needed.
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Every protocol has the 'netns_ok' member and it is euqal to 1. The
'if (!prot->netns_ok)' always false in inet_add_protocol().
Signed-off-by: Yejune Deng <yejunedeng@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
32-bit architectures which expect 8-byte alignment for 8-byte integers and
need 64-bit DMA addresses (arm, mips, ppc) had their struct page
inadvertently expanded in 2019. When the dma_addr_t was added, it forced
the alignment of the union to 8 bytes, which inserted a 4 byte gap between
'flags' and the union.
Fix this by storing the dma_addr_t in one or two adjacent unsigned longs.
This restores the alignment to that of an unsigned long. We always
store the low bits in the first word to prevent the PageTail bit from
being inadvertently set on a big endian platform. If that happened,
get_user_pages_fast() racing against a page which was freed and
reallocated to the page_pool could dereference a bogus compound_head(),
which would be hard to trace back to this cause.
Link: https://lkml.kernel.org/r/20210510153211.1504886-1-willy@infradead.org
Fixes: c25fff7171 ("mm: add dma_addr_t to struct page")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Matteo Croce <mcroce@linux.microsoft.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently qdisc_run() checks the STATE_DEACTIVATED of lockless
qdisc before calling __qdisc_run(), which ultimately clear the
STATE_MISSED when all the skb is dequeued. If STATE_DEACTIVATED
is set before clearing STATE_MISSED, there may be rescheduling
of net_tx_action() at the end of qdisc_run_end(), see below:
CPU0(net_tx_atcion) CPU1(__dev_xmit_skb) CPU2(dev_deactivate)
. . .
. set STATE_MISSED .
. __netif_schedule() .
. . set STATE_DEACTIVATED
. . qdisc_reset()
. . .
.<--------------- . synchronize_net()
clear __QDISC_STATE_SCHED | . .
. | . .
. | . some_qdisc_is_busy()
. | . return *false*
. | . .
test STATE_DEACTIVATED | . .
__qdisc_run() *not* called | . .
. | . .
test STATE_MISS | . .
__netif_schedule()--------| . .
. . .
. . .
__qdisc_run() is not called by net_tx_atcion() in CPU0 because
CPU2 has set STATE_DEACTIVATED flag during dev_deactivate(), and
STATE_MISSED is only cleared in __qdisc_run(), __netif_schedule
is called at the end of qdisc_run_end(), causing tx action
rescheduling problem.
qdisc_run() called by net_tx_action() runs in the softirq context,
which should has the same semantic as the qdisc_run() called by
__dev_xmit_skb() protected by rcu_read_lock_bh(). And there is a
synchronize_net() between STATE_DEACTIVATED flag being set and
qdisc_reset()/some_qdisc_is_busy in dev_deactivate(), we can safely
bail out for the deactived lockless qdisc in net_tx_action(), and
qdisc_reset() will reset all skb not dequeued yet.
So add the rcu_read_lock() explicitly to protect the qdisc_run()
and do the STATE_DEACTIVATED checking in net_tx_action() before
calling qdisc_run_begin(). Another option is to do the checking in
the qdisc_run_end(), but it will add unnecessary overhead for
non-tx_action case, because __dev_queue_xmit() will not see qdisc
with STATE_DEACTIVATED after synchronize_net(), the qdisc with
STATE_DEACTIVATED can only be seen by net_tx_action() because of
__netif_schedule().
The STATE_DEACTIVATED checking in qdisc_run() is to avoid race
between net_tx_action() and qdisc_reset(), see:
commit d518d2ed86 ("net/sched: fix race between deactivation
and dequeue for NOLOCK qdisc"). As the bailout added above for
deactived lockless qdisc in net_tx_action() provides better
protection for the race without calling qdisc_run() at all, so
remove the STATE_DEACTIVATED checking in qdisc_run().
After qdisc_reset(), there is no skb in qdisc to be dequeued, so
clear the STATE_MISSED in dev_reset_queue() too.
Fixes: 6b3ba9146f ("net: sched: allow qdiscs to handle locking")
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
V8: Clearing STATE_MISSED before calling __netif_schedule() has
avoid the endless rescheduling problem, but there may still
be a unnecessary rescheduling, so adjust the commit log.
Signed-off-by: David S. Miller <davem@davemloft.net>
Lockless qdisc has below concurrent problem:
cpu0 cpu1
. .
q->enqueue .
. .
qdisc_run_begin() .
. .
dequeue_skb() .
. .
sch_direct_xmit() .
. .
. q->enqueue
. qdisc_run_begin()
. return and do nothing
. .
qdisc_run_end() .
cpu1 enqueue a skb without calling __qdisc_run() because cpu0
has not released the lock yet and spin_trylock() return false
for cpu1 in qdisc_run_begin(), and cpu0 do not see the skb
enqueued by cpu1 when calling dequeue_skb() because cpu1 may
enqueue the skb after cpu0 calling dequeue_skb() and before
cpu0 calling qdisc_run_end().
Lockless qdisc has below another concurrent problem when
tx_action is involved:
cpu0(serving tx_action) cpu1 cpu2
. . .
. q->enqueue .
. qdisc_run_begin() .
. dequeue_skb() .
. . q->enqueue
. . .
. sch_direct_xmit() .
. . qdisc_run_begin()
. . return and do nothing
. . .
clear __QDISC_STATE_SCHED . .
qdisc_run_begin() . .
return and do nothing . .
. . .
. qdisc_run_end() .
This patch fixes the above data race by:
1. If the first spin_trylock() return false and STATE_MISSED is
not set, set STATE_MISSED and retry another spin_trylock() in
case other CPU may not see STATE_MISSED after it releases the
lock.
2. reschedule if STATE_MISSED is set after the lock is released
at the end of qdisc_run_end().
For tx_action case, STATE_MISSED is also set when cpu1 is at the
end if qdisc_run_end(), so tx_action will be rescheduled again
to dequeue the skb enqueued by cpu2.
Clear STATE_MISSED before retrying a dequeuing when dequeuing
returns NULL in order to reduce the overhead of the second
spin_trylock() and __netif_schedule() calling.
Also clear the STATE_MISSED before calling __netif_schedule()
at the end of qdisc_run_end() to avoid doing another round of
dequeuing in the pfifo_fast_dequeue().
The performance impact of this patch, tested using pktgen and
dummy netdev with pfifo_fast qdisc attached:
threads without+this_patch with+this_patch delta
1 2.61Mpps 2.60Mpps -0.3%
2 3.97Mpps 3.82Mpps -3.7%
4 5.62Mpps 5.59Mpps -0.5%
8 2.78Mpps 2.77Mpps -0.3%
16 2.22Mpps 2.22Mpps -0.0%
Fixes: 6b3ba9146f ("net: sched: allow qdiscs to handle locking")
Acked-by: Jakub Kicinski <kuba@kernel.org>
Tested-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Offloading conns could fail for multiple reasons and a hw refresh bit is
set to try to reoffload it in next sw packet.
But it could be in some cases and future points that the hw refresh bit
is not set but a refresh could succeed.
Remove the hw refresh bit and do offload refresh if requested.
There won't be a new work entry if a work is already pending
anyway as there is the hw pending bit.
Fixes: 8b3646d6e0 ("net/sched: act_ct: Support refreshing the flow table entries")
Signed-off-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
If the owing socket is shutting down - e.g. the sock reference
count already dropped to 0 and only sk_wmem_alloc is keeping
the sock alive, skb_orphan_partial() becomes a no-op.
When forwarding packets over veth with GRO enabled, the above
causes refcount errors.
This change addresses the issue with a plain skb_orphan() call
in the critical scenario.
Fixes: 9adc89af72 ("net: let skb_orphan_partial wake-up waiters.")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
and netfilter trees. Self-contained fixes, nothing risky.
Current release - new code bugs:
- dsa: ksz: fix a few bugs found by static-checker in the new driver
- stmmac: fix frame preemption handshake not triggering after
interface restart
Previous releases - regressions:
- make nla_strcmp handle more then one trailing null character
- fix stack OOB reads while fragmenting IPv4 packets in openvswitch
and net/sched
- sctp: do asoc update earlier in sctp_sf_do_dupcook_a
- sctp: delay auto_asconf init until binding the first addr
- stmmac: clear receive all(RA) bit when promiscuous mode is off
- can: mcp251x: fix resume from sleep before interface was brought up
Previous releases - always broken:
- bpf: fix leakage of uninitialized bpf stack under speculation
- bpf: fix masking negation logic upon negative dst register
- netfilter: don't assume that skb_header_pointer() will never fail
- only allow init netns to set default tcp cong to a restricted algo
- xsk: fix xp_aligned_validate_desc() when len == chunk_size to
avoid false positive errors
- ethtool: fix missing NLM_F_MULTI flag when dumping
- can: m_can: m_can_tx_work_queue(): fix tx_skb race condition
- sctp: fix a SCTP_MIB_CURRESTAB leak in sctp_sf_do_dupcook_b
- bridge: fix NULL-deref caused by a races between assigning
rx_handler_data and setting the IFF_BRIDGE_PORT bit
Latecomer:
- seg6: add counters support for SRv6 Behaviors
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmCV3YoACgkQMUZtbf5S
IrsQ2w//Q8/qbl6wGTKUfu6DZHYUU5j5sTwiHR823PKKSgXI+okWMN0KUlZszOsz
qnPkH6GuojRooOE1s8PFLSlt9axKhQ0y7uzMTrWYafQ+JZTtgg9/MiPxQ8fdiE5i
uOG1ngttZ+1jlE5tMPL4GAOSegg3rWVDclzqnJTdsPPOco3MWj6SL9xN0LDPxCEL
BDysRqL/UiOIoh4v6IXQRx2UWjsNGu4biM1po+Jfumnd9T0zKoEpzu6UN6yPShbx
284LihZSQtughCbhGqkErBOxfjZcvpFOQrqmjEvI+Z/eYg4InfWZemt8Sa92/alE
yAFjK76MUTaUxaAO/gk8XauhvkYOzJJwKpqhbOmlaM7oj55QdzT5/8JxMxVoA6hV
pscHOixk15GVse49PdPV8v47cyTLc/Xi69i+/uUdNVVfuORL1wft1w1xbd0S6Pbe
7Gqax21S7zxcDsrUli7cFheYiqtbQAL0anlIUz8tUOZFz0VQ/zPuFd4rUYZ/o38V
Mrevdk3t6CXNxS4CRXyUW4UejYB1O6Qw12sUue31e3h73d6LiN3NAiN5Qp7SEk1/
fvk+jfOf8vvmtimYvcUK2i0D+vqj4Ec/qRIE/XXuUDBcp22tPL9uWMfWavwTdAj1
Se4SzksTWF+NM0lO0ItonMyPh3ZXcSLhIv/gHrZwEKuWkXCGO4M=
=JmWS
-----END PGP SIGNATURE-----
Merge tag 'net-5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Networking fixes for 5.13-rc1, including fixes from bpf, can and
netfilter trees. Self-contained fixes, nothing risky.
Current release - new code bugs:
- dsa: ksz: fix a few bugs found by static-checker in the new driver
- stmmac: fix frame preemption handshake not triggering after
interface restart
Previous releases - regressions:
- make nla_strcmp handle more then one trailing null character
- fix stack OOB reads while fragmenting IPv4 packets in openvswitch
and net/sched
- sctp: do asoc update earlier in sctp_sf_do_dupcook_a
- sctp: delay auto_asconf init until binding the first addr
- stmmac: clear receive all(RA) bit when promiscuous mode is off
- can: mcp251x: fix resume from sleep before interface was brought up
Previous releases - always broken:
- bpf: fix leakage of uninitialized bpf stack under speculation
- bpf: fix masking negation logic upon negative dst register
- netfilter: don't assume that skb_header_pointer() will never fail
- only allow init netns to set default tcp cong to a restricted algo
- xsk: fix xp_aligned_validate_desc() when len == chunk_size to avoid
false positive errors
- ethtool: fix missing NLM_F_MULTI flag when dumping
- can: m_can: m_can_tx_work_queue(): fix tx_skb race condition
- sctp: fix a SCTP_MIB_CURRESTAB leak in sctp_sf_do_dupcook_b
- bridge: fix NULL-deref caused by a races between assigning
rx_handler_data and setting the IFF_BRIDGE_PORT bit
Latecomer:
- seg6: add counters support for SRv6 Behaviors"
* tag 'net-5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (73 commits)
atm: firestream: Use fallthrough pseudo-keyword
net: stmmac: Do not enable RX FIFO overflow interrupts
mptcp: fix splat when closing unaccepted socket
i40e: Remove LLDP frame filters
i40e: Fix PHY type identifiers for 2.5G and 5G adapters
i40e: fix the restart auto-negotiation after FEC modified
i40e: Fix use-after-free in i40e_client_subtask()
i40e: fix broken XDP support
netfilter: nftables: avoid potential overflows on 32bit arches
netfilter: nftables: avoid overflows in nft_hash_buckets()
tcp: Specify cmsgbuf is user pointer for receive zerocopy.
mlxsw: spectrum_mr: Update egress RIF list before route's action
net: ipa: fix inter-EE IRQ register definitions
can: m_can: m_can_tx_work_queue(): fix tx_skb race condition
can: mcp251x: fix resume from sleep before interface was brought up
can: mcp251xfd: mcp251xfd_probe(): add missing can_rx_offload_del() in error path
can: mcp251xfd: mcp251xfd_probe(): fix an error pointer dereference in probe
netfilter: nftables: Fix a memleak from userdata error path in new objects
netfilter: remove BUG_ON() after skb_header_pointer()
netfilter: nfnetlink_osf: Fix a missing skb_header_pointer() NULL check
...
The same thing should be done for sctp_sf_do_dupcook_b().
Meanwhile, SCTP_CMD_UPDATE_ASSOC cmd can be removed.
v1->v2:
- Fix the return value in sctp_sf_do_assoc_update().
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There are cases where the page_pool need to refill with pages from the
page allocator. Some workloads cause the page_pool to release pages
instead of recycling these pages.
For these workload it can improve performance to bulk alloc pages from the
page-allocator to refill the alloc cache.
For XDP-redirect workload with 100G mlx5 driver (that use page_pool)
redirecting xdp_frame packets into a veth, that does XDP_PASS to create an
SKB from the xdp_frame, which then cannot return the page to the
page_pool.
Performance results under GitHub xdp-project[1]:
[1] https://github.com/xdp-project/xdp-project/blob/master/areas/mem/page_pool06_alloc_pages_bulk.org
Mel: The patch "net: page_pool: convert to use alloc_pages_bulk_array
variant" was squashed with this patch. From the test page, the array
variant was superior with one of the test results as follows.
Kernel XDP stats CPU pps Delta
Baseline XDP-RX CPU total 3,771,046 n/a
List XDP-RX CPU total 3,940,242 +4.49%
Array XDP-RX CPU total 4,249,224 +12.68%
Link: https://lkml.kernel.org/r/20210325114228.27719-10-mgorman@techsingularity.net
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Alexander Lobakin <alobakin@pm.me>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: David Miller <davem@davemloft.net>
Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Core:
- bpf:
- allow bpf programs calling kernel functions (initially to
reuse TCP congestion control implementations)
- enable task local storage for tracing programs - remove the
need to store per-task state in hash maps, and allow tracing
programs access to task local storage previously added for
BPF_LSM
- add bpf_for_each_map_elem() helper, allowing programs to
walk all map elements in a more robust and easier to verify
fashion
- sockmap: support UDP and cross-protocol BPF_SK_SKB_VERDICT
redirection
- lpm: add support for batched ops in LPM trie
- add BTF_KIND_FLOAT support - mostly to allow use of BTF
on s390 which has floats in its headers files
- improve BPF syscall documentation and extend the use of kdoc
parsing scripts we already employ for bpf-helpers
- libbpf, bpftool: support static linking of BPF ELF files
- improve support for encapsulation of L2 packets
- xdp: restructure redirect actions to avoid a runtime lookup,
improving performance by 4-8% in microbenchmarks
- xsk: build skb by page (aka generic zerocopy xmit) - improve
performance of software AF_XDP path by 33% for devices
which don't need headers in the linear skb part (e.g. virtio)
- nexthop: resilient next-hop groups - improve path stability
on next-hops group changes (incl. offload for mlxsw)
- ipv6: segment routing: add support for IPv4 decapsulation
- icmp: add support for RFC 8335 extended PROBE messages
- inet: use bigger hash table for IP ID generation
- tcp: deal better with delayed TX completions - make sure we don't
give up on fast TCP retransmissions only because driver is
slow in reporting that it completed transmitting the original
- tcp: reorder tcp_congestion_ops for better cache locality
- mptcp:
- add sockopt support for common TCP options
- add support for common TCP msg flags
- include multiple address ids in RM_ADDR
- add reset option support for resetting one subflow
- udp: GRO L4 improvements - improve 'forward' / 'frag_list'
co-existence with UDP tunnel GRO, allowing the first to take
place correctly even for encapsulated UDP traffic
- micro-optimize dev_gro_receive() and flow dissection, avoid
retpoline overhead on VLAN and TEB GRO
- use less memory for sysctls, add a new sysctl type, to allow using
u8 instead of "int" and "long" and shrink networking sysctls
- veth: allow GRO without XDP - this allows aggregating UDP
packets before handing them off to routing, bridge, OvS, etc.
- allow specifing ifindex when device is moved to another namespace
- netfilter:
- nft_socket: add support for cgroupsv2
- nftables: add catch-all set element - special element used
to define a default action in case normal lookup missed
- use net_generic infra in many modules to avoid allocating
per-ns memory unnecessarily
- xps: improve the xps handling to avoid potential out-of-bound
accesses and use-after-free when XPS change race with other
re-configuration under traffic
- add a config knob to turn off per-cpu netdev refcnt to catch
underflows in testing
Device APIs:
- add WWAN subsystem to organize the WWAN interfaces better and
hopefully start driving towards more unified and vendor-
-independent APIs
- ethtool:
- add interface for reading IEEE MIB stats (incl. mlx5 and
bnxt support)
- allow network drivers to dump arbitrary SFP EEPROM data,
current offset+length API was a poor fit for modern SFP
which define EEPROM in terms of pages (incl. mlx5 support)
- act_police, flow_offload: add support for packet-per-second
policing (incl. offload for nfp)
- psample: add additional metadata attributes like transit delay
for packets sampled from switch HW (and corresponding egress
and policy-based sampling in the mlxsw driver)
- dsa: improve support for sandwiched LAGs with bridge and DSA
- netfilter:
- flowtable: use direct xmit in topologies with IP
forwarding, bridging, vlans etc.
- nftables: counter hardware offload support
- Bluetooth:
- improvements for firmware download w/ Intel devices
- add support for reading AOSP vendor capabilities
- add support for virtio transport driver
- mac80211:
- allow concurrent monitor iface and ethernet rx decap
- set priority and queue mapping for injected frames
- phy: add support for Clause-45 PHY Loopback
- pci/iov: add sysfs MSI-X vector assignment interface
to distribute MSI-X resources to VFs (incl. mlx5 support)
New hardware/drivers:
- dsa: mv88e6xxx: add support for Marvell mv88e6393x -
11-port Ethernet switch with 8x 1-Gigabit Ethernet
and 3x 10-Gigabit interfaces.
- dsa: support for legacy Broadcom tags used on BCM5325, BCM5365
and BCM63xx switches
- Microchip KSZ8863 and KSZ8873; 3x 10/100Mbps Ethernet switches
- ath11k: support for QCN9074 a 802.11ax device
- Bluetooth: Broadcom BCM4330 and BMC4334
- phy: Marvell 88X2222 transceiver support
- mdio: add BCM6368 MDIO mux bus controller
- r8152: support RTL8153 and RTL8156 (USB Ethernet) chips
- mana: driver for Microsoft Azure Network Adapter (MANA)
- Actions Semi Owl Ethernet MAC
- can: driver for ETAS ES58X CAN/USB interfaces
Pure driver changes:
- add XDP support to: enetc, igc, stmmac
- add AF_XDP support to: stmmac
- virtio:
- page_to_skb() use build_skb when there's sufficient tailroom
(21% improvement for 1000B UDP frames)
- support XDP even without dedicated Tx queues - share the Tx
queues with the stack when necessary
- mlx5:
- flow rules: add support for mirroring with conntrack,
matching on ICMP, GTP, flex filters and more
- support packet sampling with flow offloads
- persist uplink representor netdev across eswitch mode
changes
- allow coexistence of CQE compression and HW time-stamping
- add ethtool extended link error state reporting
- ice, iavf: support flow filters, UDP Segmentation Offload
- dpaa2-switch:
- move the driver out of staging
- add spanning tree (STP) support
- add rx copybreak support
- add tc flower hardware offload on ingress traffic
- ionic:
- implement Rx page reuse
- support HW PTP time-stamping
- octeon: support TC hardware offloads - flower matching on ingress
and egress ratelimitting.
- stmmac:
- add RX frame steering based on VLAN priority in tc flower
- support frame preemption (FPE)
- intel: add cross time-stamping freq difference adjustment
- ocelot:
- support forwarding of MRP frames in HW
- support multiple bridges
- support PTP Sync one-step timestamping
- dsa: mv88e6xxx, dpaa2-switch: offload bridge port flags like
learning, flooding etc.
- ipa: add IPA v4.5, v4.9 and v4.11 support (Qualcomm SDX55, SM8350,
SC7280 SoCs)
- mt7601u: enable TDLS support
- mt76:
- add support for 802.3 rx frames (mt7915/mt7615)
- mt7915 flash pre-calibration support
- mt7921/mt7663 runtime power management fixes
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmCKFPIACgkQMUZtbf5S
Irtw0g/+NA8bWdHNgG4H5rya0pv2z3IieLRmSdDfKRQQXcJpklawc5MKVVaTee/Q
5/QqgPdCsu1LAU6JXBKsKmyDDaMlQKdWuKbOqDSiAQKoMesZStTEHf9d851ZzgxA
Cdb6O7BD3lBl/IN+oxNG+KcmD1LKquTPKGySq2mQtEdLO12ekAsranzmj4voKffd
q9tBShpXQ7Dq77DLYfiQXVCvsizNcbbJFuxX0o9Lpb9+61ZyYAbogZSa9ypiZZwR
I/9azRBtJg7UV1aD/cLuAfy66Qh7t63+rCxVazs5Os8jVO26P/jQdisnnOe/x+p9
wYEmKm3GSu0V4SAPxkWW+ooKusflCeqDoMIuooKt6kbP6BRj540veGw3Ww/m5YFr
7pLQkTSP/tSjuGQIdBE1LOP5LBO8DZeC8Kiop9V0fzAW9hFSZbEq25WW0bPj8QQO
zA4Z7yWlslvxcfY2BdJX3wD8klaINkl/8fDWZFFsBdfFX2VeLtm7Xfduw34BJpvU
rYT3oWr6PhtkPAKR32SUcemSfeWgIVU41eSshzRz3kez1NngBUuLlSGGSEaKbes5
pZVt6pYFFVByyf6MTHFEoQvafZfEw04JILZpo4R5V8iTHzom0kD3Py064sBiXEw2
B6t+OW4qgcxGblpFkK2lD4kR2s1TPUs0ckVO6sAy1x8q60KKKjY=
=vcbA
-----END PGP SIGNATURE-----
Merge tag 'net-next-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
"Core:
- bpf:
- allow bpf programs calling kernel functions (initially to
reuse TCP congestion control implementations)
- enable task local storage for tracing programs - remove the
need to store per-task state in hash maps, and allow tracing
programs access to task local storage previously added for
BPF_LSM
- add bpf_for_each_map_elem() helper, allowing programs to walk
all map elements in a more robust and easier to verify fashion
- sockmap: support UDP and cross-protocol BPF_SK_SKB_VERDICT
redirection
- lpm: add support for batched ops in LPM trie
- add BTF_KIND_FLOAT support - mostly to allow use of BTF on
s390 which has floats in its headers files
- improve BPF syscall documentation and extend the use of kdoc
parsing scripts we already employ for bpf-helpers
- libbpf, bpftool: support static linking of BPF ELF files
- improve support for encapsulation of L2 packets
- xdp: restructure redirect actions to avoid a runtime lookup,
improving performance by 4-8% in microbenchmarks
- xsk: build skb by page (aka generic zerocopy xmit) - improve
performance of software AF_XDP path by 33% for devices which don't
need headers in the linear skb part (e.g. virtio)
- nexthop: resilient next-hop groups - improve path stability on
next-hops group changes (incl. offload for mlxsw)
- ipv6: segment routing: add support for IPv4 decapsulation
- icmp: add support for RFC 8335 extended PROBE messages
- inet: use bigger hash table for IP ID generation
- tcp: deal better with delayed TX completions - make sure we don't
give up on fast TCP retransmissions only because driver is slow in
reporting that it completed transmitting the original
- tcp: reorder tcp_congestion_ops for better cache locality
- mptcp:
- add sockopt support for common TCP options
- add support for common TCP msg flags
- include multiple address ids in RM_ADDR
- add reset option support for resetting one subflow
- udp: GRO L4 improvements - improve 'forward' / 'frag_list'
co-existence with UDP tunnel GRO, allowing the first to take place
correctly even for encapsulated UDP traffic
- micro-optimize dev_gro_receive() and flow dissection, avoid
retpoline overhead on VLAN and TEB GRO
- use less memory for sysctls, add a new sysctl type, to allow using
u8 instead of "int" and "long" and shrink networking sysctls
- veth: allow GRO without XDP - this allows aggregating UDP packets
before handing them off to routing, bridge, OvS, etc.
- allow specifing ifindex when device is moved to another namespace
- netfilter:
- nft_socket: add support for cgroupsv2
- nftables: add catch-all set element - special element used to
define a default action in case normal lookup missed
- use net_generic infra in many modules to avoid allocating
per-ns memory unnecessarily
- xps: improve the xps handling to avoid potential out-of-bound
accesses and use-after-free when XPS change race with other
re-configuration under traffic
- add a config knob to turn off per-cpu netdev refcnt to catch
underflows in testing
Device APIs:
- add WWAN subsystem to organize the WWAN interfaces better and
hopefully start driving towards more unified and vendor-
independent APIs
- ethtool:
- add interface for reading IEEE MIB stats (incl. mlx5 and bnxt
support)
- allow network drivers to dump arbitrary SFP EEPROM data,
current offset+length API was a poor fit for modern SFP which
define EEPROM in terms of pages (incl. mlx5 support)
- act_police, flow_offload: add support for packet-per-second
policing (incl. offload for nfp)
- psample: add additional metadata attributes like transit delay for
packets sampled from switch HW (and corresponding egress and
policy-based sampling in the mlxsw driver)
- dsa: improve support for sandwiched LAGs with bridge and DSA
- netfilter:
- flowtable: use direct xmit in topologies with IP forwarding,
bridging, vlans etc.
- nftables: counter hardware offload support
- Bluetooth:
- improvements for firmware download w/ Intel devices
- add support for reading AOSP vendor capabilities
- add support for virtio transport driver
- mac80211:
- allow concurrent monitor iface and ethernet rx decap
- set priority and queue mapping for injected frames
- phy: add support for Clause-45 PHY Loopback
- pci/iov: add sysfs MSI-X vector assignment interface to distribute
MSI-X resources to VFs (incl. mlx5 support)
New hardware/drivers:
- dsa: mv88e6xxx: add support for Marvell mv88e6393x - 11-port
Ethernet switch with 8x 1-Gigabit Ethernet and 3x 10-Gigabit
interfaces.
- dsa: support for legacy Broadcom tags used on BCM5325, BCM5365 and
BCM63xx switches
- Microchip KSZ8863 and KSZ8873; 3x 10/100Mbps Ethernet switches
- ath11k: support for QCN9074 a 802.11ax device
- Bluetooth: Broadcom BCM4330 and BMC4334
- phy: Marvell 88X2222 transceiver support
- mdio: add BCM6368 MDIO mux bus controller
- r8152: support RTL8153 and RTL8156 (USB Ethernet) chips
- mana: driver for Microsoft Azure Network Adapter (MANA)
- Actions Semi Owl Ethernet MAC
- can: driver for ETAS ES58X CAN/USB interfaces
Pure driver changes:
- add XDP support to: enetc, igc, stmmac
- add AF_XDP support to: stmmac
- virtio:
- page_to_skb() use build_skb when there's sufficient tailroom
(21% improvement for 1000B UDP frames)
- support XDP even without dedicated Tx queues - share the Tx
queues with the stack when necessary
- mlx5:
- flow rules: add support for mirroring with conntrack, matching
on ICMP, GTP, flex filters and more
- support packet sampling with flow offloads
- persist uplink representor netdev across eswitch mode changes
- allow coexistence of CQE compression and HW time-stamping
- add ethtool extended link error state reporting
- ice, iavf: support flow filters, UDP Segmentation Offload
- dpaa2-switch:
- move the driver out of staging
- add spanning tree (STP) support
- add rx copybreak support
- add tc flower hardware offload on ingress traffic
- ionic:
- implement Rx page reuse
- support HW PTP time-stamping
- octeon: support TC hardware offloads - flower matching on ingress
and egress ratelimitting.
- stmmac:
- add RX frame steering based on VLAN priority in tc flower
- support frame preemption (FPE)
- intel: add cross time-stamping freq difference adjustment
- ocelot:
- support forwarding of MRP frames in HW
- support multiple bridges
- support PTP Sync one-step timestamping
- dsa: mv88e6xxx, dpaa2-switch: offload bridge port flags like
learning, flooding etc.
- ipa: add IPA v4.5, v4.9 and v4.11 support (Qualcomm SDX55, SM8350,
SC7280 SoCs)
- mt7601u: enable TDLS support
- mt76:
- add support for 802.3 rx frames (mt7915/mt7615)
- mt7915 flash pre-calibration support
- mt7921/mt7663 runtime power management fixes"
* tag 'net-next-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2451 commits)
net: selftest: fix build issue if INET is disabled
net: netrom: nr_in: Remove redundant assignment to ns
net: tun: Remove redundant assignment to ret
net: phy: marvell: add downshift support for M88E1240
net: dsa: ksz: Make reg_mib_cnt a u8 as it never exceeds 255
net/sched: act_ct: Remove redundant ct get and check
icmp: standardize naming of RFC 8335 PROBE constants
bpf, selftests: Update array map tests for per-cpu batched ops
bpf: Add batched ops support for percpu array
bpf: Implement formatted output helpers with bstr_printf
seq_file: Add a seq_bprintf function
sfc: adjust efx->xdp_tx_queue_count with the real number of initialized queues
net:nfc:digital: Fix a double free in digital_tg_recv_dep_req
net: fix a concurrency bug in l2tp_tunnel_register()
net/smc: Remove redundant assignment to rc
mpls: Remove redundant assignment to err
llc2: Remove redundant assignment to rc
net/tls: Remove redundant initialization of record
rds: Remove redundant assignment to nr_sig
dt-bindings: net: mdio-gpio: add compatible for microchip,mdio-smi0
...
In case ethernet driver is enabled and INET is disabled, selftest will
fail to build.
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Fixes: 3e1e58d64c ("net: add generic selftest support")
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/20210428130947.29649-1-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for net-next:
1) Add support for the catch-all set element. This special element
can be used to define a default action to be applied in case that
the set lookup returns no matching element.
2) Fix incorrect #ifdef dependencies in the nftables cgroupsv2
support, from Arnd Bergmann.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Free skb->cb usage in core driver and let device drivers decide to
use or not. The reason having a DSA_SKB_CB(skb)->clone was because
dsa_skb_tx_timestamp() which may set the clone pointer was called
before p->xmit() which would use the clone if any, and the device
driver has no way to initialize the clone pointer.
This patch just put memset(skb->cb, 0, sizeof(skb->cb)) at beginning
of dsa_slave_xmit(). Some new features in the future, like one-step
timestamp may need more bytes of skb->cb to use in
dsa_skb_tx_timestamp(), and p->xmit().
Signed-off-by: Yangbo Lu <yangbo.lu@nxp.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It was a waste to clone skb directly in dsa_skb_tx_timestamp().
For one-step timestamping, a clone was not needed. For any failure of
port_txtstamp (this may usually happen), the skb clone had to be freed.
So this patch moves skb cloning for tx timestamp out of dsa core, and
let drivers clone skb in port_txtstamp if they really need.
Signed-off-by: Yangbo Lu <yangbo.lu@nxp.com>
Tested-by: Kurt Kanzenbach <kurt@linutronix.de>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move ptp_classify_raw out of dsa core driver for handling tx
timestamp request. Let device drivers do this if they want.
Not all drivers want to limit tx timestamping for only PTP
packet.
Signed-off-by: Yangbo Lu <yangbo.lu@nxp.com>
Tested-by: Kurt Kanzenbach <kurt@linutronix.de>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The IPv6 Multicast Router Advertisements parsing has the following two
issues:
For one thing, ICMPv6 MRD Advertisements are smaller than ICMPv6 MLD
messages (ICMPv6 MRD Adv.: 8 bytes vs. ICMPv6 MLDv1/2: >= 24 bytes,
assuming MLDv2 Reports with at least one multicast address entry).
When ipv6_mc_check_mld_msg() tries to parse an Multicast Router
Advertisement its MLD length check will fail - and it will wrongly
return -EINVAL, even if we have a valid MRD Advertisement. With the
returned -EINVAL the bridge code will assume a broken packet and will
wrongly discard it, potentially leading to multicast packet loss towards
multicast routers.
The second issue is the MRD header parsing in
br_ip6_multicast_mrd_rcv(): It wrongly checks for an ICMPv6 header
immediately after the IPv6 header (IPv6 next header type). However
according to RFC4286, section 2 all MRD messages contain a Router Alert
option (just like MLD). So instead there is an IPv6 Hop-by-Hop option
for the Router Alert between the IPv6 and ICMPv6 header, again leading
to the bridge wrongly discarding Multicast Router Advertisements.
To fix these two issues, introduce a new return value -ENODATA to
ipv6_mc_check_mld() to indicate a valid ICMPv6 packet with a hop-by-hop
option which is not an MLD but potentially an MRD packet. This also
simplifies further parsing in the bridge code, as ipv6_mc_check_mld()
already fully checks the ICMPv6 header and hop-by-hop option.
These issues were found and fixed with the help of the mrdisc tool
(https://github.com/troglobit/mrdisc).
Fixes: 4b3087c7e3 ("bridge: Snoop Multicast Router Advertisements")
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: David S. Miller <davem@davemloft.net>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAmCHJJAACgkQ+7dXa6fL
C2uv0A//S/sJyToPtj3xbzmRVmSGGWFYNRMaxBD2gYAq7swbDNiX4ZbBCe8A4FBY
zedeMfoNztHIRB2M9vvnhG4HJWXPKq2BaT0xzeteCcmZ65b5zBOrAXue0PQPqE20
xmK1RDls/y5Y2FaF92Ay0VZzXW7+y/M+RRSo+FCFzrIgpJrPprTnlZigrECYauGJ
Qdsv26rQ0flK6tyi6GVuWZIMvpINCt3WwpwQTkAUewz2VewA1tZ1xFe70sP0vF7R
MJNaS6A4uJmvoJJzb8rqdnBGiu76+TxmPaXn0IZKJBECZjBVJyk/duce0jgqbQ7C
PZz5j4C2xrPyu3Y98joj37HPEAHCy0DPRx2Es1mz5cHPzI1TDRClHzPrxyycz9gr
D9WnMiPj9ff9aDaV6XpWKyuHhPxaHpoOD3VGdrhx6bU19Jd3/mLHB3lSt1kJzWdg
QrSAk3KzMWAZigz/+I5xetOpbygKTPLEYgpdmdOSTrtACcm1wjnhIougu0FUIWXK
arPNFOIV9liN0qCQyDOcLx4UEcxXrb2W0AYeHHJDBFxJ7sT2WWUCjPZFW5bh3G+Y
goKv/XJRVWJxFlTXLZLZ5siclzzIlAAmSylh661ji836yRhqTQ3NJTB8QfnrGGsZ
QlD1hjpyqC8uwIGUvoh56KdLRTxj9Gj70gpVe/Lk3Z16mivqDUE=
=fSr0
-----END PGP SIGNATURE-----
Merge tag 'afs-netfs-lib-20210426' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull AFS updates from David Howells:
"Use the new netfs lib.
Begin the process of overhauling the use of the fscache API by AFS and
the introduction of support for features such as Transparent Huge
Pages (THPs).
- Add some support for THPs, including using core VM helper functions
to find details of pages.
- Use the ITER_XARRAY I/O iterator to mediate access to the pagecache
as this handles THPs and doesn't require allocation of large bvec
arrays.
- Delegate address_space read/pre-write I/O methods for AFS to the
netfs helper library. A method is provided to the library that
allows it to issue a read against the server.
This includes a change in use for PG_fscache (it now indicates a
DIO write in progress from the marked page), so a number of waits
need to be deployed for it.
- Split the core AFS writeback function to make it easier to modify
in future patches to handle writing to the cache. [This might
feasibly make more sense moved out into my fscache-iter branch].
I've tested these with "xfstests -g quick" against an AFS volume
(xfstests needs patching to make it work). With this, AFS without a
cache passes all expected xfstests; with a cache, there's an extra
failure, but that's also there before these patches. Fixing that
probably requires a greater overhaul (as can be found on my
fscache-iter branch, but that's for a later time).
Thanks should go to Marc Dionne and Jeff Altman of AuriStor for
exercising the patches in their test farm also"
Link: https://lore.kernel.org/lkml/3785063.1619482429@warthog.procyon.org.uk/
* tag 'afs-netfs-lib-20210426' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
afs: Use the netfs_write_begin() helper
afs: Use new netfs lib read helper API
afs: Use the fs operation ops to handle FetchData completion
afs: Prepare for use of THPs
afs: Extract writeback extension into its own function
afs: Wait on PG_fscache before modifying/releasing a page
afs: Use ITER_XARRAY for writing
afs: Set up the iov_iter before calling afs_extract_data()
afs: Log remote unmarshalling errors
afs: Don't truncate iter during data fetch
afs: Move key to afs_read struct
afs: Print the operation debug_id when logging an unexpected data version
afs: Pass page into dirty region helpers to provide THP size
afs: Disable use of the fscache I/O routines
This patch extends the set infrastructure to add a special catch-all set
element. If the lookup fails to find an element (or range) in the set,
then the catch-all element is selected. Users can specify a mapping,
expression(s) and timeout to be attached to the catch-all element.
This patch adds a catchall list to the set, this list might contain more
than one single catch-all element (e.g. in case that the catch-all
element is removed and a new one is added in the same transaction).
However, most of the time, there will be either one element or no
elements at all in this list.
The catch-all element is identified via NFT_SET_ELEM_CATCHALL flag and
such special element has no NFTA_SET_ELEM_KEY attribute. There is a new
nft_set_elem_catchall object that stores a reference to the dummy
catch-all element (catchall->elem) whose layout is the same of the set
element type to reuse the existing set element codebase.
The set size does not apply to the catch-all element, users can define a
catch-all element even if the set is full.
The check for valid set element flags hava been updates to report
EOPNOTSUPP in case userspace requests flags that are not supported when
using new userspace nftables and old kernel.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for net-next:
1) The various ip(6)table_foo incarnations are updated to expect
that the table is passed as 'void *priv' argument that netfilter core
passes to the hook functions. This reduces the struct net size by 2
cachelines on x86_64. From Florian Westphal.
2) Add cgroupsv2 support for nftables.
3) Fix bridge log family merge into nf_log_syslog: Missing
unregistration from netns exit path, from Phil Sutter.
4) Add nft_pernet() helper to access nftables pernet area.
5) Add struct nfnl_info to reduce nfnetlink callback footprint and
to facilite future updates. Consolidate nfnetlink callbacks.
6) Add CONFIG_NETFILTER_XTABLES_COMPAT Kconfig knob, also from Florian.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Here is the big set of tty and serial driver updates for 5.13-rc1.
Actually busy this release, with a number of cleanups happening:
- much needed core tty cleanups by Jiri Slaby
- removal of unused and orphaned old-style serial drivers. If
anyone shows up with this hardware, it is trivial to restore
these but we really do not think they are in use anymore.
- fixes and cleanups from Johan Hovold on a number of termios
setting corner cases that loads of drivers got wrong as well
as removing unneeded code due to tty core changes from long
ago that were never propagated out to the drivers
- loads of platform-specific serial port driver updates and
fixes
- coding style cleanups and other small fixes and updates all
over the tty/serial tree.
All of these have been in linux-next for a while now with no reported
issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCYIa3NQ8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ykMXgCfX3FZgKveI4l94ChXSy4OyKwycHUAn00BzrMC
/7BwA1FnjQnC4zSzuHnm
=bAas
-----END PGP SIGNATURE-----
Merge tag 'tty-5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty and serial driver updates from Greg KH:
"Here is the big set of tty and serial driver updates for 5.13-rc1.
Actually busy this release, with a number of cleanups happening:
- much needed core tty cleanups by Jiri Slaby
- removal of unused and orphaned old-style serial drivers. If anyone
shows up with this hardware, it is trivial to restore these but we
really do not think they are in use anymore.
- fixes and cleanups from Johan Hovold on a number of termios setting
corner cases that loads of drivers got wrong as well as removing
unneeded code due to tty core changes from long ago that were never
propagated out to the drivers
- loads of platform-specific serial port driver updates and fixes
- coding style cleanups and other small fixes and updates all over
the tty/serial tree.
All of these have been in linux-next for a while now with no reported
issues"
* tag 'tty-5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (186 commits)
serial: extend compile-test coverage
serial: stm32: add FIFO threshold configuration
dt-bindings: serial: 8250: update TX FIFO trigger level
dt-bindings: serial: stm32: override FIFO threshold properties
dt-bindings: serial: add RX and TX FIFO properties
serial: xilinx_uartps: drop low-latency workaround
serial: vt8500: drop low-latency workaround
serial: timbuart: drop low-latency workaround
serial: sunsu: drop low-latency workaround
serial: sifive: drop low-latency workaround
serial: txx9: drop low-latency workaround
serial: sa1100: drop low-latency workaround
serial: rp2: drop low-latency workaround
serial: rda: drop low-latency workaround
serial: owl: drop low-latency workaround
serial: msm_serial: drop low-latency workaround
serial: mpc52xx_uart: drop low-latency workaround
serial: meson: drop low-latency workaround
serial: mcf: drop low-latency workaround
serial: lpc32xx_hs: drop low-latency workaround
...
devlink external port attribute for SF (Sub-Function) port flavour
This adds the support to instantiate Sub-Functions on external hosts
E.g when Eswitch manager is enabled on the ARM SmarNic SoC CPU, users
are now able to spawn new Sub-Functions on the Host server CPU.
Parav Pandit Says:
==================
This series introduces and uses external attribute for the SF port to
indicate that a SF port belongs to an external controller.
This is needed to generate unique phys_port_name when PF and SF numbers
are overlapping between local and external controllers.
For example two controllers 0 and 1, both of these controller have a SF.
having PF number 0, SF number 77. Here, phys_port_name has duplicate
entry which doesn't have controller number in it.
Hence, add controller number optionally when a SF port is for an
external controller. This extension is similar to existing PF and VF
eswitch ports of the external controller.
When a SF is for external controller an example view of external SF
port and config sequence:
On eswitch system:
$ devlink dev eswitch set pci/0033:01:00.0 mode switchdev
$ devlink port show
pci/0033:01:00.0/196607: type eth netdev enP51p1s0f0np0 flavour physical port 0 splittable false
pci/0033:01:00.0/131072: type eth netdev eth0 flavour pcipf controller 1 pfnum 0 external true splittable false
function:
hw_addr 00:00:00:00:00:00
$ devlink port add pci/0033:01:00.0 flavour pcisf pfnum 0 sfnum 77 controller 1
pci/0033:01:00.0/163840: type eth netdev eth1 flavour pcisf controller 1 pfnum 0 sfnum 77 splittable false
function:
hw_addr 00:00:00:00:00:00 state inactive opstate detached
phys_port_name construction:
$ cat /sys/class/net/eth1/phys_port_name
c1pf0sf77
Patch summary:
First 3 patches prepares the eswitch to handle vports in more generic
way using xarray to lookup vport from its unique vport number.
Patch-1 returns maximum eswitch ports only when eswitch is enabled
Patch-2 prepares eswitch to return eswitch max ports from a struct
Patch-3 uses xarray for vport and representor lookup
Patch-4 considers SF for an additioanl range of SF vports
Patch-5 relies on SF hw table to check SF support
Patch-6 extends SF devlink port attribute for external flag
Patch-7 stores the per controller SF allocation attributes
Patch-8 uses SF function id for filtering events
Patch-9 uses helper for allocation and free
Patch-10 splits hw table into per controller table and generic one
Patch-11 extends sf table for additional range
==================
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAmCDz80ACgkQSD+KveBX
+j4LmggAwS9otYoo639Kmow/wlMZ6yyLsH02zVMFLEJ2AE4VbL73i4iiQ67ZWygL
yQ8HawPAnythx4RsN/M6/WjSKRpdqTC27C9CpdM78zhXb1vnOrlzba7rYngqmo7N
5fIkGyjsUGHNqq+15SftK7JYbXFTe1b5RdWawXkQoyBlXTTBamyxD7C5NMpoDots
/e88Bs8Zy5nVPZqPchIId8TZEKKuO/heTz8ks6q6s/t1MGj7QP+ddxVMgNg00NR5
OpNTr7YYdpHxpfLSUZgdHaptwwKOx+nou8LdJkIKWPs7SHX6HDggyZJjGBOEWtE7
qG7oSip4olOTM0w9PZrAewLwSYhq7Q==
=oqBr
-----END PGP SIGNATURE-----
Merge tag 'mlx5-updates-2021-04-21' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-updates-2021-04-21
devlink external port attribute for SF (Sub-Function) port flavour
This adds the support to instantiate Sub-Functions on external hosts
E.g when Eswitch manager is enabled on the ARM SmarNic SoC CPU, users
are now able to spawn new Sub-Functions on the Host server CPU.
Parav Pandit Says:
==================
This series introduces and uses external attribute for the SF port to
indicate that a SF port belongs to an external controller.
This is needed to generate unique phys_port_name when PF and SF numbers
are overlapping between local and external controllers.
For example two controllers 0 and 1, both of these controller have a SF.
having PF number 0, SF number 77. Here, phys_port_name has duplicate
entry which doesn't have controller number in it.
Hence, add controller number optionally when a SF port is for an
external controller. This extension is similar to existing PF and VF
eswitch ports of the external controller.
When a SF is for external controller an example view of external SF
port and config sequence:
On eswitch system:
$ devlink dev eswitch set pci/0033:01:00.0 mode switchdev
$ devlink port show
pci/0033:01:00.0/196607: type eth netdev enP51p1s0f0np0 flavour physical port 0 splittable false
pci/0033:01:00.0/131072: type eth netdev eth0 flavour pcipf controller 1 pfnum 0 external true splittable false
function:
hw_addr 00:00:00:00:00:00
$ devlink port add pci/0033:01:00.0 flavour pcisf pfnum 0 sfnum 77 controller 1
pci/0033:01:00.0/163840: type eth netdev eth1 flavour pcisf controller 1 pfnum 0 sfnum 77 splittable false
function:
hw_addr 00:00:00:00:00:00 state inactive opstate detached
phys_port_name construction:
$ cat /sys/class/net/eth1/phys_port_name
c1pf0sf77
Patch summary:
First 3 patches prepares the eswitch to handle vports in more generic
way using xarray to lookup vport from its unique vport number.
Patch-1 returns maximum eswitch ports only when eswitch is enabled
Patch-2 prepares eswitch to return eswitch max ports from a struct
Patch-3 uses xarray for vport and representor lookup
Patch-4 considers SF for an additioanl range of SF vports
Patch-5 relies on SF hw table to check SF support
Patch-6 extends SF devlink port attribute for external flag
Patch-7 stores the per controller SF allocation attributes
Patch-8 uses SF function id for filtering events
Patch-9 uses helper for allocation and free
Patch-10 splits hw table into per controller table and generic one
Patch-11 extends sf table for additional range
==================
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
No longer needed, table pointer arg is now passed via netfilter core.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
ebtables stores the table internal data (what gets passed to the
ebt_do_table() interpreter) in struct net.
nftables keeps the internal interpreter format in pernet lists
and passes it via the netfilter core infrastructure (priv pointer).
Do the same for ebtables: the nf_hook_ops are duplicated via kmemdup,
then the ops->priv pointer is set to the table that is being registered.
After that, the netfilter core passes this table info to the hookfn.
This allows to remove the pointers from struct net.
Same pattern can be applied to ip/ip6/arptables.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
When I changed defrag hooks to no longer get registered by default I
intentionally made it so that registration can only be un-done by unloading
the nf_defrag_ipv4/6 module.
In hindsight this was too conservative; there is no reason to keep defrag
on while there is no feature dependency anymore.
Moreover, this won't work if user isn't allowed to remove nf_defrag module.
This adds the disable() functions for both ipv4 and ipv6 and calls them
from conntrack, TPROXY and the xtables socket module.
ipvs isn't converted here, it will behave as before this patch and
will need module removal.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Alexei Starovoitov says:
====================
pull-request: bpf-next 2021-04-23
The following pull-request contains BPF updates for your *net-next* tree.
We've added 69 non-merge commits during the last 22 day(s) which contain
a total of 69 files changed, 3141 insertions(+), 866 deletions(-).
The main changes are:
1) Add BPF static linker support for extern resolution of global, from Andrii.
2) Refine retval for bpf_get_task_stack helper, from Dave.
3) Add a bpf_snprintf helper, from Florent.
4) A bunch of miscellaneous improvements from many developers.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Extended SF port attributes to have optional external flag similar to
PCI PF and VF port attributes.
External atttibute is required to generate unique phys_port_name when PF number
and SF number are overlapping between two controllers similar to SR-IOV
VFs.
When a SF is for external controller an example view of external SF
port and config sequence.
On eswitch system:
$ devlink dev eswitch set pci/0033:01:00.0 mode switchdev
$ devlink port show
pci/0033:01:00.0/196607: type eth netdev enP51p1s0f0np0 flavour physical port 0 splittable false
pci/0033:01:00.0/131072: type eth netdev eth0 flavour pcipf controller 1 pfnum 0 external true splittable false
function:
hw_addr 00:00:00:00:00:00
$ devlink port add pci/0033:01:00.0 flavour pcisf pfnum 0 sfnum 77 controller 1
pci/0033:01:00.0/163840: type eth netdev eth1 flavour pcisf controller 1 pfnum 0 sfnum 77 splittable false
function:
hw_addr 00:00:00:00:00:00 state inactive opstate detached
phys_port_name construction:
$ cat /sys/class/net/eth1/phys_port_name
c1pf0sf77
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Vu Pham <vuhuong@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Steffen Klassert says:
====================
pull request (net-next): ipsec-next 2021-04-23
1) The SPI flow key in struct flowi has no consumers,
so remove it. From Florian Westphal.
2) Remove stray synchronize_rcu from xfrm_init.
From Florian Westphal.
3) Use the new exit_pre hook to reset the netlink socket
on net namespace destruction. From Florian Westphal.
4) Remove an unnecessary get_cpu() in ipcomp, that
code is always called with BHs off.
From Sabrina Dubroca.
Please pull or let me know if there are problems.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Some combinations of tag protocols and Ethernet controllers are
incompatible, and it is hard for the driver to keep track of these.
Therefore, allow the device tree author (typically the board vendor)
to inform the driver of this fact by selecting an alternate protocol
that is known to work.
Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* set sk_pacing_shift for 802.3->802.11 encap offload
* some monitor support for 802.11->802.3 decap offload
* HE (802.11ax) spec updates
* userspace API for TDLS HE support
* along with various other small features, cleanups and
fixups
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAmB+4y8ACgkQB8qZga/f
l8SiGw/9Fz3XETnNDYMvyY7ppmWzZ6vofRq307YJiCz1fszEKqwyyzMQOrHA9tg2
Nasl711egWlVyHTBCN+VCSaTQUjkODsK/5t4XWoxdJ0J3lZkgryVGBJljpl+k4A6
11qpvwUnO1WCmt0s49V2yU/jWgZ9itHfu9dosu/YIq+NfXUVA7ylKmP3gqfmcCeV
631z5AnM8/9N8QVMpnk5F2fE57WUXbA+KdVsw0LXMmjXYSsQ9MyTBX/lRDVcaMWV
7cOtHekkzD0MVfsOoBVvsJl+bybBgEPOfZn2Kt22Rh4JzAch/uUhwRQGzsGxcR3p
D8W9BABXCU8C5mhP8gcKlOSuH3h7ydKKqrXXNeRO+y5hymOtUSGJxia93m+uQ8qC
97wootP3cb97/dEzv5cWqw5Pa39uEsny6mQqueD5WcMI9imL98HEo3hrZElbctx8
s9ZE37WAlZ0zw+cGIsmElZfE2qMqEhjxF3mGFcpXLkk9/Y/1jmypYopkBLJh6KcS
mIfwk9qWgADbPT5df1A/1388lMkjBRcQGc1SriYxy/olvb70mD8IPPiDSD2kULDt
Sq2frnOdvjW0Q5DB6jBKzdMudAxY3WP5MlcGDy1iYwEbY6s4lPfQXG48joJpRQFG
I3zPM6Z+Pimx7vcTd5a+IUyKvDoF+DtxiOu8DGKYT2M5tv3/tpI=
=b0NQ
-----END PGP SIGNATURE-----
Merge tag 'mac80211-next-for-net-next-2021-04-20' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
Johannes Berg says:
====================
Another set of updates, all over the map:
* set sk_pacing_shift for 802.3->802.11 encap offload
* some monitor support for 802.11->802.3 decap offload
* HE (802.11ax) spec updates
* userspace API for TDLS HE support
* along with various other small features, cleanups and
fixups
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Most of generic selftest should be able to work with probably all ethernet
controllers. The DSA switches are not exception, so enable it by default at
least for DSA.
This patch was tested with SJA1105 and AR9331.
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Port some parts of the stmmac selftest and reuse it as basic generic selftest
library. This patch was tested with following combinations:
- iMX6DL FEC -> AT8035
- iMX6DL FEC -> SJA1105Q switch -> KSZ8081
- iMX6DL FEC -> SJA1105Q switch -> KSZ9031
- AR9331 ag71xx -> AR9331 PHY
- AR9331 ag71xx -> AR9331 switch -> AR9331 PHY
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for net-next:
1) Add vlan match and pop actions to the flowtable offload,
patches from wenxu.
2) Reduce size of the netns_ct structure, which itself is
embedded in struct net Make netns_ct a read-mostly structure.
Patches from Florian Westphal.
3) Add FLOW_OFFLOAD_XMIT_UNSPEC to skip dst check from garbage
collector path, as required by the tc CT action. From Roi Dayan.
4) VLAN offload fixes for nftables: Allow for matching on both s-vlan
and c-vlan selectors. Fix match of VLAN id due to incorrect
byteorder. Add a new routine to properly populate flow dissector
ethertypes.
5) Missing keys in ip{6}_route_me_harder() results in incorrect
routes. This includes an update for selftest infra. Patches
from Ido Schimmel.
6) Add counter hardware offload support through FLOW_CLS_STATS.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
xfrm session decode ipv4 path (but not ipv6) sets this, but there are no
consumers. Remove it.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
This will allow the low level driver to get the wdev during
the add_interface flow.
In order to do that, remove a few checks from there and do
not return NULL for vifs that were not yet added to the
driver. Note that all the current callers of this helper
function assume that the vif already exists:
- The callers from the drivers already have a vif pointer.
Before this change, ieee80211_vif_to_wdev would return NULL
in some cases, but those callers don't even check they
get a non-NULL pointer from ieee80211_vif_to_wdev.
- The callers from net/mac80211/cfg.c assume the vif is
already added to the driver as well.
So, this change has no impact on existing callers of this
helper function.
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20210409123755.6078d3517095.I1907a45f267a62dab052bcc44428aa7a2005ffc9@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The htmldoc produces this warning which was introduced
bu the commit below.
include/net/cfg80211.h:6643: warning: expecting prototype for wiphy_rfkill_set_hw_state().
Prototype was for wiphy_rfkill_set_hw_state_reason() instead
Fixes: 6f779a66dc ("cfg80211: allow specifying a reason for hw_rfkill")
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Link: https://lore.kernel.org/r/20210413113850.59098-1-emmanuel.grumbach@intel.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
This patch adds the .offload_stats operation to synchronize hardware
stats with the expression data. Update the counter expression to use
this new interface. The hardware stats are retrieved from the netlink
dump path via FLOW_CLS_STATS command to the driver.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The flow dissector representation expects the VLAN id in host byteorder.
Add the NFT_OFFLOAD_F_NETWORK2HOST flag to swap the bytes from nft_cmp.
Fixes: a82055af59 ("netfilter: nft_payload: add VLAN offload support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
- add another struct flow_dissector_key_vlan for C-VLAN
- update layer 3 dependency to allow to match on IPv4/IPv6
Fixes: 89d8fd44ab ("netfilter: nft_payload: add C-VLAN offload support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
As explained in bugfix commit 6ab4c3117a ("net: bridge: don't notify
switchdev for local FDB addresses") as well as in this discussion:
https://lore.kernel.org/netdev/20210117193009.io3nungdwuzmo5f7@skbuf/
the switchdev notifiers for FDB entries managed to have a zero-day bug,
which was that drivers would not know what to do with local FDB entries,
because they were not told that they are local. The bug fix was to
simply not notify them of those addresses.
Let us now add the 'is_local' bit to bridge FDB entries, and make all
drivers ignore these entries by their own choice.
Co-developed-by: Tobias Waldekranz <tobias@waldekranz.com>
Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Grygorii Strashko <grygorii.strashko@ti.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
of_get_mac_address() returns a "const void*" pointer to a MAC address.
Lately, support to fetch the MAC address by an NVMEM provider was added.
But this will only work with platform devices. It will not work with
PCI devices (e.g. of an integrated root complex) and esp. not with DSA
ports.
There is an of_* variant of the nvmem binding which works without
devices. The returned data of a nvmem_cell_read() has to be freed after
use. On the other hand the return of_get_mac_address() points to some
static data without a lifetime. The trick for now, was to allocate a
device resource managed buffer which is then returned. This will only
work if we have an actual device.
Change it, so that the caller of of_get_mac_address() has to supply a
buffer where the MAC address is written to. Unfortunately, this will
touch all drivers which use the of_get_mac_address().
Usually the code looks like:
const char *addr;
addr = of_get_mac_address(np);
if (!IS_ERR(addr))
ether_addr_copy(ndev->dev_addr, addr);
This can then be simply rewritten as:
of_get_mac_address(np, ndev->dev_addr);
Sometimes is_valid_ether_addr() is used to test the MAC address.
of_get_mac_address() already makes sure, it just returns a valid MAC
address. Thus we can just test its return code. But we have to be
careful if there are still other sources for the MAC address before the
of_get_mac_address(). In this case we have to keep the
is_valid_ether_addr() call.
The following coccinelle patch was used to convert common cases to the
new style. Afterwards, I've manually gone over the drivers and fixed the
return code variable: either used a new one or if one was already
available use that. Mansour Moufid, thanks for that coccinelle patch!
<spml>
@a@
identifier x;
expression y, z;
@@
- x = of_get_mac_address(y);
+ x = of_get_mac_address(y, z);
<...
- ether_addr_copy(z, x);
...>
@@
identifier a.x;
@@
- if (<+... x ...+>) {}
@@
identifier a.x;
@@
if (<+... x ...+>) {
...
}
- else {}
@@
identifier a.x;
expression e;
@@
- if (<+... x ...+>@e)
- {}
- else
+ if (!(e))
{...}
@@
expression x, y, z;
@@
- x = of_get_mac_address(y, z);
+ of_get_mac_address(y, z);
... when != x
</spml>
All drivers, except drivers/net/ethernet/aeroflex/greth.c, were
compile-time tested.
Suggested-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Michael Walle <michael@walle.cc>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
It could be xmit type was not set and would default to FLOW_OFFLOAD_XMIT_NEIGH
and in this type the gc expect to have a route info.
Fix that by adding FLOW_OFFLOAD_XMIT_UNSPEC which defaults to 0.
Fixes: 8b9229d158 ("netfilter: flowtable: dst_check() from garbage collector path")
Signed-off-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
log_invalid sysctl allows values of 0 to 255 inclusive so we no longer
need a range check: the min/max values can be removed.
This also removes all member variables that were moved to net_generic
data in previous patches.
This reduces size of netns_ct struct by one cache line.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Its only needed from slowpath (sysctl, ctnetlink, gc worker) and
when a new conntrack object is allocated.
Furthermore, each write dirties the otherwise read-mostly pernet
data in struct net.ct, which are accessed from packet path.
Move it to the net_generic data. This makes struct netns_ct
read-mostly.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Creation of a new conntrack entry isn't a frequent operation (compared
to 'ct entry already exists'). Creation of a new entry that is also an
expected (related) connection even less so.
Place this counter in net_generic data.
A followup patch will also move the conntrack count -- this will make
netns_ct a read-mostly structure.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
While at it, make it an u8, no need to use an integer for a boolean.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Not accessed in fast path, place this is generic_net data instead.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch adds support for vlan_id, vlan_priority and vlan_proto match
for flowtable offload.
Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Conflicts:
MAINTAINERS
- keep Chandrasekar
drivers/net/ethernet/mellanox/mlx5/core/en_main.c
- simple fix + trust the code re-added to param.c in -next is fine
include/linux/bpf.h
- trivial
include/linux/ethtool.h
- trivial, fix kdoc while at it
include/linux/skmsg.h
- move to relevant place in tcp.c, comment re-wrapped
net/core/skmsg.c
- add the sk = sk // sk = NULL around calls
net/tipc/crypto.c
- trivial
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
DCCP is virtually never used, so no need to use space in struct net for it.
Put the pernet ipv4/v6 socket in the dccp ipv4/ipv6 modules instead.
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://lore.kernel.org/r/20210408174502.1625-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
- Proper support for BCM4330 and BMC4334
- Various improvements for firmware download of Intel controllers
- Update management interface revision to 20
- Support for AOSP HCI vendor commands
- Initial Virtio support
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
-----BEGIN PGP SIGNATURE-----
iQJNBAABCAA3FiEE7E6oRXp8w05ovYr/9JCA4xAyCykFAmBvL7UZHGx1aXoudm9u
LmRlbnR6QGludGVsLmNvbQAKCRD0kIDjEDILKU+eD/4qGiwnwQVia4hjnQ5PZ0c/
sBmZvzzZE43RNRvy8gAwcVLypLaiIkEYml4ruVdF6RGoqaDA6kRB7D0tdk8EtZc4
6h72LX0zefnyZNj/Fcl6+dKuq7bSHjdOK0G+oXFmpvYfScPsI6Oq0ntpYOFYwUE0
cDdeZyb/sW4f6L5AXUgLjD9AIjzlBOJEU5HcRL/uY+E/fH1oe757/IhUDerXjTIh
9EhY28mrI7KeoEcJ1NIgvjPnRCxPRLAKgrzc8NXkUsLeNZSIywrwjgRfTvCZ+qQf
g/MmNv+n7yFlUzA7Fu2d8YAgRtgF9nZc45geqOSHqcSZURORSs0CzEoYinqwT7f0
xdMBi/UoSzJsLksgv+OfXOyoQ4lPNJ47pcH/edmT/ZJ/Ai2yTqUUSODHQ3MLC4Dp
kQj34thQvfTf5b6+9HwKmBjlfMK0QGPqWjp6dth2bzos/ugurDP+XimAv0urvFz+
NkES5kCMUs/1z7Yh+Nj1MyU3Wfdwol0wOh+PXBTAw2fxSy6kmKYGyLw7HEdWegB0
K2w/ZS58g0EGK89ZSiqlZKmXtAd22dQqS6JlZje+9YTtEtRWu+rAF1dddS8vJDjl
Rq3VSQ1B6Qw0rdGpyfxdgm/sWwq5S1YgQQcAZgVaeo9ZiIhfscVD5v5L0Ol6k9ms
3eCK/2brsX0IaV61eVnoOw==
=jiCz
-----END PGP SIGNATURE-----
Merge tag 'for-net-next-2021-04-08' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Luiz Augusto von Dentz says:
====================
bluetooth-next pull request for net-next:
- Proper support for BCM4330 and BMC4334
- Various improvements for firmware download of Intel controllers
- Update management interface revision to 20
- Support for AOSP HCI vendor commands
- Initial Virtio support
====================
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Setting iftoken can fail for several different reasons but there
and there was no report to user as to the cause. Add netlink
extended errors to the processing of the request.
This requires adding additional argument through rtnl_af_ops
set_link_af callback.
Reported-by: Hongren Zheng <li@zenithal.me>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
With recent changes that separated action module load from action
initialization tcf_action_init() function error handling code was modified
to manually release the loaded modules if loading/initialization of any
further action in same batch failed. For the case when all modules
successfully loaded and some of the actions were initialized before one of
them failed in init handler. In this case for all previous actions the
module will be released twice by the error handler: First time by the loop
that manually calls module_put() for all ops, and second time by the action
destroy code that puts the module after destroying the action.
Reproduction:
$ sudo tc actions add action simple sdata \"2\" index 2
$ sudo tc actions add action simple sdata \"1\" index 1 \
action simple sdata \"2\" index 2
RTNETLINK answers: File exists
We have an error talking to the kernel
$ sudo tc actions ls action simple
total acts 1
action order 0: Simple <"2">
index 2 ref 1 bind 0
$ sudo tc actions flush action simple
$ sudo tc actions ls action simple
$ sudo tc actions add action simple sdata \"2\" index 2
Error: Failed to load TC action module.
We have an error talking to the kernel
$ lsmod | grep simple
act_simple 20480 -1
Fix the issue by modifying module reference counting handling in action
initialization code:
- Get module reference in tcf_idr_create() and put it in tcf_idr_release()
instead of taking over the reference held by the caller.
- Modify users of tcf_action_init_1() to always release the module
reference which they obtain before calling init function instead of
assuming that created action takes over the reference.
- Finally, modify tcf_action_init_1() to not release the module reference
when overwriting existing action as this is no longer necessary since both
upper and lower layers obtain and manage their own module references
independently.
Fixes: d349f99768 ("net_sched: fix RTNL deadlock again caused by request_module()")
Suggested-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Action init code increments reference counter when it changes an action.
This is the desired behavior for cls API which needs to obtain action
reference for every classifier that points to action. However, act API just
needs to change the action and releases the reference before returning.
This sequence breaks when the requested action doesn't exist, which causes
act API init code to create new action with specified index, but action is
still released before returning and is deleted (unless it was referenced
concurrently by cls API).
Reproduction:
$ sudo tc actions ls action gact
$ sudo tc actions change action gact drop index 1
$ sudo tc actions ls action gact
Extend tcf_action_init() to accept 'init_res' array and initialize it with
action->ops->init() result. In tcf_action_add() remove pointers to created
actions from actions array before passing it to tcf_action_put_many().
Fixes: cae422f379 ("net: sched: use reference counting action init")
Reported-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some HW/driver can support passing ethernet rx decap frames and
raw 802.11 frames for the monitor interface concurrently and
via separate RX calls to mac80211. Packets going to the monitor
interface(s) would be in 802.11 format and thus not have the
RX_FLAG_8023 set, and 802.11 format monitoring frames should have
RX_FLAG_ONLY_MONITOR set.
Drivers doing such can enable the SUPPORTS_CONC_MON_RX_DECAP to
allow using ethernet decap offload while a monitor interface is
active, currently RX decapsulation offload gets disabled when a
monitor interface is added.
Signed-off-by: Sriram R <srirrama@codeaurora.org>
Link: https://lore.kernel.org/r/1617068116-32253-1-git-send-email-srirrama@codeaurora.org
[add proper documentation, rewrite commit message]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
rfkill now allows to report a reason for the hw_rfkill state.
Allow cfg80211 drivers to specify this reason.
Keep the current API to use the default reason
(RFKILL_HARD_BLOCK_SIGNAL).
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Link: https://lore.kernel.org/r/20210322204633.102581-4-emmanuel.grumbach@intel.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Some controllers don't support the Simple Pairing Options feature that
can indicate the support for P-192 and P-256 public key validation.
However they might support the Microsoft vendor extension that can
indicate the validiation capability as well.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>