Almost all nat helpers reserve an expecation port the same way:
Try the port inidcated by the peer, then move to next port if that
port is already in use.
We can squash this into a helper.
Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Follow the advice of the below link and prefer 'strscpy' in this
subsystem. Conversion is 1:1 because the return value is not used.
Generated by a coccinelle script.
Link: https://lore.kernel.org/r/CAHk-=wgfRnXz0W3D37d01q3JFkr_i_uTL=V6A6G1oUZcprmknw@mail.gmail.com/
Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Reviewed-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Florian Westphal <fw@strlen.de>
In case the endpoints and conntrack go out-of-sync, i.e. there is
disagreement wrt. validy of sequence/ack numbers between conntracks
internal state and those of the endpoints, connections can hang for a
long time (until ESTABLISHED timeout).
This adds a check to detect a fin/fin exchange even if those are
invalid. The timeout is then lowered to UNACKED (default 300s).
Signed-off-by: Florian Westphal <fw@strlen.de>
After previous patch, the conditional branch is obsolete, reformat it.
gcc generates same code as before this change.
Signed-off-by: Florian Westphal <fw@strlen.de>
If 'nf_conntrack_tcp_loose' is off (the default), tcp packets that are
outside of the current window are marked as INVALID.
nf/iptables rulesets often drop such packets via 'ct state invalid' or
similar checks.
For overly delayed acks, this can be a nuisance if such 'invalid' packets
are also logged.
Since they are not invalid in a strict sense, just ignore them, i.e.
conntrack won't extend timeout or change state so that they do not match
invalid state rules anymore.
This also avoids unwantend connection stalls in case conntrack considers
retransmission (of data that did not reach the peer) as too old.
The else branch of the conditional becomes obsolete.
Next patch will reformant the now always-true if condition.
The existing workaround for data that exceeds the calculated receive
window is adjusted to use the 'ignore' state so that these packets do
not refresh the timeout or change state other than updating ->td_end.
Signed-off-by: Florian Westphal <fw@strlen.de>
tcp_in_window returns true if the packet is in window and false if it is
not.
If its outside of window, packet will be treated as INVALID.
There are corner cases where the packet should still be tracked, because
rulesets may drop or log such packets, even though they can occur during
normal operation, such as overly delayed acks.
In extreme cases, connection may hang forever because conntrack state
differs from real state.
There is no retransmission for ACKs.
In case of ACK loss after conntrack processing, its possible that a
connection can be stuck because the actual retransmits are considered
stale ("SEQ is under the lower bound (already ACKed data
retransmitted)".
The problem is made worse by carrier-grade-nat which can also result
in stale packets from old connections to get treated as 'recent' packets
in conntrack (it doesn't support tcp timestamps at this time).
Prepare tcp_in_window() to return an enum that tells the desired
action (in-window/accept, bogus/drop).
A third action (accept the packet as in-window, but do not change
state) is added in a followup patch.
Signed-off-by: Florian Westphal <fw@strlen.de>
Saeed Mahameed says:
====================
Introduce MACsec skb_metadata_dst and mlx5 macsec offload
v1->v2:
- attach mlx5 implementation patches.
This patchset introduces MACsec skb_metadata_dst to lay the ground
for MACsec HW offload.
MACsec is an IEEE standard (IEEE 802.1AE) for MAC security.
It defines a way to establish a protocol independent connection
between two hosts with data confidentiality, authenticity and/or
integrity, using GCM-AES. MACsec operates on the Ethernet layer and
as such is a layer 2 protocol, which means it’s designed to secure
traffic within a layer 2 network, including DHCP or ARP requests.
Linux has a software implementation of the MACsec standard and
HW offloading support.
The offloading is re-using the logic, netlink API and data
structures of the existing MACsec software implementation.
For Tx:
In the current MACsec offload implementation, MACsec interfaces shares
the same MAC address by default.
Therefore, HW can't distinguish from which MACsec interface the traffic
originated from.
MACsec stack will use skb_metadata_dst to store the SCI value, which is
unique per MACsec interface, skb_metadat_dst will be used later by the
offloading device driver to associate the SKB with the corresponding
offloaded interface (SCI) to facilitate HW MACsec offload.
For Rx:
Like in the Tx changes, if there are more than one MACsec device with
the same MAC address as in the packet's destination MAC, the packet will
be forward only to one of the devices and not neccessarly to the desired one.
Offloading device driver sets the MACsec skb_metadata_dst sci
field with the appropriaate Rx SCI for each SKB so the MACsec rx handler
will know to which port to divert those skbs, instead of wrongly solely
relaying on dst MAC address comparison.
1) patch 1,2, Add support to skb_metadata_dst in MACsec code:
net/macsec: Add MACsec skb_metadata_dst Tx Data path support
net/macsec: Add MACsec skb_metadata_dst Rx Data path support
2) patch 3, Move some MACsec driver code for sharing with various
drivers that implements offload:
net/macsec: Move some code for sharing with various drivers that
implements offload
3) The rest of the patches introduce mlx5 implementation for macsec
offloads TX and RX via steering tables.
a) TX, intercept skbs with macsec offlad mark in skb_metadata_dst and mark
the descriptor for offload.
b) RX, intercept offloaded frames and prepare the proper
skb_metadata_dst to mark offloaded rx frames.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the ability to add up to 16 MACsec offload interfaces
over the same physical interface
Signed-off-by: Lior Nahmanson <liorna@nvidia.com>
Reviewed-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the following statistics:
RX successfully decrypted MACsec packets:
macsec_rx_pkts : Number of packets decrypted successfully
macsec_rx_bytes : Number of bytes decrypted successfully
Rx dropped MACsec packets:
macsec_rx_pkts_drop : Number of MACsec packets dropped
macsec_rx_bytes_drop : Number of MACsec bytes dropped
TX successfully encrypted MACsec packets:
macsec_tx_pkts : Number of packets encrypted/authenticated successfully
macsec_tx_bytes : Number of bytes encrypted/authenticated successfully
Tx dropped MACsec packets:
macsec_tx_pkts_drop : Number of MACsec packets dropped
macsec_tx_bytes_drop : Number of MACsec bytes dropped
The above can be seen using:
ethtool -S <ifc> |grep macsec
Signed-off-by: Lior Nahmanson <liorna@nvidia.com>
Reviewed-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add offload support for MACsec SecY callbacks - add/update/delete.
add_secy is called when need to create a new MACsec interface.
upd_secy is called when source MAC address or tx SC was changed.
del_secy is called when need to destroy the MACsec interface.
Signed-off-by: Lior Nahmanson <liorna@nvidia.com>
Reviewed-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
MACsec driver need to distinguish to which offload device the MACsec
is target to, in order to handle them correctly.
This can be done by attaching a metadata_dst to a SKB with a SCI,
when there is a match on MACsec rule.
To achieve that, there is a map between fs_id to SCI, so for each RX SC,
there is a unique fs_id allocated when creating RX SC.
fs_id passed to device driver as metadata for packets that passed Rx
MACsec offload to aid the driver to retrieve the matching SCI.
Signed-off-by: Lior Nahmanson <liorna@nvidia.com>
Reviewed-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rx flow steering consists of two flow tables (FTs).
The first FT (crypto table) have one default miss rule so non MACsec
offloaded packets bypass the MACSec tables.
All others flow table entries (FTEs) are divided to two equal groups
size, both of them are for MACsec packets:
The first group is for MACsec packets which contains SCI field in the
SecTAG header.
The second group is for MACsec packets which doesn't contain SCI,
where need to match on the source MAC address (only if the SCI
is built from default MACsec port).
Destination MAC address, ethertype and some of SecTAG fields
are also matched for both groups.
In case of match, invoke decrypt action on the packet.
For each MACsec Rx offloaded SA two rules are created: one with SCI
and one without SCI.
The second FT (check table) has two fixed rules:
One rule is for verifying that the previous offload actions were
finished successfully.
In this case, need to decap the SecTAG header and forward the packet
for further processing.
Another default rule for dropping packets that failed in the previous
decrypt actions.
The MACsec FTs are created on demand when the first MACsec rule is
added and destroyed when the last MACsec rule is deleted.
Signed-off-by: Lior Nahmanson <liorna@nvidia.com>
Reviewed-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add new namespace for MACsec RX flows.
Encrypted MACsec packets should be first decrypted and stripped
from MACsec header and then continues with the kernel's steering
pipeline.
Signed-off-by: Lior Nahmanson <liorna@nvidia.com>
Reviewed-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a support for Connect-X MACsec offload Rx SA & SC commands:
add, update and delete.
SCs are created on demend and aren't limited by number and unique by SCI.
Each Rx SA must be associated with Rx SC according to SCI.
Follow-up patches will implement the Rx steering.
Signed-off-by: Lior Nahmanson <liorna@nvidia.com>
Reviewed-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
MACsec driver marks Tx packets for device offload using a dedicated
skb_metadata_dst which holds a 64 bits SCI number.
A previously set rule will match on this number so the correct SA is used
for the MACsec operation.
As device driver can only provide 32 bits of metadata to flow tables,
need to used a mapping from 64 bit to 32 bits marker or id,
which is can be achieved by provide a 32 bit unique flow id in the
control path, and used a hash table to map 64 bit to the unique id in the
datapath.
Signed-off-by: Lior Nahmanson <liorna@nvidia.com>
Reviewed-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tx flow steering consists of two flow tables (FTs).
The first FT (crypto table) has two fixed rules:
One default miss rule so non MACsec offloaded packets bypass the MACSec
tables, another rule to make sure that MACsec key exchange (MKE) traffic
passes unencrypted as expected (matched of ethertype).
On each new MACsec offload flow, a new MACsec rule is added.
This rule is matched on metadata_reg_a (which contains the id of the
flow) and invokes the MACsec offload action on match.
The second FT (check table) has two fixed rules:
One rule for verifying that the previous offload actions were
finished successfully and packet need to be transmitted.
Another default rule for dropping packets that were failed in the
offload actions.
The MACsec FTs should be created on demand when the first MACsec rule is
added and destroyed when the last MACsec rule is deleted.
Signed-off-by: Lior Nahmanson <liorna@nvidia.com>
Reviewed-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Changed EGRESS_KERNEL namespace to EGRESS_IPSEC and add new
namespace for MACsec TX.
This namespace should be the last namespace for transmitted packets.
Signed-off-by: Lior Nahmanson <liorna@nvidia.com>
Reviewed-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support for Connect-X MACsec offload Tx SA commands:
add, update and delete.
In Connect-X MACsec, a Security Association (SA) is added or deleted
via allocating a HW context of an encryption/decryption key and
a HW context of a matching SA (MACsec object).
When new SA is added:
- Use a separate crypto key HW context.
- Create a separate MACsec context in HW to include the SA properties.
Introduce a new compilation flag MLX5_EN_MACSEC for it.
Follow-up patches will implement the Tx steering.
Signed-off-by: Lior Nahmanson <liorna@nvidia.com>
Reviewed-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add MACsec offload related IFC structs, layouts and enumerations.
Signed-off-by: Lior Nahmanson <liorna@nvidia.com>
Reviewed-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to support MACsec offload (and maybe some other crypto features
in the future), generalize flow action parameters / defines to be used by
crypto offlaods other than IPsec.
The following changes made:
ipsec_obj_id field at flow action context was changed to crypto_obj_id,
intreduced a new crypto_type field where IPsec is the default zero type
for backward compatibility.
Action ipsec_decrypt was changed to crypto_decrypt.
Action ipsec_encrypt was changed to crypto_encrypt.
IPsec offload code was updated accordingly for backward compatibility.
Signed-off-by: Lior Nahmanson <liorna@nvidia.com>
Reviewed-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
esp_id is no longer in used
Signed-off-by: Lior Nahmanson <liorna@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move some MACsec infrastructure like defines and functions,
in order to avoid code duplication for future drivers which
implements MACsec offload.
Signed-off-by: Lior Nahmanson <liorna@nvidia.com>
Reviewed-by: Raed Salem <raeds@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Ben Ben-Ishay <benishay@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Like in the Tx changes, if there are more than one MACsec device with
the same MAC address as in the packet's destination MAC, the packet will
be forward only to this device and not neccessarly to the desired one.
Offloading device drivers will mark offloaded MACsec SKBs with the
corresponding SCI in the skb_metadata_dst so the macsec rx handler will
know to which port to divert those skbs, instead of wrongly solely
relaying on dst MAC address comparison.
Signed-off-by: Lior Nahmanson <liorna@nvidia.com>
Reviewed-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the current MACsec offload implementation, MACsec interfaces shares
the same MAC address by default.
Therefore, HW can't distinguish from which MACsec interface the traffic
originated from.
MACsec stack will use skb_metadata_dst to store the SCI value, which is
unique per Macsec interface, skb_metadat_dst will be used by the
offloading device driver to associate the SKB with the corresponding
offloaded interface (SCI).
Signed-off-by: Lior Nahmanson <liorna@nvidia.com>
Reviewed-by: Raed Salem <raeds@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that nla_policy allows range checks for bigendian data make use of
this to reject such attributes. At this time, reject happens later
from the init or select_ops callbacks, but its prone to errors.
In the future, new attributes can be handled via NLA_POLICY_MAX_BE
and exiting ones can be converted one by one.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
netlink allows to specify allowed ranges for integer types.
Unfortunately, nfnetlink passes integers in big endian, so the existing
NLA_POLICY_MAX() cannot be used.
At the moment, nfnetlink users, such as nf_tables, need to resort to
programmatic checking via helpers such as nft_parse_u32_check().
This is both cumbersome and error prone. This adds NLA_POLICY_MAX_BE
which adds range check support for BE16, BE32 and BE64 integers.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Edward Cree says:
====================
sfc: add support for PTP over IPv6 and 802.3
Most recent cards (8000 series and newer) had enough hardware support
for this, but it was not enabled in the driver. The transmission of PTP
packets over these protocols was already added in commit bd4a2697e5
("sfc: use hardware tx timestamps for more than PTP"), but receiving
them was already unsupported so synchronization didn't happen.
These patches add support for timestamping received packets over
IPv6/UPD and IEEE802.3.
v2: fixed weird indentation in efx_ptp_init_filter
v3: fixed bug caused by usage of htons in PTP_EVENT_PORT definition.
It was used in more places, where htons was used too, so using it
2 times leave it again in host order. I didn't detected it in my
tests because it only affected if timestamping through the MC, but
the model I used do it through the MAC. Detected by kernel test
robot <lkp@intel.com>
v4: removed `inline` specifiers from 2 local functions
v5: restored deleted comment with useful explanation about packets
reordering. Deleted useless whitespaces.
====================
Reviewed-by: Edward Cree <ecree.xilinx@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The previous patch add support for PTP over IPv6/UDP (only for 8000
series and newer) and this one add support for PTP over 802.3.
Tested: sync as master and as slave is correct with ptp4l. PTP over IPv4
and IPv6 still works fine.
Suggested-by: Edward Cree <ecree.xilinx@gmail.com>
Signed-off-by: Íñigo Huguet <ihuguet@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit bd4a2697e5 ("sfc: use hardware tx timestamps for more than
PTP") added support for hardware timestamping on TX for cards of the
8000 series and newer, in an effort to provide support for other
transports other than IPv4/UDP.
However, timestamping was still not working on RX for these other
transports. This patch add support for PTP over IPv6/UDP.
Tested: sync as master and as slave is correct using ptp4l from linuxptp
package, both with IPv4 and IPv6.
Suggested-by: Edward Cree <ecree.xilinx@gmail.com>
Signed-off-by: Íñigo Huguet <ihuguet@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In preparation for the support of PTP over IPv6/UDP and Ethernet in next
patches, allow a more flexible way of adding and removing RX filters for
PTP. Right now, only 2 filters are allowed, which are the ones needed
for PTP over IPv4/UDP.
Signed-off-by: Íñigo Huguet <ihuguet@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adding support for the LAN9354 device by allowing it to use
the LAN9303 DSA driver. These devices have the same underlying
access and control methods and from a feature set point of view
the LAN9354 is a superset of the LAN9303.
The MDIO access method has been tested on a SAMA5D3-EDS board
with a LAN9354 RMII daughter card.
While the SPI access method should also be the same, it has not
been tested and as such is not included at this time.
Signed-off-by: Jerry Ray <jerry.ray@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add initial BYTE_ORDER read to sync the 32-bit accesses over the 16-bit
mdio bus to improve driver robustness.
The lan9303 expects two mdio read transactions back-to-back to read a
32-bit register. The first read transaction causes the other half of the
32-bit register to get latched. The subsequent read returns the latched
second half of the 32-bit read. The BYTE_ORDER register is an exception to
this rule. As it is a constant value, there is no need to latch the second
half. We read this register first in case there were reads during the boot
loader process that might have occurred prior to this driver taking over
ownership of accessing this device.
This patch has been tested on the SAMA5D3-EDS with a LAN9303 RMII daughter
card.
Signed-off-by: Jerry Ray <jerry.ray@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
According to the KSZ9477S datasheet, there is no global register
at 0x033C and 0x033D addresses.
Signed-off-by: Romain Naour <romain.naour@skf.com>
Cc: Oleksij Rempel <o.rempel@pengutronix.de>
Tested-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for the KSZ9896 6-port Gigabit Ethernet Switch to the
ksz9477 driver. The KSZ9896 supports both SPI (already in) and I2C.
Signed-off-by: Romain Naour <romain.naour@skf.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for the KSZ9896 6-port Gigabit Ethernet Switch to the
ksz9477 driver.
Although the KSZ9896 is already listed in the device tree binding
documentation since a1c0ed24fe (dt-bindings: net: dsa: document
additional Microchip KSZ9477 family switches) the chip id
(0x00989600) is not recognized by ksz_switch_detect() and rejected
by the driver.
The KSZ9896 is similar to KSZ9897 but has only one configurable
MII/RMII/RGMII/GMII cpu port.
Signed-off-by: Romain Naour <romain.naour@skf.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann says:
====================
pull-request: bpf-next 2022-09-05
The following pull-request contains BPF updates for your *net-next* tree.
We've added 106 non-merge commits during the last 18 day(s) which contain
a total of 159 files changed, 5225 insertions(+), 1358 deletions(-).
There are two small merge conflicts, resolve them as follows:
1) tools/testing/selftests/bpf/DENYLIST.s390x
Commit 27e23836ce ("selftests/bpf: Add lru_bug to s390x deny list") in
bpf tree was needed to get BPF CI green on s390x, but it conflicted with
newly added tests on bpf-next. Resolve by adding both hunks, result:
[...]
lru_bug # prog 'printk': failed to auto-attach: -524
setget_sockopt # attach unexpected error: -524 (trampoline)
cb_refs # expected error message unexpected error: -524 (trampoline)
cgroup_hierarchical_stats # JIT does not support calling kernel function (kfunc)
htab_update # failed to attach: ERROR: strerror_r(-524)=22 (trampoline)
[...]
2) net/core/filter.c
Commit 1227c1771d ("net: Fix data-races around sysctl_[rw]mem_(max|default).")
from net tree conflicts with commit 29003875bd ("bpf: Change bpf_setsockopt(SOL_SOCKET)
to reuse sk_setsockopt()") from bpf-next tree. Take the code as it is from
bpf-next tree, result:
[...]
if (getopt) {
if (optname == SO_BINDTODEVICE)
return -EINVAL;
return sk_getsockopt(sk, SOL_SOCKET, optname,
KERNEL_SOCKPTR(optval),
KERNEL_SOCKPTR(optlen));
}
return sk_setsockopt(sk, SOL_SOCKET, optname,
KERNEL_SOCKPTR(optval), *optlen);
[...]
The main changes are:
1) Add any-context BPF specific memory allocator which is useful in particular for BPF
tracing with bonus of performance equal to full prealloc, from Alexei Starovoitov.
2) Big batch to remove duplicated code from bpf_{get,set}sockopt() helpers as an effort
to reuse the existing core socket code as much as possible, from Martin KaFai Lau.
3) Extend BPF flow dissector for BPF programs to just augment the in-kernel dissector
with custom logic. In other words, allow for partial replacement, from Shmulik Ladkani.
4) Add a new cgroup iterator to BPF with different traversal options, from Hao Luo.
5) Support for BPF to collect hierarchical cgroup statistics efficiently through BPF
integration with the rstat framework, from Yosry Ahmed.
6) Support bpf_{g,s}et_retval() under more BPF cgroup hooks, from Stanislav Fomichev.
7) BPF hash table and local storages fixes under fully preemptible kernel, from Hou Tao.
8) Add various improvements to BPF selftests and libbpf for compilation with gcc BPF
backend, from James Hilliard.
9) Fix verifier helper permissions and reference state management for synchronous
callbacks, from Kumar Kartikeya Dwivedi.
10) Add support for BPF selftest's xskxceiver to also be used against real devices that
support MAC loopback, from Maciej Fijalkowski.
11) Various fixes to the bpf-helpers(7) man page generation script, from Quentin Monnet.
12) Document BPF verifier's tnum_in(tnum_range(), ...) gotchas, from Shung-Hsi Yu.
13) Various minor misc improvements all over the place.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (106 commits)
bpf: Optimize rcu_barrier usage between hash map and bpf_mem_alloc.
bpf: Remove usage of kmem_cache from bpf_mem_cache.
bpf: Remove prealloc-only restriction for sleepable bpf programs.
bpf: Prepare bpf_mem_alloc to be used by sleepable bpf programs.
bpf: Remove tracing program restriction on map types
bpf: Convert percpu hash map to per-cpu bpf_mem_alloc.
bpf: Add percpu allocation support to bpf_mem_alloc.
bpf: Batch call_rcu callbacks instead of SLAB_TYPESAFE_BY_RCU.
bpf: Adjust low/high watermarks in bpf_mem_cache
bpf: Optimize call_rcu in non-preallocated hash map.
bpf: Optimize element count in non-preallocated hash map.
bpf: Relax the requirement to use preallocated hash maps in tracing progs.
samples/bpf: Reduce syscall overhead in map_perf_test.
selftests/bpf: Improve test coverage of test_maps
bpf: Convert hash map to bpf_mem_alloc.
bpf: Introduce any context BPF specific memory allocator.
selftest/bpf: Add test for bpf_getsockopt()
bpf: Change bpf_getsockopt(SOL_IPV6) to reuse do_ipv6_getsockopt()
bpf: Change bpf_getsockopt(SOL_IP) to reuse do_ip_getsockopt()
bpf: Change bpf_getsockopt(SOL_TCP) to reuse do_tcp_getsockopt()
...
====================
Link: https://lore.kernel.org/r/20220905161136.9150-1-daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Sparse checker found two endianness-related issues:
.../moxart_ether.c:34:15: warning: incorrect type in assignment (different base types)
.../moxart_ether.c:34:15: expected unsigned int [usertype]
.../moxart_ether.c:34:15: got restricted __le32 [usertype]
.../moxart_ether.c:39:16: warning: cast to restricted __le32
Fix them by using __le32 type instead of u32.
Signed-off-by: Sergei Antonov <saproj@gmail.com>
Link: https://lore.kernel.org/r/20220902125037.1480268-1-saproj@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Sparse found a number of endianness-related issues of these kinds:
.../ftmac100.c:192:32: warning: restricted __le32 degrades to integer
.../ftmac100.c:208:23: warning: incorrect type in assignment (different base types)
.../ftmac100.c:208:23: expected unsigned int rxdes0
.../ftmac100.c:208:23: got restricted __le32 [usertype]
.../ftmac100.c:249:23: warning: invalid assignment: &=
.../ftmac100.c:249:23: left side has type unsigned int
.../ftmac100.c:249:23: right side has type restricted __le32
.../ftmac100.c:527:16: warning: cast to restricted __le32
Change type of some fields from 'unsigned int' to '__le32' to fix it.
Signed-off-by: Sergei Antonov <saproj@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20220902113749.1408562-1-saproj@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
We're not in a hot path and don't want to miss this message,
therefore remove the net_ratelimit() check.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In preparation for FORTIFY_SOURCE doing bounds-check on memcpy(),
switch from __nlmsg_put to nlmsg_put(), and explain the bounds check
for dealing with the memcpy() across a composite flexible array struct.
Avoids this future run-time warning:
memcpy: detected field-spanning write (size 32) of single field "&errmsg->msg" at net/netlink/af_netlink.c:2447 (size 16)
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Jozsef Kadlecsik <kadlec@netfilter.org>
Cc: Florian Westphal <fw@strlen.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: syzbot <syzkaller@googlegroups.com>
Cc: netfilter-devel@vger.kernel.org
Cc: coreteam@netfilter.org
Cc: netdev@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220901071336.1418572-1-keescook@chromium.org
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexei Starovoitov says:
====================
Introduce any context BPF specific memory allocator.
Tracing BPF programs can attach to kprobe and fentry. Hence they run in
unknown context where calling plain kmalloc() might not be safe. Front-end
kmalloc() with per-cpu cache of free elements. Refill this cache asynchronously
from irq_work.
Major achievements enabled by bpf_mem_alloc:
- Dynamically allocated hash maps used to be 10 times slower than fully
preallocated. With bpf_mem_alloc and subsequent optimizations the speed
of dynamic maps is equal to full prealloc.
- Tracing bpf programs can use dynamically allocated hash maps. Potentially
saving lots of memory. Typical hash map is sparsely populated.
- Sleepable bpf programs can used dynamically allocated hash maps.
Future work:
- Expose bpf_mem_alloc as uapi FD to be used in dynptr_alloc, kptr_alloc
- Convert lru map to bpf_mem_alloc
- Further cleanup htab code. Example: htab_use_raw_lock can be removed.
Changelog:
v5->v6:
- Debugged the reason for selftests/bpf/test_maps ooming in a small VM that BPF CI is using.
Added patch 16 that optimizes the usage of rcu_barrier-s between bpf_mem_alloc and
hash map. It drastically improved the speed of htab destruction.
v4->v5:
- Fixed missing migrate_disable in hash tab free path (Daniel)
- Replaced impossible "memory leak" with WARN_ON_ONCE (Martin)
- Dropped sysctl kernel.bpf_force_dyn_alloc patch (Daniel)
- Added Andrii's ack
- Added new patch 15 that removes kmem_cache usage from bpf_mem_alloc.
It saves memory, speeds up map create/destroy operations
while maintains hash map update/delete performance.
v3->v4:
- fix build issue due to missing local.h on 32-bit arch
- add Kumar's ack
- proposal for next steps from Delyan:
https://lore.kernel.org/bpf/d3f76b27f4e55ec9e400ae8dcaecbb702a4932e8.camel@fb.com/
v2->v3:
- Rewrote the free_list algorithm based on discussions with Kumar. Patch 1.
- Allowed sleepable bpf progs use dynamically allocated maps. Patches 13 and 14.
- Added sysctl to force bpf_mem_alloc in hash map even if pre-alloc is
requested to reduce memory consumption. Patch 15.
- Fix: zero-fill percpu allocation
- Single rcu_barrier at the end instead of each cpu during bpf_mem_alloc destruction
v2 thread:
https://lore.kernel.org/bpf/20220817210419.95560-1-alexei.starovoitov@gmail.com/
v1->v2:
- Moved unsafe direct call_rcu() from hash map into safe place inside bpf_mem_alloc. Patches 7 and 9.
- Optimized atomic_inc/dec in hash map with percpu_counter. Patch 6.
- Tuned watermarks per allocation size. Patch 8
- Adopted this approach to per-cpu allocation. Patch 10.
- Fully converted hash map to bpf_mem_alloc. Patch 11.
- Removed tracing prog restriction on map types. Combination of all patches and final patch 12.
v1 thread:
https://lore.kernel.org/bpf/20220623003230.37497-1-alexei.starovoitov@gmail.com/
LWN article:
https://lwn.net/Articles/899274/
====================
Link: https://lore.kernel.org/r/
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
User space might be creating and destroying a lot of hash maps. Synchronous
rcu_barrier-s in a destruction path of hash map delay freeing of hash buckets
and other map memory and may cause artificial OOM situation under stress.
Optimize rcu_barrier usage between bpf hash map and bpf_mem_alloc:
- remove rcu_barrier from hash map, since htab doesn't use call_rcu
directly and there are no callback to wait for.
- bpf_mem_alloc has call_rcu_in_progress flag that indicates pending callbacks.
Use it to avoid barriers in fast path.
- When barriers are needed copy bpf_mem_alloc into temp structure
and wait for rcu barrier-s in the worker to let the rest of
hash map freeing to proceed.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220902211058.60789-17-alexei.starovoitov@gmail.com
For bpf_mem_cache based hash maps the following stress test:
for (i = 1; i <= 512; i <<= 1)
for (j = 1; j <= 1 << 18; j <<= 1)
fd = bpf_map_create(BPF_MAP_TYPE_HASH, NULL, i, j, 2, 0);
creates many kmem_cache-s that are not mergeable in debug kernels
and consume unnecessary amount of memory.
Turned out bpf_mem_cache's free_list logic does batching well,
so usage of kmem_cache for fixes size allocations doesn't bring
any performance benefits vs normal kmalloc.
Hence get rid of kmem_cache in bpf_mem_cache.
That saves memory, speeds up map create/destroy operations,
while maintains hash map update/delete performance.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220902211058.60789-16-alexei.starovoitov@gmail.com
Since hash map is now converted to bpf_mem_alloc and it's waiting for rcu and
rcu_tasks_trace GPs before freeing elements into global memory slabs it's safe
to use dynamically allocated hash maps in sleepable bpf programs.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220902211058.60789-15-alexei.starovoitov@gmail.com
Use call_rcu_tasks_trace() to wait for sleepable progs to finish.
Then use call_rcu() to wait for normal progs to finish
and finally do free_one() on each element when freeing objects
into global memory pool.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220902211058.60789-14-alexei.starovoitov@gmail.com
The hash map is now fully converted to bpf_mem_alloc. Its implementation is not
allocating synchronously and not calling call_rcu() directly. It's now safe to
use non-preallocated hash maps in all types of tracing programs including
BPF_PROG_TYPE_PERF_EVENT that runs out of NMI context.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220902211058.60789-13-alexei.starovoitov@gmail.com