linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-22 04:02:20 +00:00

Author	SHA1	Message	Date
Donald Hunter	294f37fc87	doc/netlink: Update genetlink-legacy documentation Add documentation for recently added genetlink-legacy schema attributes. Remove statements about 'work in progress' and 'todo'. Signed-off-by: Donald Hunter <donald.hunter@gmail.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://lore.kernel.org/r/20230825122756.7603-4-donald.hunter@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-27 17:17:09 -07:00
Donald Hunter	ed68c58c0e	doc/netlink: Add a schema for netlink-raw families This schema is largely a copy of the genetlink-legacy schema with the following modifications: - change the schema id to netlink-raw - add a top-level protonum property, e.g. 0 (for NETLINK_ROUTE) - change the protocol enumeration to netlink-raw, removing the genetlink options. - replace doc references to generic netlink with raw netlink - add a value property to mcast-group definitions Signed-off-by: Donald Hunter <donald.hunter@gmail.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://lore.kernel.org/r/20230825122756.7603-3-donald.hunter@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-27 17:17:09 -07:00
Donald Hunter	c4e1ab07b5	doc/netlink: Fix typo in genetlink-* schemas Fix typo verion -> version in genetlink-c and genetlink-legacy. Signed-off-by: Donald Hunter <donald.hunter@gmail.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://lore.kernel.org/r/20230825122756.7603-2-donald.hunter@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-27 17:17:09 -07:00
Jakub Kicinski	75d6d8b5c1	Merge branch 'devlink-mlx5-add-port-function-attributes-for-ipsec' Saeed Mahameed says: ==================== {devlink,mlx5}: Add port function attributes for ipsec From Dima: Introduce hypervisor-level control knobs to set the functionality of PCI VF devices passed through to guests. The administrator of a hypervisor host may choose to change the settings of a port function from the defaults configured by the device firmware. The software stack has two types of IPsec offload - crypto and packet. Specifically, the ip xfrm command has sub-commands for "state" and "policy" that have an "offload" parameter. With ip xfrm state, both crypto and packet offload types are supported, while ip xfrm policy can only be offloaded in packet mode. The series introduces two new boolean attributes of a port function: ipsec_crypto and ipsec_packet. The goal is to provide a similar level of granularity for controlling VF IPsec offload capabilities, which would be aligned with the software model. This will allow users to decide if they want both types of offload enabled for a VF, just one of them, or none at all (which is the default). At a high level, the difference between the two knobs is that with ipsec_crypto, only XFRM state can be offloaded. Specifically, only the crypto operation (Encrypt/Decrypt) is offloaded. With ipsec_packet, both XFRM state and policy can be offloaded. Furthermore, in addition to crypto operation offload, IPsec encapsulation is also offloaded. For XFRM state, choosing between crypto and packet offload types is possible. From the HW perspective, different resources may be required for each offload type. Examples of when a user prefers to enable IPsec packet offload for a VF when using switchdev mode: $ devlink port show pci/0000:06:00.0/1 pci/0000:06:00.0/1: type eth netdev enp6s0pf0vf0 flavour pcivf pfnum 0 vfnum 0 function: hw_addr 00:00:00:00:00:00 roce enable migratable disable ipsec_crypto disable ipsec_packet disable $ devlink port function set pci/0000:06:00.0/1 ipsec_packet enable $ devlink port show pci/0000:06:00.0/1 pci/0000:06:00.0/1: type eth netdev enp6s0pf0vf0 flavour pcivf pfnum 0 vfnum 0 function: hw_addr 00:00:00:00:00:00 roce enable migratable disable ipsec_crypto disable ipsec_packet enable This enables the corresponding IPsec capability of the function before it's enumerated, so when the driver reads the capability from the device firmware, it is enabled. The driver is then able to configure corresponding features and ops of the VF net device to support IPsec state and policy offloading. v2: https://lore.kernel.org/netdev/20230421104901.897946-1-dchumak@nvidia.com/ ==================== Link: https://lore.kernel.org/r/20230825062836.103744-1-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-27 17:08:47 -07:00
Dima Chumak	b691b1116e	net/mlx5: Implement devlink port function cmds to control ipsec_packet Implement devlink port function commands to enable / disable IPsec packet offloads. This is used to control the IPsec capability of the device. When ipsec_offload is enabled for a VF, it prevents adding IPsec packet offloads on the PF, because the two cannot be active simultaneously due to HW constraints. Conversely, if there are any active IPsec packet offloads on the PF, it's not allowed to enable ipsec_packet on a VF, until PF IPsec offloads are cleared. Signed-off-by: Dima Chumak <dchumak@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Link: https://lore.kernel.org/r/20230825062836.103744-9-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-27 17:08:45 -07:00
Dima Chumak	06bab69658	net/mlx5: Implement devlink port function cmds to control ipsec_crypto Implement devlink port function commands to enable / disable IPsec crypto offloads. This is used to control the IPsec capability of the device. When ipsec_crypto is enabled for a VF, it prevents adding IPsec crypto offloads on the PF, because the two cannot be active simultaneously due to HW constraints. Conversely, if there are any active IPsec crypto offloads on the PF, it's not allowed to enable ipsec_crypto on a VF, until PF IPsec offloads are cleared. Signed-off-by: Dima Chumak <dchumak@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Link: https://lore.kernel.org/r/20230825062836.103744-8-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-27 17:08:45 -07:00
Leon Romanovsky	8efd7b17a3	net/mlx5: Provide an interface to block change of IPsec capabilities mlx5 HW can't perform IPsec offload operation simultaneously both on PF and VFs at the same time. While the previous patches added devlink knobs to change IPsec capabilities dynamically, there is a need to add a logic to block such IPsec capabilities for the cases when IPsec is already configured. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Link: https://lore.kernel.org/r/20230825062836.103744-7-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-27 17:08:45 -07:00
Leon Romanovsky	17c8da5a34	net/mlx5: Add IFC bits to support IPsec enable/disable Add hardware definitions to allow to control IPSec capabilities. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Link: https://lore.kernel.org/r/20230825062836.103744-6-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-27 17:08:45 -07:00
Leon Romanovsky	e253734166	net/mlx5e: Rewrite IPsec vs. TC block interface In the commit `366e46242b` ("net/mlx5e: Make IPsec offload work together with eswitch and TC"), new API to block IPsec vs. TC creation was introduced. Internally, that API used devlink lock to avoid races with userspace, but it is not really needed as dev->priv.eswitch is stable and can't be changed. So remove dependency on devlink lock and move block encap code back to its original place. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Link: https://lore.kernel.org/r/20230825062836.103744-5-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-27 17:08:45 -07:00
Leon Romanovsky	c46fb77383	net/mlx5: Drop extra layer of locks in IPsec There is no need in holding devlink lock as it gives nothing compared to already used write mode_lock. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Link: https://lore.kernel.org/r/20230825062836.103744-4-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-27 17:08:45 -07:00
Dima Chumak	390a24cbc3	devlink: Expose port function commands to control IPsec packet offloads Expose port function commands to enable / disable IPsec packet offloads, this is used to control the port IPsec capabilities. When IPsec packet is disabled for a function of the port (default), function cannot offload IPsec packet operations (encapsulation and XFRM policy offload). When enabled, IPsec packet operations can be offloaded by the function of the port, which includes crypto operation (Encrypt/Decrypt), IPsec encapsulation and XFRM state and policy offload. Example of a PCI VF port which supports IPsec packet offloads: $ devlink port show pci/0000:06:00.0/1 pci/0000:06:00.0/1: type eth netdev enp6s0pf0vf0 flavour pcivf pfnum 0 vfnum 0 function: hw_addr 00:00:00:00:00:00 roce enable ipsec_packet disable $ devlink port function set pci/0000:06:00.0/1 ipsec_packet enable $ devlink port show pci/0000:06:00.0/1 pci/0000:06:00.0/1: type eth netdev enp6s0pf0vf0 flavour pcivf pfnum 0 vfnum 0 function: hw_addr 00:00:00:00:00:00 roce enable ipsec_packet enable Signed-off-by: Dima Chumak <dchumak@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20230825062836.103744-3-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-27 17:08:45 -07:00
Dima Chumak	62b6442c58	devlink: Expose port function commands to control IPsec crypto offloads Expose port function commands to enable / disable IPsec crypto offloads, this is used to control the port IPsec capabilities. When IPsec crypto is disabled for a function of the port (default), function cannot offload any IPsec crypto operations (Encrypt/Decrypt and XFRM state offloading). When enabled, IPsec crypto operations can be offloaded by the function of the port. Example of a PCI VF port which supports IPsec crypto offloads: $ devlink port show pci/0000:06:00.0/1 pci/0000:06:00.0/1: type eth netdev enp6s0pf0vf0 flavour pcivf pfnum 0 vfnum 0 function: hw_addr 00:00:00:00:00:00 roce enable ipsec_crypto disable $ devlink port function set pci/0000:06:00.0/1 ipsec_crypto enable $ devlink port show pci/0000:06:00.0/1 pci/0000:06:00.0/1: type eth netdev enp6s0pf0vf0 flavour pcivf pfnum 0 vfnum 0 function: hw_addr 00:00:00:00:00:00 roce enable ipsec_crypto enable Signed-off-by: Dima Chumak <dchumak@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20230825062836.103744-2-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-27 17:08:44 -07:00
David S. Miller	aa05346dad	Merge branch 'iep-drver-timestamping-support' MD Danish Anwar says: ==================== Introduce IEP driver and packet timestamping support This series introduces Industrial Ethernet Peripheral (IEP) driver to support timestamping of ethernet packets and thus support PTP and PPS for PRU ICSSG ethernet ports. This series also adds 10M full duplex support for ICSSG ethernet driver. There are two IEP instances. IEP0 is used for packet timestamping while IEP1 is used for 10M full duplex support. This is v7 of the series [v1]. It addresses comments made on [v6]. This series is based on linux-next(#next-20230823). Changes from v6 to v7: ) Dropped blank line in example section of patch 1. ) Patch 1 previously had three examples, removed two examples and kept only one example as asked by Krzysztof. ) Added Jacob Keller's RB tag in patch 5. ) Dropped Roger's RB tags from the patches that he has authored (Patch 3 and 4) Changes from v5 to v6: ) Added description of IEP in commit messages of patch 2 as asked by Rob. ) Described the items constraints properly for iep property in patch 2 as asked by Rob. ) Added Roger and Simon's RB tags. Changes from v4 to v5: ) Added comments on why we are using readl / writel instead of regmap_read() / write() in icss_iep_gettime() / settime() APIs as asked by Roger. ) Added Conor's RB tag in patch 1 and 2. Change from v3 to v4: ) Changed compatible in iep dt bindings. Now each SoC has their own compatible in the binding with "ti,am654-icss-iep" as a fallback as asked by Conor. ) Addressed Andew's comments and removed helper APIs icss_iep_readl() / writel(). Now the settime/gettime APIs directly use readl() / writel(). ) Moved selecting TI_ICSS_IEP in Kconfig from patch 3 to patch 4. ) Removed forward declaration of icss_iep_of_match in patch 3. ) Replaced use of of_device_get_match_data() to device_get_match_data() in patch 3. ) Removed of_match_ptr() from patch 3 as it is not needed. Changes from v2 to v3: ) Addressed Roger's comment and moved IEP1 related changes in patch 5. ) Addressed Roger's comment and moved icss_iep.c / .h changes from patch 4 to patch 3. ) Added support for multiple timestamping in patch 4 as asked by Roger. ) Addressed Andrew's comment and added comment in case SPEED_10 in icssg_config_ipg() API. ) Kept compatible as "ti,am654-icss-iep" for all TI K3 SoCs Changes from v1 to v2: ) Addressed Simon's comment to fix reverse xmas tree declaration. Some APIs in patch 3 and 4 were not following reverse xmas tree variable declaration. Fixed it in this version. ) Addressed Conor's comments and removed unsupported SoCs from compatible comment in patch 1. *) Addded patch 2 which was not part of v1. Patch 2, adds IEP node to dt bindings for ICSSG. [v1] https://lore.kernel.org/all/20230803110153.3309577-1-danishanwar@ti.com/ [v2] https://lore.kernel.org/all/20230807110048.2611456-1-danishanwar@ti.com/ [v3] https://lore.kernel.org/all/20230809114906.21866-1-danishanwar@ti.com/ [v4] https://lore.kernel.org/all/20230814100847.3531480-1-danishanwar@ti.com/ [v5] https://lore.kernel.org/all/20230817114527.1585631-1-danishanwar@ti.com/ [v6] https://lore.kernel.org/all/20230823113254.292603-1-danishanwar@ti.com/ ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2023-08-27 07:13:24 +01:00
Grygorii Strashko	443a2367ba	net: ti: icssg-prueth: am65x SR2.0 add 10M full duplex support For AM65x SR2.0 it's required to enable IEP1 in raw 64bit mode which is used by PRU FW to monitor the link and apply w/a for 10M link issue. Note. No public errata available yet. Without this w/a the PRU FW will stuck if link state changes under TX traffic pressure. Hence, add support for 10M full duplex for AM65x SR2.0: - add new IEP API to enable IEP, but without PTP support - add pdata quirk_10m_link_issue to enable 10M link issue w/a. Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: Vignesh Raghavendra <vigneshr@ti.com> Reviewed-by: Roger Quadros <rogerq@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: MD Danish Anwar <danishanwar@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-08-27 07:13:24 +01:00
Roger Quadros	186734c158	net: ti: icssg-prueth: add packet timestamping and ptp support Add packet timestamping TS and PTP PHC clock support. For AM65x and AM64x: - IEP1 is not used - IEP0 is configured in shadow mode with 1ms cycle and shared between Linux and FW. It provides time and TS in number cycles, so special conversation in ns is required. - IEP0 shared between PRUeth ports. - IEP0 supports PPS, periodic output. - IEP0 settime() and enabling PPS required FW interraction. - RX TS provided with each packet in CPPI5 descriptor. - TX TS returned through separate ICSSG hw queues for each port. TX TS readiness is signaled by INTC IRQ. Only one packet at time can be requested for TX TS. Signed-off-by: Roger Quadros <rogerq@ti.com> Co-developed-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: Vignesh Raghavendra <vigneshr@ti.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: MD Danish Anwar <danishanwar@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-08-27 07:13:23 +01:00
Roger Quadros	c1e0230eea	net: ti: icss-iep: Add IEP driver Add a driver for Industrial Ethernet Peripheral (IEP) block of PRUSS to support timestamping of ethernet packets and thus support PTP and PPS for PRU ethernet ports. Signed-off-by: Roger Quadros <rogerq@ti.com> Signed-off-by: Lokesh Vutla <lokeshvutla@ti.com> Signed-off-by: Murali Karicheri <m-karicheri2@ti.com> Signed-off-by: Vignesh Raghavendra <vigneshr@ti.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: MD Danish Anwar <danishanwar@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-08-27 07:13:23 +01:00
MD Danish Anwar	b120562783	dt-bindings: net: Add IEP property in ICSSG Add IEP property in ICSSG hardware DT binding document. ICSSG uses IEP (Industrial Ethernet Peripheral) to support timestamping of ethernet packets, PTP and PPS. Reviewed-by: Conor Dooley <conor.dooley@microchip.com> Reviewed-by: Roger Quadros <rogerq@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: MD Danish Anwar <danishanwar@ti.com> Reviewed-by: Rob Herring <robh@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-08-27 07:13:23 +01:00
MD Danish Anwar	f0035689c0	dt-bindings: net: Add ICSS IEP Add a DT binding document for the ICSS Industrial Ethernet Peripheral(IEP) hardware. IEP supports packet timestamping, PTP and PPS. Reviewed-by: Conor Dooley <conor.dooley@microchip.com> Reviewed-by: Roger Quadros <rogerq@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: MD Danish Anwar <danishanwar@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-08-27 07:13:23 +01:00
David S. Miller	2cc88bbcbb	Merge branch 'sfc-pedit-offloads' Pieter Jansen van Vuuren says: ==================== sfc: introduce eth, ipv4 and ipv6 pedit offloads This set introduces mac source and destination pedit set action offloads. It also adds offload for ipv4 ttl and ipv6 hop limit pedit set action as well pedit add actions that would result in the same semantics as decrementing the ttl and hop limit. v2: - fix 'efx_tc_mangle' kdoc which was orphaned when adding 'efx_tc_pedit_add'. - add description of 'match' in 'efx_tc_mangle' kdoc. - correct some inconsistent kdoc indentation. v1: https://lore.kernel.org/netdev/20230823111725.28090-1-pieter.jansen-van-vuuren@amd.com/ ==================== Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-08-27 06:56:54 +01:00
Pieter Jansen van Vuuren	e8e0bd60e4	sfc: extend pedit add action to handle decrement ipv6 hop limit Extend the pedit add actions to handle this case for ipv6. Similar to ipv4 dec ttl, decrementing ipv6 hop limit can be achieved by adding 0xff to the hop limit field. Co-developed-by: Edward Cree <ecree.xilinx@gmail.com> Signed-off-by: Edward Cree <ecree.xilinx@gmail.com> Signed-off-by: Pieter Jansen van Vuuren <pieter.jansen-van-vuuren@amd.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-08-27 06:56:54 +01:00
Pieter Jansen van Vuuren	64848f062e	sfc: introduce pedit add actions on the ipv4 ttl field Introduce pedit add actions and use it to achieve decrement ttl offload. Decrement ttl can be achieved by adding 0xff to the ttl field. Co-developed-by: Edward Cree <ecree.xilinx@gmail.com> Signed-off-by: Edward Cree <ecree.xilinx@gmail.com> Signed-off-by: Pieter Jansen van Vuuren <pieter.jansen-van-vuuren@amd.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-08-27 06:56:54 +01:00
Pieter Jansen van Vuuren	9dbc8d2b9a	sfc: add decrement ipv6 hop limit by offloading set hop limit actions Offload pedit set ipv6 hop limit, where the hop limit has already been matched and the new value is one less, by translating it to a decrement. Co-developed-by: Edward Cree <ecree.xilinx@gmail.com> Signed-off-by: Edward Cree <ecree.xilinx@gmail.com> Signed-off-by: Pieter Jansen van Vuuren <pieter.jansen-van-vuuren@amd.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-08-27 06:56:54 +01:00
Pieter Jansen van Vuuren	66f7288726	sfc: add decrement ttl by offloading set ipv4 ttl actions Offload pedit set ipv4 ttl field, where the ttl field has already been matched and the new value is one less, by translating it to a decrement. Co-developed-by: Edward Cree <ecree.xilinx@gmail.com> Signed-off-by: Edward Cree <ecree.xilinx@gmail.com> Signed-off-by: Pieter Jansen van Vuuren <pieter.jansen-van-vuuren@amd.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-08-27 06:56:54 +01:00
Pieter Jansen van Vuuren	0c676503bd	sfc: add mac source and destination pedit action offload Introduce the first pedit set offload functionality for the sfc driver. In addition to this, add offload functionality for both mac source and destination pedit set actions. Co-developed-by: Edward Cree <ecree.xilinx@gmail.com> Signed-off-by: Edward Cree <ecree.xilinx@gmail.com> Signed-off-by: Pieter Jansen van Vuuren <pieter.jansen-van-vuuren@amd.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-08-27 06:56:54 +01:00
Pieter Jansen van Vuuren	439c4be983	sfc: introduce ethernet pedit set action infrastructure Introduce the initial ethernet pedit set action infrastructure in preparation for adding mac src and dst pedit action offloads. Co-developed-by: Edward Cree <ecree.xilinx@gmail.com> Signed-off-by: Edward Cree <ecree.xilinx@gmail.com> Signed-off-by: Pieter Jansen van Vuuren <pieter.jansen-van-vuuren@amd.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-08-27 06:56:54 +01:00
Jakub Kicinski	b32add2d20	Merge branch '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue Tony Nguyen says: ==================== Intel Wired LAN Driver Updates 2023-08-24 (igc, e1000e) This series contains updates to igc and e1000e drivers. Vinicius adds support for utilizing multiple PTP registers on igc. Sasha reduces interval time for PTM on igc and adds new device support on e1000e. * '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue: e1000e: Add support for the next LOM generation igc: Decrease PTM short interval from 10 us to 1 us igc: Add support for multiple in-flight TX timestamps ==================== Link: https://lore.kernel.org/r/20230824204418.1551093-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-25 19:09:46 -07:00
Donald Hunter	52d08fda35	doc/netlink: Add delete operation to ovs_vport spec Add del operation to the spec to help with testing. Signed-off-by: Donald Hunter <donald.hunter@gmail.com> Link: https://lore.kernel.org/r/20230824142221.71339-1-donald.hunter@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-25 18:58:34 -07:00
Jakub Kicinski	a02430c06f	tools: ynl-gen: fix uAPI generation after tempfile changes We use a tempfile for code generation, to avoid wiping the target file out if the code generator crashes. File contents are copied from tempfile to actual destination at the end of main(). uAPI generation is relatively simple so when generating the uAPI header we return from main() early, and never reach the "copy code over" stage. Since commit under Fixes uAPI headers are not updated by ynl-gen. Move the copy/commit of the code into CodeWriter, to make it easier to call at any point in time. Hook it into the destructor to make sure we don't miss calling it. Fixes: `f65f305ae0` ("tools: ynl-gen: use temporary file for rendering") Link: https://lore.kernel.org/r/20230824212431.1683612-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-25 18:56:48 -07:00
Jakub Kicinski	f5e17b471f	Merge branch 'stmmac-cleanups' Russell King says: ==================== stmmac cleanups One of the comments I had on Feiyang Chen's series was concerning the initialisation of phylink... and so I've decided to do something about it, cleaning it up a bit. This series: 1) adds a new phylink function to limit the MAC capabilities according to a maximum speed. This allows us to greatly simplify stmmac's initialisation of phylink's mac capabilities. 2) everywhere that uses priv->plat->phylink_node first converts this to a fwnode before doing anything with it. This is silly. Let's instead store it as a fwnode to eliminate these conversions in multiple places. 3) clean up passing the fwnode to phylink - it might as well happen at the phylink_create() callsite, rather than being scattered throughout the entire function. 4) same for mdio_bus_data 5) use phylink_limit_mac_speed() to handle the priv->plat->max_speed restriction. 6) add a method to get the MAC-specific capabilities from the code dealing with the MACs, and arrange to call it at an appropriate time. 7) convert the gmac4 users to use the MAC specific method. 8) same for xgmac. 9) group all the simple phylink_config initialisations together. 10) convert half-duplex logic to being positive logic. While looking into all of this, this raised eyebrows: if (priv->plat->tx_queues_to_use > 1) priv->phylink_config.mac_capabilities &= ~(MAC_10HD \| MAC_100HD \| MAC_1000HD); priv->plat->tx_queues_to_use is initialised by platforms to either 1, 4 or 8, and can be controlled from userspace via the --set-channels ethtool op. The implementation of this op in this driver limits the number of channels to priv->dma_cap.number_tx_queues, which is derived from the DMA hwcap. So, the obvious questions are: 1) what guarantees that the static initialisation of tx_queues_to_use will always be less than or equal to number_tx_queues from the DMA hw cap? 2) tx_queues_to_use starts off as 1, but number_tx_queues is larger, we will leave the half-duplex capabilities in place, but userspace can increase tx_queues_to_use above 1. Does that mean half-duplex is then not supported? 3) Should we be basing the decision whether half-duplex is supported off the DMA capabilities? 4) What about priv->dma_cap.half_duplex? Doesn't that get a say in whether half-duplex is supported or not? Why isn't this used? Why is it only reported via debugfs? If it's not being used by the driver, what's the point of reporting it via debugfs? ==================== Link: https://lore.kernel.org/r/ZOddFH22PWmOmbT5@shell.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-25 18:55:22 -07:00
Russell King (Oracle)	76649fc93f	net: stmmac: convert half-duplex support to positive logic Rather than detecting when half-duplex is not supported, and clearing the MAC capabilities, reverse the if() condition and use it to set the capabilities instead. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://lore.kernel.org/r/E1qZAXn-005pUb-SP@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-25 18:55:19 -07:00
Russell King (Oracle)	64961f1b8c	net: stmmac: move priv->phylink_config.mac_managed_pm Move priv->phylink_config.mac_managed_pm to be along side the other phylink initialisations. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://lore.kernel.org/r/E1qZAXi-005pUV-Nq@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-25 18:55:19 -07:00
Russell King (Oracle)	bedf9b8123	net: stmmac: move xgmac specific phylink caps to dwxgmac2 core Move the xgmac specific phylink capabilities to the dwxgmac2 support core. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://lore.kernel.org/r/E1qZAXd-005pUP-JL@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-25 18:55:19 -07:00
Russell King (Oracle)	f1dae3d222	net: stmmac: move gmac4 specific phylink capabilities to gmac4 Move the setup of gmac4 speicifc phylink capabilities into gmac4 code. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://lore.kernel.org/r/E1qZAXY-005pUJ-Ez@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-25 18:55:19 -07:00
Russell King (Oracle)	d42ca04e04	net: stmmac: provide stmmac_mac_phylink_get_caps() Allow MACs to provide their own capabilities via the MAC operations struct. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://lore.kernel.org/r/E1qZAXT-005pUD-Aj@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-25 18:55:19 -07:00
Russell King (Oracle)	a4ac612bd3	net: stmmac: use phylink_limit_mac_speed() Use phylink_limit_mac_speed() to limit the MAC capabilities rather than coding this for each speed. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://lore.kernel.org/r/E1qZAXO-005pU7-61@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-25 18:55:19 -07:00
Russell King (Oracle)	2b070cdd3a	net: stmmac: use "mdio_bus_data" local variable We have a local variable for priv->plat->mdio_bus_data, which we use later in the conditional if() block, but we evaluate the above within the conditional expression. Use mdio_bus_data instead. Since these will be the only two users of this local variable, move its assignment just before the if(). Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://lore.kernel.org/r/E1qZAXJ-005pU1-1z@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-25 18:55:18 -07:00
Russell King (Oracle)	1a37c1c198	net: stmmac: clean up passing fwnode to phylink Move the initialisation of the fwnode variable closer to its use site, rather than scattered throughout stmmac_phy_setup(). Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://lore.kernel.org/r/E1qZAXD-005pTv-TN@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-25 18:55:18 -07:00
Russell King (Oracle)	e80af2acde	net: stmmac: convert plat->phylink_node to fwnode All users of plat->phylink_node first convert it to a fwnode. Rather than repeatedly convert to a fwnode, store it as a fwnode. To reflect this change, call it plat->port_node instead - it is used for more than just phylink. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://lore.kernel.org/r/E1qZAX8-005pTo-OT@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-25 18:55:18 -07:00
Russell King (Oracle)	70934c7c99	net: phylink: add phylink_limit_mac_speed() Add a function which can be used to limit the phylink MAC capabilities to an upper speed limit. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://lore.kernel.org/r/E1qZAX3-005pTi-K1@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-25 18:55:18 -07:00
Liang Chen	215eb9f962	veth: Avoid NAPI scheduling on failed SKB forwarding When an skb fails to be forwarded to the peer(e.g., skb data buffer length exceeds MTU), it will not be added to the peer's receive queue. Therefore, we should schedule the peer's NAPI poll function only when skb forwarding is successful to avoid unnecessary overhead. Signed-off-by: Liang Chen <liangchen.linux@gmail.com> Link: https://lore.kernel.org/r/20230824123131.7673-1-liangchen.linux@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-25 18:52:11 -07:00
Jakub Kicinski	bebfbf07c7	bpf-next-for-netdev -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZOjkTAAKCRDbK58LschI gx32AP9gaaHFBtOYBfoenKTJfMgv1WhtQHIBas+WN9ItmBx9MAEA4gm/VyQ6oD7O EBjJKJQ2CZ/QKw7cNacXw+l5jF7/+Q0= =8P7g -----END PGP SIGNATURE----- Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2023-08-25 We've added 87 non-merge commits during the last 8 day(s) which contain a total of 104 files changed, 3719 insertions(+), 4212 deletions(-). The main changes are: 1) Add multi uprobe BPF links for attaching multiple uprobes and usdt probes, which is significantly faster and saves extra fds, from Jiri Olsa. 2) Add support BPF cpu v4 instructions for arm64 JIT compiler, from Xu Kuohai. 3) Add support BPF cpu v4 instructions for riscv64 JIT compiler, from Pu Lehui. 4) Fix LWT BPF xmit hooks wrt their return values where propagating the result from skb_do_redirect() would trigger a use-after-free, from Yan Zhai. 5) Fix a BPF verifier issue related to bpf_kptr_xchg() with local kptr where the map's value kptr type and locally allocated obj type mismatch, from Yonghong Song. 6) Fix BPF verifier's check_func_arg_reg_off() function wrt graph root/node which bypassed reg->off == 0 enforcement, from Kumar Kartikeya Dwivedi. 7) Lift BPF verifier restriction in networking BPF programs to treat comparison of packet pointers not as a pointer leak, from Yafang Shao. 8) Remove unmaintained XDP BPF samples as they are maintained in xdp-tools repository out of tree, from Toke Høiland-Jørgensen. 9) Batch of fixes for the tracing programs from BPF samples in order to make them more libbpf-aware, from Daniel T. Lee. 10) Fix a libbpf signedness determination bug in the CO-RE relocation handling logic, from Andrii Nakryiko. 11) Extend libbpf to support CO-RE kfunc relocations. Also follow-up fixes for bpf_refcount shared ownership implementation, both from Dave Marchevsky. 12) Add a new bpf_object__unpin() API function to libbpf, from Daniel Xu. 13) Fix a memory leak in libbpf to also free btf_vmlinux when the bpf_object gets closed, from Hao Luo. 14) Small error output improvements to test_bpf module, from Helge Deller. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (87 commits) selftests/bpf: Add tests for rbtree API interaction in sleepable progs bpf: Allow bpf_spin_{lock,unlock} in sleepable progs bpf: Consider non-owning refs to refcounted nodes RCU protected bpf: Reenable bpf_refcount_acquire bpf: Use bpf_mem_free_rcu when bpf_obj_dropping refcounted nodes bpf: Consider non-owning refs trusted bpf: Ensure kptr_struct_meta is non-NULL for collection insert and refcount_acquire selftests/bpf: Enable cpu v4 tests for RV64 riscv, bpf: Support unconditional bswap insn riscv, bpf: Support signed div/mod insns riscv, bpf: Support 32-bit offset jmp insn riscv, bpf: Support sign-extension mov insns riscv, bpf: Support sign-extension load insns riscv, bpf: Fix missing exception handling and redundant zext for LDX_B/H/W samples/bpf: Add note to README about the XDP utilities moved to xdp-tools samples/bpf: Cleanup .gitignore samples/bpf: Remove the xdp_sample_pkts utility samples/bpf: Remove the xdp1 and xdp2 utilities samples/bpf: Remove the xdp_rxq_info utility samples/bpf: Remove the xdp_redirect* utilities ... ==================== Link: https://lore.kernel.org/r/20230825194319.12727-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-25 18:40:15 -07:00
Jakub Kicinski	1fa6ffad12	wireless-next patches for v6.6 The second pull request for v6.6, this time with both stack and driver changes. Unusually we have only one major new feature but lots of small cleanup all over, I guess this is due to people have been on vacation the last month. Major changes: rtw89 * Introduce Time Averaged SAR (TAS) support -----BEGIN PGP SIGNATURE----- iQFFBAABCgAvFiEEiBjanGPFTz4PRfLobhckVSbrbZsFAmToqosRHGt2YWxvQGtl cm5lbC5vcmcACgkQbhckVSbrbZv9XQf9HDq9smbuWLvwzNjbbS31hHFLmnfhN8Zp +Zzn47gpMCle9ahGLQyw8lcfNPWCMyqOu4sGQ6hyyuH+YXoxZryuq9QDwWo9L/b1 5Cpm4IaBYBMm0ZoOkWw2lQSzGyNrXgvCEKRVC+pYQMvr5V2aEWxT/kT4guiou9D5 OXPRFN2iqZP0Q3TKcfKWRnWn3S0Ok3kZCFuXcWkL0sgwjqP/wbAPO1XNI1IImKNM xUd0zT4vK/layYq7i20y8blglI5kcp/aKCFEwYpQC2WPeZ3Wtl1G9PQ8eze5Gc2Q NTw3xfr6tENIcAmYoLdBdKbUq6e6pwLwXlojlZ2beR6s7LHM30AinQ== =2Hja -----END PGP SIGNATURE----- Merge tag 'wireless-next-2023-08-25' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next Kalle Valo says: ==================== wireless-next patches for v6.6 The second pull request for v6.6, this time with both stack and driver changes. Unusually we have only one major new feature but lots of small cleanup all over, I guess this is due to people have been on vacation the last month. Major changes: rtw89 - Introduce Time Averaged SAR (TAS) support * tag 'wireless-next-2023-08-25' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (114 commits) wifi: rtlwifi: rtl8723: Remove unused function rtl8723_cmd_send_packet() wifi: rtw88: usb: kill and free rx urbs on probe failure wifi: rtw89: Fix clang -Wimplicit-fallthrough in rtw89_query_sar() wifi: rtw89: phy: modify register setting of ENV_MNTR, PHYSTS and DIG wifi: rtw89: phy: add phy_gen_def::cr_base to support WiFi 7 chips wifi: rtw89: mac: define register address of rx_filter to generalize code wifi: rtw89: mac: define internal memory address for WiFi 7 chip wifi: rtw89: mac: generalize code to indirectly access WiFi internal memory wifi: rtw89: mac: add mac_gen_def::band1_offset to map MAC band1 register address wifi: wlcore: sdio: Use module_sdio_driver macro to simplify the code wifi: rtw89: initialize multi-channel handling wifi: rtw89: provide functions to configure NoA for beacon update wifi: rtw89: call rtw89_chan_get() by vif chanctx if aware of vif wifi: rtw89: sar: let caller decide the center frequency to query wifi: rtw89: refine rtw89_correct_cck_chan() by rtw89_hw_to_nl80211_band() wifi: rtw89: add function prototype for coex request duration Fix nomenclature for USB and PCI wireless devices wifi: ath: Use is_multicast_ether_addr() to check multicast Ether address wifi: ath12k: Remove unused declarations wifi: ath12k: add check max message length while scanning with extraie ... ==================== Link: https://lore.kernel.org/r/20230825132230.A0833C433C8@smtp.kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-25 18:35:09 -07:00
Jakub Kicinski	3db3474763	bluetooth-next pull request for net-next: - Introduce HCI_QUIRK_BROKEN_LE_CODED - Add support for PA/BIG sync - Add support for NXP IW624 chipset - Add support for Qualcomm WCN7850 -----BEGIN PGP SIGNATURE----- iQJNBAABCAA3FiEE7E6oRXp8w05ovYr/9JCA4xAyCykFAmTnua4ZHGx1aXoudm9u LmRlbnR6QGludGVsLmNvbQAKCRD0kIDjEDILKZtPD/4i+sYW1vmJi1VRr13Im1D+ 8RTcipgiM3Wehyc+aGx5sli2L/uHm5JXQBPb2SpMtLa7HoXdMxD/g6cxPGT8I4nZ 8JmPQzklEYw3xCs/6AUm/LmIl5XAmmpvP6Ky9DifYoGgZxbY2iY3eGLp9qY5MQ0P rZTN4hqy0887F8iG/jC4dkID0h7m2WjjIp6GbT+klpwZQCO8PTaGk8NWYxVbkdrD 4sH+X4/DSzsYOokI2zZSXgrHtkA+OEowNu9T0ItyzcG7X5f9lz4154KeeFFUoZ6/ 83A7c5YcQY6Kg+xQDtHb1Im7St8JEHgMjTGjQEme9HZkdf/KMxWQVcFvCVIpgU/B CQmaxaUIUTcMyZ8UDwpxHvJf2DT3IYHjqP3kNjUNPgrt4V1eBIpqod39M126lV37 CSLqALpLCXRez5GuCDX2NcoPjPZqu7gkARZrIz5q9v83QaYhQiohpYAkrjdkrIWT rg3iNbjYx+0okOPe81hhAKjV8iARorGUty8aOjyfDLW0F5GMowp/UYefXs3GVXvI LmY7uVoDXHLriuzcMJ1LbjaxEzpgVr3FuDEhnCx050l1TU0kg7ks6dSxdf4dpt3l zFR4YR6zBDOyxRJay4srhZYTLfq6ENxDHRpJR/TuO7gsZyOXnHRlSMiCra+xnZnD LNnXwZBJdxpo7ZBQG/25iA== =bt6N -----END PGP SIGNATURE----- Merge tag 'for-net-next-2023-08-24' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next Luiz Augusto von Dentz says: ==================== bluetooth-next pull request for net-next: - Introduce HCI_QUIRK_BROKEN_LE_CODED - Add support for PA/BIG sync - Add support for NXP IW624 chipset - Add support for Qualcomm WCN7850 * tag 'for-net-next-2023-08-24' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next: Bluetooth: btusb: Do not call kfree_skb() under spin_lock_irqsave() Bluetooth: btusb: Fix quirks table naming Bluetooth: HCI: Introduce HCI_QUIRK_BROKEN_LE_CODED Bluetooth: btintel: Send new command for PPAG Bluetooth: ISO: Add support for periodic adv reports processing Bluetooth: hci_conn: fail SCO/ISO via hci_conn_failed if ACL gone early Bluetooth: hci_core: Fix missing instances using HCI_MAX_AD_LENGTH Bluetooth: ISO: Use defer setup to separate PA sync and BIG sync Bluetooth: qca: add support for WCN7850 Bluetooth: qca: use switch case for soc type behavior dt-bindings: net: bluetooth: qualcomm: document WCN7850 chipset Bluetooth: hci_conn: Fix sending BT_HCI_CMD_LE_CREATE_CONN_CANCEL Bluetooth: hci_sync: Fix UAF in hci_disconnect_all_sync Bluetooth: btnxpuart: Improve inband Independent Reset handling Bluetooth: btnxpuart: Add support for IW624 chipset Bluetooth: btnxpuart: Remove check for CTS low after FW download ==================== Link: https://lore.kernel.org/r/20230824201458.2577-1-luiz.dentz@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-08-25 18:30:59 -07:00
Alexei Starovoitov	ec0ded2e02	Merge branch 'bpf-refcount-followups-3-bpf_mem_free_rcu-refcounted-nodes' Dave Marchevsky says: ==================== BPF Refcount followups 3: bpf_mem_free_rcu refcounted nodes This series is the third of three (or more) followups to address issues in the bpf_refcount shared ownership implementation discovered by Kumar. This series addresses the use-after-free scenario described in [0]. The first followup series ([1]) also attempted to address the same use-after-free, but only got rid of the splat without addressing the underlying issue. After this series the underyling issue is fixed and bpf_refcount_acquire can be re-enabled. The main fix here is migration of bpf_obj_drop to use bpf_mem_free_rcu. To understand why this fixes the issue, let us consider the example interleaving provided by Kumar in [0]: CPU 0 CPU 1 n = bpf_obj_new lock(lock1) bpf_rbtree_add(rbtree1, n) m = bpf_rbtree_acquire(n) unlock(lock1) kptr_xchg(map, m) // move to map // at this point, refcount = 2 m = kptr_xchg(map, NULL) lock(lock2) lock(lock1) bpf_rbtree_add(rbtree2, m) p = bpf_rbtree_first(rbtree1) if (!RB_EMPTY_NODE) bpf_obj_drop_impl(m) // A bpf_rbtree_remove(rbtree1, p) unlock(lock1) bpf_obj_drop(p) // B bpf_refcount_acquire(m) // use-after-free ... Before this series, bpf_obj_drop returns memory to the allocator using bpf_mem_free. At this point (B in the example) there might be some non-owning references to that memory which the verifier believes are valid, but where the underlying memory was reused for some other allocation. Commit `7793fc3bab` ("bpf: Make bpf_refcount_acquire fallible for non-owning refs") attempted to fix this by doing refcount_inc_non_zero on refcount_acquire in instead of refcount_inc under the assumption that preventing erroneous incr-on-0 would be sufficient. This isn't true, though: refcount_inc_non_zero must check if the refcount is zero, and the memory it's checking could have been reused, so the check may look at and incr random reused bytes. If we wait to reuse this memory until all non-owning refs that could point to it are gone, there is no possibility of this scenario happening. Migrating bpf_obj_drop to use bpf_mem_free_rcu for refcounted nodes accomplishes this. For such nodes, the validity of their underlying memory is now tied to RCU critical section. This matches MEM_RCU trustedness expectations, so the series takes the opportunity to more explicitly mark this trustedness state. The functional effects of trustedness changes here are rather small. This is largely due to local kptrs having separate verifier handling - with implicit trustedness assumptions - than arbitrary kptrs. Regardless, let's take the opportunity to move towards a world where trustedness is more explicitly handled. Changelog: v1 -> v2: https://lore.kernel.org/bpf/20230801203630.3581291-1-davemarchevsky@fb.com/ Patch 1 ("bpf: Ensure kptr_struct_meta is non-NULL for collection insert and refcount_acquire") * Spent some time experimenting with a better approach as per convo w/ Yonghong on v1's patch. It started getting too complex, so left unchanged for now. Yonghong was fine with this approach being shipped. Patch 2 ("bpf: Consider non-owning refs trusted") * Add Yonghong ack Patch 3 ("bpf: Use bpf_mem_free_rcu when bpf_obj_dropping refcounted nodes") * Add Yonghong ack Patch 4 ("bpf: Reenable bpf_refcount_acquire") * Add Yonghong ack Patch 5 ("bpf: Consider non-owning refs to refcounted nodes RCU protected") * Undo a nonfunctional whitespace change that shouldn't have been included (Yonghong) * Better logging message when complaining about rcu_read_{lock,unlock} in rbtree cb (Alexei) * Don't invalidate_non_owning_refs when processing bpf_rcu_read_unlock (Yonghong, Alexei) Patch 6 ("[RFC] bpf: Allow bpf_spin_{lock,unlock} in sleepable prog's RCU CS") * preempt_{disable,enable} in __bpf_spin_{lock,unlock} (Alexei) * Due to this we can consider spin_lock CS an RCU-sched read-side CS (per RCU/Design/Requirements/Requirements.rst). Modify in_rcu_cs accordingly. * no need to check for !in_rcu_cs before allowing bpf_spin_{lock,unlock} (Alexei) * RFC tag removed and renamed to "bpf: Allow bpf_spin_{lock,unlock} in sleepable progs" Patch 7 ("selftests/bpf: Add tests for rbtree API interaction in sleepable progs") * Remove "no explicit bpf_rcu_read_lock" failure test, add similar success test (Alexei) Summary of patch contents, with sub-bullets being leading questions and comments I think are worth reviewer attention: * Patches 1 and 2 are moreso documententation - and enforcement, in patch 1's case - of existing semantics / expectations * Patch 3 changes bpf_obj_drop behavior for refcounted nodes such that their underlying memory is not reused until RCU grace period elapses * Perhaps it makes sense to move to mem_free_rcu for _all_ non-owning refs in the future, not just refcounted. This might allow custom non-owning ref lifetime + invalidation logic to be entirely subsumed by MEM_RCU handling. IMO this needs a bit more thought and should be tackled outside of a fix series, so it's not attempted here. * Patch 4 re-enables bpf_refcount_acquire as changes in patch 3 fix the remaining use-after-free * One might expect this patch to be last in the series, or last before selftest changes. Patches 5 and 6 don't change verification or runtime behavior for existing BPF progs, though. * Patch 5 brings the verifier's understanding of refcounted node trustedness in line with Patch 4's changes * Patch 6 allows some bpf_spin_{lock, unlock} calls in sleepable progs. Marked RFC for a few reasons: * bpf_spin_{lock,unlock} haven't been usable in sleepable progs since before the introduction of bpf linked list and rbtree. As such this feels more like a new feature that may not belong in this fixes series. * Patch 7 adds tests [0]: https://lore.kernel.org/bpf/atfviesiidev4hu53hzravmtlau3wdodm2vqs7rd7tnwft34e3@xktodqeqevir/ [1]: https://lore.kernel.org/bpf/20230602022647.1571784-1-davemarchevsky@fb.com/ ==================== Link: https://lore.kernel.org/r/20230821193311.3290257-1-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-08-25 09:23:23 -07:00
Dave Marchevsky	312aa5bde8	selftests/bpf: Add tests for rbtree API interaction in sleepable progs Confirm that the following sleepable prog states fail verification: * bpf_rcu_read_unlock before bpf_spin_unlock * RCU CS will last at least as long as spin_lock CS Also confirm that correct usage passes verification, specifically: * Explicit use of bpf_rcu_read_{lock, unlock} in sleepable test prog * Implied RCU CS due to spin_lock CS None of the selftest progs actually attach to bpf_testmod's bpf_testmod_test_read. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230821193311.3290257-8-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-08-25 09:23:17 -07:00
Dave Marchevsky	5861d1e8db	bpf: Allow bpf_spin_{lock,unlock} in sleepable progs Commit `9e7a4d9831` ("bpf: Allow LSM programs to use bpf spin locks") disabled bpf_spin_lock usage in sleepable progs, stating: Sleepable LSM programs can be preempted which means that allowng spin locks will need more work (disabling preemption and the verifier ensuring that no sleepable helpers are called when a spin lock is held). This patch disables preemption before grabbing bpf_spin_lock. The second requirement above "no sleepable helpers are called when a spin lock is held" is implicitly enforced by current verifier logic due to helper calls in spin_lock CS being disabled except for a few exceptions, none of which sleep. Due to above preemption changes, bpf_spin_lock CS can also be considered a RCU CS, so verifier's in_rcu_cs check is modified to account for this. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230821193311.3290257-7-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-08-25 09:23:17 -07:00
Dave Marchevsky	0816b8c6bf	bpf: Consider non-owning refs to refcounted nodes RCU protected An earlier patch in the series ensures that the underlying memory of nodes with bpf_refcount - which can have multiple owners - is not reused until RCU grace period has elapsed. This prevents use-after-free with non-owning references that may point to recently-freed memory. While RCU read lock is held, it's safe to dereference such a non-owning ref, as by definition RCU GP couldn't have elapsed and therefore underlying memory couldn't have been reused. From the perspective of verifier "trustedness" non-owning refs to refcounted nodes are now trusted only in RCU CS and therefore should no longer pass is_trusted_reg, but rather is_rcu_reg. Let's mark them MEM_RCU in order to reflect this new state. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230821193311.3290257-6-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-08-25 09:23:16 -07:00
Dave Marchevsky	ba2464c86f	bpf: Reenable bpf_refcount_acquire Now that all reported issues are fixed, bpf_refcount_acquire can be turned back on. Also reenable all bpf_refcount-related tests which were disabled. This a revert of: * commit `f3514a5d67` ("selftests/bpf: Disable newly-added 'owner' field test until refcount re-enabled") * commit `7deca5eae8` ("bpf: Disable bpf_refcount_acquire kfunc calls until race conditions are fixed") Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230821193311.3290257-5-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-08-25 09:23:16 -07:00
Dave Marchevsky	7e26cd12ad	bpf: Use bpf_mem_free_rcu when bpf_obj_dropping refcounted nodes This is the final fix for the use-after-free scenario described in commit `7793fc3bab` ("bpf: Make bpf_refcount_acquire fallible for non-owning refs"). That commit, by virtue of changing bpf_refcount_acquire's refcount_inc to a refcount_inc_not_zero, fixed the "refcount incr on 0" splat. The not_zero check in refcount_inc_not_zero, though, still occurs on memory that could have been free'd and reused, so the commit didn't properly fix the root cause. This patch actually fixes the issue by free'ing using the recently-added bpf_mem_free_rcu, which ensures that the memory is not reused until RCU grace period has elapsed. If that has happened then there are no non-owning references alive that point to the recently-free'd memory, so it can be safely reused. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230821193311.3290257-4-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-08-25 09:23:16 -07:00
Dave Marchevsky	2a6d50b50d	bpf: Consider non-owning refs trusted Recent discussions around default kptr "trustedness" led to changes such as commit `6fcd486b3a` ("bpf: Refactor RCU enforcement in the verifier."). One of the conclusions of those discussions, as expressed in code and comments in that patch, is that we'd like to move away from 'raw' PTR_TO_BTF_ID without some type flag or other register state indicating trustedness. Although PTR_TRUSTED and PTR_UNTRUSTED flags mark this state explicitly, the verifier currently considers trustedness implied by other register state. For example, owning refs to graph collection nodes must have a nonzero ref_obj_id, so they pass the is_trusted_reg check despite having no explicit PTR_{UN}TRUSTED flag. This patch makes trustedness of non-owning refs to graph collection nodes explicit as well. By definition, non-owning refs are currently trusted. Although the ref has no control over pointee lifetime, due to non-owning ref clobbering rules (see invalidate_non_owning_refs) dereferencing a non-owning ref is safe in the critical section controlled by bpf_spin_lock associated with its owning collection. Note that the previous statement does not hold true for nodes with shared ownership due to the use-after-free issue that this series is addressing. True shared ownership was disabled by commit `7deca5eae8` ("bpf: Disable bpf_refcount_acquire kfunc calls until race conditions are fixed"), though, so the statement holds for now. Further patches in the series will change the trustedness state of non-owning refs before re-enabling bpf_refcount_acquire. Let's add NON_OWN_REF type flag to BPF_REG_TRUSTED_MODIFIERS such that a non-owning ref reg state would pass is_trusted_reg check. Somewhat surprisingly, this doesn't result in any change to user-visible functionality elsewhere in the verifier: graph collection nodes are all marked MEM_ALLOC, which tends to be handled in separate codepaths from "raw" PTR_TO_BTF_ID. Regardless, let's be explicit here and document the current state of things before changing it elsewhere in the series. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20230821193311.3290257-3-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2023-08-25 09:23:16 -07:00

1 2 3 4 5 ...

1203772 Commits