linux

Author	SHA1	Message	Date
Kurt Kanzenbach	bf08824a0f	flow_dissector: Add support for HSR Network drivers such as igb or igc call eth_get_headlen() to determine the header length for their to be constructed skbs in receive path. When running HSR on top of these drivers, it results in triggering BUG_ON() in skb_pull(). The reason is the skb headlen is not sufficient for HSR to work correctly. skb_pull() notices that. For instance, eth_get_headlen() returns 14 bytes for TCP traffic over HSR which is not correct. The problem is, the flow dissection code does not take HSR into account. Therefore, add support for it. Reported-by: Anthony Harivel <anthony.harivel@linutronix.de> Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de> Link: https://lore.kernel.org/r/20220228195856.88187-1-kurt@linutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-03-02 22:44:49 -08:00
Baruch Siach	0020288573	net: dsa: mv88e6xxx: support RMII cmode Add support for direct RMII MAC mode. This allows hardware with CPU port connected in direct 100M fixed link to work properly. Signed-off-by: Baruch Siach <baruch.siach@siklu.com> Link: https://lore.kernel.org/r/a962d1ccbeec42daa10dd8aff0e66e31f0faf1eb.1646050203.git.baruch@tkos.co.il Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-03-02 22:33:37 -08:00
Baruch Siach	13b0bd2e62	net: dsa: mv88e6xxx: don't error out cmode set on missing lane When the given cmode has no serdes, mv88e6xxx_serdes_get_lane() returns -NODEV. Earlier in the same function the code skips serdes handing in this case. Do the same after cmode set. Signed-off-by: Baruch Siach <baruch.siach@siklu.com> Link: https://lore.kernel.org/r/cd95cf3422ae8daf297a01fa9ec3931b203cdf45.1646050203.git.baruch@tkos.co.il Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-03-02 22:33:37 -08:00
Yang Li	cb1d8fba91	net: openvswitch: remove unneeded semicolon Eliminate the following coccicheck warning: ./net/openvswitch/flow.c:379:2-3: Unneeded semicolon Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Link: https://lore.kernel.org/r/20220227132208.24658-1-yang.lee@linux.alibaba.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-03-02 22:22:18 -08:00
Baowen Zheng	d922a99b96	flow_offload: improve extack msg for user when adding invalid filter Add extack message to return exact message to user when adding invalid filter with conflict flags for TC action. In previous implement we just return EINVAL which is confusing for user. Signed-off-by: Baowen Zheng <baowen.zheng@corigine.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Link: https://lore.kernel.org/r/1646191769-17761-1-git-send-email-baowen.zheng@corigine.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-03-02 22:16:10 -08:00
Jakub Kicinski	2102a27e49	Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue Tony Nguyen says: ==================== 40GbE Intel Wired LAN Driver Updates 2022-03-01 This series contains updates to iavf driver only. Mateusz adds support for interrupt moderation for 50G and 100G speeds as well as support for the driver to specify a request as its primary MAC address. He also refactors VLAN V2 capability exchange into more generic extended capabilities to ease the addition of future capabilities. Finally, he corrects the incorrect return of iavf_status values and removes non-inclusive language. Minghao Chi removes unneeded variables, instead returning values directly. * '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue: iavf: Remove non-inclusive language iavf: Fix incorrect use of assigning iavf_status to int iavf: stop leaking iavf_status as "errno" values iavf: remove redundant ret variable iavf: Add usage of new virtchnl format to set default MAC iavf: refactor processing of VLAN V2 capability message iavf: Add support for 50G/100G in AIM algorithm ==================== Link: https://lore.kernel.org/r/20220301185939.3005116-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-03-02 22:13:06 -08:00
Christophe JAILLET	432509013f	nfp: flower: Remove usage of the deprecated ida_simple_xxx API Use ida_alloc_xxx()/ida_free() instead to ida_simple_get()/ida_simple_remove(). The latter is deprecated and more verbose. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Simon Horman <simon.horman@corigine.com> Link: https://lore.kernel.org/r/20220301131212.26348-1-simon.horman@corigine.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-03-02 22:05:17 -08:00
Russell King (Oracle)	9ae1ef4b16	net: sfp: use %pe for printing errors Convert sfp to use %pe for printing error codes, which can print them as errno symbols rather than numbers. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/E1nOyEN-00BuuE-OB@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-03-02 22:03:34 -08:00
Russell King (Oracle)	ab1198e5a1	net: phylink: use %pe for printing errors Convert phylink to use %pe for printing error codes, which can print them as errno symbols rather than numbers. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/E1nOyEI-00Buu8-K9@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-03-02 22:02:21 -08:00
Harold Huang	74a335a07a	tuntap: add sanity checks about msg_controllen in sendmsg In patch [1], tun_msg_ctl was added to allow pass batched xdp buffers to tun_sendmsg. Although we donot use msg_controllen in this path, we should check msg_controllen to make sure the caller pass a valid msg_ctl. [1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=fe8dd45bb7556246c6b76277b1ba4296c91c2505 Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Suggested-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Harold Huang <baymaxhuang@gmail.com> Acked-by: Jason Wang <jasowang@redhat.com> Link: https://lore.kernel.org/r/20220303022441.383865-1-baymaxhuang@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-03-02 22:00:59 -08:00
Jakub Kicinski	fa452e0a60	This cleanup patchset includes the following patches: - bump version strings, by Simon Wunderlich - Remove redundant 'flush_workqueue()' calls, by Christophe JAILLET - Migrate to linux/container_of.h, by Sven Eckelmann - Demote batadv-on-batadv skip error message, by Sven Eckelmann -----BEGIN PGP SIGNATURE----- iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAmIfnFgWHHN3QHNpbW9u d3VuZGVybGljaC5kZQAKCRChK+OYQpKeofBZEACtLe1VvUbNi00KMFWE7N32/C6v X7snbt9HeoWJUAGQ8C89Eu80sAa0Jpig99qnQNNFRT6UR0T/DkFUYtUVkd5HV1TV OwiZag6PvROck4FyN2YYde5NA96PvMm6/70NlVWL4dXB1IVWoQvGBtoWNmuom/hA EkCIXt7IE1T1Y+OrAyeRM5KXcxK8nNYQbL2fKvampELAu8SRcq/cF7vfUQYq9OTz 7PNxTRqbZ2EOzp57A0EyYqYSzNpoKgQxyJsMjRGBZ6mooJB/GHNhj6B7qxyva/70 O942Twq9HY0F+XPhUxVDD5W2W8g2Mai1FFYpXlMpHOhiQQuVHqp9g6SLNOxGjEhC O1UrPRHdC4KKQoEqqJdYwdyFBE7yNvkJkgF1dUIpoAAjn6xcYo9uWUq+hxItbW2k OxmhNA9xLkiEtffT1sEJxf0rAyUj6WK88PsBVaVwxMSnSgRq87s3b926EnaxOnkx Te7V8ZnNFk+kvJQHtAmW1ZylAeAMAOvJ7m8f3+RzS4h7C5hiYYFP4B3QRJ8uIVAO klKohohPvGIuann1fyu3qRB2tm4Op+PurakGzusryVDkrPm70Gvtdy34M3s68vH+ y41pWZuSwz5HjBHvXrVDgXPK8Jo7KfxUM3Xrt7sd7mJJ6ik6GMbe5PkM04cjxVks 4kZKB4oCk72u8DtknA== =OG/5 -----END PGP SIGNATURE----- Merge tag 'batadv-next-pullrequest-20220302' of git://git.open-mesh.org/linux-merge Simon Wunderlich says: ==================== This cleanup patchset includes the following patches: - bump version strings, by Simon Wunderlich - Remove redundant 'flush_workqueue()' calls, by Christophe JAILLET - Migrate to linux/container_of.h, by Sven Eckelmann - Demote batadv-on-batadv skip error message, by Sven Eckelmann * tag 'batadv-next-pullrequest-20220302' of git://git.open-mesh.org/linux-merge: batman-adv: Demote batadv-on-batadv skip error message batman-adv: Migrate to linux/container_of.h batman-adv: Remove redundant 'flush_workqueue()' calls batman-adv: Start new development cycle ==================== Link: https://lore.kernel.org/r/20220302163522.102842-1-sw@simonwunderlich.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-03-02 21:58:03 -08:00
Wang Qing	a577223a97	net: hamradio: fix compliation error add missing ")" which caused by previous commit. Fixes: `61c4fb9c4d` ("net: hamradio: use time_is_after_jiffies() instead of open coding it") Link: https://lore.kernel.org/all/1646018012-61129-1-git-send-email-wangqing@vivo.com/ Signed-off-by: Wang Qing <wangqing@vivo.com> Link: https://lore.kernel.org/r/1646203277-83159-1-git-send-email-wangqing@vivo.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-03-02 09:47:38 -08:00
Sven Eckelmann	6ee3c393ee	batman-adv: Demote batadv-on-batadv skip error message The error message "Cannot find parent device" was shown for users of macvtap (on batadv devices) whenever the macvtap was moved to a different netns. This happens because macvtap doesn't provide an implementation for rtnl_link_ops->get_link_net. The situation for which this message is printed is actually not an error but just a warning that the optional sanity check was skipped. So demote the message from error to warning and adjust the text to better explain what happened. Reported-by: Leonardo Mörlein <freifunk@irrelefant.net> Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>	2022-03-02 09:00:17 +01:00
Sven Eckelmann	eb7da4f17d	batman-adv: Migrate to linux/container_of.h The commit `d2a8ebbf81` ("kernel.h: split out container_of() and typeof_member() macros") introduced a new header for the container_of related macros from (previously) linux/kernel.h. Signed-off-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>	2022-03-02 09:00:13 +01:00
Jakub Kicinski	96946d892a	Merge branch 'if_ether-h-add-industrial-fieldbus-ethertypes' Daniel Braunwarth says: ==================== if_ether.h: add industrial fieldbus Ethertypes This set of patches adds the Ethertypes for PROFINET and EtherCAT. The defines should be used by iproute2 to extend the list of available link layer protocols. ==================== Link: https://lore.kernel.org/r/20220228133029.100913-1-daniel@braunwarth.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-03-01 18:29:35 -08:00
Daniel Braunwarth	cd73cda742	if_ether.h: add EtherCAT Ethertype Add the Ethertype for EtherCAT protocol. Signed-off-by: Daniel Braunwarth <daniel@braunwarth.dev> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-03-01 18:29:27 -08:00
Daniel Braunwarth	dd0ca255f3	if_ether.h: add PROFINET Ethertype Add the Ethertype for PROFINET protocol. Signed-off-by: Daniel Braunwarth <daniel@braunwarth.dev> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-03-01 18:29:27 -08:00
Sven Eckelmann	a02192151b	macvtap: advertise link netns via netlink Assign rtnl_link_ops->get_link_net() callback so that IFLA_LINK_NETNSID is added to rtnetlink messages. This fixes iproute2 which otherwise resolved the link interface to an interface in the wrong namespace. Test commands: ip netns add nst ip link add dummy0 type dummy ip link add link macvtap0 link dummy0 type macvtap ip link set macvtap0 netns nst ip -netns nst link show macvtap0 Before: 10: macvtap0@gre0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 500 link/ether 5e:8f:ae:1d:60:50 brd ff:ff:ff:ff:ff:ff After: 10: macvtap0@if2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 500 link/ether 5e:8f:ae:1d:60:50 brd ff:ff:ff:ff:ff:ff link-netnsid 0 Reported-by: Leonardo Mörlein <freifunk@irrelefant.net> Signed-off-by: Sven Eckelmann <sven@narfation.org> Link: https://lore.kernel.org/r/20220228003240.1337426-1-sven@narfation.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-03-01 17:59:28 -08:00
Wan Jiabing	323d51cac6	nfp: avoid newline at end of message in NL_SET_ERR_MSG_MOD Fix the following coccicheck warning: ./drivers/net/ethernet/netronome/nfp/flower/qos_conf.c:750:7-55: WARNING avoid newline at end of message in NL_SET_ERR_MSG_MOD Signed-off-by: Wan Jiabing <wanjiabing@vivo.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Link: https://lore.kernel.org/r/20220301112356.1820985-1-wanjiabing@vivo.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-03-01 17:42:26 -08:00
Harold Huang	fb3f903769	tun: support NAPI for packets received from batched XDP buffs In tun, NAPI is supported and we can also use NAPI in the path of batched XDP buffs to accelerate packet processing. What is more, after we use NAPI, GRO is also supported. The iperf shows that the throughput of single stream could be improved from 4.5Gbps to 9.2Gbps. Additionally, 9.2 Gbps nearly reachs the line speed of the phy nic and there is still about 15% idle cpu core remaining on the vhost thread. Test topology: [iperf server]<--->tap<--->dpdk testpmd<--->phy nic<--->[iperf client] Iperf stream: iperf3 -c 10.0.0.2 -i 1 -t 10 Before: ... [ 5] 5.00-6.00 sec 558 MBytes 4.68 Gbits/sec 0 1.50 MBytes [ 5] 6.00-7.00 sec 556 MBytes 4.67 Gbits/sec 1 1.35 MBytes [ 5] 7.00-8.00 sec 556 MBytes 4.67 Gbits/sec 2 1.18 MBytes [ 5] 8.00-9.00 sec 559 MBytes 4.69 Gbits/sec 0 1.48 MBytes [ 5] 9.00-10.00 sec 556 MBytes 4.67 Gbits/sec 1 1.33 MBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 5.39 GBytes 4.63 Gbits/sec 72 sender [ 5] 0.00-10.04 sec 5.39 GBytes 4.61 Gbits/sec receiver After: ... [ 5] 5.00-6.00 sec 1.07 GBytes 9.19 Gbits/sec 0 1.55 MBytes [ 5] 6.00-7.00 sec 1.08 GBytes 9.30 Gbits/sec 0 1.63 MBytes [ 5] 7.00-8.00 sec 1.08 GBytes 9.25 Gbits/sec 0 1.72 MBytes [ 5] 8.00-9.00 sec 1.08 GBytes 9.25 Gbits/sec 77 1.31 MBytes [ 5] 9.00-10.00 sec 1.08 GBytes 9.24 Gbits/sec 0 1.48 MBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 10.8 GBytes 9.28 Gbits/sec 166 sender [ 5] 0.00-10.04 sec 10.8 GBytes 9.24 Gbits/sec receiver Reported-at: https://lore.kernel.org/all/CACGkMEvTLG0Ayg+TtbN4q4pPW-ycgCCs3sC3-TF8cuRTf7Pp1A@mail.gmail.com Signed-off-by: Harold Huang <baymaxhuang@gmail.com> Acked-by: Jason Wang <jasowang@redhat.com> Link: https://lore.kernel.org/r/20220228033805.1579435-1-baymaxhuang@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-03-01 17:27:53 -08:00
Jakub Kicinski	422ce83667	Merge branch 'sfc-optimize-rxqs-count-and-affinities' Íñigo Huguet says: ==================== sfc: optimize RXQs count and affinities In sfc driver one RX queue per physical core was allocated by default. Later on, IRQ affinities were set spreading the IRQs in all NUMA local CPUs. However, with that default configuration it result in a non very optimal configuration in many modern systems. Specifically, in systems with hyper threading and 2 NUMA nodes, affinities are set in a way that IRQs are handled by all logical cores of one same NUMA node. Handling IRQs from both hyper threading siblings has no benefit, and setting affinities to one queue per physical core is neither a very good idea because there is a performance penalty for moving data across nodes (I was able to check it with some XDP tests using pktgen). This patches reduce the default number of channels to one per physical core in the local NUMA node. Then, they set IRQ affinities to CPUs in the local NUMA node only. This way we save hardware resources since channels are limited resources. We also leave more room for XDP_TX channels without hitting driver's limit of 32 channels per interface. Running performance tests using iperf with a SFC9140 device showed no performance penalty for reducing the number of channels. RX XDP tests showed that performance can go down to less than half if the IRQ is handled by a CPU in a different NUMA node, which doesn't happen with the new defaults from this patches. ==================== Link: https://lore.kernel.org/r/20220228132254.25787-1-ihuguet@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-03-01 17:12:47 -08:00
Íñigo Huguet	09a99ab16c	sfc: set affinity hints in local NUMA node only Affinity hints were being set to CPUs in local NUMA node first, and then in other CPUs. This was creating 2 unintended issues: 1. Channels created to be assigned each to a different physical core were assigned to hyperthreading siblings because of being in same NUMA node. Since the patch previous to this one, this did not longer happen with default rss_cpus modparam because less channels are created. 2. XDP channels could be assigned to CPUs in different NUMA nodes, decreasing performance too much (to less than half in some of my tests). This patch sets the affinity hints spreading the channels only in local NUMA node's CPUs. A fallback for the case that no CPU in local NUMA node is online has been added too. Example of CPUs being assigned in a non optimal way before this and the previous patch (note: in this system, xdp-8 to xdp-15 are created because num_possible_cpus == 64, but num_present_cpus == 32 so they're never used): $ lscpu \| grep -i numa NUMA node(s): 2 NUMA node0 CPU(s): 0-7,16-23 NUMA node1 CPU(s): 8-15,24-31 $ grep -H . /proc/irq//0000:07:00.0/../smp_affinity_list /proc/irq/141/0000:07:00.0-0/../smp_affinity_list:0 /proc/irq/142/0000:07:00.0-1/../smp_affinity_list:1 /proc/irq/143/0000:07:00.0-2/../smp_affinity_list:2 /proc/irq/144/0000:07:00.0-3/../smp_affinity_list:3 /proc/irq/145/0000:07:00.0-4/../smp_affinity_list:4 /proc/irq/146/0000:07:00.0-5/../smp_affinity_list:5 /proc/irq/147/0000:07:00.0-6/../smp_affinity_list:6 /proc/irq/148/0000:07:00.0-7/../smp_affinity_list:7 /proc/irq/149/0000:07:00.0-8/../smp_affinity_list:16 /proc/irq/150/0000:07:00.0-9/../smp_affinity_list:17 /proc/irq/151/0000:07:00.0-10/../smp_affinity_list:18 /proc/irq/152/0000:07:00.0-11/../smp_affinity_list:19 /proc/irq/153/0000:07:00.0-12/../smp_affinity_list:20 /proc/irq/154/0000:07:00.0-13/../smp_affinity_list:21 /proc/irq/155/0000:07:00.0-14/../smp_affinity_list:22 /proc/irq/156/0000:07:00.0-15/../smp_affinity_list:23 /proc/irq/157/0000:07:00.0-xdp-0/../smp_affinity_list:8 /proc/irq/158/0000:07:00.0-xdp-1/../smp_affinity_list:9 /proc/irq/159/0000:07:00.0-xdp-2/../smp_affinity_list:10 /proc/irq/160/0000:07:00.0-xdp-3/../smp_affinity_list:11 /proc/irq/161/0000:07:00.0-xdp-4/../smp_affinity_list:12 /proc/irq/162/0000:07:00.0-xdp-5/../smp_affinity_list:13 /proc/irq/163/0000:07:00.0-xdp-6/../smp_affinity_list:14 /proc/irq/164/0000:07:00.0-xdp-7/../smp_affinity_list:15 /proc/irq/165/0000:07:00.0-xdp-8/../smp_affinity_list:24 /proc/irq/166/0000:07:00.0-xdp-9/../smp_affinity_list:25 /proc/irq/167/0000:07:00.0-xdp-10/../smp_affinity_list:26 /proc/irq/168/0000:07:00.0-xdp-11/../smp_affinity_list:27 /proc/irq/169/0000:07:00.0-xdp-12/../smp_affinity_list:28 /proc/irq/170/0000:07:00.0-xdp-13/../smp_affinity_list:29 /proc/irq/171/0000:07:00.0-xdp-14/../smp_affinity_list:30 /proc/irq/172/0000:07:00.0-xdp-15/../smp_affinity_list:31 CPUs assignments after this and previous patch, so normal channels created only one per core in NUMA node and affinities set only to local NUMA node: $ grep -H . /proc/irq//0000:07:00.0/../smp_affinity_list /proc/irq/116/0000:07:00.0-0/../smp_affinity_list:0 /proc/irq/117/0000:07:00.0-1/../smp_affinity_list:1 /proc/irq/118/0000:07:00.0-2/../smp_affinity_list:2 /proc/irq/119/0000:07:00.0-3/../smp_affinity_list:3 /proc/irq/120/0000:07:00.0-4/../smp_affinity_list:4 /proc/irq/121/0000:07:00.0-5/../smp_affinity_list:5 /proc/irq/122/0000:07:00.0-6/../smp_affinity_list:6 /proc/irq/123/0000:07:00.0-7/../smp_affinity_list:7 /proc/irq/124/0000:07:00.0-xdp-0/../smp_affinity_list:16 /proc/irq/125/0000:07:00.0-xdp-1/../smp_affinity_list:17 /proc/irq/126/0000:07:00.0-xdp-2/../smp_affinity_list:18 /proc/irq/127/0000:07:00.0-xdp-3/../smp_affinity_list:19 /proc/irq/128/0000:07:00.0-xdp-4/../smp_affinity_list:20 /proc/irq/129/0000:07:00.0-xdp-5/../smp_affinity_list:21 /proc/irq/130/0000:07:00.0-xdp-6/../smp_affinity_list:22 /proc/irq/131/0000:07:00.0-xdp-7/../smp_affinity_list:23 /proc/irq/132/0000:07:00.0-xdp-8/../smp_affinity_list:0 /proc/irq/133/0000:07:00.0-xdp-9/../smp_affinity_list:1 /proc/irq/134/0000:07:00.0-xdp-10/../smp_affinity_list:2 /proc/irq/135/0000:07:00.0-xdp-11/../smp_affinity_list:3 /proc/irq/136/0000:07:00.0-xdp-12/../smp_affinity_list:4 /proc/irq/137/0000:07:00.0-xdp-13/../smp_affinity_list:5 /proc/irq/138/0000:07:00.0-xdp-14/../smp_affinity_list:6 /proc/irq/139/0000:07:00.0-xdp-15/../smp_affinity_list:7 Signed-off-by: Íñigo Huguet <ihuguet@redhat.com> Acked-by: Martin Habets <habetsm.xilinx@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-03-01 17:12:38 -08:00
Íñigo Huguet	c265b569a4	sfc: default config to 1 channel/core in local NUMA node only Handling channels from CPUs in different NUMA node can penalize performance, so better configure only one channel per core in the same NUMA node than the NIC, and not per each core in the system. Fallback to all other online cores if there are not online CPUs in local NUMA node. Signed-off-by: Íñigo Huguet <ihuguet@redhat.com> Acked-by: Martin Habets <habetsm.xilinx@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-03-01 17:12:38 -08:00
Jakub Kicinski	ef739f1dd3	net: smc: fix different types in min() Fix build: include/linux/minmax.h:45:25: note: in expansion of macro ‘__careful_cmp’ 45 \| #define min(x, y) __careful_cmp(x, y, <) \| ^~~~~~~~~~~~~ net/smc/smc_tx.c:150:24: note: in expansion of macro ‘min’ 150 \| corking_size = min(sock_net(&smc->sk)->smc.sysctl_autocorking_size, \| ^~~ Fixes: `12bbb0d163` ("net/smc: add sysctl for autocorking") Link: https://lore.kernel.org/r/20220301222446.1271127-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-03-01 16:43:27 -08:00
Mateusz Palczewski	0a62b20989	iavf: Remove non-inclusive language Remove non-inclusive language from the iavf driver. Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Mateusz Palczewski <mateusz.palczewski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2022-03-01 08:50:11 -08:00
Mateusz Palczewski	8fc16be67d	iavf: Fix incorrect use of assigning iavf_status to int Currently there are functions in iavf_virtchnl.c for polling specific virtchnl receive events. These are all assigning iavf_status values to int values. Fix this and explicitly assign int values if iavf_status is not IAVF_SUCCESS. Also, refactor a small amount of duplicated code that can be reused by all of the previously mentioned functions. Finally, fix some spacing errors for variable assignment and get rid of all the goto statements in the refactored functions for clarity. Signed-off-by: Brett Creeley <brett.creeley@intel.com> Signed-off-by: Mateusz Palczewski <mateusz.palczewski@intel.com> Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2022-03-01 08:50:11 -08:00
Mateusz Palczewski	bae569d01a	iavf: stop leaking iavf_status as "errno" values Several functions in the iAVF core files take status values of the enum iavf_status and convert them into integer values. This leads to confusion as functions return both Linux errno values and status codes intermixed. Reporting status codes as if they were "errno" values can lead to confusion when reviewing error logs. Additionally, it can lead to unexpected behavior if a return value is not interpreted properly. Fix this by introducing iavf_status_to_errno, a switch that explicitly converts from the status codes into an appropriate error value. Also introduce a virtchnl_status_to_errno function for the one case where we were returning both virtchnl status codes and iavf_status codes in the same function. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Mateusz Palczewski <mateusz.palczewski@intel.com> Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2022-03-01 08:50:11 -08:00
Minghao Chi	c3fec56e12	iavf: remove redundant ret variable Return value directly instead of taking this in another redundant variable. Reported-by: Zeal Robot <zealci@zte.com.cn> Signed-off-by: Minghao Chi <chi.minghao@zte.com.cn> Signed-off-by: CGEL ZTE <cgel.zte@gmail.com> Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2022-03-01 08:50:11 -08:00
Mateusz Palczewski	a3e839d539	iavf: Add usage of new virtchnl format to set default MAC Use new type field of VIRTCHNL_OP_ADD_ETH_ADDR and VIRTCHNL_OP_DEL_ETH_ADDR requests to indicate that VF wants to change its default MAC address. Signed-off-by: Sylwester Dziedziuch <sylwesterx.dziedziuch@intel.com> Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com> Signed-off-by: Mateusz Palczewski <mateusz.palczewski@intel.com> Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2022-03-01 08:50:11 -08:00
Mateusz Palczewski	87dba256c7	iavf: refactor processing of VLAN V2 capability message In order to handle the capability exchange necessary for VIRTCHNL_VF_OFFLOAD_VLAN_V2, the driver must send a VIRTCHNL_OP_GET_OFFLOAD_VLAN_V2_CAPS message. This must occur prior to __IAVF_CONFIG_ADAPTER, and the driver must wait for the response from the PF. To handle this, the __IAVF_INIT_GET_OFFLOAD_VLAN_V2_CAPS state was introduced. This state is intended to process the response from the VLAN V2 caps message. This works ok, but is difficult to extend to adding more extended capability exchange. Existing (and future) AVF features are relying more and more on these sort of extended ops for processing additional capabilities. Just like VLAN V2, this exchange must happen prior to __IAVF_CONFIG_ADPATER. Since we only send one outstanding AQ message at a time during init, it is not clear where to place this state. Adding more capability specific states becomes a mess. Instead of having the "previous" state send a message and then transition into a capability-specific state, introduce __IAVF_EXTENDED_CAPS state. This state will use a list of extended_caps that determines what messages to send and receive. As long as there are extended_caps bits still set, the driver will remain in this state performing one send or one receive per state machine loop. Refactor the VLAN V2 negotiation to use this new state, and remove the capability-specific state. This makes it significantly easier to add a new similar capability exchange going forward. Extended capabilities are processed by having an associated SEND and RECV extended capability bit. During __IAVF_EXTENDED_CAPS, the driver checks these bits in order by feature, first the send bit for a feature, then the recv bit for a feature. Each send flag will call a function that sends the necessary response, while each receive flag will wait for the response from the PF. If a given feature can't be negotiated with the PF, the associated flags will be cleared in order to skip processing of that feature. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Mateusz Palczewski <mateusz.palczewski@intel.com> Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2022-03-01 08:50:11 -08:00
Mateusz Palczewski	d73dd1275e	iavf: Add support for 50G/100G in AIM algorithm Advanced link speed support was added long back, but adding AIM support was missed. This patch adds AIM support for advanced link speed support, which allows the algorithm to take into account 50G/100G link speeds. Also, other previous speeds are taken into consideration when advanced link speeds are supported. Signed-off-by: Brett Creeley <brett.creeley@intel.com> Signed-off-by: Mateusz Palczewski <mateusz.palczewski@intel.com> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>	2022-03-01 08:50:10 -08:00
David S. Miller	7282c126f7	Merge branch 'smc-datapath-opts' Dust Li says: ==================== net/smc: some datapath performance optimizations This series tries to improve the performance of SMC in datapath. - patch #1, add sysctl interface to support tuning the behaviour of SMC in container environment. - patch #2/#3, add autocorking support which is very efficient for small messages without trade-off for latency. - patch #4, send directly on setting TCP_NODELAY, without wake up the TX worker, this make it consistent with clearing TCP_CORK. - patch #5, this correct the setting of RMB window update limit, so we don't send CDC messages to update peer's RMB window too frequently in some cases. - patch #6, implemented something like NAPI in SMC, decrease the number of hardirq when busy. - patch #7, this moves TX work doing in the BH to the user context when sock_lock is hold by user. With this patchset applied, we can get a good performance gain: - qperf tcp_bw test has shown a great improvement. Other benchmarks like 'netperf TCP_STREAM' or 'sockperf throughput' has similar result. - In my testing environment, running qperf tcp_bw and tcp_lat, SMC behaves better then TCP in most all message size. Here are some test results with the following testing command: client: smc_run taskset -c 1 qperf smc-server -oo msg_size:1:64K:*2 \ -t 30 -vu tcp_{bw\|lat} server: smc_run taskset -c 1 qperf ==== Bandwidth ==== MsgSize Origin SMC TCP SMC with patches 1 0.578 MB/s 2.392 MB/s(313.57%) 2.561 MB/s(342.83%) 2 1.159 MB/s 4.780 MB/s(312.53%) 5.162 MB/s(345.46%) 4 2.283 MB/s 10.266 MB/s(349.77%) 10.122 MB/s(343.46%) 8 4.668 MB/s 19.040 MB/s(307.86%) 20.521 MB/s(339.59%) 16 9.147 MB/s 38.904 MB/s(325.31%) 40.823 MB/s(346.29%) 32 18.369 MB/s 79.587 MB/s(333.25%) 80.535 MB/s(338.42%) 64 36.562 MB/s 148.668 MB/s(306.61%) 158.170 MB/s(332.60%) 128 72.961 MB/s 274.913 MB/s(276.80%) 316.217 MB/s(333.41%) 256 144.705 MB/s 512.059 MB/s(253.86%) 626.019 MB/s(332.62%) 512 288.873 MB/s 884.977 MB/s(206.35%) 1221.596 MB/s(322.88%) 1024 574.180 MB/s 1337.736 MB/s(132.98%) 2203.156 MB/s(283.70%) 2048 1095.192 MB/s 1865.952 MB/s( 70.38%) 3036.448 MB/s(177.25%) 4096 2066.157 MB/s 2380.337 MB/s( 15.21%) 3834.271 MB/s( 85.58%) 8192 3717.198 MB/s 2733.073 MB/s(-26.47%) 4904.910 MB/s( 31.95%) 16384 4742.221 MB/s 2958.693 MB/s(-37.61%) 5220.272 MB/s( 10.08%) 32768 5349.550 MB/s 3061.285 MB/s(-42.77%) 5321.865 MB/s( -0.52%) 65536 5162.919 MB/s 3731.408 MB/s(-27.73%) 5245.021 MB/s( 1.59%) ==== Latency ==== MsgSize Origin SMC TCP SMC with patches 1 10.540 us 11.938 us( 13.26%) 10.356 us( -1.75%) 2 10.996 us 11.992 us( 9.06%) 10.073 us( -8.39%) 4 10.229 us 11.687 us( 14.25%) 9.996 us( -2.28%) 8 10.203 us 11.653 us( 14.21%) 10.063 us( -1.37%) 16 10.530 us 11.313 us( 7.44%) 10.013 us( -4.91%) 32 10.241 us 11.586 us( 13.13%) 10.081 us( -1.56%) 64 10.693 us 11.652 us( 8.97%) 9.986 us( -6.61%) 128 10.597 us 11.579 us( 9.27%) 10.262 us( -3.16%) 256 10.409 us 11.957 us( 14.87%) 10.148 us( -2.51%) 512 11.088 us 12.505 us( 12.78%) 10.206 us( -7.95%) 1024 11.240 us 12.255 us( 9.03%) 10.631 us( -5.42%) 2048 11.485 us 16.970 us( 47.76%) 10.981 us( -4.39%) 4096 12.077 us 13.948 us( 15.49%) 11.847 us( -1.90%) 8192 13.683 us 16.693 us( 22.00%) 13.336 us( -2.54%) 16384 16.470 us 23.615 us( 43.38%) 16.519 us( 0.30%) 32768 22.540 us 40.966 us( 81.75%) 22.452 us( -0.39%) 65536 34.192 us 73.003 us(113.51%) 33.916 us( -0.81%) ------------ Test environment notes: 1. Testing is run on 2 VMs within the same physical host 2. The NIC is ConnectX-4Lx, using SRIOV, and passing through 2 VFs to the 2 VMs respectively. 3. To decrease jitter, VM's vCPU are binded to each physical CPU, and those physical CPUs are all isolated using boot parameter `isolcpus=xxx` 4. The queue number are set to 1, and interrupt from the queue is binded to CPU0 in the guest ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2022-03-01 14:25:12 +00:00
Dust Li	6b88af839d	net/smc: don't send in the BH context if sock_owned_by_user Send data all the way down to the RDMA device is a time consuming operation(get a new slot, maybe do RDMA Write and send a CDC, etc). Moving those operations from BH to user context is good for performance. If the sock_lock is hold by user, we don't try to send data out in the BH context, but just mark we should send. Since the user will release the sock_lock soon, we can do the sending there. Add smc_release_cb() which will be called in release_sock() and try send in the callback if needed. This patch moves the sending part out from BH if sock lock is hold by user. In my testing environment, this saves about 20% softirq in the qperf 4K tcp_bw test in the sender side with no noticeable throughput drop. Signed-off-by: Dust Li <dust.li@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-03-01 14:25:12 +00:00
Dust Li	a505cce6f7	net/smc: don't req_notify until all CQEs drained When we are handling softirq workload, enable hardirq may again interrupt the current routine of softirq, and then try to raise softirq again. This only wastes CPU cycles and won't have any real gain. Since IB_CQ_REPORT_MISSED_EVENTS already make sure if ib_req_notify_cq() returns 0, it is safe to wait for the next event, with no need to poll the CQ again in this case. This patch disables hardirq during the processing of softirq, and re-arm the CQ after softirq is done. Somehow like NAPI. Co-developed-by: Guangguan Wang <guangguan.wang@linux.alibaba.com> Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com> Signed-off-by: Dust Li <dust.li@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-03-01 14:25:12 +00:00
Dust Li	6bf536eb5c	net/smc: correct settings of RMB window update limit rmbe_update_limit is used to limit announcing receive window updating too frequently. RFC7609 request a minimal increase in the window size of 10% of the receive buffer space. But current implementation used: min_t(int, rmbe_size / 10, SOCK_MIN_SNDBUF / 2) and SOCK_MIN_SNDBUF / 2 == 2304 Bytes, which is almost always less then 10% of the receive buffer space. This causes the receiver always sending CDC message to update its consumer cursor when it consumes more then 2K of data. And as a result, we may encounter something like "TCP silly window syndrome" when sending 2.5~8K message. This patch fixes this using max(rmbe_size / 10, SOCK_MIN_SNDBUF / 2). With this patch and SMC autocorking enabled, qperf 2K/4K/8K tcp_bw test shows 45%/75%/40% increase in throughput respectively. Signed-off-by: Dust Li <dust.li@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-03-01 14:25:12 +00:00
Dust Li	b70a5cc045	net/smc: send directly on setting TCP_NODELAY In commit ea785a1a573b("net/smc: Send directly when TCP_CORK is cleared"), we don't use delayed work to implement cork. This patch use the same algorithm, removes the delayed work when setting TCP_NODELAY and send directly in setsockopt(). This also makes the TCP_NODELAY the same as TCP. Cc: Tony Lu <tonylu@linux.alibaba.com> Signed-off-by: Dust Li <dust.li@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-03-01 14:25:12 +00:00
Dust Li	12bbb0d163	net/smc: add sysctl for autocorking This add a new sysctl: net.smc.autocorking_size We can dynamically change the behaviour of autocorking by change the value of autocorking_size. Setting to 0 disables autocorking in SMC Signed-off-by: Dust Li <dust.li@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-03-01 14:25:12 +00:00
Dust Li	dcd2cf5f2f	net/smc: add autocorking support This patch adds autocorking support for SMC which could improve throughput for small message by x3+. The main idea is borrowed from TCP autocorking with some RDMA specific modification: 1. The first message should never cork to make sure we won't bring extra latency 2. If we have posted any Tx WRs to the NIC that have not completed, cork the new messages until: a) Receive CQE for the last Tx WR b) We have corked enough message on the connection 3. Try to push the corked data out when we receive CQE of the last Tx WR to prevent the corked messages hang in the send queue. Both SMC autocorking and TCP autocorking check the TX completion to decide whether we should cork or not. The difference is when we got a SMC Tx WR completion, the data have been confirmed by the RNIC while TCP TX completion just tells us the data have been sent out by the local NIC. Add an atomic variable tx_pushing in smc_connection to make sure only one can send to let it cork more and save CDC slot. SMC autocorking should not bring extra latency since the first message will always been sent out immediately. The qperf tcp_bw test shows more than x4 increase under small message size with Mellanox connectX4-Lx, same result with other throughput benchmarks like sockperf/netperf. The qperf tcp_lat test shows SMC autocorking has not increase any ping-pong latency. Test command: client: smc_run taskset -c 1 qperf smc-server -oo msg_size:1:64K:*2 \ -t 30 -vu tcp_{bw\|lat} server: smc_run taskset -c 1 qperf === Bandwidth ==== MsgSize(Bytes) SMC-NoCork TCP SMC-AutoCorking 1 0.578 MB/s 2.392 MB/s(313.57%) 2.647 MB/s(357.72%) 2 1.159 MB/s 4.780 MB/s(312.53%) 5.153 MB/s(344.71%) 4 2.283 MB/s 10.266 MB/s(349.77%) 10.363 MB/s(354.02%) 8 4.668 MB/s 19.040 MB/s(307.86%) 21.215 MB/s(354.45%) 16 9.147 MB/s 38.904 MB/s(325.31%) 41.740 MB/s(356.32%) 32 18.369 MB/s 79.587 MB/s(333.25%) 82.392 MB/s(348.52%) 64 36.562 MB/s 148.668 MB/s(306.61%) 161.564 MB/s(341.89%) 128 72.961 MB/s 274.913 MB/s(276.80%) 325.363 MB/s(345.94%) 256 144.705 MB/s 512.059 MB/s(253.86%) 633.743 MB/s(337.96%) 512 288.873 MB/s 884.977 MB/s(206.35%) 1250.681 MB/s(332.95%) 1024 574.180 MB/s 1337.736 MB/s(132.98%) 2246.121 MB/s(291.19%) 2048 1095.192 MB/s 1865.952 MB/s( 70.38%) 2057.767 MB/s( 87.89%) 4096 2066.157 MB/s 2380.337 MB/s( 15.21%) 2173.983 MB/s( 5.22%) 8192 3717.198 MB/s 2733.073 MB/s(-26.47%) 3491.223 MB/s( -6.08%) 16384 4742.221 MB/s 2958.693 MB/s(-37.61%) 4637.692 MB/s( -2.20%) 32768 5349.550 MB/s 3061.285 MB/s(-42.77%) 5385.796 MB/s( 0.68%) 65536 5162.919 MB/s 3731.408 MB/s(-27.73%) 5223.890 MB/s( 1.18%) ==== Latency ==== MsgSize(Bytes) SMC-NoCork TCP SMC-AutoCorking 1 10.540 us 11.938 us( 13.26%) 10.573 us( 0.31%) 2 10.996 us 11.992 us( 9.06%) 10.269 us( -6.61%) 4 10.229 us 11.687 us( 14.25%) 10.240 us( 0.11%) 8 10.203 us 11.653 us( 14.21%) 10.402 us( 1.95%) 16 10.530 us 11.313 us( 7.44%) 10.599 us( 0.66%) 32 10.241 us 11.586 us( 13.13%) 10.223 us( -0.18%) 64 10.693 us 11.652 us( 8.97%) 10.251 us( -4.13%) 128 10.597 us 11.579 us( 9.27%) 10.494 us( -0.97%) 256 10.409 us 11.957 us( 14.87%) 10.710 us( 2.89%) 512 11.088 us 12.505 us( 12.78%) 10.547 us( -4.88%) 1024 11.240 us 12.255 us( 9.03%) 10.787 us( -4.03%) 2048 11.485 us 16.970 us( 47.76%) 11.256 us( -1.99%) 4096 12.077 us 13.948 us( 15.49%) 12.230 us( 1.27%) 8192 13.683 us 16.693 us( 22.00%) 13.786 us( 0.75%) 16384 16.470 us 23.615 us( 43.38%) 16.459 us( -0.07%) 32768 22.540 us 40.966 us( 81.75%) 23.284 us( 3.30%) 65536 34.192 us 73.003 us(113.51%) 34.233 us( 0.12%) With SMC autocorking support, we can archive better throughput than TCP in most message sizes without any latency trade-off. Signed-off-by: Dust Li <dust.li@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-03-01 14:25:12 +00:00
Dust Li	462791bbfa	net/smc: add sysctl interface for SMC This patch add sysctl interface to support container environment for SMC as we talk in the mail list. Link: https://lore.kernel.org/netdev/20220224020253.GF5443@linux.alibaba.com Co-developed-by: Tony Lu <tonylu@linux.alibaba.com> Signed-off-by: Tony Lu <tonylu@linux.alibaba.com> Signed-off-by: Dust Li <dust.li@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-03-01 14:25:12 +00:00
David S. Miller	1e385c0824	Merge branch 'vxlan-vnifiltering' Roopa Prabhu says: ==================== vxlan metadata device vnifiltering support This series adds vnifiltering support to vxlan collect metadata device. Motivation: You can only use a single vxlan collect metadata device for a given vxlan udp port in the system today. The vxlan collect metadata device terminates all received vxlan packets. As shown in the below diagram, there are use-cases where you need to support multiple such vxlan devices in independent bridge domains. Each vxlan device must terminate the vni's it is configured for. Example usecase: In a service provider network a service provider typically supports multiple bridge domains with overlapping vlans. One bridge domain per customer. Vlans in each bridge domain are mapped to globally unique vxlan ranges assigned to each customer. This series adds vnifiltering support to collect metadata devices to terminate only configured vnis. This is similar to vlan filtering in bridge driver. The vni filtering capability is provided by a new flag on collect metadata device. In the below pic: - customer1 is mapped to br1 bridge domain - customer2 is mapped to br2 bridge domain - customer1 vlan 10-11 is mapped to vni 1001-1002 - customer2 vlan 10-11 is mapped to vni 2001-2002 - br1 and br2 are vlan filtering bridges - vxlan1 and vxlan2 are collect metadata devices with vnifiltering enabled ┌──────────────────────────────────────────────────────────────────┐ │ switch │ │ │ │ ┌───────────┐ ┌───────────┐ │ │ │ │ │ │ │ │ │ br1 │ │ br2 │ │ │ └┬─────────┬┘ └──┬───────┬┘ │ │ vlans│ │ vlans │ │ │ │ 10,11│ │ 10,11│ │ │ │ │ vlanvnimap: │ vlanvnimap: │ │ │ 10-1001,11-1002 │ 10-2001,11-2002 │ │ │ │ │ │ │ │ ┌──────┴┐ ┌──┴─────────┐ ┌───┴────┐ │ │ │ │ swp1 │ │vxlan1 │ │ swp2 │ ┌┴─────────────┐ │ │ │ │ │ vnifilter:│ │ │ │vxlan2 │ │ │ └───┬───┘ │ 1001,1002│ └───┬────┘ │ vnifilter: │ │ │ │ └────────────┘ │ │ 2001,2002 │ │ │ │ │ └──────────────┘ │ │ │ │ │ └───────┼──────────────────────────────────┼───────────────────────┘ │ │ │ │ ┌─────┴───────┐ │ │ customer1 │ ┌─────┴──────┐ │ host/VM │ │customer2 │ └─────────────┘ │ host/VM │ └────────────┘ v2: - remove stale xstats declarations pointed out by Nikolay Aleksandrov - squash selinux patch with the tunnel api patch as pointed out by benjamin poirier - Fix various build issues: Reported-by: kernel test robot <lkp@intel.com> v3: - incorporate review feedback from Jakub - move rhashtable declarations to c file - define and use netlink policy for top level vxlan filter api - fix unused stats function warning - pass vninode from vnifilter lookup into stats count function to avoid another lookup (only applicable to vxlan_rcv) - fix missing vxlan vni delete notifications in vnifilter uninit function - misc cleanups - remote dev check for multicast groups added via vnifiltering api ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2022-03-01 08:38:02 +00:00
Nikolay Aleksandrov	445b2f36bb	drivers: vxlan: vnifilter: add support for stats dumping Add support for VXLAN vni filter entries' stats dumping Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-03-01 08:38:02 +00:00
Nikolay Aleksandrov	4095e0e132	drivers: vxlan: vnifilter: per vni stats Add per-vni statistics for vni filter mode. Counting Rx/Tx bytes/packets/drops/errors at the appropriate places. This patch changes vxlan_vs_find_vni to also return the vxlan_vni_node in cases where the vni belongs to a vni filtering vxlan device Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Roopa Prabhu <roopa@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-03-01 08:38:02 +00:00
Roopa Prabhu	3edf5f66c1	selftests: add new tests for vxlan vnifiltering This patch adds a new test script test_vxlan_vnifiltering.sh with tests for vni filtering api, various datapath tests. Also has a test with a mix of traditional, metadata and vni filtering devices inuse at the same time. Signed-off-by: Roopa Prabhu <roopa@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-03-01 08:38:02 +00:00
Roopa Prabhu	f9c4bb0b24	vxlan: vni filtering support on collect metadata device This patch adds vnifiltering support to collect metadata device. Motivation: You can only use a single vxlan collect metadata device for a given vxlan udp port in the system today. The vxlan collect metadata device terminates all received vxlan packets. As shown in the below diagram, there are use-cases where you need to support multiple such vxlan devices in independent bridge domains. Each vxlan device must terminate the vni's it is configured for. Example usecase: In a service provider network a service provider typically supports multiple bridge domains with overlapping vlans. One bridge domain per customer. Vlans in each bridge domain are mapped to globally unique vxlan ranges assigned to each customer. vnifiltering support in collect metadata devices terminates only configured vnis. This is similar to vlan filtering in bridge driver. The vni filtering capability is provided by a new flag on collect metadata device. In the below pic: - customer1 is mapped to br1 bridge domain - customer2 is mapped to br2 bridge domain - customer1 vlan 10-11 is mapped to vni 1001-1002 - customer2 vlan 10-11 is mapped to vni 2001-2002 - br1 and br2 are vlan filtering bridges - vxlan1 and vxlan2 are collect metadata devices with vnifiltering enabled ┌──────────────────────────────────────────────────────────────────┐ │ switch │ │ │ │ ┌───────────┐ ┌───────────┐ │ │ │ │ │ │ │ │ │ br1 │ │ br2 │ │ │ └┬─────────┬┘ └──┬───────┬┘ │ │ vlans│ │ vlans │ │ │ │ 10,11│ │ 10,11│ │ │ │ │ vlanvnimap: │ vlanvnimap: │ │ │ 10-1001,11-1002 │ 10-2001,11-2002 │ │ │ │ │ │ │ │ ┌──────┴┐ ┌──┴─────────┐ ┌───┴────┐ │ │ │ │ swp1 │ │vxlan1 │ │ swp2 │ ┌┴─────────────┐ │ │ │ │ │ vnifilter:│ │ │ │vxlan2 │ │ │ └───┬───┘ │ 1001,1002│ └───┬────┘ │ vnifilter: │ │ │ │ └────────────┘ │ │ 2001,2002 │ │ │ │ │ └──────────────┘ │ │ │ │ │ └───────┼──────────────────────────────────┼───────────────────────┘ │ │ │ │ ┌─────┴───────┐ │ │ customer1 │ ┌─────┴──────┐ │ host/VM │ │customer2 │ └─────────────┘ │ host/VM │ └────────────┘ With this implementation, vxlan dst metadata device can be associated with range of vnis. struct vxlan_vni_node is introduced to represent a configured vni. We start with vni and its associated remote_ip in this structure. This structure can be extended to bring in other per vni attributes if there are usecases for it. A vni inherits an attribute from the base vxlan device if there is no per vni attributes defined. struct vxlan_dev gets a new rhashtable for vnis called vxlan_vni_group. vxlan_vnifilter.c implements the necessary netlink api, notifications and helper functions to process and manage lifecycle of vxlan_vni_node. This patch also adds new helper functions in vxlan_multicast.c to handle per vni remote_ip multicast groups which are part of vxlan_vni_group. Fix build problems: Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Roopa Prabhu <roopa@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-03-01 08:38:02 +00:00
Roopa Prabhu	a498c5953a	vxlan_multicast: Move multicast helpers to a separate file subsequent patches will add more helpers. Signed-off-by: Roopa Prabhu <roopa@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-03-01 08:38:02 +00:00
Roopa Prabhu	7b8135f4df	rtnetlink: add new rtm tunnel api for tunnel id filtering This patch adds new rtm tunnel msg and api for tunnel id filtering in dst_metadata devices. First dst_metadata device to use the api is vxlan driver with AF_BRIDGE family. This and later changes add ability in vxlan driver to do tunnel id filtering (or vni filtering) on dst_metadata devices. This is similar to vlan api in the vlan filtering bridge. this patch includes selinux nlmsg_route_perms support for RTM_*TUNNEL api from Benjamin Poirier. Signed-off-by: Roopa Prabhu <roopa@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-03-01 08:38:02 +00:00
Roopa Prabhu	efe0f94b33	vxlan_core: add helper vxlan_vni_in_use more users in follow up patches Signed-off-by: Roopa Prabhu <roopa@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-03-01 08:38:02 +00:00
Roopa Prabhu	a9508d121a	vxlan_core: make multicast helper take rip and ifindex explicitly This patch changes multicast helpers to take rip and ifindex as input. This is needed in future patches where rip can come from a pervni structure while the ifindex can come from the vxlan device. Signed-off-by: Roopa Prabhu <roopa@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-03-01 08:38:02 +00:00
Roopa Prabhu	c63053e0cb	vxlan_core: move some fdb helpers to non-static This patch moves some fdb helpers to non-static for use in later patches. Ideally, all fdb code could move into its own file vxlan_fdb.c. This can be done as a subsequent patch and is out of scope of this series. Signed-off-by: Roopa Prabhu <roopa@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-03-01 08:38:02 +00:00
Roopa Prabhu	76fc217d7f	vxlan_core: move common declarations to private header file This patch moves common structures and global declarations to a shared private headerfile vxlan_private.h. Subsequent patches use this header file as a common header file for additional shared declarations. Signed-off-by: Roopa Prabhu <roopa@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-03-01 08:38:02 +00:00

1 2 3 4 5 ...

1075229 Commits