linux

Author	SHA1	Message	Date
Toke Høiland-Jørgensen	83f8fd69af	sch_cake: Add DiffServ handling This adds support for DiffServ-based priority queueing to CAKE. If the shaper is in use, each priority tier gets its own virtual clock, which limits that tier's rate to a fraction of the overall shaped rate, to discourage trying to game the priority mechanism. CAKE defaults to a simple, three-tier mode that interprets most code points as "best effort", but places CS1 traffic into a low-priority "bulk" tier which is assigned 1/16 of the total rate, and a few code points indicating latency-sensitive or control traffic (specifically TOS4, VA, EF, CS6, CS7) into a "latency sensitive" high-priority tier, which is assigned 1/4 rate. The other supported DiffServ modes are a 4-tier mode matching the 802.11e precedence rules, as well as two 8-tier modes, one of which implements strict precedence of the eight priority levels. This commit also adds an optional DiffServ 'wash' mode, which will zero out the DSCP fields of any packet passing through CAKE. While this can technically be done with other mechanisms in the kernel, having the feature available in CAKE significantly decreases configuration complexity; and the implementation cost is low on top of the other DiffServ-handling code. Filters and applications can set the skb->priority field to override the DSCP-based classification into tiers. If TC_H_MAJ(skb->priority) matches CAKE's qdisc handle, the minor number will be interpreted as a priority tier if it is less than or equal to the number of configured priority tiers. Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-10 20:06:34 -07:00
Toke Høiland-Jørgensen	ea82511518	sch_cake: Add NAT awareness to packet classifier When CAKE is deployed on a gateway that also performs NAT (which is a common deployment mode), the host fairness mechanism cannot distinguish internal hosts from each other, and so fails to work correctly. To fix this, we add an optional NAT awareness mode, which will query the kernel conntrack mechanism to obtain the pre-NAT addresses for each packet and use that in the flow and host hashing. When the shaper is enabled and the host is already performing NAT, the cost of this lookup is negligible. However, in unlimited mode with no NAT being performed, there is a significant CPU cost at higher bandwidths. For this reason, the feature is turned off by default. Cc: netfilter-devel@vger.kernel.org Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-10 20:06:34 -07:00
Toke Høiland-Jørgensen	b60a60405f	netfilter: Add nf_ct_get_tuple_skb global lookup function This adds a global netfilter function to extract a conntrack tuple from an skb. The function uses a new function added to nf_ct_hook, which will try to get the tuple from skb->_nfct, and do a full lookup if that fails. This makes it possible to use the lookup function before the skb has passed through the conntrack init hooks (e.g., in an ingress qdisc). The tuple is copied to the caller to avoid issues with reference counting. The function returns false if conntrack is not loaded, allowing it to be used without incurring a module dependency on conntrack. This is used by the NAT mode in sch_cake. Cc: netfilter-devel@vger.kernel.org Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-10 20:06:34 -07:00
Toke Høiland-Jørgensen	8b7138814f	sch_cake: Add optional ACK filter The ACK filter is an optional feature of CAKE which is designed to improve performance on links with very asymmetrical rate limits. On such links (which are unfortunately quite prevalent, especially for DSL and cable subscribers), the downstream throughput can be limited by the number of ACKs capable of being transmitted in the upstream direction. Filtering ACKs can, in general, have adverse effects on TCP performance because it interferes with ACK clocking (especially in slow start), and it reduces the flow's resiliency to ACKs being dropped further along the path. To alleviate these drawbacks, the ACK filter in CAKE tries its best to always keep enough ACKs queued to ensure forward progress in the TCP flow being filtered. It does this by only filtering redundant ACKs. In its default 'conservative' mode, the filter will always keep at least two redundant ACKs in the queue, while in 'aggressive' mode, it will filter down to a single ACK. The ACK filter works by inspecting the per-flow queue on every packet enqueue. Starting at the head of the queue, the filter looks for another eligible packet to drop (so the ACK being dropped is always closer to the head of the queue than the packet being enqueued). An ACK is eligible only if it ACKs fewer bytes than the new packet being enqueued, including any SACK options. This prevents duplicate ACKs from being filtered, to avoid interfering with retransmission logic. In addition, we check TCP header options and only drop those that are known to not interfere with sender state. In particular, packets with unknown option codes are never dropped. In aggressive mode, an eligible packet is always dropped, while in conservative mode, at least two ACKs are kept in the queue. Only pure ACKs (with no data segments) are considered eligible for dropping, but when an ACK with data segments is enqueued, this can cause another pure ACK to become eligible for dropping. The approach described above ensures that this ACK filter avoids most of the drawbacks of a naive filtering mechanism that only keeps flow state but does not inspect the queue. This is the rationale for including the ACK filter in CAKE itself rather than as separate module (as the TC filter, for instance). Our performance evaluation has shown that on a 30/1 Mbps link with a bidirectional traffic test (RRUL), turning on the ACK filter on the upstream link improves downstream throughput by ~20% (both modes) and upstream throughput by ~12% in conservative mode and ~40% in aggressive mode, at the cost of ~5ms of inter-flow latency due to the increased congestion. In really pathological cases, the effect can be a lot more; for instance, the ACK filter increases the achievable downstream throughput on a link with 100 Kbps in the upstream direction by an order of magnitude (from ~2.5 Mbps to ~25 Mbps). Finally, even though we consider the ACK filter to be safer than most, we do not recommend turning it on everywhere: on more symmetrical link bandwidths the effect is negligible at best. Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-10 20:06:34 -07:00
Toke Høiland-Jørgensen	7298de9cd7	sch_cake: Add ingress mode The ingress mode is meant to be enabled when CAKE runs downlink of the actual bottleneck (such as on an IFB device). The mode changes the shaper to also account dropped packets to the shaped rate, as these have already traversed the bottleneck. Enabling ingress mode will also tune the AQM to always keep at least two packets queued for each flow. This is done by scaling the minimum queue occupancy level that will disable the AQM by the number of active bulk flows. The rationale for this is that retransmits are more expensive in ingress mode, since dropped packets have to traverse the bottleneck again when they are retransmitted; thus, being more lenient and keeping a minimum number of packets queued will improve throughput in cases where the number of active flows are so large that they saturate the bottleneck even at their minimum window size. This commit also adds a separate switch to enable ingress mode rate autoscaling. If enabled, the autoscaling code will observe the actual traffic rate and adjust the shaper rate to match it. This can help avoid latency increases in the case where the actual bottleneck rate decreases below the shaped rate. The scaling filters out spikes by an EWMA filter. Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-10 20:06:34 -07:00
Toke Høiland-Jørgensen	046f6fd5da	sched: Add Common Applications Kept Enhanced (cake) qdisc sch_cake targets the home router use case and is intended to squeeze the most bandwidth and latency out of even the slowest ISP links and routers, while presenting an API simple enough that even an ISP can configure it. Example of use on a cable ISP uplink: tc qdisc add dev eth0 cake bandwidth 20Mbit nat docsis ack-filter To shape a cable download link (ifb and tc-mirred setup elided) tc qdisc add dev ifb0 cake bandwidth 200mbit nat docsis ingress wash CAKE is filled with: * A hybrid Codel/Blue AQM algorithm, "Cobalt", tied to an FQ_Codel derived Flow Queuing system, which autoconfigures based on the bandwidth. * A novel "triple-isolate" mode (the default) which balances per-host and per-flow FQ even through NAT. * An deficit based shaper, that can also be used in an unlimited mode. * 8 way set associative hashing to reduce flow collisions to a minimum. * A reasonable interpretation of various diffserv latency/loss tradeoffs. * Support for zeroing diffserv markings for entering and exiting traffic. * Support for interacting well with Docsis 3.0 shaper framing. * Extensive support for DSL framing types. * Support for ack filtering. * Extensive statistics for measuring, loss, ecn markings, latency variation. A paper describing the design of CAKE is available at https://arxiv.org/abs/1804.07617, and will be published at the 2018 IEEE International Symposium on Local and Metropolitan Area Networks (LANMAN). This patch adds the base shaper and packet scheduler, while subsequent commits add the optional (configurable) features. The full userspace API and most data structures are included in this commit, but options not understood in the base version will be ignored. Various versions baking have been available as an out of tree build for kernel versions going back to 3.10, as the embedded router world has been running a few years behind mainline Linux. A stable version has been generally available on lede-17.01 and later. sch_cake replaces a combination of iptables, tc filter, htb and fq_codel in the sqm-scripts, with sane defaults and vastly simpler configuration. CAKE's principal author is Jonathan Morton, with contributions from Kevin Darbyshire-Bryant, Toke Høiland-Jørgensen, Sebastian Moeller, Ryan Mounce, Tony Ambardar, Dean Scarff, Nils Andreas Svee, Dave Täht, and Loganaden Velvindron. Testing from Pete Heist, Georgios Amanakis, and the many other members of the cake@lists.bufferbloat.net mailing list. tc -s qdisc show dev eth2 qdisc cake 8017: root refcnt 2 bandwidth 1Gbit diffserv3 triple-isolate split-gso rtt 100.0ms noatm overhead 38 mpu 84 Sent 51504294511 bytes 37724591 pkt (dropped 6, overlimits 64958695 requeues 12) backlog 0b 0p requeues 12 memory used: 1053008b of 15140Kb capacity estimate: 970Mbit min/max network layer size: 28 / 1500 min/max overhead-adjusted size: 84 / 1538 average network hdr offset: 14 Bulk Best Effort Voice thresh 62500Kbit 1Gbit 250Mbit target 5.0ms 5.0ms 5.0ms interval 100.0ms 100.0ms 100.0ms pk_delay 5us 5us 6us av_delay 3us 2us 2us sp_delay 2us 1us 1us backlog 0b 0b 0b pkts 3164050 25030267 9530280 bytes 3227519915 35396974782 12879808898 way_inds 0 8 0 way_miss 21 366 25 way_cols 0 0 0 drops 5 0 1 marks 0 0 0 ack_drop 0 0 0 sp_flows 1 3 0 bk_flows 0 1 1 un_flows 0 0 0 max_len 68130 68130 68130 Tested-by: Pete Heist <peteheist@gmail.com> Tested-by: Georgios Amanakis <gamanakis@gmail.com> Signed-off-by: Dave Taht <dave.taht@gmail.com> Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-10 20:06:34 -07:00
Jesus Sanchez-Palencia	52b509218f	net: Use __u32 in uapi net_stamp.h We are not supposed to use u32 in uapi, so change the flags member of struct sock_txtime from u32 to __u32 instead. Fixes: `80b14dee2b` ("net: Add a new socket option for a future transmit time") Reported-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-09 16:31:28 -07:00
David S. Miller	1497d2fd1b	Merge branch 'mlxsw-More-Spectrum-2-preparations' aIdo Schimmel says: ==================== mlxsw: More Spectrum-2 preparations This is the second and last set of preparations towards initial Spectrum-2 support in mlxsw. It mainly re-arranges parts of the code that need to work with both ASICs, but somewhat differ. The first three patches allow different ASICs to register different set of operations for KVD linear (KVDL) management. In Spectrum-2 there is no linear memory and instead entries that reside there in Spectrum (e.g., nexthops) are hashed and inserted to the hash-based KVD memory. The fourth patch does a similar restructuring in the low-level multicast router code. This is necessary because multicast routing is implemented using regular circuit TCAM (C-TCAM) in Spectrum, whereas Spectrum-2 uses an algorithmic TCAM (A-TCAM). Next six patches prepare the ACL code for the introduction of A-TCAM in follow-up patch sets. Last two patches allow different ASICs to require different firmware versions and add two resources that need to be queried from firmware by Spectrum-2 specific code. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-09 16:24:18 -07:00
Jiri Pirko	a8b9f232ec	mlxsw: resources: Add couple of Spectrum-2 KVD resources These resources are needed for Spectrum-2 KVD linear management implementation. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-09 16:24:18 -07:00
Jiri Pirko	abfd61825b	mlxsw: spectrum: Prepare for multiple FW versions for Spectrum and Spectrum-2 Prepare for Spectrum-2 FW version checking and make mlxsw_sp_fw_rev_validate() per-ASIC as well as required FW revision and FW filename. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-09 16:24:17 -07:00
Jiri Pirko	ea8b2e28aa	mlxsw: spectrum_acl: Implement priority setting for rules inserted to TCAM For Spectrum-2, we need to insert priority to C-TCAM because HW needs that info in order to correctly process scenarios where rules are in both C-TCAM and A-TCAM. So extend the mlxsw_sp_acl_ctcam_entry_add() args to accept indication if priority needs to be filled up and implement the priority computation and fill-up. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-09 16:24:17 -07:00
Jiri Pirko	42df8358c3	mlxsw: reg: Add priority field for PTCEV2 register This is going to be needed for Spectrum-2 C-TCAM implementation. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-09 16:24:17 -07:00
Jiri Pirko	a5995cc801	mlxsw: spectrum_acl: Move block items encoding into Spectrum op Since Spectrum-2 encodes blocks into different HW layout, push this code into Spectrum-specific op. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-09 16:24:17 -07:00
Jiri Pirko	c17d20838e	mlxsw: spectrum_acl: Convert mlxsw_afk_create args to ops Since the flex keys for Spectrum-2 differ not only in blocks definitions but also in encoding layout, prepare for the implementation and pass Spectrum/Spectrum-2 specific ops down to mlxsw_afk_create. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-09 16:24:17 -07:00
Jiri Pirko	bab5c1cfb7	mlxsw: spectrum_acl: Add tcam init/fini ops Add ops to be called on driver instance init and fini. This is needed in order to be possible to do Spectrum-2 specific init and fini work. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-09 16:24:17 -07:00
Jiri Pirko	64eccd0066	mlxsw: spectrum_acl: Split TCAM handling 3 ways To allow easy and clean Spectrum-2 implementation for things that differ from Spectrum, split the existing ACL TCAM code 3 ways: 1) common code that calls Spectrum/Spectrum-2 specific ops 2) Spectrum ops implementations 3) common C-TCAM code that is going to be shared between Spectrum and Spectrum-2 implementations Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-09 16:24:17 -07:00
Jiri Pirko	8fae4392d4	mlxsw: spectrum_mr_tcam: Push Spectrum-specific operations into a separate file Since Spectrum-2 has different handling of TCAM, push Spectrum MR TCAM bits to a separate file accessible by ops which allows to implement Spectrum-2 specific ops. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-09 16:24:17 -07:00
Jiri Pirko	0304c00546	mlxsw: spectrum_kvdl: Pass entry_count to free function For the Spectrum-2 KVD linear manager implementation, entry_count will be needed even for the free function. So pass it down. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-09 16:24:16 -07:00
Jiri Pirko	4b6b18692a	mlxsw: spectrum_kvdl: Pass entry type to alloc/free Future Spectrum-2 KVD linear manager implementation needs to know type of the entry to alloc and free. So define the types in an enum and pass it down to alloc and free functions. Once the entry type is passed down, KVDL common part knows sizes of each entry types, so replace size function arg with entry count. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-09 16:24:16 -07:00
Jiri Pirko	ebcff74386	mlxsw: spectrum_kvdl: Push out KVD linear management into ops In Spectrum-2 there is a different implementation of KVD linear management. Unlike in Spectrum where there is a single index space, in Spectrum-2 the indexes are per-resource. Also there is need to explicitly tell HW that an entry is no longer used. So push out the existing implementation into spectrum1_kvdl.c and prepare ops infrastructure to allow new implementation in a follow-up. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-09 16:24:16 -07:00
Kees Cook	eec4edc9ee	net/mlx5: Use 2-factor allocator calls This restores the use of 2-factor allocation helpers that were already fixed treewide. Please do not use open-coded multiplication; prefer, instead, using 2-factor allocation helpers. Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-09 16:00:07 -07:00
Julian Wiedmann	95765a6ca1	tcp: remove SG-related comment in tcp_sendmsg() Since commit `74d4a8f8d3` ("tcp: remove sk_can_gso() use"), the code doesn't care whether the interface supports SG. Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-09 15:57:11 -07:00
David S. Miller	863f4fdb71	Merge branch 'fix-use-after-free-bugs-in-skb-list-processing' Edward Cree says: ==================== fix use-after-free bugs in skb list processing A couple of bugs in skb list handling were spotted by Dan Carpenter, with the help of Smatch; following up on them I found a couple more similar cases. This series fixes them by changing the relevant loops to use the dequeue-enqueue model (rather than in-place list modification). v3: fixed another similar bug in __netif_receive_skb_list_core(). v2: dropped patch #3 (new list.h helper), per DaveM's request. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-09 14:55:54 -07:00
Edward Cree	9af86f9338	net: core: fix use-after-free in __netif_receive_skb_list_core __netif_receive_skb_core can free the skb, so we have to use the dequeue- enqueue model when calling it from __netif_receive_skb_list_core. Fixes: `88eb1944e1` ("net: core: propagate SKB lists through packet_type lookup") Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-09 14:55:53 -07:00
Edward Cree	9f17dbf04d	netfilter: fix use-after-free in NF_HOOK_LIST nf_hook() can free the skb, so we need to remove it from the list before calling, and add passed skbs to a sublist afterwards. Fixes: `17266ee939` ("net: ipv4: listified version of ip_rcv") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-09 14:55:53 -07:00
Edward Cree	8c057efaeb	net: core: fix uses-after-free in list processing In netif_receive_skb_list_internal(), all of skb_defer_rx_timestamp(), do_xdp_generic() and enqueue_to_backlog() can lead to kfree(skb). Thus, we cannot wait until after they return to remove the skb from the list; instead, we remove it first and, in the pass case, add it to a sublist afterwards. In the case of enqueue_to_backlog() we have already decided not to pass when we call the function, so we do not need a sublist. Fixes: `7da517a3bc` ("net: core: Another step of skb receive list processing") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-09 14:55:53 -07:00
Eric Dumazet	c47078d6a3	tcp: remove redundant SOCK_DONE checks In both tcp_splice_read() and tcp_recvmsg(), we already test sock_flag(sk, SOCK_DONE) right before evaluating sk->sk_state, so "!sock_flag(sk, SOCK_DONE)" is always true. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-08 17:14:58 +09:00
David S. Miller	3d907eafa3	Merge branch 'mlxsw-Spectrum2-acl-prep' Ido Schimmel says: ==================== mlxsw: Spectrum-2 small ACL preparations This is the first set of changes towards Spectrum-2 support in the mlxsw driver. It contains small changes that prepare the code for the later introduction of Spectrum-2 support. The Spectrum-2 ASIC uses an algorithmic TCAM (A-TCAM) instead of a circuit TCAM (C-TCAM) as Spectrum, and thus most of the changes are around the ACL code. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-08 17:05:20 +09:00
Jiri Pirko	0317a6f4eb	mlxsw: core_acl_flex_actions: Fix helper to get the first KVD linear index The helper should return always KVD linear index of the second set. It is unused now, but going to be used soon. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-08 17:05:19 +09:00
Jiri Pirko	5b9488fd5f	mlxsw: core_acl_flex_actions: Allow the first set to be dummy In Spectrum-2, the real action sets are always in KVD linear. The first set is always empty and contains only pointer to the first real set in KVD linear. So provide possibility to specify the first set is the dummy one. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-08 17:05:19 +09:00
Jiri Pirko	9dbab6f588	mlxsw: spectrum: Put pointer to flex action ops to mlxsw_sp Spectrum-2 need a slightly different handling of flexible actions. So put an ops pointer in mlxsw_sp struct and rename it. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-08 17:05:19 +09:00
Jiri Pirko	82b63bcf8c	mlxsw: core_acl_flex_keys: Change SRC_SYS_PORT flex key element size The SRC_SYS_PORT is passed as 8 bit value down to hw anyway, so cap it in the driver as well. Also, in Spectrum-2 the FW iface for SRC_SYS_PORT is only 8 bits, so prepare for it. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-08 17:05:19 +09:00
Jiri Pirko	c43ea06dbd	mlxsw: core_acl_flex_keys: Split MAC and IP address flex key elements Since in Spectrum-2, MACs are split and IP addresses are split as well, in order to use the same elements for Spectrum and Spectrum-2 split them now. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-08 17:05:19 +09:00
Jiri Pirko	2139469b04	mlxsw: spectrum_acl: Ignore always-zeroed bits in tp->prio The lowest 16 bits of tp->prio are always zero, so ignore them with a shift. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-08 17:05:19 +09:00
Jiri Pirko	45e0620d5e	mlxsw: reg: Introduce Flex2 key type for PTAR register Introduce Flex2 key type for PTAR register which is used in Spectrum-2. Also, extend mlxsw_reg_ptar_pack() to set the value according to the caller. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-08 17:05:19 +09:00
Jiri Pirko	d4b0d20fec	mlxsw: spectrum: Change name of mlxsw_sp_afk_blocks to mlxsw_sp1_afk_blocks This is specific for Spectrum as Spectrum-2 has completely different key blocks. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-08 17:05:19 +09:00
David S. Miller	0dbc81eab4	net: sched: Fix warnings from xchg() on RCU'd cookie pointer. The kbuild test robot reports: >> net/sched/act_api.c:71:15: sparse: incorrect type in initializer (different address spaces) @@ expected struct tc_cookie [noderef] <asn:4>__ret @@ got [noderef] <asn:4>__ret @@ net/sched/act_api.c:71:15: expected struct tc_cookie [noderef] <asn:4>__ret net/sched/act_api.c:71:15: got struct tc_cookie new_cookie >> net/sched/act_api.c:71:13: sparse: incorrect type in assignment (different address spaces) @@ expected struct tc_cookie old @@ got struct tc_cookie [noderef] <struct tc_cookie old @@ net/sched/act_api.c:71:13: expected struct tc_cookie old net/sched/act_api.c:71:13: got struct tc_cookie [noderef] <asn:4>[assigned] __ret >> net/sched/act_api.c:132:48: sparse: dereference of noderef expression Handle this in the usual way by force casting away the __rcu annotation when we are using xchg() on it. Fixes: `eec94fdb04` ("net: sched: use rcu for action cookie update") Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-08 17:02:59 +09:00
David S. Miller	e9ec804564	Merge branch 'Modify-action-API-for-implementing-lockless-actions' Vlad Buslov says: ==================== Modify action API for implementing lockless actions Currently, all netlink protocol handlers for updating rules, actions and qdiscs are protected with single global rtnl lock which removes any possibility for parallelism. This patch set is a first step to remove rtnl lock dependency from TC rules update path. Recently, new rtnl registration flag RTNL_FLAG_DOIT_UNLOCKED was added. Handlers registered with this flag are called without RTNL taken. End goal is to have rule update handlers(RTM_NEWTFILTER, RTM_DELTFILTER, etc.) to be registered with UNLOCKED flag to allow parallel execution. However, there is no intention to completely remove or split rtnl lock itself. This patch set addresses specific problems in action API that prevents it from being executed concurrently. This patch set does not completely unlock rules or actions update path. Additional patch sets are required to refactor individual actions and filters update for parallel execution. As a preparation for executing TC rules update handlers without rtnl lock, action API code was audited to determine areas that assume external synchronization with rtnl lock and must be changed to allow safe concurrent access with following results: 1. Action idr is already protected with spinlock. However, some code paths assume that idr state is not changes between several consecutive tcf_idr_* function calls. 2. tc_action reference and bind counters are implemented as plain integers. They purpose was to allow single actions to be shared between multiple filters, not to provide means for concurrent modification. 3. tc_action 'cookie' pointer field is not protected against modification. 4. Action API functions, that work with set of actions, use intrusive linked list, which cannot be used concurrently without additional synchronization. 5. Action API functions don't take reference to actions while using them, assuming external synchronization with rtnl lock. Following solutions to these problems are implemented: 1. To remove assumption that idr state doesn't change between tcf_idr_* calls, implement new functions that atomically perform several operations on idr without releasing idr spinlock. (function to atomically lookup and delete action by index, function to atomically check if action exists and allocate new one if necessary, etc.) 2. Use atomic operations on counters to make them suitable for concurrent get/put operations. 3. Data that 'cookie' points to is never modified, so it enough to refactor it to rcu pointer to prevent concurrent de-allocation. 4. Action API doesn't actually use any linked list specific operations on actions intrusive linked list, so it can be refactored to array in straightforward manner. 5. Always take reference to action while accessing it in action API. tcf_idr_search function modified to take reference to action before returning it, so there is no way to lookup an action without incrementing its reference counter. All users of this function are modified to release the reference, after they done using action. With all users using reference counting, it is now safe to concurrently delete actions. Additionally, actions init function signature was expanded with 'rtnl_held' argument, that allows actions that have internal dependency on rtnl lock to take/release it when necessary. Since only shared state in action API module are actions themselves and action idr, these changes are sufficient to not to rely on global rtnl lock for protection of internal action API data structures. Changes from V5 to V6: - Rebase on current net-next - When action is deleted, set pointer in actions array to NULL to prevent double freeing. Changes from V4 to V5: - Change action delete API to track actions that were deleted, to prevent releasing them on error. Changes from V3 to V4: - Expand cover letter. - Reduce actions array size in tcf_action_init_1. - Rebase on latest net-next. Changes from V2 to V3: - Re-send with changelog copied to individual patches. Changes from V1 to V2: - Removed redundant actions ops lookup during delete. - Merge action ops delete definition and implementation. - Assume all actions have delete implemented and don't check for it explicitly. - Resplit action lookup/release code to prevent memory leaks in individual patches. - Make __tcf_idr_check function static - Remove unique idr insertion function. Change original idr insert to do the same thing. - Merge changes that take reference to action when performing lookup and changes that account for this additional reference when dumping action to user space into single patch. - Change convoluted commit message. - Rename "unlocked" to "rtnl_held" for clarity. - Remove estimator lock add patch. - Refactor action check-alloc code into standalone function. - Rename tcf_idr_find_delete to tcf_idr_delete_index. - Rearrange variable definitions in tc_action_delete. - Add patch that refactors action API code to use array of pointers to actions instead of intrusive linked list. - Expand cover letter. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-08 12:42:29 +09:00
Vlad Buslov	90b73b77d0	net: sched: change action API to use array of pointers to actions Act API used linked list to pass set of actions to functions. It is intrusive data structure that stores list nodes inside action structure itself, which means it is not safe to modify such list concurrently. However, action API doesn't use any linked list specific operations on this set of actions, so it can be safely refactored into plain pointer array. Refactor action API to use array of pointers to tc_actions instead of linked list. Change argument 'actions' type of exported action init, destroy and dump functions. Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-08 12:42:29 +09:00
Vlad Buslov	0190c1d452	net: sched: atomically check-allocate action Implement function that atomically checks if action exists and either takes reference to it, or allocates idr slot for action index to prevent concurrent allocations of actions with same index. Use EBUSY error pointer to indicate that idr slot is reserved. Implement cleanup helper function that removes temporary error pointer from idr. (in case of error between idr allocation and insertion of newly created action to specified index) Refactor all action init functions to insert new action to idr using this API. Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-08 12:42:29 +09:00
Vlad Buslov	cae422f379	net: sched: use reference counting action init Change action API to assume that action init function always takes reference to action, even when overwriting existing action. This is necessary because action API continues to use action pointer after init function is done. At this point action becomes accessible for concurrent modifications, so user must always hold reference to it. Implement helper put list function to atomically release list of actions after action API init code is done using them. Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-08 12:42:29 +09:00
Vlad Buslov	4e8ddd7f17	net: sched: don't release reference on action overwrite Return from action init function with reference to action taken, even when overwriting existing action. Action init API initializes its fourth argument (pointer to pointer to tc action) to either existing action with same index or newly created action. In case of existing index(and bind argument is zero), init function returns without incrementing action reference counter. Caller of action init then proceeds working with action, without actually holding reference to it. This means that action could be deleted concurrently. Change action init behavior to always take reference to action before returning successfully, in order to protect from concurrent deletion. Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-08 12:42:29 +09:00
Vlad Buslov	16af606739	net: sched: implement reference counted action release Implement helper delete function that uses new action ops 'delete', instead of destroying action directly. This is required so act API could delete actions by index, without holding any references to action that is being deleted. Implement function __tcf_action_put() that releases reference to action and frees it, if necessary. Refactor action deletion code to use new put function and not to rely on rtnl lock. Remove rtnl lock assertions that are no longer needed. Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-08 12:42:29 +09:00
Vlad Buslov	b409074e66	net: sched: add 'delete' function to action ops Extend action ops with 'delete' function. Each action type to implements its own delete function that doesn't depend on rtnl lock. Implement delete function that is required to delete actions without holding rtnl lock. Use action API function that atomically deletes action only if it is still in action idr. This implementation prevents concurrent threads from deleting same action twice. Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-08 12:42:29 +09:00
Vlad Buslov	2a2ea34970	net: sched: implement action API that deletes action by index Implement new action API function that atomically finds and deletes action from idr by index. Intended to be used by lockless actions that do not rely on rtnl lock. Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-08 12:42:28 +09:00
Vlad Buslov	3f7c72bc42	net: sched: always take reference to action Without rtnl lock protection it is no longer safe to use pointer to tc action without holding reference to it. (it can be destroyed concurrently) Remove unsafe action idr lookup function. Instead of it, implement safe tcf idr check function that atomically looks up action in idr and increments its reference and bind counters. Implement both action search and check using new safe function Reference taken by idr check is temporal and should not be accounted by userspace clients (both logically and to preserver current API behavior). Subtract temporal reference when dumping action to userspace using existing tca_get_fill function arguments. Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-08 12:42:28 +09:00
Vlad Buslov	789871bb2a	net: sched: implement unlocked action init API Add additional 'rtnl_held' argument to act API init functions. It is required to implement actions that need to release rtnl lock before loading kernel module and reacquire if afterwards. Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-08 12:42:28 +09:00
Vlad Buslov	036bb44327	net: sched: change type of reference and bind counters Change type of action reference counter to refcount_t. Change type of action bind counter to atomic_t. This type is used to allow decrementing bind counter without testing for 0 result. Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-08 12:42:28 +09:00
Vlad Buslov	eec94fdb04	net: sched: use rcu for action cookie update Implement functions to atomically update and free action cookie using rcu mechanism. Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-08 12:42:28 +09:00
Yifeng Sun	b233504033	openvswitch: kernel datapath clone action Add 'clone' action to kernel datapath by using existing functions. When actions within clone don't modify the current flow, the flow key is not cloned before executing clone actions. This is a follow up patch for this incomplete work: https://patchwork.ozlabs.org/patch/722096/ v1 -> v2: Refactor as advised by reviewer. Signed-off-by: Yifeng Sun <pkusunyifeng@gmail.com> Signed-off-by: Andy Zhou <azhou@ovn.org> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2018-07-08 11:13:25 +09:00

1 2 3 4 5 ...

767788 Commits