linux

mirror of https://github.com/torvalds/linux.git synced 2024-12-01 08:31:37 +00:00

Author	SHA1	Message	Date
Jörn-Thorben Hinz	41c95dd6a6	bpf: Allow a TCP CC to write sk_pacing_rate and sk_pacing_status A CC that implements tcp_congestion_ops.cong_control() should be able to control sk_pacing_rate and set sk_pacing_status, since tcp_update_pacing_rate() is never called in this case. A built-in CC or one from a kernel module is already able to write to both struct sock members. For a BPF program, write access has not been allowed, yet. Signed-off-by: Jörn-Thorben Hinz <jthinz@mailbox.tu-berlin.de> Reviewed-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/r/20220622191227.898118-2-jthinz@mailbox.tu-berlin.de Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-06-23 09:49:57 -07:00
Jian Shen	9676feccac	test_bpf: fix incorrect netdev features The prototype of .features is netdev_features_t, it should use NETIF_F_LLTX and NETIF_F_HW_VLAN_STAG_TX, not NETIF_F_LLTX_BIT and NETIF_F_HW_VLAN_STAG_TX_BIT. Fixes: `cf204a7183` ("bpf, testing: Introduce 'gso_linear_no_head_frag' skb_segment test") Signed-off-by: Jian Shen <shenjian15@huawei.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/r/20220622135002.8263-1-shenjian15@huawei.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-06-22 19:20:20 -07:00
Dave Marchevsky	7308748925	selftests/bpf: Add benchmark for local_storage get Add a benchmarks to demonstrate the performance cliff for local_storage get as the number of local_storage maps increases beyond current local_storage implementation's cache size. "sequential get" and "interleaved get" benchmarks are added, both of which do many bpf_task_storage_get calls on sets of task local_storage maps of various counts, while considering a single specific map to be 'important' and counting task_storage_gets to the important map separately in addition to normal 'hits' count of all gets. Goal here is to mimic scenario where a particular program using one map - the important one - is running on a system where many other local_storage maps exist and are accessed often. While "sequential get" benchmark does bpf_task_storage_get for map 0, 1, ..., {9, 99, 999} in order, "interleaved" benchmark interleaves 4 bpf_task_storage_gets for the important map for every 10 map gets. This is meant to highlight performance differences when important map is accessed far more frequently than non-important maps. A "hashmap control" benchmark is also included for easy comparison of standard bpf hashmap lookup vs local_storage get. The benchmark is similar to "sequential get", but creates and uses BPF_MAP_TYPE_HASH instead of local storage. Only one inner map is created - a hashmap meant to hold tid -> data mapping for all tasks. Size of the hashmap is hardcoded to my system's PID_MAX_LIMIT (4,194,304). The number of these keys which are actually fetched as part of the benchmark is configurable. Addition of this benchmark is inspired by conversation with Alexei in a previous patchset's thread [0], which highlighted the need for such a benchmark to motivate and validate improvements to local_storage implementation. My approach in that series focused on improving performance for explicitly-marked 'important' maps and was rejected with feedback to make more generally-applicable improvements while avoiding explicitly marking maps as important. Thus the benchmark reports both general and important-map-focused metrics, so effect of future work on both is clear. Regarding the benchmark results. On a powerful system (Skylake, 20 cores, 256gb ram): Hashmap Control =============== num keys: 10 hashmap (control) sequential get: hits throughput: 20.900 ± 0.334 M ops/s, hits latency: 47.847 ns/op, important_hits throughput: 20.900 ± 0.334 M ops/s num keys: 1000 hashmap (control) sequential get: hits throughput: 13.758 ± 0.219 M ops/s, hits latency: 72.683 ns/op, important_hits throughput: 13.758 ± 0.219 M ops/s num keys: 10000 hashmap (control) sequential get: hits throughput: 6.995 ± 0.034 M ops/s, hits latency: 142.959 ns/op, important_hits throughput: 6.995 ± 0.034 M ops/s num keys: 100000 hashmap (control) sequential get: hits throughput: 4.452 ± 0.371 M ops/s, hits latency: 224.635 ns/op, important_hits throughput: 4.452 ± 0.371 M ops/s num keys: 4194304 hashmap (control) sequential get: hits throughput: 3.043 ± 0.033 M ops/s, hits latency: 328.587 ns/op, important_hits throughput: 3.043 ± 0.033 M ops/s Local Storage ============= num_maps: 1 local_storage cache sequential get: hits throughput: 47.298 ± 0.180 M ops/s, hits latency: 21.142 ns/op, important_hits throughput: 47.298 ± 0.180 M ops/s local_storage cache interleaved get: hits throughput: 55.277 ± 0.888 M ops/s, hits latency: 18.091 ns/op, important_hits throughput: 55.277 ± 0.888 M ops/s num_maps: 10 local_storage cache sequential get: hits throughput: 40.240 ± 0.802 M ops/s, hits latency: 24.851 ns/op, important_hits throughput: 4.024 ± 0.080 M ops/s local_storage cache interleaved get: hits throughput: 48.701 ± 0.722 M ops/s, hits latency: 20.533 ns/op, important_hits throughput: 17.393 ± 0.258 M ops/s num_maps: 16 local_storage cache sequential get: hits throughput: 44.515 ± 0.708 M ops/s, hits latency: 22.464 ns/op, important_hits throughput: 2.782 ± 0.044 M ops/s local_storage cache interleaved get: hits throughput: 49.553 ± 2.260 M ops/s, hits latency: 20.181 ns/op, important_hits throughput: 15.767 ± 0.719 M ops/s num_maps: 17 local_storage cache sequential get: hits throughput: 38.778 ± 0.302 M ops/s, hits latency: 25.788 ns/op, important_hits throughput: 2.284 ± 0.018 M ops/s local_storage cache interleaved get: hits throughput: 43.848 ± 1.023 M ops/s, hits latency: 22.806 ns/op, important_hits throughput: 13.349 ± 0.311 M ops/s num_maps: 24 local_storage cache sequential get: hits throughput: 19.317 ± 0.568 M ops/s, hits latency: 51.769 ns/op, important_hits throughput: 0.806 ± 0.024 M ops/s local_storage cache interleaved get: hits throughput: 24.397 ± 0.272 M ops/s, hits latency: 40.989 ns/op, important_hits throughput: 6.863 ± 0.077 M ops/s num_maps: 32 local_storage cache sequential get: hits throughput: 13.333 ± 0.135 M ops/s, hits latency: 75.000 ns/op, important_hits throughput: 0.417 ± 0.004 M ops/s local_storage cache interleaved get: hits throughput: 16.898 ± 0.383 M ops/s, hits latency: 59.178 ns/op, important_hits throughput: 4.717 ± 0.107 M ops/s num_maps: 100 local_storage cache sequential get: hits throughput: 6.360 ± 0.107 M ops/s, hits latency: 157.233 ns/op, important_hits throughput: 0.064 ± 0.001 M ops/s local_storage cache interleaved get: hits throughput: 7.303 ± 0.362 M ops/s, hits latency: 136.930 ns/op, important_hits throughput: 1.907 ± 0.094 M ops/s num_maps: 1000 local_storage cache sequential get: hits throughput: 0.452 ± 0.010 M ops/s, hits latency: 2214.022 ns/op, important_hits throughput: 0.000 ± 0.000 M ops/s local_storage cache interleaved get: hits throughput: 0.542 ± 0.007 M ops/s, hits latency: 1843.341 ns/op, important_hits throughput: 0.136 ± 0.002 M ops/s Looking at the "sequential get" results, it's clear that as the number of task local_storage maps grows beyond the current cache size (16), there's a significant reduction in hits throughput. Note that current local_storage implementation assigns a cache_idx to maps as they are created. Since "sequential get" is creating maps 0..n in order and then doing bpf_task_storage_get calls in the same order, the benchmark is effectively ensuring that a map will not be in cache when the program tries to access it. For "interleaved get" results, important-map hits throughput is greatly increased as the important map is more likely to be in cache by virtue of being accessed far more frequently. Throughput still reduces as # maps increases, though. To get a sense of the overhead of the benchmark program, I commented out bpf_task_storage_get/bpf_map_lookup_elem in local_storage_bench.c and ran the benchmark on the same host as the 'real' run. Results: Hashmap Control =============== num keys: 10 hashmap (control) sequential get: hits throughput: 54.288 ± 0.655 M ops/s, hits latency: 18.420 ns/op, important_hits throughput: 54.288 ± 0.655 M ops/s num keys: 1000 hashmap (control) sequential get: hits throughput: 52.913 ± 0.519 M ops/s, hits latency: 18.899 ns/op, important_hits throughput: 52.913 ± 0.519 M ops/s num keys: 10000 hashmap (control) sequential get: hits throughput: 53.480 ± 1.235 M ops/s, hits latency: 18.699 ns/op, important_hits throughput: 53.480 ± 1.235 M ops/s num keys: 100000 hashmap (control) sequential get: hits throughput: 54.982 ± 1.902 M ops/s, hits latency: 18.188 ns/op, important_hits throughput: 54.982 ± 1.902 M ops/s num keys: 4194304 hashmap (control) sequential get: hits throughput: 50.858 ± 0.707 M ops/s, hits latency: 19.662 ns/op, important_hits throughput: 50.858 ± 0.707 M ops/s Local Storage ============= num_maps: 1 local_storage cache sequential get: hits throughput: 110.990 ± 4.828 M ops/s, hits latency: 9.010 ns/op, important_hits throughput: 110.990 ± 4.828 M ops/s local_storage cache interleaved get: hits throughput: 161.057 ± 4.090 M ops/s, hits latency: 6.209 ns/op, important_hits throughput: 161.057 ± 4.090 M ops/s num_maps: 10 local_storage cache sequential get: hits throughput: 112.930 ± 1.079 M ops/s, hits latency: 8.855 ns/op, important_hits throughput: 11.293 ± 0.108 M ops/s local_storage cache interleaved get: hits throughput: 115.841 ± 2.088 M ops/s, hits latency: 8.633 ns/op, important_hits throughput: 41.372 ± 0.746 M ops/s num_maps: 16 local_storage cache sequential get: hits throughput: 115.653 ± 0.416 M ops/s, hits latency: 8.647 ns/op, important_hits throughput: 7.228 ± 0.026 M ops/s local_storage cache interleaved get: hits throughput: 138.717 ± 1.649 M ops/s, hits latency: 7.209 ns/op, important_hits throughput: 44.137 ± 0.525 M ops/s num_maps: 17 local_storage cache sequential get: hits throughput: 112.020 ± 1.649 M ops/s, hits latency: 8.927 ns/op, important_hits throughput: 6.598 ± 0.097 M ops/s local_storage cache interleaved get: hits throughput: 128.089 ± 1.960 M ops/s, hits latency: 7.807 ns/op, important_hits throughput: 38.995 ± 0.597 M ops/s num_maps: 24 local_storage cache sequential get: hits throughput: 92.447 ± 5.170 M ops/s, hits latency: 10.817 ns/op, important_hits throughput: 3.855 ± 0.216 M ops/s local_storage cache interleaved get: hits throughput: 128.844 ± 2.808 M ops/s, hits latency: 7.761 ns/op, important_hits throughput: 36.245 ± 0.790 M ops/s num_maps: 32 local_storage cache sequential get: hits throughput: 102.042 ± 1.462 M ops/s, hits latency: 9.800 ns/op, important_hits throughput: 3.194 ± 0.046 M ops/s local_storage cache interleaved get: hits throughput: 126.577 ± 1.818 M ops/s, hits latency: 7.900 ns/op, important_hits throughput: 35.332 ± 0.507 M ops/s num_maps: 100 local_storage cache sequential get: hits throughput: 111.327 ± 1.401 M ops/s, hits latency: 8.983 ns/op, important_hits throughput: 1.113 ± 0.014 M ops/s local_storage cache interleaved get: hits throughput: 131.327 ± 1.339 M ops/s, hits latency: 7.615 ns/op, important_hits throughput: 34.302 ± 0.350 M ops/s num_maps: 1000 local_storage cache sequential get: hits throughput: 101.978 ± 0.563 M ops/s, hits latency: 9.806 ns/op, important_hits throughput: 0.102 ± 0.001 M ops/s local_storage cache interleaved get: hits throughput: 141.084 ± 1.098 M ops/s, hits latency: 7.088 ns/op, important_hits throughput: 35.430 ± 0.276 M ops/s Adjusting for overhead, latency numbers for "hashmap control" and "sequential get" are: hashmap_control_1k: ~53.8ns hashmap_control_10k: ~124.2ns hashmap_control_100k: ~206.5ns sequential_get_1: ~12.1ns sequential_get_10: ~16.0ns sequential_get_16: ~13.8ns sequential_get_17: ~16.8ns sequential_get_24: ~40.9ns sequential_get_32: ~65.2ns sequential_get_100: ~148.2ns sequential_get_1000: ~2204ns Clearly demonstrating a cliff. In the discussion for v1 of this patch, Alexei noted that local_storage was 2.5x faster than a large hashmap when initially implemented [1]. The benchmark results show that local_storage is 5-10x faster: a long-running BPF application putting some pid-specific info into a hashmap for each pid it sees will probably see on the order of 10-100k pids. Bench numbers for hashmaps of this size are ~10x slower than sequential_get_16, but as the number of local_storage maps grows far past local_storage cache size the performance advantage shrinks and eventually reverses. When running the benchmarks it may be necessary to bump 'open files' ulimit for a successful run. [0]: https://lore.kernel.org/all/20220420002143.1096548-1-davemarchevsky@fb.com [1]: https://lore.kernel.org/bpf/20220511173305.ftldpn23m4ski3d3@MBP-98dd607d3435.dhcp.thefacebook.com/ Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20220620222554.270578-1-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-06-22 19:14:33 -07:00
Andy Gospodarek	7722517422	samples/bpf: fixup some tools to be able to support xdp multibuffer This changes the section name for the bpf program embedded in these files to "xdp.frags" to allow the programs to be loaded on drivers that are using an MTU greater than PAGE_SIZE. Rather than directly accessing the buffers, the packet data is now accessed via xdp helper functions to provide an example for those who may need to write more complex programs. v2: remove new unnecessary variable Signed-off-by: Andy Gospodarek <gospo@broadcom.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://lore.kernel.org/r/20220621175402.35327-1-gospo@broadcom.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-06-21 18:55:53 -07:00
Jakub Sitnicki	d4609a5d8c	bpf, arm64: Keep tail call count across bpf2bpf calls Today doing a BPF tail call after a BPF to BPF call, that is from a subprogram, is allowed only by the x86-64 BPF JIT. Mixing these features requires support from JIT. Tail call count has to be tracked through BPF to BPF calls, as well as through BPF tail calls to prevent unbounded chains of tail calls. arm64 BPF JIT stores the tail call count (TCC) in a dedicated register (X26). This makes it easier to support bpf2bpf calls mixed with tail calls than on x86 platform. In order to keep the tail call count in tact throughout bpf2bpf calls, all we need to do is tweak the program prologue generator. When emitting prologue for a subprogram, we skip the block that initializes the tail call count and emits a jump pad for the tail call. With this change, a sample execution flow where a bpf2bpf call is followed by a tail call would look like so: int entry(struct __sk_buff skb): 0xffffffc0090151d4: paciasp 0xffffffc0090151d8: stp x29, x30, [sp, #-16]! 0xffffffc0090151dc: mov x29, sp 0xffffffc0090151e0: stp x19, x20, [sp, #-16]! 0xffffffc0090151e4: stp x21, x22, [sp, #-16]! 0xffffffc0090151e8: stp x25, x26, [sp, #-16]! 0xffffffc0090151ec: stp x27, x28, [sp, #-16]! 0xffffffc0090151f0: mov x25, sp 0xffffffc0090151f4: mov x26, #0x0 // <- init TCC only 0xffffffc0090151f8: bti j // in main prog 0xffffffc0090151fc: sub x27, x25, #0x0 0xffffffc009015200: sub sp, sp, #0x10 0xffffffc009015204: mov w1, #0x0 0xffffffc009015208: mov x10, #0xffffffffffffffff 0xffffffc00901520c: strb w1, [x25, x10] 0xffffffc009015210: mov x10, #0xffffffffffffd25c 0xffffffc009015214: movk x10, #0x902, lsl #16 0xffffffc009015218: movk x10, #0xffc0, lsl #32 0xffffffc00901521c: blr x10 -------------------. // bpf2bpf call 0xffffffc009015220: add x7, x0, #0x0 <-------------. 0xffffffc009015224: add sp, sp, #0x10 \| \| 0xffffffc009015228: ldp x27, x28, [sp], #16 \| \| 0xffffffc00901522c: ldp x25, x26, [sp], #16 \| \| 0xffffffc009015230: ldp x21, x22, [sp], #16 \| \| 0xffffffc009015234: ldp x19, x20, [sp], #16 \| \| 0xffffffc009015238: ldp x29, x30, [sp], #16 \| \| 0xffffffc00901523c: add x0, x7, #0x0 \| \| 0xffffffc009015240: autiasp \| \| 0xffffffc009015244: ret \| \| \| \| int subprog_tail(struct __sk_buff skb): \| \| 0xffffffc00902d25c: paciasp <----------------------' \| 0xffffffc00902d260: stp x29, x30, [sp, #-16]! \| 0xffffffc00902d264: mov x29, sp \| 0xffffffc00902d268: stp x19, x20, [sp, #-16]! \| 0xffffffc00902d26c: stp x21, x22, [sp, #-16]! \| 0xffffffc00902d270: stp x25, x26, [sp, #-16]! \| 0xffffffc00902d274: stp x27, x28, [sp, #-16]! \| 0xffffffc00902d278: mov x25, sp \| 0xffffffc00902d27c: sub x27, x25, #0x0 \| 0xffffffc00902d280: sub sp, sp, #0x10 \| // <- end of prologue, notice: 0xffffffc00902d284: add x19, x0, #0x0 \| // 1) TCC not touched, and 0xffffffc00902d288: mov w0, #0x1 \| // 2) no tail call jump pad 0xffffffc00902d28c: mov x10, #0xfffffffffffffffc \| 0xffffffc00902d290: str w0, [x25, x10] \| 0xffffffc00902d294: mov x20, #0xffffff80ffffffff \| 0xffffffc00902d298: movk x20, #0xc033, lsl #16 \| 0xffffffc00902d29c: movk x20, #0x4e00 \| 0xffffffc00902d2a0: add x0, x19, #0x0 \| 0xffffffc00902d2a4: add x1, x20, #0x0 \| 0xffffffc00902d2a8: mov x2, #0x0 \| 0xffffffc00902d2ac: mov w10, #0x24 \| 0xffffffc00902d2b0: ldr w10, [x1, x10] \| 0xffffffc00902d2b4: add w2, w2, #0x0 \| 0xffffffc00902d2b8: cmp w2, w10 \| 0xffffffc00902d2bc: b.cs 0xffffffc00902d2f8 \| 0xffffffc00902d2c0: mov w10, #0x21 \| 0xffffffc00902d2c4: cmp x26, x10 \| // TCC >= MAX_TAIL_CALL_CNT? 0xffffffc00902d2c8: b.cs 0xffffffc00902d2f8 \| 0xffffffc00902d2cc: add x26, x26, #0x1 \| // TCC++ 0xffffffc00902d2d0: mov w10, #0x110 \| 0xffffffc00902d2d4: add x10, x1, x10 \| 0xffffffc00902d2d8: lsl x11, x2, #3 \| 0xffffffc00902d2dc: ldr x11, [x10, x11] \| 0xffffffc00902d2e0: cbz x11, 0xffffffc00902d2f8 \| 0xffffffc00902d2e4: mov w10, #0x30 \| 0xffffffc00902d2e8: ldr x10, [x11, x10] \| 0xffffffc00902d2ec: add x10, x10, #0x24 \| 0xffffffc00902d2f0: add sp, sp, #0x10 \| // <- destroy just current 0xffffffc00902d2f4: br x10 ---------------------. \| // BPF stack frame 0xffffffc00902d2f8: mov x10, #0xfffffffffffffffc \| \| // before the tail call 0xffffffc00902d2fc: ldr w7, [x25, x10] \| \| 0xffffffc00902d300: add sp, sp, #0x10 \| \| 0xffffffc00902d304: ldp x27, x28, [sp], #16 \| \| 0xffffffc00902d308: ldp x25, x26, [sp], #16 \| \| 0xffffffc00902d30c: ldp x21, x22, [sp], #16 \| \| 0xffffffc00902d310: ldp x19, x20, [sp], #16 \| \| 0xffffffc00902d314: ldp x29, x30, [sp], #16 \| \| 0xffffffc00902d318: add x0, x7, #0x0 \| \| 0xffffffc00902d31c: autiasp \| \| 0xffffffc00902d320: ret \| \| \| \| int classifier_0(struct __sk_buff *skb): \| \| 0xffffffc008ff5874: paciasp \| \| 0xffffffc008ff5878: stp x29, x30, [sp, #-16]! \| \| 0xffffffc008ff587c: mov x29, sp \| \| 0xffffffc008ff5880: stp x19, x20, [sp, #-16]! \| \| 0xffffffc008ff5884: stp x21, x22, [sp, #-16]! \| \| 0xffffffc008ff5888: stp x25, x26, [sp, #-16]! \| \| 0xffffffc008ff588c: stp x27, x28, [sp, #-16]! \| \| 0xffffffc008ff5890: mov x25, sp \| \| 0xffffffc008ff5894: mov x26, #0x0 \| \| 0xffffffc008ff5898: bti j <----------------------' \| 0xffffffc008ff589c: sub x27, x25, #0x0 \| 0xffffffc008ff58a0: sub sp, sp, #0x0 \| 0xffffffc008ff58a4: mov x0, #0xffffffc0ffffffff \| 0xffffffc008ff58a8: movk x0, #0x8fc, lsl #16 \| 0xffffffc008ff58ac: movk x0, #0x6000 \| 0xffffffc008ff58b0: mov w1, #0x1 \| 0xffffffc008ff58b4: str w1, [x0] \| 0xffffffc008ff58b8: mov w7, #0x0 \| 0xffffffc008ff58bc: mov sp, sp \| 0xffffffc008ff58c0: ldp x27, x28, [sp], #16 \| 0xffffffc008ff58c4: ldp x25, x26, [sp], #16 \| 0xffffffc008ff58c8: ldp x21, x22, [sp], #16 \| 0xffffffc008ff58cc: ldp x19, x20, [sp], #16 \| 0xffffffc008ff58d0: ldp x29, x30, [sp], #16 \| 0xffffffc008ff58d4: add x0, x7, #0x0 \| 0xffffffc008ff58d8: autiasp \| 0xffffffc008ff58dc: ret -------------------------------' Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220617105735.733938-3-jakub@cloudflare.com	2022-06-21 18:52:14 +02:00
Tony Ambardar	95acd8817e	bpf, x64: Add predicate for bpf2bpf with tailcalls support in JIT The BPF core/verifier is hard-coded to permit mixing bpf2bpf and tail calls for only x86-64. Change the logic to instead rely on a new weak function 'bool bpf_jit_supports_subprog_tailcalls(void)', which a capable JIT backend can override. Update the x86-64 eBPF JIT to reflect this. Signed-off-by: Tony Ambardar <Tony.Ambardar@gmail.com> [jakub: drop MIPS bits and tweak patch subject] Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220617105735.733938-2-jakub@cloudflare.com	2022-06-21 18:52:04 +02:00
Alexei Starovoitov	b40b414ec8	Merge branch 'bpf_loop inlining' Eduard Zingerman says: ==================== Hi Everyone, This is the next iteration of the patch. It includes changes suggested by Song, Joanne and Alexei. Please find updated intro message and change log below. This patch implements inlining of calls to bpf_loop helper function when bpf_loop's callback is statically known. E.g. the rewrite does the following transformation during BPF program processing: bpf_loop(10, foo, NULL, 0); -> for (int i = 0; i < 10; ++i) foo(i, NULL); The transformation leads to measurable latency change for simple loops. Measurements using `benchs/run_bench_bpf_loop.sh` inside QEMU / KVM on i7-4710HQ CPU show a drop in latency from 14 ns/op to 2 ns/op. The change is split in five parts: * Update to test_verifier.c to specify expected and unexpected instruction sequences. This allows to check BPF program rewrites applied by e.g. do_mix_fixups function. * Update to test_verifier.c to specify BTF function infos and types per test case. This is necessary for tests that load sub-program addresses to a variable because of the checks applied by check_ld_imm function. * The update to verifier.c that tracks state of the parameters for each bpf_loop call in a program and decides whether it could be replaced by a loop. * A set of test cases for `test_verifier` that use capabilities added by the first two patches to verify instructions produced by inlining logic. * Two test cases for `test_prog` to check that possible corner cases behave as expected. Additional details are available in commit messages for each patch. Changes since v7: - Call to `mark_chain_precision` is added in `loop_flag_is_zero` to avoid potential issues with state pruning and precision tracking. - `flags non-zero` test_verifier test case is updated to have two execution paths reaching `bpf_loop` call, one with flags = 0, another with flags = 1. Potentially this test case should be able to show that call to `mark_chain_precision` is necessary in `loop_flag_is_zero` but not at the moment. Please refer to discussion for [PATCH bpf-next v7 3/5] for additional details. - `stack_depth_extra` computation is updated to guarantee that R6, R7 and R8 offsets are always aligned on 8 byte boundary. - `stack locations for loop vars` test_verifier test case updated to show that R6, R7, R8 offsets are indeed aligned when function stack depth is not a multiple of 8. - I removed Song Liu's ACK from commit message for [PATCH bpf-next v8 4/5] because I updated the patch. (Please let me know if I had to keep the ACK tag). Changes since v6: - Return value of the `optimize_bpf_loop` function is no longer ignored. This is necessary to properly propagate -ENOMEM error. Changes since v5: - Added function `loop_flag_is_zero` to skip a few checks in `update_loop_inline_state` when loop instruction is not fit for inline. Changes since v4: - Added missing `static` modifier for `update_loop_inline_state` and `inline_bpf_loop` functions. - `update_loop_inline_state` updated for better readability. - Fields `initialized` and `fit_for_inline` of `struct bpf_loop_inline_state` are changed back from `bool` to bitfields. - Acks from Song Liu added to comments for patches 1/5, 2/5, 4/5, 5/5. Changes since v3: - Function `adjust_stack_depth_for_loop_inlining` is replaced by function `optimize_bpf_loop`. Function `optimize_bpf_loop` is responsible for both stack depth adjustment and call instruction replacement. - Changes in `do_misc_fixups` are reverted. - Changes in `adjust_subprog_starts_after_remove` are reverted and function `adjust_loop_inline_subprogno` is removed. This is possible because call to `optimize_bpf_loop` is placed before the dead code removal in `opt_remove_dead_code` (in contrast to the position of `do_misc_fixups` where inlining was done in v3). - Field `bpf_insn_aux_data.loop_inline_state` is now a part of anonymous union at the start of the `bpf_insn_aux_data`. - Data structure `bpf_loop_inline_state` is simplified to use single flag field `fit_for_inline` instead of separate fields `flags_is_zero` & `callback_is_constant`. - Macro definition `BPF_MAX_LOOPS` is moved from `include/linux/bpf_verifier.h` to `include/linux/bpf.h` to avoid include of `include/linux/bpf_verifier.h` in `bpf_iter.c`. - `inline_bpf_loop` changed back to use array initialization and hard coded offsets as in v2. - Style / formatting updates. Changes since v2: - fix for `stack_check` test case in `test_progs-no_alu32`, all tests are passing now; - v2 3/3 patch is split in three parts: - kernel changes - test_verifier changes - test_prog changes - updated `inline_bpf_loop` in `verifier.c` to calculate each offset used in instructions to avoid "magic" numbers; - removed newline handling logic in `fail_log` branch of `do_single_test` in `test_verifier.c` to simplify the patch set; - styling fixes suggested in review for v2 of this patch set. Changes since v1: - allow to use SKIP_INSNS in instruction pattern specification in test_verifier tests; - fix for a bug in spill offset assignement for loop vars when bpf_loop is located in a non-main function. ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-06-20 17:40:52 -07:00
Eduard Zingerman	0e1bf9ed20	selftests/bpf: BPF test_prog selftests for bpf_loop inlining Two new test BPF programs for test_prog selftests checking bpf_loop behavior. Both are corner cases for bpf_loop inlinig transformation: - check that bpf_loop behaves correctly when callback function is not a compile time constant - check that local function variables are not affected by allocating additional stack storage for registers spilled by loop inlining Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/r/20220620235344.569325-6-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-06-20 17:40:52 -07:00
Eduard Zingerman	f8acfdd044	selftests/bpf: BPF test_verifier selftests for bpf_loop inlining A number of test cases for BPF selftests test_verifier to check how bpf_loop inline transformation rewrites the BPF program. The following cases are covered: - happy path - no-rewrite when flags is non-zero - no-rewrite when callback is non-constant - subprogno in insn_aux is updated correctly when dead sub-programs are removed - check that correct stack offsets are assigned for spilling of R6-R8 registers Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/r/20220620235344.569325-5-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-06-20 17:40:51 -07:00
Eduard Zingerman	1ade237119	bpf: Inline calls to bpf_loop when callback is known Calls to `bpf_loop` are replaced with direct loops to avoid indirection. E.g. the following: bpf_loop(10, foo, NULL, 0); Is replaced by equivalent of the following: for (int i = 0; i < 10; ++i) foo(i, NULL); This transformation could be applied when: - callback is known and does not change during program execution; - flags passed to `bpf_loop` are always zero. Inlining logic works as follows: - During execution simulation function `update_loop_inline_state` tracks the following information for each `bpf_loop` call instruction: - is callback known and constant? - are flags constant and zero? - Function `optimize_bpf_loop` increases stack depth for functions where `bpf_loop` calls can be inlined and invokes `inline_bpf_loop` to apply the inlining. The additional stack space is used to spill registers R6, R7 and R8. These registers are used as loop counter, loop maximal bound and callback context parameter; Measurements using `benchs/run_bench_bpf_loop.sh` inside QEMU / KVM on i7-4710HQ CPU show a drop in latency from 14 ns/op to 2 ns/op. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/r/20220620235344.569325-4-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-06-20 17:40:51 -07:00
Eduard Zingerman	7a42008ca5	selftests/bpf: allow BTF specs and func infos in test_verifier tests The BTF and func_info specification for test_verifier tests follows the same notation as in prog_tests/btf.c tests. E.g.: ... .func_info = { { 0, 6 }, { 8, 7 } }, .func_info_cnt = 2, .btf_strings = "\0int\0", .btf_types = { BTF_TYPE_INT_ENC(1, BTF_INT_SIGNED, 0, 32, 4), BTF_PTR_ENC(1), }, ... The BTF specification is loaded only when specified. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/r/20220620235344.569325-3-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-06-20 17:40:51 -07:00
Eduard Zingerman	933ff53191	selftests/bpf: specify expected instructions in test_verifier tests Allows to specify expected and unexpected instruction sequences in test_verifier test cases. The instructions are requested from kernel after BPF program loading, thus allowing to check some of the transformations applied by BPF verifier. - `expected_insn` field specifies a sequence of instructions expected to be found in the program; - `unexpected_insn` field specifies a sequence of instructions that are not expected to be found in the program; - `INSN_OFF_MASK` and `INSN_IMM_MASK` values could be used to mask `off` and `imm` fields. - `SKIP_INSNS` could be used to specify that some instructions in the (un)expected pattern are not important (behavior similar to usage of `\t` in `errstr` field). The intended usage is as follows: { "inline simple bpf_loop call", .insns = { /* main / BPF_ALU64_IMM(BPF_MOV, BPF_REG_1, 1), BPF_RAW_INSN(BPF_LD \| BPF_IMM \| BPF_DW, BPF_REG_2, BPF_PSEUDO_FUNC, 0, 6), ... BPF_EXIT_INSN(), / callback */ BPF_ALU64_IMM(BPF_MOV, BPF_REG_0, 1), BPF_EXIT_INSN(), }, .expected_insns = { BPF_ALU64_IMM(BPF_MOV, BPF_REG_1, 1), SKIP_INSNS(), BPF_RAW_INSN(BPF_JMP \| BPF_CALL, 0, BPF_PSEUDO_CALL, 8, 1) }, .unexpected_insns = { BPF_RAW_INSN(BPF_JMP \| BPF_CALL, 0, 0, INSN_OFF_MASK, INSN_IMM_MASK), }, .prog_type = BPF_PROG_TYPE_TRACEPOINT, .result = ACCEPT, .runs = 0, }, Here it is expected that move of 1 to register 1 would remain in place and helper function call instruction would be replaced by a relative call instruction. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/r/20220620235344.569325-2-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-06-20 17:40:51 -07:00
Delyan Kratunov	aca80dd95e	uprobe: gate bpf call behind BPF_EVENTS The call into bpf from uprobes needs to be gated now that it doesn't use the trace_events.h helpers. Randy found this as a randconfig build failure on linux-next [1]. [1]: https://lore.kernel.org/linux-next/2de99180-7d55-2fdf-134d-33198c27cc58@infradead.org/ Reported-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Delyan Kratunov <delyank@fb.com> Tested-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: Randy Dunlap <rdunlap@infradead.org> Link: https://lore.kernel.org/r/cb8bfbbcde87ed5d811227a393ef4925f2aadb7b.camel@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>	2022-06-20 17:04:03 -07:00
Maxim Mikityanskiy	e068c0776b	selftests/bpf: Enable config options needed for xdp_synproxy test This commit adds the kernel config options needed to run the recently added xdp_synproxy test. Users without these options will hit errors like this: test_synproxy:FAIL:iptables -t raw -I PREROUTING -i tmp1 -p tcp -m tcp --syn --dport 8080 -j CT --notrack unexpected error: 256 (errno 22) Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220620104939.4094104-1-maximmi@nvidia.com	2022-06-20 14:10:54 +02:00
Cong Wang	43312915b5	skmsg: Get rid of unncessary memset() We always allocate skmsg with kzalloc(), so there is no need to call memset(0) on it, the only thing we need from sk_msg_init() is sg_init_marker(). So introduce a new helper which is just kzalloc()+sg_init_marker(), this saves an unncessary memset(0) for skmsg on fast path. Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20220615162014.89193-5-xiyou.wangcong@gmail.com	2022-06-20 14:05:52 +02:00
Cong Wang	57452d767f	skmsg: Get rid of skb_clone() With ->read_skb() now we have an entire skb dequeued from receive queue, now we just need to grab an addtional refcnt before passing its ownership to recv actors. And we should not touch them any more, particularly for skb->sk. Fortunately, skb->sk is already set for most of the protocols except UDP where skb->sk has been stolen, so we have to fix it up for UDP case. Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20220615162014.89193-4-xiyou.wangcong@gmail.com	2022-06-20 14:05:52 +02:00
Cong Wang	965b57b469	net: Introduce a new proto_ops ->read_skb() Currently both splice() and sockmap use ->read_sock() to read skb from receive queue, but for sockmap we only read one entire skb at a time, so ->read_sock() is too conservative to use. Introduce a new proto_ops ->read_skb() which supports this sematic, with this we can finally pass the ownership of skb to recv actors. For non-TCP protocols, all ->read_sock() can be simply converted to ->read_skb(). Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20220615162014.89193-3-xiyou.wangcong@gmail.com	2022-06-20 14:05:52 +02:00
Cong Wang	04919bed94	tcp: Introduce tcp_read_skb() This patch inroduces tcp_read_skb() based on tcp_read_sock(), a preparation for the next patch which actually introduces a new sock ops. TCP is special here, because it has tcp_read_sock() which is mainly used by splice(). tcp_read_sock() supports partial read and arbitrary offset, neither of them is needed for sockmap. Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20220615162014.89193-2-xiyou.wangcong@gmail.com	2022-06-20 14:05:52 +02:00
David S. Miller	4336487e30	Merge branch 'mlxsw-unified-bridge-conversion-part-1' Ido Schimmel says: ==================== mlxsw: Unified bridge conversion - part 1/6 This set starts converting mlxsw to the unified bridge model and mainly adds new device registers and extends existing ones that will be used in follow-up patchsets. High-level summary ================== The unified bridge model is a new way of managing low-level device objects such as filtering identifiers (FIDs). The conversion moves a lot of logic out of the device's firmware towards the driver, but its main selling point is that it allows to overcome various scalability issues related to the amount of entries that need to be programmed to the device. The only (intended) user visible changes of the conversion are improvement in resource utilization and ability to support more router interfaces (RIFs) in Spectrum-{2,3}. Details ======= Commit `50853808ff` ("Merge branch 'mlxsw-Prepare-for-VLAN-aware-bridge-w-VxLAN'") converted mlxsw to emulate 802.1Q FIDs (represent VLANs in a VLAN-aware bridge) using 802.1D FIDs (represent VLAN-unaware bridges). This was necessary because at that time VNI could not be assigned to 802.1Q FIDs, which effectively meant that mlxsw could not support VXLAN with VLAN-aware bridges. The downside of this approach is that multiple {Port,VID}->FID entries are required in order to classify incoming traffic to a FID, as opposed to a single VID->FID entry that can be used with actual 802.1Q FIDs. For example, if 10 ports are members in the same VLAN-aware bridge and the same 100 VLANs are configured on each port, then only 100 VID->FID entries are required with 802.1Q FIDs, whereas 1000 {Port,VID}->FID entries are required with emulated 802.1Q FIDs. The above limitation is the result of various assumptions that were made in the design of the API that was exposed to software. In the unified bridge model the API is much more "raw" and therefore avoids these assumptions, allowing software to configure the device in a more efficient manner. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2022-06-20 10:03:34 +01:00
Amit Cohen	b382092265	mlxsw: reg: Add support for VLAN RIF as part of RITR register Router interfaces (RIFs) constructed on top of VLAN-aware bridges are of "VLAN" type, whereas RIFs constructed on top of VLAN-unaware bridges of "FID" type. In other words, the RIF type is derived from the underlying FID type. VLAN RIFs are used on top of 802.1Q FIDs, whereas FID RIFs are used on top of 802.1D FIDs. Currently 802.1Q FIDs are emulated using 802.1D FIDs, and therefore VLAN RIFs are emulated using FID RIFs. As part of converting the driver to use unified bridge, 802.1Q FIDs and VLAN RIFs will be used. Add the relevant fields to RITR register, add pack() function for VLAN RIF and rename one field to fit the internal name. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-06-20 10:03:34 +01:00
Amit Cohen	1b1c198c30	mlxsw: Add support for egress FID classification after decapsulation As preparation for unified bridge model, add support for VNI->FID mapping via SVFA register. When performing VXLAN encapsulation, the VXLAN header needs to contain a VNI. This VNI is derived from the FID classification performed on ingress, through which the ingress RIF is also determined. Similarly, when performing VXLAN decapsulation, the FID of the packet needs to be determined. This FID is derived from VNI classification performed during decapsulation. In the old model, both entries (i.e., FID->VNI and VNI->FID) were configured via SFMR.vni. In the new model, where ingress is separated from egress, ingress configuration (VNI->FID) is performed via SVFA, while SFMR only configures egress (FID->VNI). Add 'vni' field to SVFA, add new mapping table - VNI to FID, add new pack() function for VNI mapping and edit the comment in SFMR. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Danielle Ratson <danieller@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-06-20 10:03:33 +01:00
Amit Cohen	ad9592c061	mlxsw: reg: Add egress FID field to RITR register RITR configures the router interface table. As preparation for unified bridge model, add egress FID field to RITR. After routing, a packet has to perform a layer-2 lookup using the destination MAC it got from the routing and a FID. In the new model, the egress FID is configured by RITR for both sub-port and FID RIFs. Add 'efid' field to sub-port router interface and update FID router interface related comment. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-06-20 10:03:33 +01:00
Amit Cohen	27f0b6ce06	mlxsw: reg: Add Router Egress Interface to VID Register The REIV maps {egress router interface (eRIF), egress_port} -> {vlan ID}. As preparation for unified bridge model, add REIV register for future use. In the past, firmware would take care of the above mentioned mapping, but in the new model this should be done by software using REIV register. REIV register supports a simultaneous update of 256 ports using 'port_page' field. When 'port_page'=0 the records represent ports 0-255, when 'port_page'=1 the records represent ports 256-511 and so on. The register is reserved while using the legacy model. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-06-20 10:03:33 +01:00
Amit Cohen	48bca94fff	mlxsw: reg: Replace MID related fields in SFGC register SFGC register maps {packet type, bridge type} -> {MID base, table type}. As preparation for unified bridge model, remove 'mid' field and add 'mid_base' field. The MID index (index to PGT table which maps MID to local port list and SMPE index) is a result of 'mid_base' + 'fid_offset'. Using the legacy bridge model, firmware configures 'mid_base'. However, using the new model, software is responsible to configure it via SFGC register. The 'mid_base' is configured per {packet type, bridge type}, for example, for {Unicast, .1Q}, {Broadcast, .1D}. Add the field 'mid_base' to SFGC register and increase the length of the register accordingly. Remove the field 'mid' as currently it is ignored by the device, its use is an old leftover. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-06-20 10:03:33 +01:00
Amit Cohen	94536249b8	mlxsw: reg: Add flood related field to SFMR register SFMR register creates and configures FIDs. As preparation for unified bridge model, add a required field for future use. The PGT (Port Group) table maps multicast ID (MID) to {local port list, SMPE index} on Spectrum-1 and to {local port list} on the other ASICs. In the legacy model, software did not interact with this table directly. Instead, it was accessed by firmware in response to registers such as SFTR and SMID. In the new model, the SFTR register is deprecated and software has full control over the PGT table using the SMID register. The configuration of MDB entries (using SFD) is unchanged, but flooding configuration is completely different. SFGC register maps {packet type, bridge type} -> {MID base, table type}, then with FID and FID-offset which are configured via SFMR, the MID index is obtained. Add the field 'flood_bridge_type' to SFMR, software can separate between 802.1q FIDs and vFIDs using two types which are supported. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-06-20 10:03:33 +01:00
Amit Cohen	485c281cad	mlxsw: reg: Add VID related fields to SFD register SFD register configures FDB table. As preparation for unified bridge model, add some required fields for future use. In the new model, firmware no longer configures the egress VID, this responsibility is moved to software. For layer 2 this means that software needs to determine the egress VID for both unicast and multicast. For unicast FDB records and unicast LAG FDB records, the VID needs to be set via new fields in SFD - 'set_vid' and 'vid'. Add the two mentioned fields for future use. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-06-20 10:03:33 +01:00
Amit Cohen	92e4e543b1	mlxsw: reg: Add SMPE related fields to SFMR register SFMR register creates and configures FIDs. As preparation unified bridge model, add some required fields for future use. The device includes two main tables to support layer 2 multicast (i.e., MDB and flooding). These are the PGT (Port Group Table) and the MPE (Multicast Port Egress) table. - PGT is {MID -> (bitmap of local_port, SPME index)} - MPE is {(Local port, SMPE index) -> eVID} In Spectrum-2 and later ASICs, the SMPE index is an attribute of the FID and programmed via new fields in SFMR register - 'smpe_valid' and 'smpe'. Add the two mentioned fields for future use and increase the length of the register accordingly. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Danielle Ratson <danieller@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-06-20 10:03:33 +01:00
Amit Cohen	894b98d50b	mlxsw: Add SMPE related fields to SMID2 register SMID register maps multicast ID (MID) into a list of local ports. As preparation for unified bridge model, add some required fields for future use. The device includes two main tables to support layer 2 multicast (i.e., MDB and flooding). These are the PGT (Port Group Table) and the MPE (Multicast Port Egress) table. - PGT is {MID -> (bitmap of local_port, SPME index)} - MPE is {(Local port, SMPE index) -> eVID} In Spectrum-1, both indexes into the MPE table (local port and SMPE) are derived from the PGT table. Therefore, the SMPE index needs to be programmed as part of the PGT entry via new fields in SMID - 'smpe_valid' and 'smpe'. Add the two mentioned fields for future use and align the callers of mlxsw_reg_smid2_pack() to pass zeros for SMPE fields. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-06-20 10:03:33 +01:00
Amit Cohen	e0f071c5b8	mlxsw: reg: Add Switch Multicast Port to Egress VID Register The SMPE register maps {egress_port, SMPE index} -> VID. The device includes two main tables to support layer 2 multicast (i.e., MDB and flooding). These are the PGT (Port Group Table) and the MPE (Multicast Port Egress) table. - PGT is {MID -> (bitmap of local_port, SPME index)} - MPE is {(Local port, SMPE index) -> eVID} In Spectrum-1, the index into the MPE table - called switch multicast to port egress VID (SMPE) - is derived from the PGT entry, whereas in Spectrum-2 and later ASICs it is derived from the FID. In the legacy model, software did not interact with this table as it was completely hidden in firmware. In the new model, software needs to populate the table itself in order to map from {Local port, SMPE index} to an egress VID. This is done using the SMPE register. Add the register for future use. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Danielle Ratson <danieller@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-06-20 10:03:33 +01:00
Amit Cohen	dd326565c5	mlxsw: reg: Add ingress RIF related fields to SVFA register SVFA register controls the VID to FID mapping and {Port, VID} to FID mapping for virtualized ports. As preparation for unified bridge model, add some required fields for future use. On ingress, after ingress ACL, a packet needs to be classified to a FID. The key for this lookup can be one of: 1. VID. When port is not in virtual mode. 2. {RQ, VID}. When port is in virtual mode. 3. FID. When FID was set by ingress ACL. Since RITR no longer performs ingress configuration, the ingress RIF for the first two entry types needs to be set via new fields in SVFA - 'irif_v' and 'irif'. Add the two mentioned fields for future use and increase the length of the register accordingly. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-06-20 10:03:33 +01:00
Amit Cohen	e459466a26	mlxsw: reg: Add ingress RIF related fields to SFMR register SFMR register creates and configures FIDs. As preparation for unified bridge model, add some required fields for future use. On ingress, after ingress ACL, a packet needs to be classified to a FID. The key for this lookup can be one of: 1. VID. When port is not in virtual mode. 2. {RQ, VID}. When port is in virtual mode. 3. FID. When FID was set by ingress ACL. For example, via VR_AND_FID_ACTION. Since RITR no longer performs ingress configuration, the ingress RIF for the last entry type needs to be set via new fields in SFMR - 'irif_v' and 'irif'. Add the two mentioned fields for future use. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-06-20 10:03:33 +01:00
Amit Cohen	02d23c9544	mlxsw: reg: Add 'flood_rsp' field to SFMR register SFMR register creates and configures FIDs. As preparation for unified bridge model, add a field for future use. In the new model, RITR no longer configures the rFID used for sub-port RIFs and it has to be created by software via SFMR. Such FIDs need to be created with special flood indication using 'flood_rsp' field. When set, this bit instructs the device to manage the flooding entries for this FID in a reserved part of the port group table (PGT). Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-06-20 10:03:33 +01:00
Ronak Doshi	a56b158a50	vmxnet3: disable overlay offloads if UPT device does not support 'Commit `6f91f4ba04` ("vmxnet3: add support for capability registers")' added support for capability registers. These registers are used to advertize capabilities of the device. The patch updated the dev_caps to disable outer checksum offload if PTCR register does not support it. However, it missed to update other overlay offloads. This patch fixes this issue. Fixes: `6f91f4ba04` ("vmxnet3: add support for capability registers") Signed-off-by: Ronak Doshi <doshir@vmware.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-06-20 09:12:46 +01:00
David S. Miller	6f9d70466c	Merge branch 'raw-rcu-fixes' Kuniyuki Iwashima says: ==================== raw: Fix nits of RCU conversion series. The first patch fixes a build error by commit `ba44f8182e` ("raw: use more conventional iterators"), but it does not land in the net tree, so this series is targeted to net-next. The second patch replaces some hlist functions with sk's helper macros. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2022-06-20 09:10:13 +01:00
Kuniyuki Iwashima	f289c02bf4	raw: Use helpers for the hlist_nulls variant. hlist_nulls_add_head_rcu() and hlist_nulls_for_each_entry() have dedicated macros for sk. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-06-20 09:10:13 +01:00
Kuniyuki Iwashima	5da39e31b1	raw: Fix mixed declarations error in raw_icmp_error(). The trailing semicolon causes a compiler error, so let's remove it. net/ipv4/raw.c: In function ‘raw_icmp_error’: net/ipv4/raw.c:266:2: error: ISO C90 forbids mixed declarations and code [-Werror=declaration-after-statement] 266 \| struct hlist_nulls_head *hlist; \| ^~~~~~ Fixes: `ba44f8182e` ("raw: use more conventional iterators") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-06-20 09:10:13 +01:00
Xiang wangx	9776fe0f42	sfc/siena: Fix typo in comment Delete the redundant word 'and'. Signed-off-by: Xiang wangx <wangxiang@cdjrlc.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-06-19 11:29:28 +01:00
Xiang wangx	dd33c5932e	sfc: Fix typo in comment Delete the redundant word 'and'. Signed-off-by: Xiang wangx <wangxiang@cdjrlc.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-06-19 11:29:28 +01:00
Xiang wangx	a278bfb242	net: emac: Fix typo in a comment Delete the redundant word 'and'. Signed-off-by: Xiang wangx <wangxiang@cdjrlc.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-06-19 11:29:28 +01:00
Simon Horman	41a36d4e5a	Revert "nfp: update nfp_X logging definitions" This reverts commit `9386ebccfc` ("nfp: update nfp_X logging definitions") The reverted patch was intended to improve logging for the NFP driver by including information such as the source code file and number in log messages. Unfortunately our experience is that this has not improved things as we had hoped. The resulting logs are inconsistent with (most) other kernel log messages. And rely on knowledge of the source code version in order for the extra information to be useful. Thus, revert the change. We acknowledge that Jakub Kicinski <kuba@kernel.org> foresaw this problem. Signed-off-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-06-19 11:28:01 +01:00
David S. Miller	5fc217a3c9	Merge branch 'mii_bmcr_encode_fixed' Russell King says: ==================== net: introduce mii_bmcr_encode_fixed() While converting the mv88e6xxx driver to phylink pcs, it has been noticed that we've started to have repeated cases where we convert a speed and duplex to a BMCR value. Rather than open coding this in multiple locations, let's provide a helper for this - in linux/mii.h. This helper not only takes care of the standard 10, 100 and 1000Mbps encodings, but also includes 2500Mbps (which is the same as 1000Mbps) for those users who require that encoding as well. Unknown speeds will be encoded to 10Mbps, and non-full duplexes will be encoded as half duplex. This series converts the existing users to the new helper, and the mv88e6xxx conversion will add further users in the 6352 and 639x PCS code. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2022-06-19 10:38:27 +01:00
Russell King (Oracle)	449b7a1520	net: pcs: pcs-xpcs: use mii_bmcr_encode_fixed() Use the newly introduced mii_bmcr_encode_fixed() for the xpcs driver. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-06-19 10:38:26 +01:00
Russell King (Oracle)	e62dbaff4b	net: phy: marvell: use mii_bmcr_encode_fixed() Make use of the newly introduced mii_bmcr_encode_fixed() to get the BMCR value when setting loopback mode for the 88e1510. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-06-19 10:38:26 +01:00
Russell King (Oracle)	f28a602b28	net: phy: use mii_bmcr_encode_fixed() phylib can make use of the newly introduced mii_bmcr_encode_fixed() macro, so let's convert it over. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-06-19 10:38:26 +01:00
Russell King (Oracle)	bdb6cfe751	net: mii: add mii_bmcr_encode_fixed() Add a function to encode a fixed speed/duplex to a BMCR value. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-06-19 10:38:26 +01:00
David S. Miller	5d1d527cd9	Merge branch 'raw-RCU-conversion' Eric Dumazet says: ==================== raw: RCU conversion Using rwlock in networking code is extremely risky. writers can starve if enough readers are constantly grabing the rwlock. I thought rwlock were at fault and sent this patch: https://lkml.org/lkml/2022/6/17/272 But Peter and Linus essentially told me rwlock had to be unfair. We need to get rid of rwlock in networking stacks. Without this conversion, following script triggers soft lockups: for i in {1..48} do ping -f -n -q 127.0.0.1 & sleep 0.1 done Next step will be to convert ping sockets to RCU as well. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2022-06-19 10:00:02 +01:00
Eric Dumazet	0daf07e527	raw: convert raw sockets to RCU Using rwlock in networking code is extremely risky. writers can starve if enough readers are constantly grabing the rwlock. I thought rwlock were at fault and sent this patch: https://lkml.org/lkml/2022/6/17/272 But Peter and Linus essentially told me rwlock had to be unfair. We need to get rid of rwlock in networking code. Without this fix, following script triggers soft lockups: for i in {1..48} do ping -f -n -q 127.0.0.1 & sleep 0.1 done Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-06-19 10:00:02 +01:00
Eric Dumazet	ba44f8182e	raw: use more conventional iterators In order to prepare the following patch, I change raw v4 & v6 code to use more conventional iterators. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-06-19 10:00:02 +01:00
Xiaoliang Yang	8670dc33f4	net: dsa: felix: update base time of time-aware shaper when adjusting PTP time When adjusting the PTP clock, the base time of the TAS configuration will become unreliable. We need reset the TAS configuration by using a new base time. For example, if the driver gets a base time 0 of Qbv configuration from user, and current time is 20000. The driver will set the TAS base time to be 20000. After the PTP clock adjustment, the current time becomes 10000. If the TAS base time is still 20000, it will be a future time, and TAS entry list will stop running. Another example, if the current time becomes to be 10000000 after PTP clock adjust, a large time offset can cause the hardware to hang. This patch introduces a tas_clock_adjust() function to reset the TAS module by using a new base time after the PTP clock adjustment. This can avoid issues above. Due to PTP clock adjustment can occur at any time, it may conflict with the TAS configuration. We introduce a new TAS lock to serialize the access to the TAS registers. Signed-off-by: Xiaoliang Yang <xiaoliang.yang_1@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-06-19 09:53:59 +01:00
Christian Marangi	c205035e3a	net: ethernet: stmmac: remove select QCOM_SOCINFO and make it optional QCOM_SOCINFO depends on QCOM_SMEM but is not selected, this cause some problems with QCOM_SOCINFO getting selected with the dependency of QCOM_SMEM not met. To fix this remove the select in Kconfig and add additional info in the DWMAC_IPQ806X config description. Reported-by: kernel test robot <lkp@intel.com> Fixes: `9ec092d2fe` ("net: ethernet: stmmac: add missing sgmii configure for ipq806x") Signed-off-by: Christian Marangi <ansuelsmth@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-06-19 09:52:47 +01:00

1 2 3 4 5 ...

1105458 Commits