linux

Author	SHA1	Message	Date
Andrii Nakryiko	03d5b99138	libbpf: Cleanup struct bpf_core_cand. Remove two redundant fields from struct bpf_core_cand. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20211201181040.23337-8-alexei.starovoitov@gmail.com	2021-12-02 11:18:35 -08:00
Alexei Starovoitov	fbd94c7afc	bpf: Pass a set of bpf_core_relo-s to prog_load command. struct bpf_core_relo is generated by llvm and processed by libbpf. It's a de-facto uapi. With CO-RE in the kernel the struct bpf_core_relo becomes uapi de-jure. Add an ability to pass a set of 'struct bpf_core_relo' to prog_load command and let the kernel perform CO-RE relocations. Note the struct bpf_line_info and struct bpf_func_info have the same layout when passed from LLVM to libbpf and from libbpf to the kernel except "insn_off" fields means "byte offset" when LLVM generates it. Then libbpf converts it to "insn index" to pass to the kernel. The struct bpf_core_relo's "insn_off" field is always "byte offset". Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20211201181040.23337-6-alexei.starovoitov@gmail.com	2021-12-02 11:18:35 -08:00
Alexei Starovoitov	46334a0cd2	bpf: Define enum bpf_core_relo_kind as uapi. enum bpf_core_relo_kind is generated by llvm and processed by libbpf. It's a de-facto uapi. With CO-RE in the kernel the bpf_core_relo_kind values become uapi de-jure. Also rename them with BPF_CORE_ prefix to distinguish from conflicting names in bpf_core_read.h. The enums bpf_field_info_kind, bpf_type_id_kind, bpf_type_info_kind, bpf_enum_value_kind are passing different values from bpf program into llvm. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20211201181040.23337-5-alexei.starovoitov@gmail.com	2021-12-02 11:18:35 -08:00
Alexei Starovoitov	29db4bea1d	bpf: Prepare relo_core.c for kernel duty. Make relo_core.c to be compiled for the kernel and for user space libbpf. Note the patch is reducing BPF_CORE_SPEC_MAX_LEN from 64 to 32. This is the maximum number of nested structs and arrays. For example: struct sample { int a; struct { int b[10]; }; }; struct sample s = ...; int y = &s->b[5]; This field access is encoded as "0:1:0:5" and spec len is 4. The follow up patch might bump it back to 64. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20211201181040.23337-4-alexei.starovoitov@gmail.com	2021-12-02 11:18:34 -08:00
Alexei Starovoitov	74753e1462	libbpf: Replace btf__type_by_id() with btf_type_by_id(). To prepare relo_core.c to be compiled in the kernel and the user space replace btf__type_by_id with btf_type_by_id. In libbpf btf__type_by_id and btf_type_by_id have different behavior. bpf_core_apply_relo_insn() needs behavior of uapi btf__type_by_id vs internal btf_type_by_id, but type_id range check is already done in bpf_core_apply_relo(), so it's safe to replace it everywhere. The kernel btf_type_by_id() does the check anyway. It doesn't hurt. Suggested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20211201181040.23337-2-alexei.starovoitov@gmail.com	2021-12-02 11:18:34 -08:00
Li Zhijian	36d7d36fcf	selftests: net: remove meaningless help option $ ./fcnal-test.sh -t help Test names: help Looks it intent to list the available tests but it didn't do the right thing. I will add another option the do that in the later patch. Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-12-02 13:12:27 +00:00
Li Zhijian	a05431b22b	selftests: net: Correct case name ipv6_addr_bind/ipv4_addr_bind are function names. Previously, bind test would not be run by default due to the wrong case names Fixes: `34d0302ab8` ("selftests: Add ipv6 address bind tests to fcnal-test") Fixes: `75b2b2b3db` ("selftests: Add ipv4 address bind tests to fcnal-test") Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-12-02 12:19:08 +00:00
Boqun Feng	c438b7d860	tools/memory-model: litmus: Add two tests for unlock(A)+lock(B) ordering The memory model has been updated to provide a stronger ordering guarantee for unlock(A)+lock(B) on the same CPU/thread. Therefore add two litmus tests describing this new guarantee, these tests are simple yet can clearly show the usage of the new guarantee, also they can serve as the self tests for the modification in the model. Co-developed-by: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2021-11-30 17:47:08 -08:00
Boqun Feng	b47c05ecf6	tools/memory-model: doc: Describe the requirement of the litmus-tests directory It's better that we have some "standard" about which test should be put in the litmus-tests directory because it helps future contributors understand whether they should work on litmus-tests in kernel or Paul's GitHub repo. Therefore explain a little bit on what a "representative" litmus test is. Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2021-11-30 17:47:08 -08:00
Boqun Feng	ddfe12944e	tools/memory-model: Provide extra ordering for unlock+lock pair on the same CPU A recent discussion[1] shows that we are in favor of strengthening the ordering of unlock + lock on the same CPU: a unlock and a po-after lock should provide the so-called RCtso ordering, that is a memory access S po-before the unlock should be ordered against a memory access R po-after the lock, unless S is a store and R is a load. The strengthening meets programmers' expection that "sequence of two locked regions to be ordered wrt each other" (from Linus), and can reduce the mental burden when using locks. Therefore add it in LKMM. [1]: https://lore.kernel.org/lkml/20210909185937.GA12379@rowland.harvard.edu/ Co-developed-by: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Reviewed-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Acked-by: Palmer Dabbelt <palmerdabbelt@google.com> (RISC-V) Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2021-11-30 17:47:08 -08:00
Paul E. McKenney	90b21bcfb2	torture: Properly redirect kvm-remote.sh "echo" commands The echo commands following initialization of the "oldrun" variable need to be "tee"d to $oldrun/remote-log. This commit fixes several stragglers. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2021-11-30 17:30:29 -08:00
Paul E. McKenney	b6c9dbf04f	torture: Fix incorrectly redirected "exit" in kvm-remote.sh The "exit 4" in kvm-remote.sh is pointlessly redirected, so this commit removes the redirection. Fixes: `0092eae4cb` ("torture: Add kvm-remote.sh script for distributed rcutorture test runs") Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2021-11-30 17:30:29 -08:00
Paul E. McKenney	a959ed627a	rcutorture: Test RCU Tasks lock-contention detection This commit adjusts the TRACE02 scenario to use a pair of callback-flood kthreads. This in turn forces lock contention on the single RCU Tasks Trace callback queue, which forces use of all CPUs' queues, thus testing this transition. (No, there is not yet any way to transition back. Cc: Neeraj Upadhyay <neeraj.iitr10@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2021-11-30 17:30:29 -08:00
Paul E. McKenney	4ead4e3319	rcutorture: Cause TREE02 and TREE10 scenarios to do more callback flooding This commit enables two callback-flood kthreads for the TREE02 scenario and 28 for the TREE10 scenario. Cc: Neeraj Upadhyay <neeraj.iitr10@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2021-11-30 17:30:29 -08:00
Paul E. McKenney	f61537009e	torture: Retry download once before giving up Currently, a transient network error can kill a run if it happens while downloading the tarball to one of the target systems. This commit therefore does a 60-second wait and then a retry. If further experience indicates, a more elaborate mechanism might be used later. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2021-11-30 17:30:29 -08:00
Paul E. McKenney	c06354a121	torture: Make kvm-find-errors.sh report link-time undefined symbols This commit makes kvm-find-errors.sh check for and report undefined symbols that are detected at link time. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2021-11-30 17:30:28 -08:00
Paul E. McKenney	b6a4fd35d2	torture: Catch kvm.sh help text up with actual options This commit brings the kvm.sh script's help text up to date with recently (and some not-so-recently) added parameters. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2021-11-30 17:30:28 -08:00
Mark Brown	b0fe9dec66	tools/nolibc: Implement gettid() Allow test programs to determine their thread ID. Signed-off-by: Mark Brown <broonie@kernel.org> Cc: Willy Tarreau <w@1wt.eu> Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2021-11-30 17:26:42 -08:00
Ammar Faizi	7bdc0e7a39	tools/nolibc: x86-64: Use `mov $60,%eax` instead of `mov $60,%rax` Note that mov to 32-bit register will zero extend to 64-bit register. Thus `mov $60,%eax` has the same effect with `mov $60,%rax`. Use the shorter opcode to achieve the same thing. ``` b8 3c 00 00 00 mov $60,%eax (5 bytes) [1] 48 c7 c0 3c 00 00 00 mov $60,%rax (7 bytes) [2] ``` Currently, we use [2]. Change it to [1] for shorter code. Signed-off-by: Ammar Faizi <ammar.faizi@students.amikom.ac.id> Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2021-11-30 17:26:42 -08:00
Ammar Faizi	bf91666959	tools/nolibc: x86: Remove `r8`, `r9` and `r10` from the clobber list Linux x86-64 syscall only clobbers rax, rcx and r11 (and "memory"). - rax for the return value. - rcx to save the return address. - r11 to save the rflags. Other registers are preserved. Having r8, r9 and r10 in the syscall clobber list is harmless, but this results in a missed-optimization. As the syscall doesn't clobber r8-r10, GCC should be allowed to reuse their value after the syscall returns to userspace. But since they are in the clobber list, GCC will always miss this opportunity. Remove them from the x86-64 syscall clobber list to help GCC generate better code and fix the comment. See also the x86-64 ABI, section A.2 AMD64 Linux Kernel Conventions, A.2.1 Calling Conventions [1]. Extra note: Some people may think it does not really give a benefit to remove r8, r9 and r10 from the syscall clobber list because the impression of syscall is a C function call, and function call always clobbers those 3. However, that is not the case for nolibc.h, because we have a potential to inline the "syscall" instruction (which its opcode is "0f 05") to the user functions. All syscalls in the nolibc.h are written as a static function with inline ASM and are likely always inline if we use optimization flag, so this is a profit not to have r8, r9 and r10 in the clobber list. Here is the example where this matters. Consider the following C code: ``` #include "tools/include/nolibc/nolibc.h" #define read_abc(a, b, c) __asm__ volatile("nop"::"r"(a),"r"(b),"r"(c)) int main(void) { int a = 0xaa; int b = 0xbb; int c = 0xcc; read_abc(a, b, c); write(1, "test\n", 5); read_abc(a, b, c); return 0; } ``` Compile with: gcc -Os test.c -o test -nostdlib With r8, r9, r10 in the clobber list, GCC generates this: 0000000000001000 <main>: 1000: f3 0f 1e fa endbr64 1004: 41 54 push %r12 1006: 41 bc cc 00 00 00 mov $0xcc,%r12d 100c: 55 push %rbp 100d: bd bb 00 00 00 mov $0xbb,%ebp 1012: 53 push %rbx 1013: bb aa 00 00 00 mov $0xaa,%ebx 1018: 90 nop 1019: b8 01 00 00 00 mov $0x1,%eax 101e: bf 01 00 00 00 mov $0x1,%edi 1023: ba 05 00 00 00 mov $0x5,%edx 1028: 48 8d 35 d1 0f 00 00 lea 0xfd1(%rip),%rsi 102f: 0f 05 syscall 1031: 90 nop 1032: 31 c0 xor %eax,%eax 1034: 5b pop %rbx 1035: 5d pop %rbp 1036: 41 5c pop %r12 1038: c3 ret GCC thinks that syscall will clobber r8, r9, r10. So it spills 0xaa, 0xbb and 0xcc to callee saved registers (r12, rbp and rbx). This is clearly extra memory access and extra stack size for preserving them. But syscall does not actually clobber them, so this is a missed optimization. Now without r8, r9, r10 in the clobber list, GCC generates better code: 0000000000001000 <main>: 1000: f3 0f 1e fa endbr64 1004: 41 b8 aa 00 00 00 mov $0xaa,%r8d 100a: 41 b9 bb 00 00 00 mov $0xbb,%r9d 1010: 41 ba cc 00 00 00 mov $0xcc,%r10d 1016: 90 nop 1017: b8 01 00 00 00 mov $0x1,%eax 101c: bf 01 00 00 00 mov $0x1,%edi 1021: ba 05 00 00 00 mov $0x5,%edx 1026: 48 8d 35 d3 0f 00 00 lea 0xfd3(%rip),%rsi 102d: 0f 05 syscall 102f: 90 nop 1030: 31 c0 xor %eax,%eax 1032: c3 ret Cc: Andy Lutomirski <luto@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: x86@kernel.org Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: David Laight <David.Laight@ACULAB.COM> Acked-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Ammar Faizi <ammar.faizi@students.amikom.ac.id> Link: https://gitlab.com/x86-psABIs/x86-64-ABI/-/wikis/x86-64-psABI [1] Link: https://lore.kernel.org/lkml/20211011040344.437264-1-ammar.faizi@students.amikom.ac.id/ Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2021-11-30 17:26:42 -08:00
Willy Tarreau	de0244ae40	tools/nolibc: fix incorrect truncation of exit code Ammar Faizi reported that our exit code handling is wrong. We truncate it to the lowest 8 bits but the syscall itself is expected to take a regular 32-bit signed integer, not an unsigned char. It's the kernel that later truncates it to the lowest 8 bits. The difference is visible in strace, where the program below used to show exit(255) instead of exit(-1): int main(void) { return -1; } This patch applies the fix to all archs. x86_64, i386, arm64, armv7 and mips were all tested and confirmed to work fine now. Risc-v was not tested but the change is trivial and exactly the same as for other archs. Reported-by: Ammar Faizi <ammar.faizi@students.amikom.ac.id> Cc: stable@vger.kernel.org Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2021-11-30 17:26:42 -08:00
Willy Tarreau	ebbe0d8a44	tools/nolibc: i386: fix initial stack alignment After re-checking in the spec and comparing stack offsets with glibc, The last pushed argument must be 16-byte aligned (i.e. aligned before the call) so that in the callee esp+4 is multiple of 16, so the principle is the 32-bit equivalent to what Ammar fixed for x86_64. It's possible that 32-bit code using SSE2 or MMX could have been affected. In addition the frame pointer ought to be zero at the deepest level. Link: https://gitlab.com/x86-psABIs/i386-ABI/-/wikis/Intel386-psABI Cc: Ammar Faizi <ammar.faizi@students.amikom.ac.id> Cc: stable@vger.kernel.org Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2021-11-30 17:26:42 -08:00
Ammar Faizi	937ed91c71	tools/nolibc: x86-64: Fix startup code bug Before this patch, the `_start` function looks like this: ``` 0000000000001170 <_start>: 1170: pop %rdi 1171: mov %rsp,%rsi 1174: lea 0x8(%rsi,%rdi,8),%rdx 1179: and $0xfffffffffffffff0,%rsp 117d: sub $0x8,%rsp 1181: call 1000 <main> 1186: movzbq %al,%rdi 118a: mov $0x3c,%rax 1191: syscall 1193: hlt 1194: data16 cs nopw 0x0(%rax,%rax,1) 119f: nop ``` Note the "and" to %rsp with $-16, it makes the %rsp be 16-byte aligned, but then there is a "sub" with $0x8 which makes the %rsp no longer 16-byte aligned, then it calls main. That's the bug! What actually the x86-64 System V ABI mandates is that right before the "call", the %rsp must be 16-byte aligned, not after the "call". So the "sub" with $0x8 here breaks the alignment. Remove it. An example where this rule matters is when the callee needs to align its stack at 16-byte for aligned move instruction, like `movdqa` and `movaps`. If the callee can't align its stack properly, it will result in segmentation fault. x86-64 System V ABI also mandates the deepest stack frame should be zero. Just to be safe, let's zero the %rbp on startup as the content of %rbp may be unspecified when the program starts. Now it looks like this: ``` 0000000000001170 <_start>: 1170: pop %rdi 1171: mov %rsp,%rsi 1174: lea 0x8(%rsi,%rdi,8),%rdx 1179: xor %ebp,%ebp # zero the %rbp 117b: and $0xfffffffffffffff0,%rsp # align the %rsp 117f: call 1000 <main> 1184: movzbq %al,%rdi 1188: mov $0x3c,%rax 118f: syscall 1191: hlt 1192: data16 cs nopw 0x0(%rax,%rax,1) 119d: nopl (%rax) ``` Cc: Bedirhan KURT <windowz414@gnuweeb.org> Cc: Louvian Lyndal <louvianlyndal@gmail.com> Reported-by: Peter Cordes <peter@cordes.ca> Signed-off-by: Ammar Faizi <ammar.faizi@students.amikom.ac.id> [wt: I did this on purpose due to a misunderstanding of the spec, other archs will thus have to be rechecked, particularly i386] Cc: stable@vger.kernel.org Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2021-11-30 17:26:42 -08:00
Paul E. McKenney	e2c73a6860	rcu: Remove the RCU_FAST_NO_HZ Kconfig option All of the uses of CONFIG_RCU_FAST_NO_HZ=y that I have seen involve systems with RCU callbacks offloaded. In this situation, all that this Kconfig option does is slow down idle entry/exit with an additional allways-taken early exit. If this is the only use case, then this Kconfig option nothing but an attractive nuisance that needs to go away. This commit therefore removes the RCU_FAST_NO_HZ Kconfig option. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2021-11-30 17:24:47 -08:00
Paul E. McKenney	24eab6e1ff	torture: Remove RCU_FAST_NO_HZ from rcu scenarios All of the rcu scenarios that mentioning CONFIG_RCU_FAST_NO_HZ disable it. But this Kconfig option is disabled by default, so this commit removes the pointless "CONFIG_RCU_FAST_NO_HZ=n" lines from these scenarios. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2021-11-30 17:24:47 -08:00
Paul E. McKenney	f04cbe651b	torture: Remove RCU_FAST_NO_HZ from rcuscale and refscale scenarios All of the rcuscale and refscale scenarios that mention the Kconfig option CONFIG_RCU_FAST_NO_HZ disable it. But this Kconfig option is disabled by default, so this commit removes the pointless "CONFIG_RCU_FAST_NO_HZ=n" lines from these scenarios. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2021-11-30 17:24:46 -08:00
Paul E. McKenney	8c0abfd6d2	rcutorture: Add CONFIG_PREEMPT_DYNAMIC=n to tiny scenarios With CONFIG_PREEMPT_DYNAMIC=y, the kernel builds with CONFIG_PREEMPTION=y because preemption can be enabled at runtime. This prevents any tests of Tiny RCU or Tiny SRCU from running correctly. This commit therefore explicitly sets CONFIG_PREEMPT_DYNAMIC=n for those scenarios. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>	2021-11-30 17:20:58 -08:00
Kumar Kartikeya Dwivedi	d995816b77	libbpf: Avoid reload of imm for weak, unresolved, repeating ksym Alexei pointed out that we can use BPF_REG_0 which already contains imm from move_blob2blob computation. Note that we now compare the second insn's imm, but this should not matter, since both will be zeroed out for the error case for the insn populated earlier. Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20211122235733.634914-4-memxor@gmail.com	2021-11-30 15:48:15 -08:00
Kumar Kartikeya Dwivedi	0270090d39	libbpf: Avoid double stores for success/failure case of ksym relocations Instead, jump directly to success case stores in case ret >= 0, else do the default 0 value store and jump over the success case. This is better in terms of readability. Readjust the code for kfunc relocation as well to follow a similar pattern, also leads to easier to follow code now. Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20211122235733.634914-3-memxor@gmail.com	2021-11-30 15:48:14 -08:00
Joanne Koong	ec151037af	selftest/bpf/benchs: Add bpf_loop benchmark Add benchmark to measure the throughput and latency of the bpf_loop call. Testing this on my dev machine on 1 thread, the data is as follows: nr_loops: 10 bpf_loop - throughput: 198.519 ± 0.155 M ops/s, latency: 5.037 ns/op nr_loops: 100 bpf_loop - throughput: 247.448 ± 0.305 M ops/s, latency: 4.041 ns/op nr_loops: 500 bpf_loop - throughput: 260.839 ± 0.380 M ops/s, latency: 3.834 ns/op nr_loops: 1000 bpf_loop - throughput: 262.806 ± 0.629 M ops/s, latency: 3.805 ns/op nr_loops: 5000 bpf_loop - throughput: 264.211 ± 1.508 M ops/s, latency: 3.785 ns/op nr_loops: 10000 bpf_loop - throughput: 265.366 ± 3.054 M ops/s, latency: 3.768 ns/op nr_loops: 50000 bpf_loop - throughput: 235.986 ± 20.205 M ops/s, latency: 4.238 ns/op nr_loops: 100000 bpf_loop - throughput: 264.482 ± 0.279 M ops/s, latency: 3.781 ns/op nr_loops: 500000 bpf_loop - throughput: 309.773 ± 87.713 M ops/s, latency: 3.228 ns/op nr_loops: 1000000 bpf_loop - throughput: 262.818 ± 4.143 M ops/s, latency: 3.805 ns/op >From this data, we can see that the latency per loop decreases as the number of loops increases. On this particular machine, each loop had an overhead of about ~4 ns, and we were able to run ~250 million loops per second. Signed-off-by: Joanne Koong <joannekoong@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20211130030622.4131246-5-joannekoong@fb.com	2021-11-30 10:56:28 -08:00
Joanne Koong	f6e659b7f9	selftests/bpf: Measure bpf_loop verifier performance This patch tests bpf_loop in pyperf and strobemeta, and measures the verifier performance of replacing the traditional for loop with bpf_loop. The results are as follows: ~strobemeta~ Baseline verification time 6808200 usec stack depth 496 processed 554252 insns (limit 1000000) max_states_per_insn 16 total_states 15878 peak_states 13489 mark_read 3110 #192 verif_scale_strobemeta:OK (unrolled loop) Using bpf_loop verification time 31589 usec stack depth 96+400 processed 1513 insns (limit 1000000) max_states_per_insn 2 total_states 106 peak_states 106 mark_read 60 #193 verif_scale_strobemeta_bpf_loop:OK ~pyperf600~ Baseline verification time 29702486 usec stack depth 368 processed 626838 insns (limit 1000000) max_states_per_insn 7 total_states 30368 peak_states 30279 mark_read 748 #182 verif_scale_pyperf600:OK (unrolled loop) Using bpf_loop verification time 148488 usec stack depth 320+40 processed 10518 insns (limit 1000000) max_states_per_insn 10 total_states 705 peak_states 517 mark_read 38 #183 verif_scale_pyperf600_bpf_loop:OK Using the bpf_loop helper led to approximately a 99% decrease in the verification time and in the number of instructions. Signed-off-by: Joanne Koong <joannekoong@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20211130030622.4131246-4-joannekoong@fb.com	2021-11-30 10:56:28 -08:00
Joanne Koong	4e5070b64b	selftests/bpf: Add bpf_loop test Add test for bpf_loop testing a variety of cases: various nr_loops, null callback ctx, invalid flags, nested callbacks. Signed-off-by: Joanne Koong <joannekoong@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20211130030622.4131246-3-joannekoong@fb.com	2021-11-30 10:56:28 -08:00
Joanne Koong	e6f2dd0f80	bpf: Add bpf_loop helper This patch adds the kernel-side and API changes for a new helper function, bpf_loop: long bpf_loop(u32 nr_loops, void callback_fn, void callback_ctx, u64 flags); where long (callback_fn)(u32 index, void ctx); bpf_loop invokes the "callback_fn" nr_loops times or until the callback_fn returns 1. The callback_fn can only return 0 or 1, and this is enforced by the verifier. The callback_fn index is zero-indexed. A few things to please note: ~ The "u64 flags" parameter is currently unused but is included in case a future use case for it arises. ~ In the kernel-side implementation of bpf_loop (kernel/bpf/bpf_iter.c), bpf_callback_t is used as the callback function cast. ~ A program can have nested bpf_loop calls but the program must still adhere to the verifier constraint of its stack depth (the stack depth cannot exceed MAX_BPF_STACK)) ~ Recursive callback_fns do not pass the verifier, due to the call stack for these being too deep. ~ The next patch will include the tests and benchmark Signed-off-by: Joanne Koong <joannekoong@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20211130030622.4131246-2-joannekoong@fb.com	2021-11-30 10:56:28 -08:00
Linus Torvalds	f080815fdb	Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull kvm fixes from Paolo Bonzini: "ARM64: - Fix constant sign extension affecting TCR_EL2 and preventing running on ARMv8.7 models due to spurious bits being set - Fix use of helpers using PSTATE early on exit by always sampling it as soon as the exit takes place - Move pkvm's 32bit handling into a common helper RISC-V: - Fix incorrect KVM_MAX_VCPUS value - Unmap stage2 mapping when deleting/moving a memslot x86: - Fix and downgrade BUG_ON due to uninitialized cache - Many APICv and MOVE_ENC_CONTEXT_FROM fixes - Correctly emulate TLB flushes around nested vmentry/vmexit and when the nested hypervisor uses VPID - Prevent modifications to CPUID after the VM has run - Other smaller bugfixes Generic: - Memslot handling bugfixes" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (44 commits) KVM: fix avic_set_running for preemptable kernels KVM: VMX: clear vmx_x86_ops.sync_pir_to_irr if APICv is disabled KVM: SEV: accept signals in sev_lock_two_vms KVM: SEV: do not take kvm->lock when destroying KVM: SEV: Prohibit migration of a VM that has mirrors KVM: SEV: Do COPY_ENC_CONTEXT_FROM with both VMs locked selftests: sev_migrate_tests: add tests for KVM_CAP_VM_COPY_ENC_CONTEXT_FROM KVM: SEV: move mirror status to destination of KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM KVM: SEV: initialize regions_list of a mirror VM KVM: SEV: cleanup locking for KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM KVM: SEV: do not use list_replace_init on an empty list KVM: x86: Use a stable condition around all VT-d PI paths KVM: x86: check PIR even for vCPUs with disabled APICv KVM: VMX: prepare sync_pir_to_irr for running with APICv disabled KVM: selftests: page_table_test: fix calculation of guest_test_phys_mem KVM: x86/mmu: Handle "default" period when selectively waking kthread KVM: MMU: shadow nested paging does not have PKU KVM: x86/mmu: Remove spurious TLB flushes in TDP MMU zap collapsible path KVM: x86/mmu: Use yield-safe TDP MMU root iter in MMU notifier unmapping KVM: X86: Use vcpu->arch.walk_mmu for kvm_mmu_invlpg() ...	2021-11-30 09:22:15 -08:00
Matthew Wilcox (Oracle)	d6e6a27d96	tools: Fix math.h breakage Commit `98e1385ef2` ("include/linux/radix-tree.h: replace kernel.h with the necessary inclusions") broke the radix tree test suite in two different ways; first by including math.h which didn't exist in the tools directory, and second by removing an implicit include of spinlock.h before lockdep.h. Fix both issues. Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2021-11-30 09:14:42 -08:00
Mehrdad Arshad Rad	c291d0a4d1	libbpf: Remove duplicate assignments There is a same action when load_attr.attach_btf_id is initialized. Signed-off-by: Mehrdad Arshad Rad <arshad.rad@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20211128193337.10628-1-arshad.rad@gmail.com	2021-11-30 15:33:57 +01:00
Hangbin Liu	5944b5abd8	Bonding: add arp_missed_max option Currently, we use hard code number to verify if we are in the arp_interval timeslice. But some user may want to reduce/extend the verify timeslice. With the similar team option 'missed_max' the uers could change that number based on their own environment. Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-30 12:15:58 +00:00
Paolo Bonzini	17d44a96f0	KVM: SEV: Prohibit migration of a VM that has mirrors VMs that mirror an encryption context rely on the owner to keep the ASID allocated. Performing a KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM would cause a dangling ASID: 1. copy context from A to B (gets ref to A) 2. move context from A to L (moves ASID from A to L) 3. close L (releases ASID from L, B still references it) The right way to do the handoff instead is to create a fresh mirror VM on the destination first: 1. copy context from A to B (gets ref to A) [later] 2. close B (releases ref to A) 3. move context from A to L (moves ASID from A to L) 4. copy context from L to M So, catch the situation by adding a count of how many VMs are mirroring this one's encryption context. Fixes: `0b020f5af0` ("KVM: SEV: Add support for SEV-ES intra host migration") Message-Id: <20211123005036.2954379-11-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-30 03:54:14 -05:00
Paolo Bonzini	dc79c9f4eb	selftests: sev_migrate_tests: add tests for KVM_CAP_VM_COPY_ENC_CONTEXT_FROM I am putting the tests in sev_migrate_tests because the failure conditions are very similar and some of the setup code can be reused, too. The tests cover both successful creation of a mirror VM, and error conditions. Cc: Peter Gonda <pgonda@google.com> Cc: Sean Christopherson <seanjc@google.com> Message-Id: <20211123005036.2954379-9-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-30 03:54:13 -05:00
Maciej S. Szmigiero	81835ee113	KVM: selftests: page_table_test: fix calculation of guest_test_phys_mem A kvm_page_table_test run with its default settings fails on VMX due to memory region add failure: > ==== Test Assertion Failure ==== > lib/kvm_util.c:952: ret == 0 > pid=10538 tid=10538 errno=17 - File exists > 1 0x00000000004057d1: vm_userspace_mem_region_add at kvm_util.c:947 > 2 0x0000000000401ee9: pre_init_before_test at kvm_page_table_test.c:302 > 3 (inlined by) run_test at kvm_page_table_test.c:374 > 4 0x0000000000409754: for_each_guest_mode at guest_modes.c:53 > 5 0x0000000000401860: main at kvm_page_table_test.c:500 > 6 0x00007f82ae2d8554: ?? ??:0 > 7 0x0000000000401894: _start at ??:? > KVM_SET_USER_MEMORY_REGION IOCTL failed, > rc: -1 errno: 17 > slot: 1 flags: 0x0 > guest_phys_addr: 0xc0000000 size: 0x40000000 This is because the memory range that this test is trying to add (0x0c0000000 - 0x100000000) conflicts with LAPIC mapping at 0x0fee00000. Looking at the code it seems that guest_test_phys_mem variable gets mistakenly overwritten with guest_test_virt_mem while trying to adjust the former for alignment. With the correct variable adjusted this test runs successfully. Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Message-Id: <52e487458c3172923549bbcf9dfccfbe6faea60b.1637940473.git.maciej.szmigiero@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-11-30 03:12:13 -05:00
Jason A. Donenfeld	20ae1d6aa1	wireguard: device: reset peer src endpoint when netns exits Each peer's endpoint contains a dst_cache entry that takes a reference to another netdev. When the containing namespace exits, we take down the socket and prevent future sockets from being created (by setting creating_net to NULL), which removes that potential reference on the netns. However, it doesn't release references to the netns that a netdev cached in dst_cache might be taking, so the netns still might fail to exit. Since the socket is gimped anyway, we can simply clear all the dst_caches (by way of clearing the endpoint src), which will release all references. However, the current dst_cache_reset function only releases those references lazily. But it turns out that all of our usages of wg_socket_clear_peer_endpoint_src are called from contexts that are not exactly high-speed or bottle-necked. For example, when there's connection difficulty, or when userspace is reconfiguring the interface. And in particular for this patch, when the netns is exiting. So for those cases, it makes more sense to call dst_release immediately. For that, we add a small helper function to dst_cache. This patch also adds a test to netns.sh from Hangbin Liu to ensure this doesn't regress. Tested-by: Hangbin Liu <liuhangbin@gmail.com> Reported-by: Xiumei Mu <xmu@redhat.com> Cc: Toke Høiland-Jørgensen <toke@redhat.com> Cc: Paolo Abeni <pabeni@redhat.com> Fixes: `900575aa33` ("wireguard: device: avoid circular netns references") Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-11-29 19:50:45 -08:00
Li Zhijian	7e938beb83	wireguard: selftests: rename DEBUG_PI_LIST to DEBUG_PLIST DEBUG_PI_LIST was renamed to DEBUG_PLIST since `8e18faeac3` ("lib/plist: rename DEBUG_PI_LIST to DEBUG_PLIST"). Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com> Fixes: `8e18faeac3` ("lib/plist: rename DEBUG_PI_LIST to DEBUG_PLIST") Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-11-29 19:50:40 -08:00
Jason A. Donenfeld	782c72af56	wireguard: selftests: actually test for routing loops We previously removed the restriction on looping to self, and then added a test to make sure the kernel didn't blow up during a routing loop. The kernel didn't blow up, thankfully, but on certain architectures where skb fragmentation is easier, such as ppc64, the skbs weren't actually being discarded after a few rounds through. But the test wasn't catching this. So actually test explicitly for massive increases in tx to see if we have a routing loop. Note that the actual loop problem will need to be addressed in a different commit. Fixes: `b673e24aad` ("wireguard: socket: remove errant restriction on looping to self") Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-11-29 19:50:29 -08:00
Jason A. Donenfeld	03ff1b1def	wireguard: selftests: increase default dmesg log size The selftests currently parse the kernel log at the end to track potential memory leaks. With these tests now reading off the end of the buffer, due to recent optimizations, some creation messages were lost, making the tests think that there was a free without an alloc. Fix this by increasing the kernel log size. Fixes: `24b70eeeb4` ("wireguard: use synchronize_net rather than synchronize_rcu") Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-11-29 19:50:29 -08:00
Alan Maguire	43174f0d45	libbpf: Silence uninitialized warning/error in btf_dump_dump_type_data When compiling libbpf with gcc 4.8.5, we see: CC staticobjs/btf_dump.o btf_dump.c: In function ‘btf_dump_dump_type_data.isra.24’: btf_dump.c:2296:5: error: ‘err’ may be used uninitialized in this function [-Werror=maybe-uninitialized] if (err < 0) ^ cc1: all warnings being treated as errors make: *** [staticobjs/btf_dump.o] Error 1 While gcc 4.8.5 is too old to build the upstream kernel, it's possible it could be used to build standalone libbpf which suffers from the same problem. Silence the error by initializing 'err' to 0. The warning/error seems to be a false positive since err is set early in the function. Regardless we shouldn't prevent libbpf from building for this. Fixes: `920d16af9b` ("libbpf: BTF dumper support for typed data") Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/1638180040-8037-1-git-send-email-alan.maguire@oracle.com	2021-11-29 09:36:44 -08:00
Ivan Vecera	754d71be52	selftests: net: bridge: fix typo in vlan_filtering dependency test Prior patch: ]# TESTS=vlmc_filtering_test ./bridge_vlan_mcast.sh TEST: Vlan multicast snooping enable [ OK ] Device "bridge" does not exist. TEST: Disable multicast vlan snooping when vlan filtering is disabled [FAIL] Vlan filtering is disabled but multicast vlan snooping is still enabled After patch: # TESTS=vlmc_filtering_test ./bridge_vlan_mcast.sh TEST: Vlan multicast snooping enable [ OK ] TEST: Disable multicast vlan snooping when vlan filtering is disabled [ OK ] Fixes: `f5a9dd58f4` ("selftests: net: bridge: add test for vlan_filtering dependency") Cc: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Ivan Vecera <ivecera@redhat.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2021-11-29 12:49:53 +00:00
Hengqi Chen	baeead213e	selftests/bpf: Test BPF_MAP_TYPE_PROG_ARRAY static initialization Add testcase for BPF_MAP_TYPE_PROG_ARRAY static initialization. Signed-off-by: Hengqi Chen <hengqi.chen@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20211128141633.502339-3-hengqi.chen@gmail.com	2021-11-28 22:24:57 -08:00
Hengqi Chen	341ac5ffc4	libbpf: Support static initialization of BPF_MAP_TYPE_PROG_ARRAY Support static initialization of BPF_MAP_TYPE_PROG_ARRAY with a syntax similar to map-in-map initialization ([0]): SEC("socket") int tailcall_1(void ctx) { return 0; } struct { __uint(type, BPF_MAP_TYPE_PROG_ARRAY); __uint(max_entries, 2); __uint(key_size, sizeof(__u32)); __array(values, int (void )); } prog_array_init SEC(".maps") = { .values = { [1] = (void *)&tailcall_1, }, }; Here's the relevant part of libbpf debug log showing what's going on with prog-array initialization: libbpf: sec '.relsocket': collecting relocation for section(3) 'socket' libbpf: sec '.relsocket': relo #0: insn #2 against 'prog_array_init' libbpf: prog 'entry': found map 0 (prog_array_init, sec 4, off 0) for insn #0 libbpf: .maps relo #0: for 3 value 0 rel->r_offset 32 name 53 ('tailcall_1') libbpf: .maps relo #0: map 'prog_array_init' slot [1] points to prog 'tailcall_1' libbpf: map 'prog_array_init': created successfully, fd=5 libbpf: map 'prog_array_init': slot [1] set to prog 'tailcall_1' fd=6 [0] Closes: https://github.com/libbpf/libbpf/issues/354 Signed-off-by: Hengqi Chen <hengqi.chen@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20211128141633.502339-2-hengqi.chen@gmail.com	2021-11-28 22:24:52 -08:00
Kuniyuki Iwashima	5ce7ab4961	af_unix: Remove UNIX_ABSTRACT() macro and test sun_path[0] instead. In BSD and abstract address cases, we store sockets in the hash table with keys between 0 and UNIX_HASH_SIZE - 1. However, the hash saved in a socket varies depending on its address type; sockets with BSD addresses always have UNIX_HASH_SIZE in their unix_sk(sk)->addr->hash. This is just for the UNIX_ABSTRACT() macro used to check the address type. The difference of the saved hashes comes from the first byte of the address in the first place. So, we can test it directly. Then we can keep a real hash in each socket and replace unix_table_lock with per-hash locks in the later patch. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-11-26 18:01:56 -08:00
Nikolay Aleksandrov	f5a9dd58f4	selftests: net: bridge: add test for vlan_filtering dependency Add a test for dependency of mcast_vlan_snooping on vlan_filtering. If vlan_filtering gets disabled, then mcast_vlan_snooping must be automatically disabled as well. TEST: Disable multicast vlan snooping when vlan filtering is disabled [ OK ] Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2021-11-26 16:43:17 -08:00

... 52 53 54 55 56 ...

30456 Commits