linux/kernel
Song Liu 39d8f0d102 bpf: Use raw_spin_trylock() for pcpu_freelist_push/pop in NMI
Recent improvements in LOCKDEP highlighted a potential A-A deadlock with
pcpu_freelist in NMI:

./tools/testing/selftests/bpf/test_progs -t stacktrace_build_id_nmi

[   18.984807] ================================
[   18.984807] WARNING: inconsistent lock state
[   18.984808] 5.9.0-rc6-01771-g1466de1330e1 #2967 Not tainted
[   18.984809] --------------------------------
[   18.984809] inconsistent {INITIAL USE} -> {IN-NMI} usage.
[   18.984810] test_progs/1990 [HC2[2]:SC0[0]:HE0:SE1] takes:
[   18.984810] ffffe8ffffc219c0 (&head->lock){....}-{2:2}, at: __pcpu_freelist_pop+0xe3/0x180
[   18.984813] {INITIAL USE} state was registered at:
[   18.984814]   lock_acquire+0x175/0x7c0
[   18.984814]   _raw_spin_lock+0x2c/0x40
[   18.984815]   __pcpu_freelist_pop+0xe3/0x180
[   18.984815]   pcpu_freelist_pop+0x31/0x40
[   18.984816]   htab_map_alloc+0xbbf/0xf40
[   18.984816]   __do_sys_bpf+0x5aa/0x3ed0
[   18.984817]   do_syscall_64+0x2d/0x40
[   18.984818]   entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   18.984818] irq event stamp: 12
[...]
[   18.984822] other info that might help us debug this:
[   18.984823]  Possible unsafe locking scenario:
[   18.984823]
[   18.984824]        CPU0
[   18.984824]        ----
[   18.984824]   lock(&head->lock);
[   18.984826]   <Interrupt>
[   18.984826]     lock(&head->lock);
[   18.984827]
[   18.984828]  *** DEADLOCK ***
[   18.984828]
[   18.984829] 2 locks held by test_progs/1990:
[...]
[   18.984838]  <NMI>
[   18.984838]  dump_stack+0x9a/0xd0
[   18.984839]  lock_acquire+0x5c9/0x7c0
[   18.984839]  ? lock_release+0x6f0/0x6f0
[   18.984840]  ? __pcpu_freelist_pop+0xe3/0x180
[   18.984840]  _raw_spin_lock+0x2c/0x40
[   18.984841]  ? __pcpu_freelist_pop+0xe3/0x180
[   18.984841]  __pcpu_freelist_pop+0xe3/0x180
[   18.984842]  pcpu_freelist_pop+0x17/0x40
[   18.984842]  ? lock_release+0x6f0/0x6f0
[   18.984843]  __bpf_get_stackid+0x534/0xaf0
[   18.984843]  bpf_prog_1fd9e30e1438d3c5_oncpu+0x73/0x350
[   18.984844]  bpf_overflow_handler+0x12f/0x3f0

This is because pcpu_freelist_head.lock is accessed in both NMI and
non-NMI context. Fix this issue by using raw_spin_trylock() in NMI.

Since NMI interrupts non-NMI context, when NMI context tries to lock the
raw_spinlock, non-NMI context of the same CPU may already have locked a
lock and is blocked from unlocking the lock. For a system with N CPUs,
there could be N NMIs at the same time, and they may block N non-NMI
raw_spinlocks. This is tricky for pcpu_freelist_push(), where unlike
_pop(), failing _push() means leaking memory. This issue is more likely to
trigger in non-SMP system.

Fix this issue with an extra list, pcpu_freelist.extralist. The extralist
is primarily used to take _push() when raw_spin_trylock() failed on all
the per CPU lists. It should be empty most of the time. The following
table summarizes the behavior of pcpu_freelist in NMI and non-NMI:

non-NMI pop(): 	use _lock(); check per CPU lists first;
                if all per CPU lists are empty, check extralist;
                if extralist is empty, return NULL.

non-NMI push(): use _lock(); only push to per CPU lists.

NMI pop():    use _trylock(); check per CPU lists first;
              if all per CPU lists are locked or empty, check extralist;
              if extralist is locked or empty, return NULL.

NMI push():   use _trylock(); check per CPU lists first;
              if all per CPU lists are locked; try push to extralist;
              if extralist is also locked, keep trying on per CPU lists.

Reported-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20201005165838.3735218-1-songliubraving@fb.com
2020-10-06 00:04:11 +02:00
..
bpf bpf: Use raw_spin_trylock() for pcpu_freelist_push/pop in NMI 2020-10-06 00:04:11 +02:00
cgroup for-5.9/block-20200802 2020-08-03 11:57:03 -07:00
configs compiler: remove CONFIG_OPTIMIZE_INLINING entirely 2020-04-07 10:43:42 -07:00
debug treewide: Use fallthrough pseudo-keyword 2020-08-23 17:36:59 -05:00
dma dma-pool: Fix an uninitialized variable bug in atomic_pool_expand() 2020-08-27 09:22:56 +02:00
entry core/entry: Report syscall correctly for trace and audit 2020-09-14 22:49:51 +02:00
events treewide: Use fallthrough pseudo-keyword 2020-08-23 17:36:59 -05:00
gcov gcov: add support for GCC 10.1 2020-09-11 09:33:54 -07:00
irq Three interrupt related fixes for X86: 2020-08-30 12:01:23 -07:00
kcsan Merge branch 'kcsan' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into locking/core 2020-08-01 09:26:27 +02:00
livepatch livepatch: Make klp_apply_object_relocs static 2020-05-11 00:31:38 +02:00
locking locking/percpu-rwsem: Use this_cpu_{inc,dec}() for read_count 2020-09-16 16:26:56 +02:00
power treewide: Use fallthrough pseudo-keyword 2020-08-23 17:36:59 -05:00
printk Printk changes for 5.9 2020-08-04 22:22:25 -07:00
rcu rcu-tasks: Enclose task-list scan in rcu_read_lock() 2020-09-16 16:32:38 -07:00
sched A set of fixes for lockdep, tracing and RCU: 2020-08-30 11:43:50 -07:00
time treewide: Use fallthrough pseudo-keyword 2020-08-23 17:36:59 -05:00
trace bpf: Introducte bpf_this_cpu_ptr() 2020-10-02 15:00:49 -07:00
.gitignore .gitignore: add SPDX License Identifier 2020-03-25 11:50:48 +01:00
acct.c mmap locking API: convert mmap_sem comments 2020-06-09 09:39:14 -07:00
async.c treewide: Remove uninitialized_var() usage 2020-07-16 12:35:15 -07:00
audit_fsnotify.c fsnotify: create method handle_inode_event() in fsnotify_operations 2020-07-27 23:25:50 +02:00
audit_tree.c \n 2020-08-06 19:29:51 -07:00
audit_watch.c fsnotify: create method handle_inode_event() in fsnotify_operations 2020-07-27 23:25:50 +02:00
audit.c audit/stable-5.9 PR 20200803 2020-08-04 14:20:26 -07:00
audit.h revert: 1320a4052e ("audit: trigger accompanying records when no rules present") 2020-07-29 10:00:36 -04:00
auditfilter.c treewide: Use fallthrough pseudo-keyword 2020-08-23 17:36:59 -05:00
auditsc.c audit/stable-5.9 PR 20200803 2020-08-04 14:20:26 -07:00
backtracetest.c treewide: Replace DECLARE_TASKLET() with DECLARE_TASKLET_OLD() 2020-07-30 11:15:58 -07:00
bounds.c
capability.c treewide: Use fallthrough pseudo-keyword 2020-08-23 17:36:59 -05:00
compat.c treewide: Use fallthrough pseudo-keyword 2020-08-23 17:36:59 -05:00
configs.c
context_tracking.c context_tracking: Ensure that the critical path cannot be instrumented 2020-06-11 15:14:36 +02:00
cpu_pm.c kernel/cpu_pm: Fix uninitted local in cpu_pm 2020-05-15 11:44:34 -07:00
cpu.c The changes in this cycle are: 2020-06-03 13:06:42 -07:00
crash_core.c kdump: append kernel build-id string to VMCOREINFO 2020-08-12 10:58:01 -07:00
crash_dump.c crash_dump: Remove no longer used saved_max_pfn 2020-04-15 11:21:54 +02:00
cred.c exec: Teach prepare_exec_creds how exec treats uids & gids 2020-05-20 14:44:21 -05:00
delayacct.c
dma.c
elfcore.c
exec_domain.c
exit.c kernel: add a kernel_wait helper 2020-08-12 10:57:59 -07:00
extable.c kernel/extable.c: use address-of operator on section symbols 2020-04-07 10:43:42 -07:00
fail_function.c
fork.c fork: adjust sysctl_max_threads definition to match prototype 2020-09-05 12:14:29 -07:00
freezer.c
futex.c futex: Convert to use the preferred 'fallthrough' macro 2020-08-13 21:02:12 +02:00
gen_kheaders.sh kbuild: add variables for compression tools 2020-06-06 23:42:01 +09:00
groups.c mm: remove the pgprot argument to __vmalloc 2020-06-02 10:59:11 -07:00
hung_task.c kernel/hung_task.c: introduce sysctl to print all traces when a hung task is detected 2020-06-08 11:05:56 -07:00
iomem.c
irq_work.c irq_work, smp: Allow irq_work on call_single_queue 2020-05-28 10:54:15 +02:00
jump_label.c
kallsyms.c treewide: Use fallthrough pseudo-keyword 2020-08-23 17:36:59 -05:00
kcmp.c kernel/kcmp.c: Use new infrastructure to fix deadlocks in execve 2020-03-25 10:04:01 -05:00
Kconfig.freezer
Kconfig.hz
Kconfig.locks
Kconfig.preempt
kcov.c kcov: make some symbols static 2020-08-12 10:58:02 -07:00
kexec_core.c
kexec_elf.c
kexec_file.c Misc fixes and small updates all around the place: 2020-08-15 10:38:03 -07:00
kexec_internal.h
kexec.c
kheaders.c
kmod.c kmod: remove redundant "be an" in the comment 2020-08-12 10:58:01 -07:00
kprobes.c Tracing fixes: 2020-09-22 09:08:33 -07:00
ksysfs.c
kthread.c uaccess: add force_uaccess_{begin,end} helpers 2020-08-12 10:57:59 -07:00
latencytop.c sysctl: pass kernel pointers to ->proc_handler 2020-04-27 02:07:40 -04:00
Makefile bpf: Add kernel module with user mode driver that populates bpffs. 2020-08-20 16:02:36 +02:00
module_signature.c
module_signing.c
module-internal.h
module.c Modules updates for v5.9 2020-08-14 11:07:02 -07:00
notifier.c mm: remove vmalloc_sync_(un)mappings() 2020-06-02 10:59:12 -07:00
nsproxy.c nsproxy: support CLONE_NEWTIME with setns() 2020-07-08 11:14:22 +02:00
padata.c padata: fix possible padata_works_lock deadlock 2020-09-04 17:51:55 +10:00
panic.c panic: make print_oops_end_marker() static 2020-08-12 10:58:02 -07:00
params.c
pid_namespace.c pid_namespace: use checkpoint_restore_ns_capable() for ns_last_pid 2020-07-19 20:14:42 +02:00
pid.c cap-checkpoint-restore-v5.9 2020-08-04 15:02:07 -07:00
profile.c
ptrace.c
range.c
reboot.c arch: remove unicore32 port 2020-07-01 12:09:13 +03:00
regset.c regset: kill ->get() 2020-07-27 14:31:12 -04:00
relay.c kernel/relay.c: fix memleak on destroy relay channel 2020-08-21 09:52:53 -07:00
resource.c /dev/mem: Revoke mappings when a driver claims the region 2020-05-27 11:10:05 +02:00
rseq.c
scs.c mm: memcontrol: account kernel stack per node 2020-08-07 11:33:25 -07:00
seccomp.c seccomp: don't leave dangling ->notif if file allocation fails 2020-09-08 11:30:16 -07:00
signal.c treewide: Use fallthrough pseudo-keyword 2020-08-23 17:36:59 -05:00
smp.c smp: Fix a potential usage of stale nr_cpus 2020-07-22 10:22:04 +02:00
smpboot.c
smpboot.h
softirq.c tasklets API update for v5.9-rc1 2020-08-04 13:40:35 -07:00
stackleak.c stackleak: let stack_erasing_sysctl take a kernel pointer buffer 2020-09-19 13:13:39 -07:00
stacktrace.c uaccess: add force_uaccess_{begin,end} helpers 2020-08-12 10:57:59 -07:00
stop_machine.c
sys_ni.c all arch: remove system call sys_sysctl 2020-08-14 19:56:56 -07:00
sys.c treewide: Use fallthrough pseudo-keyword 2020-08-23 17:36:59 -05:00
sysctl-test.c
sysctl.c mm: allow a controlled amount of unfairness in the page lock 2020-09-17 10:26:41 -07:00
task_work.c task_work: only grab task signal lock when needed 2020-08-13 09:01:38 -06:00
taskstats.c
test_kprobes.c
torture.c torture: Dump ftrace at shutdown only if requested 2020-06-29 12:01:45 -07:00
tracepoint.c
tsacct.c
ucount.c ucount: Make sure ucounts in /proc/sys/user don't regress again 2020-04-07 21:51:27 +02:00
uid16.c
uid16.h
umh.c kernel: add a kernel_wait helper 2020-08-12 10:57:59 -07:00
up.c
user_namespace.c nsproxy: add struct nsset 2020-05-09 13:57:12 +02:00
user-return-notifier.c
user.c user.c: make uidhash_table static 2020-06-04 19:06:24 -07:00
usermode_driver.c umd: Stop using split_argv 2020-07-07 11:58:59 -05:00
utsname_sysctl.c sysctl: pass kernel pointers to ->proc_handler 2020-04-27 02:07:40 -04:00
utsname.c nsproxy: add struct nsset 2020-05-09 13:57:12 +02:00
watch_queue.c watch_queue: Limit the number of watches a user can hold 2020-08-17 09:39:18 -07:00
watchdog_hld.c
watchdog.c kernel/watchdog.c: convert {soft/hard}lockup boot parameters to sysctl aliases 2020-06-08 11:05:56 -07:00
workqueue_internal.h
workqueue.c maccess: rename probe_kernel_{read,write} to copy_{from,to}_kernel_nofault 2020-06-17 10:57:41 -07:00