linux/kernel
Jan H. Schönherr dd9d384375 sched: Fix cpu_active_mask/cpu_online_mask race
There is a race condition in SMP bootup code, which may result
in

    WARNING: CPU: 0 PID: 1 at kernel/workqueue.c:4418
    workqueue_cpu_up_callback()
or
    kernel BUG at kernel/smpboot.c:135!

It can be triggered with a bit of luck in Linux guests running
on busy hosts.

	CPU0                        CPUn
	====                        ====

	_cpu_up()
	  __cpu_up()
				    start_secondary()
				      set_cpu_online()
					cpumask_set_cpu(cpu,
						   to_cpumask(cpu_online_bits));
	  cpu_notify(CPU_ONLINE)
	    <do stuff, see below>
					cpumask_set_cpu(cpu,
						   to_cpumask(cpu_active_bits));

During the various CPU_ONLINE callbacks CPUn is online but not
active. Several things can go wrong at that point, depending on
the scheduling of tasks on CPU0.

Variant 1:

  cpu_notify(CPU_ONLINE)
    workqueue_cpu_up_callback()
      rebind_workers()
        set_cpus_allowed_ptr()

  This call fails because it requires an active CPU; rebind_workers()
  ends with a warning:

    WARNING: CPU: 0 PID: 1 at kernel/workqueue.c:4418
    workqueue_cpu_up_callback()

Variant 2:

  cpu_notify(CPU_ONLINE)
    smpboot_thread_call()
      smpboot_unpark_threads()
       ..
        __kthread_unpark()
          __kthread_bind()
          wake_up_state()
           ..
            select_task_rq()
              select_fallback_rq()

  The ->wake_cpu of the unparked thread is not allowed, making a call
  to select_fallback_rq() necessary. Then, select_fallback_rq() cannot
  find an allowed, active CPU and promptly resets the allowed CPUs, so
  that the task in question ends up on CPU0.

  When those unparked tasks are eventually executed, they run
  immediately into a BUG:

    kernel BUG at kernel/smpboot.c:135!

Just changing the order in which the online/active bits are set
(and adding some memory barriers), would solve the two issues
above. However, it would change the order of operations back to
the one before commit 6acbfb9697 ("sched: Fix hotplug vs.
set_cpus_allowed_ptr()"), thus, reintroducing that particular
problem.

Going further back into history, we have at least the following
commits touching this topic:
- commit 2baab4e904 ("sched: Fix select_fallback_rq() vs cpu_active/cpu_online")
- commit 5fbd036b55 ("sched: Cleanup cpu_active madness")

Together, these give us the following non-working solutions:

  - secondary CPU sets active before online, because active is assumed to
    be a subset of online;

  - secondary CPU sets online before active, because the primary CPU
    assumes that an online CPU is also active;

  - secondary CPU sets online and waits for primary CPU to set active,
    because it might deadlock.

Commit 875ebe940d ("powerpc/smp: Wait until secondaries are
active & online") introduces an arch-specific solution to this
arch-independent problem.

Now, go for a more general solution without explicit waiting and
simply set active twice: once on the secondary CPU after online
was set and once on the primary CPU after online was seen.

set_cpus_allowed_ptr()")

Signed-off-by: Jan H. Schönherr <jschoenh@amazon.de>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: <stable@vger.kernel.org>
Cc: Anton Blanchard <anton@samba.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Wilson <msw@amazon.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 6acbfb9697 ("sched: Fix hotplug vs. set_cpus_allowed_ptr()")
Link: http://lkml.kernel.org/r/1439408156-18840-1-git-send-email-jschoenh@amazon.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-08-25 12:14:39 +02:00
..
bpf bpf: allow networking programs to use bpf_trace_printk() for debugging 2015-06-15 15:53:50 -07:00
configs kconfig: add xenconfig defconfig helper 2015-06-16 11:04:29 +01:00
debug debug: prevent entering debug mode on panic/exception. 2015-02-19 12:39:03 -06:00
events perf: Fix PERF_EVENT_IOC_PERIOD migration race 2015-08-12 11:37:22 +02:00
gcov gcov: add support for GCC 5.1 2015-06-30 19:44:57 -07:00
irq genirq: Introduce irq_chip_set_type_parent() helper 2015-08-20 00:25:25 +02:00
livepatch Merge branches 'for-4.1/upstream-fixes', 'for-4.2/kaslr' and 'for-4.2/upstream' into for-linus 2015-06-22 16:26:56 +02:00
locking locking/pvqspinlock: Fix kernel panic in locking-selftest 2015-07-21 10:18:07 +02:00
power Power management and ACPI fixes for v4.2-rc1 2015-07-01 14:17:44 -07:00
printk printk: improve the description of /dev/kmsg line format 2015-06-30 19:44:59 -07:00
rcu This patch series contains several clean ups and even a new trace clock 2015-06-26 14:02:43 -07:00
sched sched: Fix cpu_active_mask/cpu_online_mask race 2015-08-25 12:14:39 +02:00
time timer: Write timer->flags atomically 2015-08-18 15:31:16 +02:00
trace ftrace: Fix breakage of set_ftrace_pid 2015-07-24 13:58:14 -04:00
.gitignore
acct.c acct: check FMODE_CAN_WRITE 2015-04-11 22:27:55 -04:00
async.c
audit_tree.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2015-04-26 17:22:07 -07:00
audit_watch.c VFS: audit: d_backing_inode() annotations 2015-04-15 15:06:55 -04:00
audit.c Merge branch 'upstream' of git://git.infradead.org/users/pcmoore/audit 2015-06-27 13:53:16 -07:00
audit.h Merge branch 'upstream' of git://git.infradead.org/users/pcmoore/audit 2015-04-22 14:49:23 -07:00
auditfilter.c Merge branch 'upstream' of git://git.infradead.org/users/pcmoore/audit 2015-02-11 20:07:47 -08:00
auditsc.c Fix broken audit tests for exec arg len 2015-07-08 09:33:38 -07:00
backtracetest.c
bounds.c
capability.c kernel: conditionally support non-root users, groups and capabilities 2015-04-15 16:35:22 -07:00
cgroup_freezer.c
cgroup.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2015-07-03 15:20:57 -07:00
compat.c compat: cleanup coding in compat_get_bitmap() and compat_put_bitmap() 2015-06-04 23:57:18 +02:00
configs.c
context_tracking.c context_tracking: Inherit TIF_NOHZ through forks instead of context switches 2015-05-07 12:02:51 +02:00
cpu_pm.c
cpu.c genirq: Revert sparse irq locking around __cpu_up() and move it to x86 for now 2015-07-15 10:39:17 +02:00
cpuset.c cpuset: use trialcs->mems_allowed as a temp variable 2015-08-10 11:18:41 -04:00
crash_dump.c
cred.c kernel: conditionally support non-root users, groups and capabilities 2015-04-15 16:35:22 -07:00
delayacct.c
dma.c
elfcore.c
exec_domain.c Remove rest of exec domains. 2015-04-12 21:03:31 +02:00
exit.c exit,stats: /* obey this comment */ 2015-06-25 17:00:43 -07:00
extable.c
fork.c x86/fpu, sched: Introduce CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT and use it on x86 2015-07-18 03:42:51 +02:00
freezer.c
futex_compat.c
futex.c Merge branch 'sched-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2015-06-24 14:46:01 -07:00
groups.c kernel: conditionally support non-root users, groups and capabilities 2015-04-15 16:35:22 -07:00
hung_task.c kernel/hung_task.c: change hung_task.c to use for_each_process_thread() 2015-04-15 16:35:22 -07:00
irq_work.c
jump_label.c module, jump_label: Fix module locking 2015-05-27 11:09:50 +09:30
kallsyms.c
kcmp.c
Kconfig.freezer
Kconfig.hz
Kconfig.locks locking/qrwlock: Rename QUEUE_RWLOCK to QUEUED_RWLOCKS 2015-05-12 09:46:00 +02:00
Kconfig.preempt
kexec.c kernel/panic/kexec: fix "crash_kexec_post_notifiers" option issue in oops path 2015-06-30 19:44:57 -07:00
kmod.c
kprobes.c kprobes: makes kprobes/enabled works correctly for optimized kprobes. 2015-02-13 21:21:42 -08:00
ksysfs.c
kthread.c kthread: export kthread functions 2015-08-07 04:39:41 +03:00
latencytop.c
Makefile make certificate list change message more useful 2015-07-02 16:42:13 -07:00
module_signing.c
module-internal.h
module.c module: weaken locking assertion for oops path. 2015-07-29 06:13:22 +09:30
notifier.c rcu: Make SRCU optional by using CONFIG_SRCU 2015-01-06 11:04:29 -08:00
nsproxy.c
padata.c padata: use %*pb[l] to print bitmaps including cpumasks and nodemasks 2015-02-13 21:21:38 -08:00
panic.c kernel/panic/kexec: fix "crash_kexec_post_notifiers" option issue in oops path 2015-06-30 19:44:57 -07:00
params.c Minor merge needed, due to function move. 2015-07-01 10:49:25 -07:00
pid_namespace.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-12-16 15:53:03 -08:00
pid.c fork: report pid reservation failure properly 2015-04-17 09:04:06 -04:00
profile.c profile: use %*pb[l] to print bitmaps including cpumasks and nodemasks 2015-02-13 21:21:38 -08:00
ptrace.c ptrace: ptrace_detach() can no longer race with SIGKILL 2015-04-17 09:04:06 -04:00
range.c kernel: avoid overflow in cmp_range 2015-01-17 10:02:23 +13:00
reboot.c kernel/reboot.c: add orderly_reboot for graceful reboot 2015-04-15 16:35:23 -07:00
relay.c kernel/relay.c: use kvfree() in relay_free_page_array() 2015-06-30 19:44:59 -07:00
resource.c mm: Fix bugs in region_is_ram() 2015-07-22 17:20:34 +02:00
seccomp.c seccomp, filter: add and use bpf_prog_create_from_user from seccomp 2015-05-09 17:35:05 -04:00
signal.c signal: fix information leak in copy_siginfo_to_user 2015-08-07 04:39:40 +03:00
smp.c smp: Fix error case handling in smp_call_function_*() 2015-04-19 13:19:23 -07:00
smpboot.c watchdog: add watchdog_cpumask sysctl to assist nohz 2015-06-24 17:49:40 -07:00
smpboot.h
softirq.c Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2015-02-09 15:24:03 -08:00
stacktrace.c stacktrace: introduce snprint_stack_trace for buffer output 2014-12-13 12:42:48 -08:00
stop_machine.c sched/stop_machine: Fix deadlock between multiple stop_two_cpus() 2015-06-19 10:03:12 +02:00
sys_ni.c kernel: conditionally support non-root users, groups and capabilities 2015-04-15 16:35:22 -07:00
sys.c prctl: more prctl(PR_SET_MM_*) checks 2015-06-25 17:00:37 -07:00
sysctl_binary.c
sysctl.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2015-07-03 15:20:57 -07:00
system_certificates.S
system_keyring.c
task_work.c
taskstats.c netlink: make nlmsg_end() and genlmsg_end() void 2015-01-18 01:03:45 -05:00
test_kprobes.c
torture.c rcu: Convert ACCESS_ONCE() to READ_ONCE() and WRITE_ONCE() 2015-05-27 12:56:15 -07:00
tracepoint.c
tsacct.c
uid16.c
up.c
user_namespace.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2014-12-17 12:31:40 -08:00
user-return-notifier.c
user.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2014-12-17 12:31:40 -08:00
utsname_sysctl.c
utsname.c
watchdog.c watchdog: add watchdog_cpumask sysctl to assist nohz 2015-06-24 17:49:40 -07:00
workqueue_internal.h
workqueue.c Minor merge needed, due to function move. 2015-07-01 10:49:25 -07:00