linux/kernel
Srivatsa S. Bhat 93ae4f978c CPU hotplug: Provide lockless versions of callback registration functions
The following method of CPU hotplug callback registration is not safe
due to the possibility of an ABBA deadlock involving the cpu_add_remove_lock
and the cpu_hotplug.lock.

	get_online_cpus();

	for_each_online_cpu(cpu)
		init_cpu(cpu);

	register_cpu_notifier(&foobar_cpu_notifier);

	put_online_cpus();

The deadlock is shown below:

          CPU 0                                         CPU 1
          -----                                         -----

   Acquire cpu_hotplug.lock
   [via get_online_cpus()]

                                              CPU online/offline operation
                                              takes cpu_add_remove_lock
                                              [via cpu_maps_update_begin()]

   Try to acquire
   cpu_add_remove_lock
   [via register_cpu_notifier()]

                                              CPU online/offline operation
                                              tries to acquire cpu_hotplug.lock
                                              [via cpu_hotplug_begin()]

                            *** DEADLOCK! ***

The problem here is that callback registration takes the locks in one order
whereas the CPU hotplug operations take the same locks in the opposite order.
To avoid this issue and to provide a race-free method to register CPU hotplug
callbacks (along with initialization of already online CPUs), introduce new
variants of the callback registration APIs that simply register the callbacks
without holding the cpu_add_remove_lock during the registration. That way,
we can avoid the ABBA scenario. However, we will need to hold the
cpu_add_remove_lock throughout the entire critical section, to protect updates
to the callback/notifier chain.

This can be achieved by writing the callback registration code as follows:

	cpu_maps_update_begin(); [ or cpu_notifier_register_begin(); see below ]

	for_each_online_cpu(cpu)
		init_cpu(cpu);

	/* This doesn't take the cpu_add_remove_lock */
	__register_cpu_notifier(&foobar_cpu_notifier);

	cpu_maps_update_done();  [ or cpu_notifier_register_done(); see below ]

Note that we can't use get_online_cpus() here instead of cpu_maps_update_begin()
because the cpu_hotplug.lock is dropped during the invocation of CPU_POST_DEAD
notifiers, and hence get_online_cpus() cannot provide the necessary
synchronization to protect the callback/notifier chains against concurrent
reads and writes. On the other hand, since the cpu_add_remove_lock protects
the entire hotplug operation (including CPU_POST_DEAD), we can use
cpu_maps_update_begin/done() to guarantee proper synchronization.

Also, since cpu_maps_update_begin/done() is like a super-set of
get/put_online_cpus(), the former naturally protects the critical sections
from concurrent hotplug operations.

Since the names cpu_maps_update_begin/done() don't make much sense in CPU
hotplug callback registration scenarios, we'll introduce new APIs named
cpu_notifier_register_begin/done() and map them to cpu_maps_update_begin/done().

In summary, introduce the lockless variants of un/register_cpu_notifier() and
also export the cpu_notifier_register_begin/done() APIs for use by modules.
This way, we provide a race-free way to register hotplug callbacks as well as
perform initialization for the CPUs that are already online.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Toshi Kani <toshi.kani@hp.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-03-20 13:43:40 +01:00
..
cpu sched/preempt: Fix up missed PREEMPT_NEED_RESCHED folding 2014-01-13 17:38:55 +01:00
debug kgdb/kdb: Fix no KDB config problem 2014-01-25 08:55:09 +01:00
events perf: Fix hotplug splat 2014-02-27 12:38:03 +01:00
gcov gcov: reuse kbasename helper 2013-11-13 12:09:34 +09:00
irq genirq: Include missing header file in irqdomain.c 2014-02-27 13:29:35 +01:00
locking Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2014-01-20 10:42:08 -08:00
power arm, pm, vmpressure: add missing slab.h includes 2014-02-03 13:24:01 -05:00
printk printk: fix syslog() overflowing user buffer 2014-02-17 12:24:45 -08:00
rcu Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-01-28 08:38:04 -08:00
sched sched/clock: Prevent tracing recursion in sched_clock_cpu() 2014-03-11 11:33:48 +01:00
time sched_clock: Prevent callers from seeing half-updated data 2014-02-19 17:07:22 +01:00
trace tracing: Do not add event files for modules that fail tracepoints 2014-03-03 21:11:05 -05:00
.gitignore Ignore generated file kernel/x509_certificate_list 2013-12-10 18:21:34 +00:00
acct.c fs: Fix hang with BSD accounting on frozen filesystem 2013-05-04 14:57:58 -04:00
async.c
audit_tree.c inotify: Fix reporting of cookies for inotify events 2014-02-18 11:17:17 +01:00
audit_watch.c inotify: Fix reporting of cookies for inotify events 2014-02-18 11:17:17 +01:00
audit.c audit: Update kdoc for audit_send_reply and audit_list_rules_send 2014-03-08 15:31:54 -08:00
audit.h audit: Use struct net not pid_t to remember the network namespce to reply in 2014-02-28 04:04:33 -08:00
auditfilter.c audit: Update kdoc for audit_send_reply and audit_list_rules_send 2014-03-08 15:31:54 -08:00
auditsc.c execve: use 'struct filename *' for executable name passing 2014-02-05 12:54:53 -08:00
backtracetest.c
bounds.c mm: do not allocate page->ptl dynamically, if spinlock_t fits to long 2013-12-20 12:25:45 -08:00
capability.c audit: Simplify and correct audit_log_capset 2014-01-13 22:26:48 -05:00
cgroup_freezer.c cgroup: replace cftype->read_seq_string() with cftype->seq_show() 2013-12-05 12:28:04 -05:00
cgroup.c cgroup: update cgroup_enable_task_cg_lists() to grab siglock 2014-02-18 18:23:18 -05:00
compat.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal 2013-05-01 07:21:43 -07:00
configs.c proc: Supply PDE attribute setting accessor functions 2013-05-01 17:29:18 -04:00
context_tracking.c context_tracking: Wrap static key check into more intuitive function name 2013-12-02 20:43:14 +01:00
cpu_pm.c
cpu.c CPU hotplug: Provide lockless versions of callback registration functions 2014-03-20 13:43:40 +01:00
cpuset.c cpuset: fix a race condition in __cpuset_node_allowed_softwall() 2014-02-27 09:39:54 -05:00
crash_dump.c
cred.c
delayacct.c kernel/delayacct.c: remove redundant checking in __delayacct_add_tsk() 2013-11-13 12:09:12 +09:00
dma.c
elfcore.c switch elf_core_write_extra_phdrs() to dump_emit() 2013-11-09 00:16:23 -05:00
exec_domain.c
exit.c introduce for_each_thread() to replace the buggy while_each_thread() 2014-01-21 16:19:46 -08:00
extable.c kernel/extable: fix address-checks for core_kernel and init areas 2013-11-28 09:49:41 -08:00
fork.c exec: kill task_struct->did_exec 2014-01-23 16:37:02 -08:00
freezer.c libata, freezer: avoid block device removal while system is frozen 2013-12-19 13:50:32 -05:00
futex_compat.c
futex.c Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2014-01-20 10:42:08 -08:00
groups.c userns: Kill nsown_capable it makes the wrong thing easy 2013-08-30 23:44:11 -07:00
hrtimer.c sched/deadline: Add SCHED_DEADLINE structures & implementation 2014-01-13 13:41:06 +01:00
hung_task.c hung_task: Display every hung task warning 2014-01-25 12:13:33 +01:00
irq_work.c
itimer.c
jump_label.c static_key: WARN on usage before jump_label_init was called 2013-10-19 19:45:35 -04:00
kallsyms.c
kcmp.c
Kconfig.freezer
Kconfig.hz kernel: remove CONFIG_USE_GENERIC_SMP_HELPERS 2013-11-15 09:32:22 +09:00
Kconfig.locks locking: Fix copy/paste errors of "ARCH_INLINE_*_UNLOCK_BH" 2013-05-28 08:50:00 +02:00
Kconfig.preempt
kexec.c kernel/kexec.c: use vscnprintf() instead of vsnprintf() in vmcoreinfo_append_str() 2014-01-27 21:02:40 -08:00
kmod.c execve: use 'struct filename *' for executable name passing 2014-02-05 12:54:53 -08:00
kprobes.c kprobes: use KSYM_NAME_LEN to size identifier buffers 2013-11-13 12:09:26 +09:00
ksysfs.c kdump: fix exported size of vmcoreinfo note 2014-01-23 16:37:03 -08:00
kthread.c kthread: make kthread_create() killable 2013-11-13 12:08:59 +09:00
latencytop.c
Makefile KEYS: Remove files generated when SYSTEM_TRUSTED_KEYRING=y 2013-12-13 15:59:11 +00:00
module_signing.c keys: change asymmetric keys to use common hash definitions 2013-10-25 17:15:18 -04:00
module-internal.h KEYS: Separate the kernel signature checking keyring from module signing 2013-09-25 17:17:01 +01:00
module.c module: Add missing newline in printk call. 2014-01-21 09:59:16 +10:30
notifier.c
nsproxy.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2013-09-07 14:35:32 -07:00
padata.c padata: Fix wrong usage of rcu_dereference() 2013-12-05 21:28:42 +08:00
panic.c panic: Make panic_timeout configurable 2013-11-26 12:12:26 +01:00
params.c params: improve standard definitions 2013-12-04 14:09:46 +10:30
pid_namespace.c pid_namespace: make freeing struct pid_namespace rcu-delayed 2013-10-24 23:43:29 -04:00
pid.c pidns: fix free_pid() to handle the first fork failure 2013-09-30 14:31:03 -07:00
posix-cpu-timers.c posix-timers: Convert abuses of BUG_ON to WARN_ON 2013-12-09 16:56:29 +01:00
posix-timers.c
profile.c mm: fix GFP_THISNODE callers and clarify 2014-03-10 17:26:19 -07:00
ptrace.c exec/ptrace: fix get_dumpable() incorrect tests 2013-11-13 12:09:33 +09:00
range.c range: Do not add new blank slot with add_range_with_merge 2013-06-18 11:32:10 -05:00
reboot.c kexec: migrate to reboot cpu 2013-12-18 19:04:50 -08:00
relay.c kernel: delete __cpuinit usage from all core kernel files 2013-07-14 19:36:59 -04:00
res_counter.c memcg: reduce function dereference 2013-09-12 15:38:02 -07:00
resource.c kernel/resource.c: remove the unneeded assignment in function __find_resource 2013-07-03 16:08:06 -07:00
seccomp.c
signal.c kernel/signal.c: change do_signal_stop/do_sigaction to use while_each_thread() 2014-01-23 16:37:02 -08:00
smp.c kernel/smp.c: remove cpumask_ipi 2014-01-30 16:56:54 -08:00
smpboot.c kernel: delete __cpuinit usage from all core kernel files 2013-07-14 19:36:59 -04:00
smpboot.h
softirq.c Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2014-01-31 09:02:51 -08:00
stacktrace.c
stop_machine.c stop_machine: Fix^2 race between stop_two_cpus() and stop_cpus() 2014-03-11 11:33:47 +01:00
sys_ni.c unify compat fanotify_mark(2), switch to COMPAT_SYSCALL_DEFINE 2013-05-09 13:46:38 -04:00
sys.c kernel/sys.c: k_getrusage() can use while_each_thread() 2014-01-23 16:37:02 -08:00
sysctl_binary.c kernel/sysctl_binary.c: use scnprintf() instead of snprintf() 2013-11-13 12:09:33 +09:00
sysctl.c Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2014-01-31 08:59:46 -08:00
system_certificates.S KEYS: correct alignment of system_certificate_list content in assembly file 2013-12-10 18:25:28 +00:00
system_keyring.c KEYS: correct alignment of system_certificate_list content in assembly file 2013-12-10 18:25:28 +00:00
task_work.c task_work: documentation 2013-09-11 15:58:27 -07:00
taskstats.c genetlink: only pass array to genl_register_family_with_ops() 2013-11-19 16:39:05 -05:00
test_kprobes.c
time.c sched: Rename sched.c as sched/core.c in comments and Documentation 2013-06-19 12:58:42 +02:00
timeconst.bc
timer.c timer: Convert kmalloc_node(...GFP_ZERO...) to kzalloc_node(...) 2013-11-19 14:59:50 +01:00
tracepoint.c tracing: Do not add event files for modules that fail tracepoints 2014-03-03 21:11:05 -05:00
tsacct.c
uid16.c userns: Kill nsown_capable it makes the wrong thing easy 2013-08-30 23:44:11 -07:00
up.c kernel: provide a __smp_call_function_single stub for !CONFIG_SMP 2013-11-15 09:32:22 +09:00
user_namespace.c user_namespace.c: Remove duplicated word in comment 2014-02-20 11:58:35 -08:00
user-return-notifier.c
user.c KEYS: fix uninitialized persistent_keyring_register_sem 2013-12-13 15:59:11 +00:00
utsname_sysctl.c
utsname.c userns: Kill nsown_capable it makes the wrong thing easy 2013-08-30 23:44:11 -07:00
watchdog.c watchdog: update watchdog_thresh properly 2013-09-24 17:00:25 -07:00
workqueue_internal.h sched: Rename sched.c as sched/core.c in comments and Documentation 2013-06-19 12:58:42 +02:00
workqueue.c workqueue: ensure @task is valid across kthread_stop() 2014-02-18 16:35:20 -05:00