linux/kernel
Gautham R Shenoy 8558f8f816 rcu: fix hotplug vs rcu race
Dhaval Giani reported this warning during cpu hotplug stress-tests:

| On running kernel compiles in parallel with cpu hotplug:
|
| WARNING: at arch/x86/kernel/smp.c:118
| native_smp_send_reschedule+0x21/0x36()
| Modules linked in:
| Pid: 27483, comm: cc1 Not tainted 2.6.26-rc7 #1
| [...]
|  [<c0110355>] native_smp_send_reschedule+0x21/0x36
|  [<c014fe8f>] force_quiescent_state+0x47/0x57
|  [<c014fef0>] call_rcu+0x51/0x6d
|  [<c01713b3>] __fput+0x130/0x158
|  [<c0171231>] fput+0x17/0x19
|  [<c016fd99>] filp_close+0x4d/0x57
|  [<c016fdff>] sys_close+0x5c/0x97

IMHO the warning is a spurious one.

cpu_online_map is updated by the _cpu_down() using stop_machine_run().
Since force_quiescent_state is invoked from irqs disabled section,
stop_machine_run() won't be executing while a cpu is executing
force_quiescent_state(). Hence the cpu_online_map is stable while we're
in the irq disabled section.

However, a cpu might have been offlined _just_ before we disabled irqs
while entering force_quiescent_state(). And rcu subsystem might not yet
have handled the CPU_DEAD notification, leading to the offlined cpu's
bit being set in the rcp->cpumask.

Hence cpumask = (rcp->cpumask & cpu_online_map) to prevent sending
smp_reschedule() to an offlined CPU.

Here's the timeline:

CPU_A						 CPU_B
--------------------------------------------------------------
cpu_down():					.
.					   	.
.						.
stop_machine(): /* disables preemption,		.
		 * and irqs */			.
.						.
.						.
take_cpu_down();				.
.						.
.						.
.						.
cpu_disable(); /*this removes cpu 		.
		*from cpu_online_map 		.
		*/				.
.						.
.						.
restart_machine(); /* enables irqs */		.
------WINDOW DURING WHICH rcp->cpumask is stale ---------------
.						call_rcu();
.						/* disables irqs here */
.						.force_quiescent_state();
.CPU_DEAD:					.for_each_cpu(rcp->cpumask)
.						.   smp_send_reschedule();
.						.
.						.   WARN_ON() for offlined CPU!
.
.
.
rcu_cpu_notify:
.
-------- WINDOW ENDS ------------------------------------------
rcu_offline_cpu() /* Which calls cpu_quiet()
		   * which removes
		   * cpu from rcp->cpumask.
		   */

If a new batch was started just before calling stop_machine_run(), the
"tobe-offlined" cpu is still present in rcp-cpumask.

During a cpu-offline, from take_cpu_down(), we queue an rt-prio idle
task as the next task to be picked by the scheduler. We also call
cpu_disable() which will disable any further interrupts and remove the
cpu's bit from the cpu_online_map.

Once the stop_machine_run() successfully calls take_cpu_down(), it calls
schedule(). That's the last time a schedule is called on the offlined
cpu, and hence the last time when rdp->passed_quiesc will be set to 1
through rcu_qsctr_inc().

But the cpu_quiet() will be on this cpu will be called only when the
next RCU_SOFTIRQ occurs on this CPU. So at this time, the offlined CPU
is still set in rcp->cpumask.

Now coming back to the idle_task which truely offlines the CPU, it does
check for a pending RCU and raises the softirq, since it will find
rdp->passed_quiesc to be 0 in this case. However, since the cpu is
offline I am not sure if the softirq will trigger on the CPU.

Even if it doesn't the rcu_offline_cpu() will find that rcp->completed
is not the same as rcp->cur, which means that our cpu could be holding
up the grace period progression. Hence we call cpu_quiet() and move
ahead.

But because of the window explained in the timeline, we could still have
a call_rcu() before the RCU subsystem executes it's CPU_DEAD
notification, and we send smp_send_reschedule() to offlined cpu while
trying to force the quiescent states. The appended patch adds comments
and prevents checking for offlined cpu everytime.

cpu_online_map is updated by the _cpu_down() using stop_machine_run().
Since force_quiescent_state is invoked from irqs disabled section,
stop_machine_run() won't be executing while a cpu is executing
force_quiescent_state(). Hence the cpu_online_map is stable while we're
in the irq disabled section.

Reported-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Acked-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Rusty Russel <rusty@rustcorp.com.au>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-07-01 09:27:17 +02:00
..
irq genirq: reenable a nobody cared disabled irq when a new driver arrives 2008-05-02 13:40:34 +02:00
power Merge branches 'release', 'acpica', 'bugzilla-10224', 'bugzilla-9772', 'bugzilla-9916', 'ec', 'eeepc', 'idle', 'misc', 'pm-legacy', 'sysfs-links-2.6.26', 'thermal', 'thinkpad' and 'video' into release 2008-04-30 13:58:00 -04:00
time clocksource: allow read access to available/current_clocksource 2008-05-03 18:11:48 +02:00
.gitignore Update kernel/.gitignore with new auto-generated files 2008-02-09 23:27:01 -08:00
acct.c bsd_acct: using task_struct->tgid is not right in pid-namespaces 2008-03-24 19:22:20 -07:00
audit_tree.c [PATCH] list_for_each_rcu must die: audit 2008-05-17 03:30:23 -04:00
audit.c [patch 1/1] audit_send_reply(): fix error-path memory leak 2008-05-17 03:30:22 -04:00
audit.h [PATCH 1/2] audit: move extern declarations to audit.h 2008-04-28 06:28:04 -04:00
auditfilter.c Merge branch 'audit.b50' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current 2008-04-29 11:41:22 -07:00
auditsc.c [PATCH] new predicate - AUDIT_FILETYPE 2008-04-28 06:28:37 -04:00
backtracetest.c x86: add a simple backtrace test module 2008-01-30 13:33:08 +01:00
bounds.c Add kbuild.h that contains common definitions for kbuild users 2008-04-29 08:06:29 -07:00
capability.c capabilities: remain source compatible with 32-bit raw legacy capability support. 2008-05-31 16:36:16 -07:00
cgroup_debug.c CGroup API files: move "releasable" to cgroup_debug subsystem 2008-04-29 08:06:09 -07:00
cgroup.c cgroups: remove node_ prefix_from ns subsystem 2008-05-24 09:56:14 -07:00
compat.c ntp: support for TAI 2008-05-01 08:03:59 -07:00
configs.c kernel: use non-racy method for proc entries creation 2008-04-29 08:06:22 -07:00
cpu.c kernel: replace remaining __FUNCTION__ occurrences 2008-04-30 08:29:54 -07:00
cpuset.c Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2008-06-20 12:37:13 -07:00
delayacct.c Add scaled time to taskstats based process accounting 2007-10-18 14:37:28 -07:00
dma.c kernel: use non-racy method for proc entries creation 2008-04-29 08:06:22 -07:00
exec_domain.c
exit.c signals: fix sigqueue_free() vs __exit_signal() race 2008-05-24 09:56:10 -07:00
extable.c module: Don't report discarded init pages as kernel text. 2008-01-29 17:13:18 +11:00
fork.c [PATCH] dup_fd() fixes, part 1 2008-05-16 17:22:26 -04:00
futex_compat.c futex_compat __user annotation 2008-03-30 14:18:41 -07:00
futex.c futexes: fix fault handling in futex_lock_pi 2008-06-23 13:31:15 +02:00
hrtimer.c hrtimer: remove duplicate helper function 2008-05-03 18:11:48 +02:00
itimer.c ITIMER_REAL: convert to use struct pid 2008-02-08 09:22:29 -08:00
kallsyms.c kernel: use non-racy method for proc entries creation 2008-04-29 08:06:22 -07:00
Kconfig.hz sched: high-res preemption tick 2008-01-25 21:08:29 +01:00
Kconfig.preempt rcu: move PREEMPT_RCU config option back under PREEMPT 2008-03-10 18:01:20 -07:00
kexec.c kexec: make extended crashkernel= syntax less confusing 2008-05-01 08:04:00 -07:00
kfifo.c
kgdb.c kgdb: use common ascii helpers and put_unaligned_be32 helper 2008-05-28 12:49:56 -05:00
kmod.c [PATCH] split linux/file.h 2008-05-01 13:08:16 -04:00
kprobes.c kprobes: fix error checking of batch registration 2008-06-12 18:05:40 -07:00
ksysfs.c Kobject: convert remaining kobject_unregister() to kobject_put() 2008-01-24 20:40:40 -08:00
kthread.c Deprecate find_task_by_pid() 2008-04-30 08:29:48 -07:00
latencytop.c kernel: use non-racy method for proc entries creation 2008-04-29 08:06:22 -07:00
lockdep_internals.h
lockdep_proc.c kernel: use non-racy method for proc entries creation 2008-04-29 08:06:22 -07:00
lockdep.c Subject: lockdep: include all lock classes in all_lock_classes 2008-02-25 23:03:02 +01:00
Makefile sched: add optional support for CONFIG_HAVE_UNSTABLE_SCHED_CLOCK 2008-05-05 23:56:18 +02:00
marker.c make marker_debug static 2008-04-30 08:29:49 -07:00
module.c modules: proper cleanup of kobject without CONFIG_SYSFS 2008-05-23 13:09:33 +10:00
mutex-debug.c kernel: remove fastcall in kernel/* 2008-02-08 09:22:31 -08:00
mutex-debug.h
mutex.c kernel: remove fastcall in kernel/* 2008-02-08 09:22:31 -08:00
mutex.h
notifier.c ipc: re-enable msgmni automatic recomputing msgmni if set to negative 2008-04-29 08:06:13 -07:00
ns_cgroup.c cgroups: kernel/ns_cgroup.c should #include <linux/nsproxy.h> 2008-04-29 08:06:07 -07:00
nsproxy.c ipc: sysvsem: refuse clone(CLONE_SYSVSEM|CLONE_NEWIPC) 2008-04-29 08:06:14 -07:00
panic.c Taint kernel after WARN_ON(condition) 2008-04-29 08:05:59 -07:00
params.c Add new string functions strict_strto* and convert kernel params to use them 2008-02-08 09:22:41 -08:00
pid_namespace.c pidns: make pid->level and pid_ns->level unsigned 2008-04-30 08:29:49 -07:00
pid.c pids: introduce change_pid() helper 2008-04-30 08:29:48 -07:00
pm_qos_params.c pm qos infrastructure and interface 2008-02-05 09:44:22 -08:00
posix-cpu-timers.c remove div_long_long_rem 2008-05-01 08:03:58 -07:00
posix-timers.c signals: join send_sigqueue() with send_group_sigqueue() 2008-04-30 08:29:36 -07:00
printk.c printk: don't read beyond string arguments' terminating zero 2008-04-30 08:29:52 -07:00
profile.c kernel: use non-racy method for proc entries creation 2008-04-29 08:06:22 -07:00
ptrace.c make generic sys_ptrace unconditional 2008-05-01 10:21:54 -07:00
rcuclassic.c rcu: fix hotplug vs rcu race 2008-07-01 09:27:17 +02:00
rcupdate.c rcupdate: fix comment 2008-02-13 16:21:18 -08:00
rcupreempt_trace.c Preempt-RCU: implementation 2008-01-25 21:08:24 +01:00
rcupreempt.c rcupreempt: remove export of rcu_batches_completed_bh 2008-06-19 09:45:37 +02:00
rcutorture.c kernel: explicitly include required header files under kernel/ 2008-04-29 08:06:04 -07:00
relay.c splice: fix sendfile() issue with relay 2008-05-28 14:49:27 +02:00
res_counter.c memcgroup: add the max_usage member on the res_counter 2008-04-29 08:06:10 -07:00
resource.c kernel: use non-racy method for proc entries creation 2008-04-29 08:06:22 -07:00
rtmutex_common.h Don't operate with pid_t in rtmutex tester 2008-02-08 09:22:41 -08:00
rtmutex-debug.c Don't operate with pid_t in rtmutex tester 2008-02-08 09:22:41 -08:00
rtmutex-debug.h
rtmutex-tester.c Driver core: change sysdev classes to use dynamic kobject names 2008-01-24 20:40:40 -08:00
rtmutex.c hrtimer: more hrtimer_init_sleeper() fallout. 2008-02-13 15:45:36 +01:00
rtmutex.h
rwsem.c sched: mark rwsem functions as __sched for wchan/profiling 2007-12-18 15:21:13 +01:00
sched_clock.c sched: fix sched_clock_cpu() 2008-05-29 11:29:19 +02:00
sched_debug.c revert ("sched: fair-group: SMP-nice for group scheduling") 2008-05-29 11:28:57 +02:00
sched_fair.c sched: stop wake_affine from causing serious imbalance 2008-05-29 11:29:20 +02:00
sched_features.h sched: /debug/sched_features 2008-04-19 19:45:00 +02:00
sched_idletask.c sched: make rt_sched_class, idle_sched_class static 2008-05-05 23:56:17 +02:00
sched_rt.c sched: rt-group: fix RR buglet 2008-06-19 09:06:59 +02:00
sched_stats.h sched, delay accounting: fix incorrect delay time when constantly waiting on runqueue 2008-06-19 14:15:28 +02:00
sched.c Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2008-06-20 12:37:13 -07:00
seccomp.c
semaphore.c Revert "semaphore: fix" 2008-05-10 20:43:22 -07:00
signal.c posix timers: discard SI_TIMER signals on exec 2008-05-26 10:37:07 -07:00
softirq.c Fix cpu hotplug problem in softirq code 2008-05-01 08:03:58 -07:00
softlockup.c softlockup: fix NMI hangs due to lock race - 2.6.26-rc regression 2008-06-19 09:45:38 +02:00
spinlock.c spinlock: lockbreak cleanup 2008-01-30 13:31:20 +01:00
srcu.c make srcu_readers_active() static 2008-02-06 10:41:02 -08:00
stacktrace.c
stop_machine.c stop_machine: make stop_machine_run more virtualization friendly 2008-05-23 13:09:34 +10:00
sys_ni.c timerfd: new timerfd API 2008-02-05 09:44:07 -08:00
sys.c sys_prctl(): fix return of uninitialized value 2008-05-24 09:56:13 -07:00
sysctl_check.c constify tables in kernel/sysctl_check.c 2008-02-08 09:22:31 -08:00
sysctl.c [PATCH] avoid multiplication overflows and signedness issues for max_fds 2008-05-16 17:22:52 -04:00
taskstats.c Use find_task_by_vpid in taskstats 2008-04-30 08:29:48 -07:00
test_kprobes.c kprobes: kretprobe user entry-handler 2008-02-06 10:41:11 -08:00
time.c Make constants in kernel/timeconst.h fixed 64 bits 2008-05-02 16:18:42 -07:00
timeconst.pl Make constants in kernel/timeconst.h fixed 64 bits 2008-05-02 16:18:42 -07:00
timer.c debugobjects: add timer specific object debugging code 2008-04-30 08:29:53 -07:00
tsacct.c Add scaled time to taskstats based process accounting 2007-10-18 14:37:28 -07:00
uid16.c asmlinkage_protect replaces prevent_tail_call 2008-04-10 17:28:26 -07:00
user_namespace.c eCryptfs: make key module subsystem respect namespaces 2008-04-29 08:06:07 -07:00
user.c alloc_uid: cleanup 2008-04-30 08:29:53 -07:00
utsname_sysctl.c Isolate the UTS namespace's domainname and hostname back 2007-11-29 09:24:53 -08:00
utsname.c kernel: explicitly include required header files under kernel/ 2008-04-29 08:06:04 -07:00
wait.c kernel: remove fastcall in kernel/* 2008-02-08 09:22:31 -08:00
workqueue.c workqueue: remove redundant function invocation 2008-05-01 08:04:02 -07:00