linux/kernel
Linus Torvalds fa490cfd15 Fix possible runqueue lock starvation in wait_task_inactive()
Miklos Szeredi reported very long pauses (several seconds, sometimes
more) on his T60 (with a Core2Duo) which he managed to track down to
wait_task_inactive()'s open-coded busy-loop.

He observed that an interrupt on one core tries to acquire the
runqueue-lock but does not succeed in doing so for a very long time -
while wait_task_inactive() on the other core loops waiting for the first
core to deschedule a task (which it wont do while spinning in an
interrupt handler).

This rewrites wait_task_inactive() to do all its waiting optimistically
without any locks taken at all, and then just double-check the end
result with the proper runqueue lock held over just a very short
section.  If there were races in the optimistic wait, of a preemption
event scheduled the process away, we simply re-synchronize, and start
over.

So the code now looks like this:

	repeat:
		/* Unlocked, optimistic looping! */
		rq = task_rq(p);
		while (task_running(rq, p))
			cpu_relax();

		/* Get the *real* values */
		rq = task_rq_lock(p, &flags);
		running = task_running(rq, p);
		array = p->array;
		task_rq_unlock(rq, &flags);

		/* Check them.. */
		if (unlikely(running)) {
			cpu_relax();
			goto repeat;
		}

		/* Preempted away? Yield if so.. */
		if (unlikely(array)) {
			yield();
			goto repeat;
		}

Basically, that first "while()" loop is done entirely without any
locking at all (and doesn't check for the case where the target process
might have been preempted away), and so it's possibly "incorrect", but
we don't really care.  Both the runqueue used, and the "task_running()"
check might be the wrong tests, but they won't oops - they just mean
that we could possibly get the wrong results due to lack of locking and
exit the loop early in the case of a race condition.

So once we've exited the loop, we then get the proper (and careful) rq
lock, and check the running/runnable state _safely_.  And if it turns
out that our quick-and-dirty and unsafe loop was wrong after all, we
just go back and try it all again.

(The patch also adds a lot of comments, which is the actual bulk of it
all, to make it more obvious why we can do these things without holding
the locks).

Thanks to Miklos for all the testing and tracking it down.

Tested-by: Miklos Szeredi <miklos@szeredi.hu>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-18 11:52:55 -07:00
..
irq Fix crash with irqpoll due to the IRQF_IRQPOLL flag testing 2007-05-24 08:37:14 -07:00
power swsusp: Fix userland interface 2007-06-16 13:16:15 -07:00
time timer stats: speedups 2007-06-01 08:18:30 -07:00
.gitignore
acct.c [PATCH] kernel: change uses of f_{dentry, vfsmnt} to use f_path 2006-12-08 08:28:42 -08:00
audit.c audit: add spaces on either side of case "..." operator. 2007-05-08 11:15:09 -07:00
audit.h [PATCH] audit signal recipients 2007-05-11 05:38:25 -04:00
auditfilter.c audit_match_signal() and friends are used only if CONFIG_AUDITSYSCALL is set 2007-05-15 18:56:37 -07:00
auditsc.c [PATCH] Abnormal End of Processes 2007-05-11 05:38:26 -04:00
capability.c [PATCH] pid: replace do/while_each_task_pid with do/while_each_pid_task 2007-02-12 09:48:32 -08:00
compat.c signal/timer/event: timerfd compat code 2007-05-11 08:29:36 -07:00
configs.c use simple_read_from_buffer in kernel/ 2007-05-09 12:30:49 -07:00
cpu.c microcode: use suspend-related CPU hotplug notifications 2007-05-09 12:30:56 -07:00
cpuset.c cpuset: zero malloc - fix for old cpusets 2007-06-16 13:16:15 -07:00
delayacct.c KMEM_CACHE(): simplify slab cache creation 2007-05-07 12:12:55 -07:00
die_notifier.c move die notifier handling to common code 2007-05-08 11:15:04 -07:00
dma.c [PATCH] struct seq_operations and struct file_operations constification 2006-12-07 08:39:46 -08:00
exec_domain.c Remove obsolete #include <linux/config.h> 2006-06-30 19:25:36 +02:00
exit.c pi-futex: fix exit races and locking problems 2007-06-08 17:23:34 -07:00
extable.c
fork.c freezer: fix vfork problem 2007-05-23 20:14:11 -07:00
futex_compat.c Revert "futex_requeue_pi optimization" 2007-06-18 09:48:41 -07:00
futex.c Revert "futex_requeue_pi optimization" 2007-06-18 09:48:41 -07:00
hrtimer.c Add suspend-related notifications for CPU hotplug 2007-05-09 12:30:56 -07:00
itimer.c The scheduled -EINVAL for invalid timevals in setitimer 2007-05-08 11:15:13 -07:00
kallsyms.c fix possible null ptr deref in kallsyms_lookup 2007-05-30 10:51:38 -07:00
Kconfig.hz [PATCH] HZ: 300Hz support 2006-12-07 08:39:36 -08:00
Kconfig.preempt Fix trivial typos in Kconfig* files 2007-05-09 07:12:20 +02:00
kexec.c kdump/kexec: calculate note size at compile time 2007-05-08 11:15:07 -07:00
kfifo.c [PATCH] Numerous fixes to kernel-doc info in source files. 2007-02-11 10:51:32 -08:00
kmod.c wait_for_helper: remove unneeded do_sigaction() 2007-05-09 12:30:53 -07:00
kprobes.c Kprobes: The ON/OFF knob thru debugfs 2007-05-08 11:15:19 -07:00
ksysfs.c remove "struct subsystem" as it is no longer needed 2007-05-02 18:57:59 -07:00
kthread.c freezer: fix kthread_create vs freezer theoretical race 2007-05-23 20:14:11 -07:00
latency.c [PATCH] severing module.h->sched.h 2006-12-04 02:00:22 -05:00
lockdep_internals.h [PATCH] lockdep: more chains 2006-12-07 08:39:43 -08:00
lockdep_proc.c [PATCH] remove many unneeded #includes of sched.h 2007-02-14 08:09:54 -08:00
lockdep.c lockdep: removed unused ip argument in mark_lock & mark_held_locks 2007-05-08 11:15:13 -07:00
Makefile move die notifier handling to common code 2007-05-08 11:15:04 -07:00
module.c Fix minor typoes in kernel/module.c 2007-05-09 07:26:28 +02:00
mutex-debug.c [PATCH] remove many unneeded #includes of sched.h 2007-02-14 08:09:54 -08:00
mutex-debug.h [PATCH] lockdep: better lock debugging 2006-07-03 15:27:01 -07:00
mutex.c wrap access to thread_info 2007-05-09 12:30:56 -07:00
mutex.h [PATCH] lockdep: prove mutex locking correctness 2006-07-03 15:27:04 -07:00
nsproxy.c Merge sys_clone()/sys_unshare() nsproxy and namespace handling 2007-05-08 11:15:00 -07:00
panic.c [PATCH] Add TAINT_USER and ability to set taint flags from userspace 2007-02-11 10:51:29 -08:00
params.c kernel/params.c: fix lying comment for param_array() 2007-05-08 11:15:08 -07:00
pid.c statically initialize struct pid for swapper 2007-05-11 08:29:35 -07:00
posix-cpu-timers.c Introduce a handy list_first_entry macro 2007-05-08 11:15:11 -07:00
posix-timers.c header cleaning: don't include smp_lock.h when not used 2007-05-08 11:15:07 -07:00
printk.c header cleaning: don't include smp_lock.h when not used 2007-05-08 11:15:07 -07:00
profile.c Detach sched.h from mm.h 2007-05-21 09:18:19 -07:00
ptrace.c [PATCH] auditing ptrace 2007-05-11 05:38:25 -04:00
rcupdate.c Add suspend-related notifications for CPU hotplug 2007-05-09 12:30:56 -07:00
rcutorture.c rcutorture: Remove redundant assignment to cur_ops in for loop 2007-05-08 11:15:17 -07:00
relay.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial 2007-05-09 12:54:17 -07:00
resource.c libata/IDE: remove combined mode quirk 2007-04-28 14:15:59 -04:00
rtmutex_common.h Revert "futex_requeue_pi optimization" 2007-06-18 09:48:41 -07:00
rtmutex-debug.c Remove all inclusions of <linux/config.h> 2006-10-04 03:38:54 -04:00
rtmutex-debug.h [PATCH] lockdep: better lock debugging 2006-07-03 15:27:01 -07:00
rtmutex-tester.c [PATCH] Add include/linux/freezer.h and move definitions from sched.h 2006-12-07 08:39:27 -08:00
rtmutex.c Revert "futex_requeue_pi optimization" 2007-06-18 09:48:41 -07:00
rtmutex.h [PATCH] lockdep: better lock debugging 2006-07-03 15:27:01 -07:00
rwsem.c Lockdep treats down_write_trylock like regular down_write 2007-05-08 11:15:09 -07:00
sched.c Fix possible runqueue lock starvation in wait_task_inactive() 2007-06-18 11:52:55 -07:00
seccomp.c
signal.c Fix signalfd interaction with thread-private signals 2007-06-18 10:18:32 -07:00
softirq.c Add suspend-related notifications for CPU hotplug 2007-05-09 12:30:56 -07:00
softlockup.c Add suspend-related notifications for CPU hotplug 2007-05-09 12:30:56 -07:00
spinlock.c [PATCH] lockdep: spin_lock_irqsave_nested() 2006-11-25 13:28:34 -08:00
srcu.c [PATCH] SRCU: report out-of-memory errors 2006-10-04 07:55:30 -07:00
stacktrace.c [PATCH] lockdep: stacktrace subsystem, core 2006-07-03 15:27:02 -07:00
stop_machine.c stop_machine() now uses hard_irq_disable 2007-05-11 08:29:34 -07:00
sys_ni.c compat signalfd and timerfd are cond syscalls 2007-05-12 10:55:40 -07:00
sys.c attach_pid() with struct pid parameter 2007-05-11 08:29:35 -07:00
sysctl.c make sysctl/kernel/core_pattern and fs/exec.c agree on maximum core filename size 2007-05-17 05:23:05 -07:00
taskstats.c KMEM_CACHE(): simplify slab cache creation 2007-05-07 12:12:55 -07:00
time.c header cleaning: don't include smp_lock.h when not used 2007-05-08 11:15:07 -07:00
timer.c NOHZ: prevent multiplication overflow - stop timer for huge timeouts 2007-05-29 18:11:10 -07:00
tsacct.c [PATCH] time: x86_64: split x86_64/kernel/time.c up 2007-02-16 08:14:00 -08:00
uid16.c header cleaning: don't include smp_lock.h when not used 2007-05-08 11:15:07 -07:00
user.c [PATCH] slab: remove kmem_cache_t 2006-12-07 08:39:25 -08:00
utsname_sysctl.c [PATCH] sysctl: remove insert_at_head from register_sysctl 2007-02-14 08:09:59 -08:00
utsname.c Merge sys_clone()/sys_unshare() nsproxy and namespace handling 2007-05-08 11:15:00 -07:00
wait.c Fix occurrences of "the the " 2007-05-09 08:57:56 +02:00
workqueue.c simplify cleanup_workqueue_thread() 2007-05-23 20:14:13 -07:00