linux/kernel
Linus Torvalds 00845eb968 sched: don't cause task state changes in nested sleep debugging
Commit 8eb23b9f35 ("sched: Debug nested sleeps") added code to report
on nested sleep conditions, which we generally want to avoid because the
inner sleeping operation can re-set the thread state to TASK_RUNNING,
but that will then cause the outer sleep loop not actually sleep when it
calls schedule.

However, that's actually valid traditional behavior, with the inner
sleep being some fairly rare case (like taking a sleeping lock that
normally doesn't actually need to sleep).

And the debug code would actually change the state of the task to
TASK_RUNNING internally, which makes that kind of traditional and
working code not work at all, because now the nested sleep doesn't just
sometimes cause the outer one to not block, but will cause it to happen
every time.

In particular, it will cause the cardbus kernel daemon (pccardd) to
basically busy-loop doing scheduling, converting a laptop into a heater,
as reported by Bruno Prémont.  But there may be other legacy uses of
that nested sleep model in other drivers that are also likely to never
get converted to the new model.

This fixes both cases:

 - don't set TASK_RUNNING when the nested condition happens (note: even
   if WARN_ONCE() only _warns_ once, the return value isn't whether the
   warning happened, but whether the condition for the warning was true.
   So despite the warning only happening once, the "if (WARN_ON(..))"
   would trigger for every nested sleep.

 - in the cases where we knowingly disable the warning by using
   "sched_annotate_sleep()", don't change the task state (that is used
   for all core scheduling decisions), instead use '->task_state_change'
   that is used for the debugging decision itself.

(Credit for the second part of the fix goes to Oleg Nesterov: "Can't we
avoid this subtle change in behaviour DEBUG_ATOMIC_SLEEP adds?" with the
suggested change to use 'task_state_change' as part of the test)

Reported-and-bisected-by: Bruno Prémont <bonbons@linux-vserver.org>
Tested-by: Rafael J Wysocki <rjw@rjwysocki.net>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>,
Cc: Ilya Dryomov <ilya.dryomov@inktank.com>,
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Hurley <peter@hurleysoftware.com>,
Cc: Davidlohr Bueso <dave@stgolabs.net>,
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-01 12:23:32 -08:00
..
bpf Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2015-01-27 13:55:36 -08:00
configs
debug Surprising number of fixes this merge window :( 2015-01-23 06:40:36 +12:00
events perf: Tighten (and fix) the grouping condition 2015-01-28 13:17:35 +01:00
gcov gcov: enable GCOV_PROFILE_ALL from ARCH Kconfigs 2014-12-13 12:42:51 -08:00
irq genirq: Prevent proc race against freeing of irq descriptors 2014-12-13 13:33:07 +01:00
locking mutex: Always clear owner field upon mutex_unlock() 2015-01-09 11:20:39 +01:00
power PM: Eliminate CONFIG_PM_RUNTIME 2014-12-19 22:55:06 +01:00
printk This code is a fork from the trace-3.19 pull as it needed the trace_seq 2014-12-13 14:04:41 -08:00
rcu Merge branches 'torture.2014.11.03a', 'cpu.2014.11.03a', 'doc.2014.11.13a', 'fixes.2014.11.13a', 'signal.2014.10.29a' and 'rt.2014.10.29a' into HEAD 2014-11-13 10:39:04 -08:00
sched sched: don't cause task state changes in nested sleep debugging 2015-02-01 12:23:32 -08:00
time Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2015-01-25 17:47:34 -08:00
trace This holds a few fixes to the ftrace infrastructure as well as 2015-01-17 07:55:52 +13:00
.gitignore
acct.c acct: eliminate compile warning 2014-10-09 22:26:04 -04:00
async.c kernel/async.c: switch to pr_foo() 2014-10-09 22:26:04 -04:00
audit_tree.c fsnotify: unify inode and mount marks handling 2014-12-13 12:42:53 -08:00
audit_watch.c audit: invalid op= values for rules 2014-09-23 16:37:53 -04:00
audit.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2014-12-30 10:45:47 -08:00
audit.h audit: reduce scope of audit_log_fcaps 2014-09-23 16:37:51 -04:00
auditfilter.c Merge branch 'upstream' of git://git.infradead.org/users/pcmoore/audit 2014-12-23 18:13:16 -08:00
auditsc.c Merge branch 'upstream' of git://git.infradead.org/users/pcmoore/audit 2014-12-31 14:52:18 -08:00
backtracetest.c
bounds.c
capability.c
cgroup_freezer.c
cgroup.c cgroup: prevent mount hang due to memory controller lifetime 2015-01-22 10:26:43 -05:00
compat.c compat: nanosleep: Clarify error handling 2014-09-06 12:58:18 +02:00
configs.c
context_tracking.c sched: stop the unbound recursion in preempt_schedule_context() 2014-10-28 10:46:05 +01:00
cpu_pm.c
cpu.c cpu: Avoid puts_pending overflow 2014-11-03 19:21:01 -08:00
cpuset.c Merge branch 'for-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup 2014-12-11 18:57:19 -08:00
crash_dump.c crash_dump: Make is_kdump_kernel() accessible from modules 2014-08-25 15:42:19 -07:00
cred.c
delayacct.c
dma.c
elfcore.c
exec_domain.c
exit.c exit: fix race between wait_consider_task() and wait_task_zombie() 2015-01-08 15:10:51 -08:00
extable.c ftrace/x86/extable: Add is_ftrace_trampoline() function 2014-11-19 15:25:26 -05:00
fork.c mm: use new helper functions around the i_mmap_mutex 2014-12-13 12:42:45 -08:00
freezer.c freezer: remove obsolete comments in __thaw_task() 2014-10-21 23:44:20 +02:00
futex_compat.c
futex.c futex: Fix a race condition between REQUEUE_PI and task death 2014-10-26 16:16:18 +01:00
groups.c userns: Don't allow setgroups until a gid mapping has been setablished 2014-12-09 16:58:40 -06:00
hung_task.c
irq_work.c percpu: Convert remaining __get_cpu_var uses in 3.18-rcX 2014-10-29 11:18:18 -04:00
jump_label.c
kallsyms.c kernel/kallsyms.c: use __seq_open_private() 2014-10-14 02:18:16 +02:00
kcmp.c kcmp: fix standard comparison bug 2014-09-10 15:42:12 -07:00
Kconfig.freezer
Kconfig.hz
Kconfig.locks
Kconfig.preempt
kexec.c kexec: remove unnecessary KERN_ERR from kexec.c 2014-12-13 12:42:51 -08:00
kmod.c usermodehelper: kill the kmod_thread_locker logic 2014-12-10 17:41:17 -08:00
kprobes.c module: remove mod arg from module_free, rename module_memfree(). 2015-01-20 11:38:33 +10:30
ksysfs.c
kthread.c kernel/kthread.c: partial revert of 81c98869fa ("kthread: ensure locality of task_struct allocations") 2014-10-09 22:25:51 -04:00
latencytop.c
Makefile kernel: res_counter: remove the unused API 2014-12-10 17:41:04 -08:00
module_signing.c
module-internal.h
module.c module: make module_refcount() a signed integer. 2015-01-22 11:15:54 +10:30
notifier.c
nsproxy.c bury struct proc_ns in fs/proc 2014-12-04 14:34:54 -05:00
padata.c
panic.c kernel: add panic_on_warn 2014-12-10 17:41:10 -08:00
params.c param: fix uninitialized read with CONFIG_DEBUG_LOCK_ALLOC 2015-01-20 11:38:31 +10:30
pid_namespace.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-12-16 15:53:03 -08:00
pid.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-12-16 15:53:03 -08:00
profile.c
ptrace.c exit: ptrace: shift "reap dead" code from exit_ptrace() to forget_original_parent() 2014-12-10 17:41:10 -08:00
range.c kernel: avoid overflow in cmp_range 2015-01-17 10:02:23 +13:00
reboot.c kernel: add support for kernel restart handler call chain 2014-09-26 00:00:06 -07:00
relay.c
resource.c x86: optimize resource lookups for ioremap 2014-10-14 02:18:22 +02:00
seccomp.c Merge branch 'x86-seccomp-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2014-10-14 02:27:06 +02:00
signal.c Merge branch 'x86-mpx-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2014-12-10 09:34:43 -08:00
smp.c Merge branch 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu 2014-10-15 07:48:18 +02:00
smpboot.c sched, smp: Correctly deal with nested sleeps 2014-10-28 10:56:24 +01:00
smpboot.h
softirq.c rcu: Remove "cpu" argument to rcu_note_context_switch() 2014-11-03 19:20:34 -08:00
stacktrace.c stacktrace: introduce snprint_stack_trace for buffer output 2014-12-13 12:42:48 -08:00
stop_machine.c
sys_ni.c syscalls: implement execveat() system call 2014-12-13 12:42:51 -08:00
sys.c x86, mpx: Strictly enforce empty prctl() args 2015-01-22 21:11:06 +01:00
sysctl_binary.c kernel: add panic_on_warn 2014-12-10 17:41:10 -08:00
sysctl.c As the merge window is still open, and this code was not as complex 2014-12-16 12:53:59 -08:00
system_certificates.S
system_keyring.c
task_work.c
taskstats.c kill f_dentry uses 2014-11-19 13:01:25 -05:00
test_kprobes.c
torture.c torture: Address race in module cleanup 2014-09-16 13:41:06 -07:00
tracepoint.c
tsacct.c
uid16.c groups: Consolidate the setgroups permission checks 2014-12-05 17:19:27 -06:00
up.c
user_namespace.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2014-12-17 12:31:40 -08:00
user-return-notifier.c scheduler: Replace __get_cpu_var with this_cpu_ptr 2014-08-26 13:45:45 -04:00
user.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2014-12-17 12:31:40 -08:00
utsname_sysctl.c
utsname.c copy address of proc_ns_ops into ns_common 2014-12-04 14:34:47 -05:00
watchdog.c Merge branch 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu 2014-10-15 07:48:18 +02:00
workqueue_internal.h
workqueue.c workqueue: fix subtle pool management issue which can stall whole worker_pool 2015-01-16 14:21:16 -05:00