linux/drivers/acpi
Ingo Molnar 0888f06ac9 [PATCH] sched: fix bad missed wakeups in the i386, x86_64, ia64, ACPI and APM idle code
Fernando Lopez-Lezcano reported frequent scheduling latencies and audio
xruns starting at the 2.6.18-rt kernel, and those problems persisted all
until current -rt kernels. The latencies were serious and unjustified by
system load, often in the milliseconds range.

After a patient and heroic multi-month effort of Fernando, where he
tested dozens of kernels, tried various configs, boot options,
test-patches of mine and provided latency traces of those incidents, the
following 'smoking gun' trace was captured by him:

                 _------=> CPU#
                / _-----=> irqs-off
               | / _----=> need-resched
               || / _---=> hardirq/softirq
               ||| / _--=> preempt-depth
               |||| /
               |||||     delay
   cmd     pid ||||| time  |   caller
      \   /    |||||   \   |   /
  IRQ_19-1479  1D..1    0us : __trace_start_sched_wakeup (try_to_wake_up)
  IRQ_19-1479  1D..1    0us : __trace_start_sched_wakeup <<...>-5856> (37 0)
  IRQ_19-1479  1D..1    0us : __trace_start_sched_wakeup (c01262ba 0 0)
  IRQ_19-1479  1D..1    0us : resched_task (try_to_wake_up)
  IRQ_19-1479  1D..1    0us : __spin_unlock_irqrestore (try_to_wake_up)
  ...
  <idle>-0     1...1   11us!: default_idle (cpu_idle)
  ...
  <idle>-0     0Dn.1  602us : smp_apic_timer_interrupt (c0103baf 1 0)
  ...
   <...>-5856  0D..2  618us : __switch_to (__schedule)
   <...>-5856  0D..2  618us : __schedule <<idle>-0> (20 162)
   <...>-5856  0D..2  619us : __spin_unlock_irq (__schedule)
   <...>-5856  0...1  619us : trace_stop_sched_switched (__schedule)
   <...>-5856  0D..1  619us : trace_stop_sched_switched <<...>-5856> (37 0)

what is visible in this trace is that CPU#1 ran try_to_wake_up() for
PID:5856, it placed PID:5856 on CPU#0's runqueue and ran resched_task()
for CPU#0. But it decided to not send an IPI that no CPU - due to
TS_POLLING. But CPU#0 never woke up after its NEED_RESCHED bit was set,
and only rescheduled to PID:5856 upon the next lapic timer IRQ. The
result was a 600+ usecs latency and a missed wakeup!

the bug turned out to be an idle-wakeup bug introduced into the mainline
kernel this summer via an optimization in the x86_64 tree:

    commit 495ab9c045
    Author: Andi Kleen <ak@suse.de>
    Date:   Mon Jun 26 13:59:11 2006 +0200

    [PATCH] i386/x86-64/ia64: Move polling flag into thread_info_status

    During some profiling I noticed that default_idle causes a lot of
    memory traffic. I think that is caused by the atomic operations
    to clear/set the polling flag in thread_info. There is actually
    no reason to make this atomic - only the idle thread does it
    to itself, other CPUs only read it. So I moved it into ti->status.

the problem is this type of change:

        if (!hlt_counter && boot_cpu_data.hlt_works_ok) {
-               clear_thread_flag(TIF_POLLING_NRFLAG);
+               current_thread_info()->status &= ~TS_POLLING;
                smp_mb__after_clear_bit();
                while (!need_resched()) {
                        local_irq_disable();

this changes clear_thread_flag() to an explicit clearing of TS_POLLING.
clear_thread_flag() is defined as:

        clear_bit(flag, &ti->flags);

and clear_bit() is a LOCK-ed atomic instruction on all x86 platforms:

  static inline void clear_bit(int nr, volatile unsigned long * addr)
  {
          __asm__ __volatile__( LOCK_PREFIX
                  "btrl %1,%0"

hence smp_mb__after_clear_bit() is defined as a simple compile barrier:

  #define smp_mb__after_clear_bit()       barrier()

but the explicit TS_POLLING clearing introduced by the patch:

+               current_thread_info()->status &= ~TS_POLLING;

is not an atomic op! So the clearing of the TS_POLLING bit is freely
reorderable with the reading of the NEED_RESCHED bit - and both now
reside in different memory addresses.

CPU idle wakeup very much depends on ordered memory ops, the clearing of
the TS_POLLING flag must always be done before we test need_resched()
and hit the idle instruction(s). [Symmetrically, the wakeup code needs
to set NEED_RESCHED before it tests the TS_POLLING flag, so memory
ordering is paramount.]

Fernando's dual-core Athlon64 system has a sufficiently advanced memory
ordering model so that it triggered this scenario very often.

( And it also turned out that the reason why these latencies never
  triggered on my testsystems is that i routinely use idle=poll, which
  was the only idle variant not affected by this bug. )

The fix is to change the smp_mb__after_clear_bit() to an smp_mb(), to
act as an absolute barrier between the TS_POLLING write and the
NEED_RESCHED read. This affects almost all idling methods (default,
ACPI, APM), on all 3 x86 architectures: i386, x86_64, ia64.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Tested-by: Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-22 08:55:51 -08:00
..
dispatcher ACPI: ACPICA 20060707 2006-07-09 15:15:40 -04:00
events Pull bugzilla-5534 into test branch 2006-10-14 02:26:42 -04:00
executer Pull acpica-20060707 into test branch 2006-07-10 02:39:41 -04:00
hardware ACPI: Allow setting SCI_EN bit in PM1_CONTROL register 2006-10-10 17:14:44 -07:00
namespace ACPI: ACPICA 20060707 2006-07-09 15:15:40 -04:00
parser ACPI: acpi_os_allocate() fixes 2006-07-10 02:37:22 -04:00
resources ACPI: ACPICA 20060526 2006-06-14 02:44:35 -04:00
sleep ACPI: add 'const' to several ACPI file_operations 2006-07-10 00:04:29 -04:00
tables ACPI: fix printk format warnings 2006-10-14 01:59:46 -04:00
utilities Pull acpi_os_allocate into test branch 2006-07-10 02:39:47 -04:00
ac.c ACPI: fix boot with acpi=off 2006-08-15 23:16:43 -04:00
acpi_memhotplug.c [PATCH] acpi memory hotplug: remove strange add_memory fail message 2006-10-20 10:26:38 -07:00
asus_acpi.c ACPI: asus_acpi: don't printk on writing garbage to proc files 2006-10-14 02:03:49 -04:00
battery.c ACPI: check battery status on resume for un/plug events during sleep 2006-10-14 02:22:51 -04:00
blacklist.c [PATCH] x86_64: Clean up and tweak ACPI blacklist year code 2006-03-25 09:10:54 -08:00
bus.c ACPI: add message if firmware_register() init fails 2006-08-15 23:27:38 -04:00
button.c ACPI: add 'const' to several ACPI file_operations 2006-07-10 00:04:29 -04:00
cm_sbs.c [PATCH] acpi NULL noise removal 2006-10-10 15:37:22 -07:00
container.c ACPI: delete acpi_os_free(), use kfree() directly 2006-06-30 03:19:10 -04:00
debug.c ACPI: delete tracing macros from drivers/acpi/*.c 2006-06-27 00:41:40 -04:00
dock.c [PATCH] severing fs.h, radix-tree.h -> sched.h 2006-12-04 02:00:24 -05:00
ec.c ACPI: EC: export ec_transaction() for msi-laptop driver 2006-10-14 00:49:56 -04:00
event.c ACPI: add 'const' to several ACPI file_operations 2006-07-10 00:04:29 -04:00
fan.c ACPI: add 'const' to several ACPI file_operations 2006-07-10 00:04:29 -04:00
glue.c ACPI: Change ACPI to use dev_archdata instead of firmware_data 2006-12-01 14:52:01 -08:00
hotkey.c ACPI: hotkey.c fixes, fix for potential crash of hotkey.c 2006-08-16 18:08:06 -04:00
i2c_ec.c i2c: Constify i2c_algorithm declarations, part 2 2006-09-26 15:38:52 -07:00
i2c_ec.h ACPI: add support for Smart Battery 2006-07-01 16:36:14 -04:00
ibm_acpi.c Driver core: proper prototype for drivers/base/init.c:driver_init() 2006-12-20 10:56:45 -08:00
Kconfig fix drivers/acpi/Kconfig typos 2006-10-03 22:24:43 +02:00
Makefile Revert "Revert "ACPI: dock driver"" 2006-07-09 17:22:28 -04:00
motherboard.c ACPI: update comments in motherboard.c 2006-10-14 01:56:27 -04:00
numa.c ACPI: remove function tracing macros from drivers/acpi/*.c 2006-07-01 16:48:23 -04:00
osl.c WorkStruct: Pass the work_struct pointer instead of context data 2006-11-22 14:55:48 +00:00
pci_bind.c ACPI: delete tracing macros from drivers/acpi/*.c 2006-06-27 00:41:40 -04:00
pci_irq.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial 2006-06-30 15:39:30 -07:00
pci_link.c ACPI: acpi_pci_link_set() can allocate with either GFP_ATOMIC or GFP_KERNEL 2006-10-14 01:54:21 -04:00
pci_root.c ACPI: pci_root: Remove unneeded acpi_handle from driver. 2006-06-30 02:51:34 -04:00
power.c ACPI: fix potential OOPS in power driver with CONFIG_ACPI_DEBUG 2006-10-14 01:54:21 -04:00
processor_core.c ACPI: fix section for CPU init functions 2006-10-14 01:58:38 -04:00
processor_idle.c [PATCH] sched: fix bad missed wakeups in the i386, x86_64, ia64, ACPI and APM idle code 2006-12-22 08:55:51 -08:00
processor_perflib.c [PATCH] Correct bound checking from the value returned from _PPC method. 2006-11-23 09:18:55 -08:00
processor_thermal.c ACPI: delete tracing macros from drivers/acpi/*.c 2006-06-27 00:41:40 -04:00
processor_throttling.c ACPI: delete tracing macros from drivers/acpi/*.c 2006-06-27 00:41:40 -04:00
sbs.c ACPI: sbs: fix module_param() initializers 2006-10-14 00:34:00 -04:00
scan.c ACPI: verbose on kset/kobject_register errors 2006-08-15 23:32:24 -04:00
system.c ACPI: add 'const' to several ACPI file_operations 2006-07-10 00:04:29 -04:00
tables.c Remove obsolete #include <linux/config.h> 2006-06-30 19:25:36 +02:00
thermal.c Merge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6 2006-07-10 15:14:38 -07:00
toshiba_acpi.c [ACPI] Lindent all ACPI files 2005-08-05 00:45:14 -04:00
utils.c ACPI: avoid irqrouter_resume might_sleep oops on resume from S4 2006-08-16 19:23:00 -04:00
video.c Pull acpi_device_handle_cleanup into release branch 2006-07-01 17:19:34 -04:00