linux/arch/powerpc
Nysal Jan K.A. 734ad0af36 powerpc/qspinlock: Fix deadlock in MCS queue
If an interrupt occurs in queued_spin_lock_slowpath() after we increment
qnodesp->count and before node->lock is initialized, another CPU might
see stale lock values in get_tail_qnode(). If the stale lock value happens
to match the lock on that CPU, then we write to the "next" pointer of
the wrong qnode. This causes a deadlock as the former CPU, once it becomes
the head of the MCS queue, will spin indefinitely until it's "next" pointer
is set by its successor in the queue.

Running stress-ng on a 16 core (16EC/16VP) shared LPAR, results in
occasional lockups similar to the following:

   $ stress-ng --all 128 --vm-bytes 80% --aggressive \
               --maximize --oomable --verify  --syslog \
               --metrics  --times  --timeout 5m

   watchdog: CPU 15 Hard LOCKUP
   ......
   NIP [c0000000000b78f4] queued_spin_lock_slowpath+0x1184/0x1490
   LR [c000000001037c5c] _raw_spin_lock+0x6c/0x90
   Call Trace:
    0xc000002cfffa3bf0 (unreliable)
    _raw_spin_lock+0x6c/0x90
    raw_spin_rq_lock_nested.part.135+0x4c/0xd0
    sched_ttwu_pending+0x60/0x1f0
    __flush_smp_call_function_queue+0x1dc/0x670
    smp_ipi_demux_relaxed+0xa4/0x100
    xive_muxed_ipi_action+0x20/0x40
    __handle_irq_event_percpu+0x80/0x240
    handle_irq_event_percpu+0x2c/0x80
    handle_percpu_irq+0x84/0xd0
    generic_handle_irq+0x54/0x80
    __do_irq+0xac/0x210
    __do_IRQ+0x74/0xd0
    0x0
    do_IRQ+0x8c/0x170
    hardware_interrupt_common_virt+0x29c/0x2a0
   --- interrupt: 500 at queued_spin_lock_slowpath+0x4b8/0x1490
   ......
   NIP [c0000000000b6c28] queued_spin_lock_slowpath+0x4b8/0x1490
   LR [c000000001037c5c] _raw_spin_lock+0x6c/0x90
   --- interrupt: 500
    0xc0000029c1a41d00 (unreliable)
    _raw_spin_lock+0x6c/0x90
    futex_wake+0x100/0x260
    do_futex+0x21c/0x2a0
    sys_futex+0x98/0x270
    system_call_exception+0x14c/0x2f0
    system_call_vectored_common+0x15c/0x2ec

The following code flow illustrates how the deadlock occurs.
For the sake of brevity, assume that both locks (A and B) are
contended and we call the queued_spin_lock_slowpath() function.

        CPU0                                   CPU1
        ----                                   ----
  spin_lock_irqsave(A)                          |
  spin_unlock_irqrestore(A)                     |
    spin_lock(B)                                |
         |                                      |
         ▼                                      |
   id = qnodesp->count++;                       |
  (Note that nodes[0].lock == A)                |
         |                                      |
         ▼                                      |
      Interrupt                                 |
  (happens before "nodes[0].lock = B")          |
         |                                      |
         ▼                                      |
  spin_lock_irqsave(A)                          |
         |                                      |
         ▼                                      |
   id = qnodesp->count++                        |
   nodes[1].lock = A                            |
         |                                      |
         ▼                                      |
  Tail of MCS queue                             |
         |                             spin_lock_irqsave(A)
         ▼                                      |
  Head of MCS queue                             ▼
         |                             CPU0 is previous tail
         ▼                                      |
   Spin indefinitely                            ▼
  (until "nodes[1].next != NULL")      prev = get_tail_qnode(A, CPU0)
                                                |
                                                ▼
                                       prev == &qnodes[CPU0].nodes[0]
                                     (as qnodes[CPU0].nodes[0].lock == A)
                                                |
                                                ▼
                                       WRITE_ONCE(prev->next, node)
                                                |
                                                ▼
                                        Spin indefinitely
                                     (until nodes[0].locked == 1)

Thanks to Saket Kumar Bhaskar for help with recreating the issue

Fixes: 84990b1695 ("powerpc/qspinlock: add mcs queueing for contended waiters")
Cc: stable@vger.kernel.org # v6.2+
Reported-by: Geetika Moolchandani <geetika@linux.ibm.com>
Reported-by: Vaishnavi Bhat <vaish123@in.ibm.com>
Reported-by: Jijo Varghese <vargjijo@in.ibm.com>
Signed-off-by: Nysal Jan K.A. <nysal@linux.ibm.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/20240829022830.1164355-1-nysal@linux.ibm.com
2024-08-29 15:12:51 +10:00
..
boot powerpc: Remove 40x leftovers 2024-07-12 11:43:41 +10:00
configs powerpc/configs: Update defconfig with now user-visible CONFIG_FSL_IFC 2024-07-04 23:10:40 +10:00
crypto This update includes the following changes: 2024-07-19 08:52:58 -07:00
include powerpc/mm: Fix return type of pgd_val() 2024-08-22 22:52:38 +10:00
kernel powerpc/vdso: Don't discard rela sections 2024-08-22 22:52:29 +10:00
kexec powerpc updates for 6.11 2024-07-19 21:00:33 -07:00
kvm a couple of leaks on failure exits missing fdput() 2024-07-26 10:26:33 -07:00
lib powerpc/qspinlock: Fix deadlock in MCS queue 2024-08-29 15:12:51 +10:00
math-emu
mm powerpc/64e: Define mmu_pte_psize static 2024-08-22 22:52:29 +10:00
net powerpc updates for 6.11 2024-07-19 21:00:33 -07:00
perf powerpc/perf: Set cpumode flags using sample address 2024-06-17 22:47:16 +10:00
platforms Driver core changes for 6.11-rc1 2024-07-25 10:42:22 -07:00
purgatory Makefile: remove redundant tool coverage variables 2024-05-14 23:35:48 +09:00
sysdev of: remove internal arguments from of_property_for_each_u32() 2024-07-25 06:53:47 -05:00
tools powerpc/tools: Pass -mabi=elfv2 to gcc-check-mprofile-kernel.sh 2023-10-20 17:46:33 +11:00
xmon powerpc/xmon: Fix disassembly CPU feature checks 2024-07-11 17:31:54 +10:00
Kbuild powerpc: Fix fatal warnings flag for LLVM's integrated assembler 2024-04-08 16:06:41 +10:00
Kconfig Kbuild updates for v6.11 2024-07-23 14:32:21 -07:00
Kconfig.debug powerpc: Remove 40x from Kconfig and defconfig 2024-06-28 22:28:47 +10:00
Makefile powerpc: Remove 40x from Kconfig and defconfig 2024-06-28 22:28:47 +10:00
Makefile.postlink kbuild: remove ARCH_POSTLINK from module builds 2023-10-28 21:10:08 +09:00