linux/kernel/locking
Thomas Gleixner dbb26055de locking/rtmutex: Prevent dequeue vs. unlock race
David reported a futex/rtmutex state corruption. It's caused by the
following problem:

CPU0		CPU1		CPU2

l->owner=T1
		rt_mutex_lock(l)
		lock(l->wait_lock)
		l->owner = T1 | HAS_WAITERS;
		enqueue(T2)
		boost()
		  unlock(l->wait_lock)
		schedule()

				rt_mutex_lock(l)
				lock(l->wait_lock)
				l->owner = T1 | HAS_WAITERS;
				enqueue(T3)
				boost()
				  unlock(l->wait_lock)
				schedule()
		signal(->T2)	signal(->T3)
		lock(l->wait_lock)
		dequeue(T2)
		deboost()
		  unlock(l->wait_lock)
				lock(l->wait_lock)
				dequeue(T3)
				  ===> wait list is now empty
				deboost()
				 unlock(l->wait_lock)
		lock(l->wait_lock)
		fixup_rt_mutex_waiters()
		  if (wait_list_empty(l)) {
		    owner = l->owner & ~HAS_WAITERS;
		    l->owner = owner
		     ==> l->owner = T1
		  }

				lock(l->wait_lock)
rt_mutex_unlock(l)		fixup_rt_mutex_waiters()
				  if (wait_list_empty(l)) {
				    owner = l->owner & ~HAS_WAITERS;
cmpxchg(l->owner, T1, NULL)
 ===> Success (l->owner = NULL)
				    l->owner = owner
				     ==> l->owner = T1
				  }

That means the problem is caused by fixup_rt_mutex_waiters() which does the
RMW to clear the waiters bit unconditionally when there are no waiters in
the rtmutexes rbtree.

This can be fatal: A concurrent unlock can release the rtmutex in the
fastpath because the waiters bit is not set. If the cmpxchg() gets in the
middle of the RMW operation then the previous owner, which just unlocked
the rtmutex is set as the owner again when the write takes place after the
successfull cmpxchg().

The solution is rather trivial: verify that the owner member of the rtmutex
has the waiters bit set before clearing it. This does not require a
cmpxchg() or other atomic operations because the waiters bit can only be
set and cleared with the rtmutex wait_lock held. It's also safe against the
fast path unlock attempt. The unlock attempt via cmpxchg() will either see
the bit set and take the slowpath or see the bit cleared and release it
atomically in the fastpath.

It's remarkable that the test program provided by David triggers on ARM64
and MIPS64 really quick, but it refuses to reproduce on x86-64, while the
problem exists there as well. That refusal might explain that this got not
discovered earlier despite the bug existing from day one of the rtmutex
implementation more than 10 years ago.

Thanks to David for meticulously instrumenting the code and providing the
information which allowed to decode this subtle problem.

Reported-by: David Daney <ddaney@caviumnetworks.com>
Tested-by: David Daney <david.daney@cavium.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: stable@vger.kernel.org
Fixes: 23f78d4a03 ("[PATCH] pi-futex: rt mutex core")
Link: http://lkml.kernel.org/r/20161130210030.351136722@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-12-02 11:13:26 +01:00
..
lockdep_internals.h lockdep: Limit static allocations if PROVE_LOCKING_SMALL is defined 2016-11-18 11:33:19 -08:00
lockdep_proc.c lockdep: Fix lock_chain::base size 2016-04-23 13:53:03 +02:00
lockdep_states.h
lockdep.c locking/lockdep: Use __jhash_mix() for iterate_chain_key() 2016-06-08 14:22:00 +02:00
locktorture.c lcoking/locktorture: Simplify the torture_runnable computation 2016-04-28 10:57:51 +02:00
Makefile locking/lglock: Remove lglock implementation 2016-09-22 15:25:56 +02:00
mcs_spinlock.h locking/mcs: Fix mcs_spin_lock() ordering 2016-02-29 10:02:41 +01:00
mutex-debug.c locking: avoid passing around 'thread_info' in mutex debugging code 2016-06-23 12:11:17 -07:00
mutex-debug.h Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2016-07-25 12:41:29 -07:00
mutex.c locking: avoid passing around 'thread_info' in mutex debugging code 2016-06-23 12:11:17 -07:00
mutex.h Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2016-07-25 12:41:29 -07:00
osq_lock.c locking/osq: Fix ordering of node initialisation in osq_lock 2015-12-17 11:40:29 -08:00
percpu-rwsem.c locking/percpu-rwsem: Optimize readers and reduce global impact 2016-08-10 14:34:01 +02:00
qrwlock.c locking/atomic, arch/qrwlock: Employ atomic_fetch_add_acquire() 2016-06-16 10:48:34 +02:00
qspinlock_paravirt.h locking/pv-qspinlock: Use cmpxchg_release() in __pv_queued_spin_unlock() 2016-09-22 15:25:51 +02:00
qspinlock_stat.h locking/pvstat: Separate wait_again and spurious wakeup stats 2016-08-10 14:16:02 +02:00
qspinlock.c locking/qspinlock: Use __this_cpu_dec() instead of full-blown this_cpu_dec() 2016-06-27 11:37:41 +02:00
rtmutex_common.h rtmutex: Delete scriptable tester 2015-07-20 11:45:45 +02:00
rtmutex-debug.c rtmutex: Cleanup deadlock detector debug logic 2014-06-21 22:05:30 +02:00
rtmutex-debug.h rtmutex: Cleanup deadlock detector debug logic 2014-06-21 22:05:30 +02:00
rtmutex.c locking/rtmutex: Prevent dequeue vs. unlock race 2016-12-02 11:13:26 +01:00
rtmutex.h rtmutex: Cleanup deadlock detector debug logic 2014-06-21 22:05:30 +02:00
rwsem-spinlock.c locking/rwsem: Introduce basis for down_write_killable() 2016-04-13 10:42:20 +02:00
rwsem-xadd.c locking/rwsem: Scan the wait_list for readers only once 2016-08-18 15:37:11 +02:00
rwsem.c locking/rwsem: Add reader-owned state to the owner field 2016-06-08 15:16:59 +02:00
rwsem.h locking/rwsem: Protect all writes to owner by WRITE_ONCE() 2016-06-08 15:16:59 +02:00
semaphore.c locking/semaphore: Resolve some shadow warnings 2014-09-04 07:17:24 +02:00
spinlock_debug.c locking: Move the spinlock code to kernel/locking/ 2013-11-06 07:55:21 +01:00
spinlock.c spinlock: Add spin_lock_bh_nested() 2015-01-03 14:32:57 -05:00