Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking and misc x86 updates from Ingo Molnar:
"Lots of changes in this cycle - in part because locking/core attracted
a number of related x86 low level work which was easier to handle in a
single tree:
- Linux Kernel Memory Consistency Model updates (Alan Stern, Paul E.
McKenney, Andrea Parri)
- lockdep scalability improvements and micro-optimizations (Waiman
Long)
- rwsem improvements (Waiman Long)
- spinlock micro-optimization (Matthew Wilcox)
- qspinlocks: Provide a liveness guarantee (more fairness) on x86.
(Peter Zijlstra)
- Add support for relative references in jump tables on arm64, x86
and s390 to optimize jump labels (Ard Biesheuvel, Heiko Carstens)
- Be a lot less permissive on weird (kernel address) uaccess faults
on x86: BUG() when uaccess helpers fault on kernel addresses (Jann
Horn)
- macrofy x86 asm statements to un-confuse the GCC inliner. (Nadav
Amit)
- ... and a handful of other smaller changes as well"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (57 commits)
locking/lockdep: Make global debug_locks* variables read-mostly
locking/lockdep: Fix debug_locks off performance problem
locking/pvqspinlock: Extend node size when pvqspinlock is configured
locking/qspinlock_stat: Count instances of nested lock slowpaths
locking/qspinlock, x86: Provide liveness guarantee
x86/asm: 'Simplify' GEN_*_RMWcc() macros
locking/qspinlock: Rework some comments
locking/qspinlock: Re-order code
locking/lockdep: Remove duplicated 'lock_class_ops' percpu array
x86/defconfig: Enable CONFIG_USB_XHCI_HCD=y
futex: Replace spin_is_locked() with lockdep
locking/lockdep: Make class->ops a percpu counter and move it under CONFIG_DEBUG_LOCKDEP=y
x86/jump-labels: Macrofy inline assembly code to work around GCC inlining bugs
x86/cpufeature: Macrofy inline assembly code to work around GCC inlining bugs
x86/extable: Macrofy inline assembly code to work around GCC inlining bugs
x86/paravirt: Work around GCC inlining bugs when compiling paravirt ops
x86/bug: Macrofy the BUG table section handling, to work around GCC inlining bugs
x86/alternatives: Macrofy lock prefixes to work around GCC inlining bugs
x86/refcount: Work around GCC inlining bug
x86/objtool: Use asm macros to work around GCC inlining bugs
...
This commit is contained in:
@@ -28,7 +28,8 @@ Explanation of the Linux-Kernel Memory Consistency Model
|
||||
20. THE HAPPENS-BEFORE RELATION: hb
|
||||
21. THE PROPAGATES-BEFORE RELATION: pb
|
||||
22. RCU RELATIONS: rcu-link, gp, rscs, rcu-fence, and rb
|
||||
23. ODDS AND ENDS
|
||||
23. LOCKING
|
||||
24. ODDS AND ENDS
|
||||
|
||||
|
||||
|
||||
@@ -1067,28 +1068,6 @@ allowing out-of-order writes like this to occur. The model avoided
|
||||
violating the write-write coherence rule by requiring the CPU not to
|
||||
send the W write to the memory subsystem at all!)
|
||||
|
||||
There is one last example of preserved program order in the LKMM: when
|
||||
a load-acquire reads from an earlier store-release. For example:
|
||||
|
||||
smp_store_release(&x, 123);
|
||||
r1 = smp_load_acquire(&x);
|
||||
|
||||
If the smp_load_acquire() ends up obtaining the 123 value that was
|
||||
stored by the smp_store_release(), the LKMM says that the load must be
|
||||
executed after the store; the store cannot be forwarded to the load.
|
||||
This requirement does not arise from the operational model, but it
|
||||
yields correct predictions on all architectures supported by the Linux
|
||||
kernel, although for differing reasons.
|
||||
|
||||
On some architectures, including x86 and ARMv8, it is true that the
|
||||
store cannot be forwarded to the load. On others, including PowerPC
|
||||
and ARMv7, smp_store_release() generates object code that starts with
|
||||
a fence and smp_load_acquire() generates object code that ends with a
|
||||
fence. The upshot is that even though the store may be forwarded to
|
||||
the load, it is still true that any instruction preceding the store
|
||||
will be executed before the load or any following instructions, and
|
||||
the store will be executed before any instruction following the load.
|
||||
|
||||
|
||||
AND THEN THERE WAS ALPHA
|
||||
------------------------
|
||||
@@ -1766,6 +1745,147 @@ before it does, and the critical section in P2 both starts after P1's
|
||||
grace period does and ends after it does.
|
||||
|
||||
|
||||
LOCKING
|
||||
-------
|
||||
|
||||
The LKMM includes locking. In fact, there is special code for locking
|
||||
in the formal model, added in order to make tools run faster.
|
||||
However, this special code is intended to be more or less equivalent
|
||||
to concepts we have already covered. A spinlock_t variable is treated
|
||||
the same as an int, and spin_lock(&s) is treated almost the same as:
|
||||
|
||||
while (cmpxchg_acquire(&s, 0, 1) != 0)
|
||||
cpu_relax();
|
||||
|
||||
This waits until s is equal to 0 and then atomically sets it to 1,
|
||||
and the read part of the cmpxchg operation acts as an acquire fence.
|
||||
An alternate way to express the same thing would be:
|
||||
|
||||
r = xchg_acquire(&s, 1);
|
||||
|
||||
along with a requirement that at the end, r = 0. Similarly,
|
||||
spin_trylock(&s) is treated almost the same as:
|
||||
|
||||
return !cmpxchg_acquire(&s, 0, 1);
|
||||
|
||||
which atomically sets s to 1 if it is currently equal to 0 and returns
|
||||
true if it succeeds (the read part of the cmpxchg operation acts as an
|
||||
acquire fence only if the operation is successful). spin_unlock(&s)
|
||||
is treated almost the same as:
|
||||
|
||||
smp_store_release(&s, 0);
|
||||
|
||||
The "almost" qualifiers above need some explanation. In the LKMM, the
|
||||
store-release in a spin_unlock() and the load-acquire which forms the
|
||||
first half of the atomic rmw update in a spin_lock() or a successful
|
||||
spin_trylock() -- we can call these things lock-releases and
|
||||
lock-acquires -- have two properties beyond those of ordinary releases
|
||||
and acquires.
|
||||
|
||||
First, when a lock-acquire reads from a lock-release, the LKMM
|
||||
requires that every instruction po-before the lock-release must
|
||||
execute before any instruction po-after the lock-acquire. This would
|
||||
naturally hold if the release and acquire operations were on different
|
||||
CPUs, but the LKMM says it holds even when they are on the same CPU.
|
||||
For example:
|
||||
|
||||
int x, y;
|
||||
spinlock_t s;
|
||||
|
||||
P0()
|
||||
{
|
||||
int r1, r2;
|
||||
|
||||
spin_lock(&s);
|
||||
r1 = READ_ONCE(x);
|
||||
spin_unlock(&s);
|
||||
spin_lock(&s);
|
||||
r2 = READ_ONCE(y);
|
||||
spin_unlock(&s);
|
||||
}
|
||||
|
||||
P1()
|
||||
{
|
||||
WRITE_ONCE(y, 1);
|
||||
smp_wmb();
|
||||
WRITE_ONCE(x, 1);
|
||||
}
|
||||
|
||||
Here the second spin_lock() reads from the first spin_unlock(), and
|
||||
therefore the load of x must execute before the load of y. Thus we
|
||||
cannot have r1 = 1 and r2 = 0 at the end (this is an instance of the
|
||||
MP pattern).
|
||||
|
||||
This requirement does not apply to ordinary release and acquire
|
||||
fences, only to lock-related operations. For instance, suppose P0()
|
||||
in the example had been written as:
|
||||
|
||||
P0()
|
||||
{
|
||||
int r1, r2, r3;
|
||||
|
||||
r1 = READ_ONCE(x);
|
||||
smp_store_release(&s, 1);
|
||||
r3 = smp_load_acquire(&s);
|
||||
r2 = READ_ONCE(y);
|
||||
}
|
||||
|
||||
Then the CPU would be allowed to forward the s = 1 value from the
|
||||
smp_store_release() to the smp_load_acquire(), executing the
|
||||
instructions in the following order:
|
||||
|
||||
r3 = smp_load_acquire(&s); // Obtains r3 = 1
|
||||
r2 = READ_ONCE(y);
|
||||
r1 = READ_ONCE(x);
|
||||
smp_store_release(&s, 1); // Value is forwarded
|
||||
|
||||
and thus it could load y before x, obtaining r2 = 0 and r1 = 1.
|
||||
|
||||
Second, when a lock-acquire reads from a lock-release, and some other
|
||||
stores W and W' occur po-before the lock-release and po-after the
|
||||
lock-acquire respectively, the LKMM requires that W must propagate to
|
||||
each CPU before W' does. For example, consider:
|
||||
|
||||
int x, y;
|
||||
spinlock_t x;
|
||||
|
||||
P0()
|
||||
{
|
||||
spin_lock(&s);
|
||||
WRITE_ONCE(x, 1);
|
||||
spin_unlock(&s);
|
||||
}
|
||||
|
||||
P1()
|
||||
{
|
||||
int r1;
|
||||
|
||||
spin_lock(&s);
|
||||
r1 = READ_ONCE(x);
|
||||
WRITE_ONCE(y, 1);
|
||||
spin_unlock(&s);
|
||||
}
|
||||
|
||||
P2()
|
||||
{
|
||||
int r2, r3;
|
||||
|
||||
r2 = READ_ONCE(y);
|
||||
smp_rmb();
|
||||
r3 = READ_ONCE(x);
|
||||
}
|
||||
|
||||
If r1 = 1 at the end then the spin_lock() in P1 must have read from
|
||||
the spin_unlock() in P0. Hence the store to x must propagate to P2
|
||||
before the store to y does, so we cannot have r2 = 1 and r3 = 0.
|
||||
|
||||
These two special requirements for lock-release and lock-acquire do
|
||||
not arise from the operational model. Nevertheless, kernel developers
|
||||
have come to expect and rely on them because they do hold on all
|
||||
architectures supported by the Linux kernel, albeit for various
|
||||
differing reasons.
|
||||
|
||||
|
||||
ODDS AND ENDS
|
||||
-------------
|
||||
|
||||
@@ -1831,26 +1951,6 @@ they behave as follows:
|
||||
events and the events preceding them against all po-later
|
||||
events.
|
||||
|
||||
The LKMM includes locking. In fact, there is special code for locking
|
||||
in the formal model, added in order to make tools run faster.
|
||||
However, this special code is intended to be exactly equivalent to
|
||||
concepts we have already covered. A spinlock_t variable is treated
|
||||
the same as an int, and spin_lock(&s) is treated the same as:
|
||||
|
||||
while (cmpxchg_acquire(&s, 0, 1) != 0)
|
||||
cpu_relax();
|
||||
|
||||
which waits until s is equal to 0 and then atomically sets it to 1,
|
||||
and where the read part of the atomic update is also an acquire fence.
|
||||
An alternate way to express the same thing would be:
|
||||
|
||||
r = xchg_acquire(&s, 1);
|
||||
|
||||
along with a requirement that at the end, r = 0. spin_unlock(&s) is
|
||||
treated the same as:
|
||||
|
||||
smp_store_release(&s, 0);
|
||||
|
||||
Interestingly, RCU and locking each introduce the possibility of
|
||||
deadlock. When faced with code sequences such as:
|
||||
|
||||
|
||||
@@ -311,7 +311,7 @@ The smp_wmb() macro orders prior stores against later stores, and the
|
||||
smp_rmb() macro orders prior loads against later loads. Therefore, if
|
||||
the final value of r0 is 1, the final value of r1 must also be 1.
|
||||
|
||||
The the xlog_state_switch_iclogs() function in fs/xfs/xfs_log.c contains
|
||||
The xlog_state_switch_iclogs() function in fs/xfs/xfs_log.c contains
|
||||
the following write-side code fragment:
|
||||
|
||||
log->l_curr_block -= log->l_logBBsize;
|
||||
|
||||
@@ -171,6 +171,12 @@ The Linux-kernel memory model has the following limitations:
|
||||
particular, the "THE PROGRAM ORDER RELATION: po AND po-loc"
|
||||
and "A WARNING" sections).
|
||||
|
||||
Note that this limitation in turn limits LKMM's ability to
|
||||
accurately model address, control, and data dependencies.
|
||||
For example, if the compiler can deduce the value of some variable
|
||||
carrying a dependency, then the compiler can break that dependency
|
||||
by substituting a constant of that value.
|
||||
|
||||
2. Multiple access sizes for a single variable are not supported,
|
||||
and neither are misaligned or partially overlapping accesses.
|
||||
|
||||
@@ -190,6 +196,36 @@ The Linux-kernel memory model has the following limitations:
|
||||
However, a substantial amount of support is provided for these
|
||||
operations, as shown in the linux-kernel.def file.
|
||||
|
||||
a. When rcu_assign_pointer() is passed NULL, the Linux
|
||||
kernel provides no ordering, but LKMM models this
|
||||
case as a store release.
|
||||
|
||||
b. The "unless" RMW operations are not currently modeled:
|
||||
atomic_long_add_unless(), atomic_add_unless(),
|
||||
atomic_inc_unless_negative(), and
|
||||
atomic_dec_unless_positive(). These can be emulated
|
||||
in litmus tests, for example, by using atomic_cmpxchg().
|
||||
|
||||
c. The call_rcu() function is not modeled. It can be
|
||||
emulated in litmus tests by adding another process that
|
||||
invokes synchronize_rcu() and the body of the callback
|
||||
function, with (for example) a release-acquire from
|
||||
the site of the emulated call_rcu() to the beginning
|
||||
of the additional process.
|
||||
|
||||
d. The rcu_barrier() function is not modeled. It can be
|
||||
emulated in litmus tests emulating call_rcu() via
|
||||
(for example) a release-acquire from the end of each
|
||||
additional call_rcu() process to the site of the
|
||||
emulated rcu-barrier().
|
||||
|
||||
e. Sleepable RCU (SRCU) is not modeled. It can be
|
||||
emulated, but perhaps not simply.
|
||||
|
||||
f. Reader-writer locking is not modeled. It can be
|
||||
emulated in litmus tests using atomic read-modify-write
|
||||
operations.
|
||||
|
||||
The "herd7" tool has some additional limitations of its own, apart from
|
||||
the memory model:
|
||||
|
||||
@@ -204,3 +240,6 @@ the memory model:
|
||||
Some of these limitations may be overcome in the future, but others are
|
||||
more likely to be addressed by incorporating the Linux-kernel memory model
|
||||
into other tools.
|
||||
|
||||
Finally, please note that LKMM is subject to change as hardware, use cases,
|
||||
and compilers evolve.
|
||||
|
||||
@@ -38,7 +38,7 @@ let strong-fence = mb | gp
|
||||
(* Release Acquire *)
|
||||
let acq-po = [Acquire] ; po ; [M]
|
||||
let po-rel = [M] ; po ; [Release]
|
||||
let rfi-rel-acq = [Release] ; rfi ; [Acquire]
|
||||
let po-unlock-rf-lock-po = po ; [UL] ; rf ; [LKR] ; po
|
||||
|
||||
(**********************************)
|
||||
(* Fundamental coherence ordering *)
|
||||
@@ -60,13 +60,13 @@ let dep = addr | data
|
||||
let rwdep = (dep | ctrl) ; [W]
|
||||
let overwrite = co | fr
|
||||
let to-w = rwdep | (overwrite & int)
|
||||
let to-r = addr | (dep ; rfi) | rfi-rel-acq
|
||||
let to-r = addr | (dep ; rfi)
|
||||
let fence = strong-fence | wmb | po-rel | rmb | acq-po
|
||||
let ppo = to-r | to-w | fence
|
||||
let ppo = to-r | to-w | fence | (po-unlock-rf-lock-po & int)
|
||||
|
||||
(* Propagation: Ordering from release operations and strong fences. *)
|
||||
let A-cumul(r) = rfe? ; r
|
||||
let cumul-fence = A-cumul(strong-fence | po-rel) | wmb
|
||||
let cumul-fence = A-cumul(strong-fence | po-rel) | wmb | po-unlock-rf-lock-po
|
||||
let prop = (overwrite & ext)? ; cumul-fence* ; rfe?
|
||||
|
||||
(*
|
||||
|
||||
@@ -1,11 +1,10 @@
|
||||
C ISA2+pooncelock+pooncelock+pombonce
|
||||
|
||||
(*
|
||||
* Result: Sometimes
|
||||
* Result: Never
|
||||
*
|
||||
* This test shows that the ordering provided by a lock-protected S
|
||||
* litmus test (P0() and P1()) are not visible to external process P2().
|
||||
* This is likely to change soon.
|
||||
* This test shows that write-write ordering provided by locks
|
||||
* (in P0() and P1()) is visible to external process P2().
|
||||
*)
|
||||
|
||||
{}
|
||||
|
||||
@@ -1,4 +1,6 @@
|
||||
This directory contains the following litmus tests:
|
||||
============
|
||||
LITMUS TESTS
|
||||
============
|
||||
|
||||
CoRR+poonceonce+Once.litmus
|
||||
Test of read-read coherence, that is, whether or not two
|
||||
@@ -36,7 +38,7 @@ IRIW+poonceonces+OnceOnce.litmus
|
||||
ISA2+pooncelock+pooncelock+pombonce.litmus
|
||||
Tests whether the ordering provided by a lock-protected S
|
||||
litmus test is visible to an external process whose accesses are
|
||||
separated by smp_mb(). This addition of an external process to
|
||||
separated by smp_mb(). This addition of an external process to
|
||||
S is otherwise known as ISA2.
|
||||
|
||||
ISA2+poonceonces.litmus
|
||||
@@ -151,3 +153,101 @@ Z6.0+pooncerelease+poacquirerelease+fencembonceonce.litmus
|
||||
A great many more litmus tests are available here:
|
||||
|
||||
https://github.com/paulmckrcu/litmus
|
||||
|
||||
==================
|
||||
LITMUS TEST NAMING
|
||||
==================
|
||||
|
||||
Litmus tests are usually named based on their contents, which means that
|
||||
looking at the name tells you what the litmus test does. The naming
|
||||
scheme covers litmus tests having a single cycle that passes through
|
||||
each process exactly once, so litmus tests not fitting this description
|
||||
are named on an ad-hoc basis.
|
||||
|
||||
The structure of a litmus-test name is the litmus-test class, a plus
|
||||
sign ("+"), and one string for each process, separated by plus signs.
|
||||
The end of the name is ".litmus".
|
||||
|
||||
The litmus-test classes may be found in the infamous test6.pdf:
|
||||
https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test6.pdf
|
||||
Each class defines the pattern of accesses and of the variables accessed.
|
||||
For example, if the one process writes to a pair of variables, and
|
||||
the other process reads from these same variables, the corresponding
|
||||
litmus-test class is "MP" (message passing), which may be found on the
|
||||
left-hand end of the second row of tests on page one of test6.pdf.
|
||||
|
||||
The strings used to identify the actions carried out by each process are
|
||||
complex due to a desire to have short(er) names. Thus, there is a tool to
|
||||
generate these strings from a given litmus test's actions. For example,
|
||||
consider the processes from SB+rfionceonce-poonceonces.litmus:
|
||||
|
||||
P0(int *x, int *y)
|
||||
{
|
||||
int r1;
|
||||
int r2;
|
||||
|
||||
WRITE_ONCE(*x, 1);
|
||||
r1 = READ_ONCE(*x);
|
||||
r2 = READ_ONCE(*y);
|
||||
}
|
||||
|
||||
P1(int *x, int *y)
|
||||
{
|
||||
int r3;
|
||||
int r4;
|
||||
|
||||
WRITE_ONCE(*y, 1);
|
||||
r3 = READ_ONCE(*y);
|
||||
r4 = READ_ONCE(*x);
|
||||
}
|
||||
|
||||
The next step is to construct a space-separated list of descriptors,
|
||||
interleaving descriptions of the relation between a pair of consecutive
|
||||
accesses with descriptions of the second access in the pair.
|
||||
|
||||
P0()'s WRITE_ONCE() is read by its first READ_ONCE(), which is a
|
||||
reads-from link (rf) and internal to the P0() process. This is
|
||||
"rfi", which is an abbreviation for "reads-from internal". Because
|
||||
some of the tools string these abbreviations together with space
|
||||
characters separating processes, the first character is capitalized,
|
||||
resulting in "Rfi".
|
||||
|
||||
P0()'s second access is a READ_ONCE(), as opposed to (for example)
|
||||
smp_load_acquire(), so next is "Once". Thus far, we have "Rfi Once".
|
||||
|
||||
P0()'s third access is also a READ_ONCE(), but to y rather than x.
|
||||
This is related to P0()'s second access by program order ("po"),
|
||||
to a different variable ("d"), and both accesses are reads ("RR").
|
||||
The resulting descriptor is "PodRR". Because P0()'s third access is
|
||||
READ_ONCE(), we add another "Once" descriptor.
|
||||
|
||||
A from-read ("fre") relation links P0()'s third to P1()'s first
|
||||
access, and the resulting descriptor is "Fre". P1()'s first access is
|
||||
WRITE_ONCE(), which as before gives the descriptor "Once". The string
|
||||
thus far is thus "Rfi Once PodRR Once Fre Once".
|
||||
|
||||
The remainder of P1() is similar to P0(), which means we add
|
||||
"Rfi Once PodRR Once". Another fre links P1()'s last access to
|
||||
P0()'s first access, which is WRITE_ONCE(), so we add "Fre Once".
|
||||
The full string is thus:
|
||||
|
||||
Rfi Once PodRR Once Fre Once Rfi Once PodRR Once Fre Once
|
||||
|
||||
This string can be given to the "norm7" and "classify7" tools to
|
||||
produce the name:
|
||||
|
||||
$ norm7 -bell linux-kernel.bell \
|
||||
Rfi Once PodRR Once Fre Once Rfi Once PodRR Once Fre Once | \
|
||||
sed -e 's/:.*//g'
|
||||
SB+rfionceonce-poonceonces
|
||||
|
||||
Adding the ".litmus" suffix: SB+rfionceonce-poonceonces.litmus
|
||||
|
||||
The descriptors that describe connections between consecutive accesses
|
||||
within the cycle through a given litmus test can be provided by the herd
|
||||
tool (Rfi, Po, Fre, and so on) or by the linux-kernel.bell file (Once,
|
||||
Release, Acquire, and so on).
|
||||
|
||||
To see the full list of descriptors, execute the following command:
|
||||
|
||||
$ diyone7 -bell linux-kernel.bell -show edges
|
||||
|
||||
@@ -30,9 +30,9 @@
|
||||
#define EX_ORIG_OFFSET 0
|
||||
#define EX_NEW_OFFSET 4
|
||||
|
||||
#define JUMP_ENTRY_SIZE 24
|
||||
#define JUMP_ENTRY_SIZE 16
|
||||
#define JUMP_ORIG_OFFSET 0
|
||||
#define JUMP_NEW_OFFSET 8
|
||||
#define JUMP_NEW_OFFSET 4
|
||||
|
||||
#define ALT_ENTRY_SIZE 13
|
||||
#define ALT_ORIG_OFFSET 0
|
||||
|
||||
Reference in New Issue
Block a user