linux/Documentation
Paul E. McKenney 23b5c8fa01 rcu: Decrease memory-barrier usage based on semi-formal proof
(Note: this was reverted, and is now being re-applied in pieces, with
this being the fifth and final piece.  See below for the reason that
it is now felt to be safe to re-apply this.)

Commit d09b62d fixed grace-period synchronization, but left some smp_mb()
invocations in rcu_process_callbacks() that are no longer needed, but
sheer paranoia prevented them from being removed.  This commit removes
them and provides a proof of correctness in their absence.  It also adds
a memory barrier to rcu_report_qs_rsp() immediately before the update to
rsp->completed in order to handle the theoretical possibility that the
compiler or CPU might move massive quantities of code into a lock-based
critical section.  This also proves that the sheer paranoia was not
entirely unjustified, at least from a theoretical point of view.

In addition, the old dyntick-idle synchronization depended on the fact
that grace periods were many milliseconds in duration, so that it could
be assumed that no dyntick-idle CPU could reorder a memory reference
across an entire grace period.  Unfortunately for this design, the
addition of expedited grace periods breaks this assumption, which has
the unfortunate side-effect of requiring atomic operations in the
functions that track dyntick-idle state for RCU.  (There is some hope
that the algorithms used in user-level RCU might be applied here, but
some work is required to handle the NMIs that user-space applications
can happily ignore.  For the short term, better safe than sorry.)

This proof assumes that neither compiler nor CPU will allow a lock
acquisition and release to be reordered, as doing so can result in
deadlock.  The proof is as follows:

1.	A given CPU declares a quiescent state under the protection of
	its leaf rcu_node's lock.

2.	If there is more than one level of rcu_node hierarchy, the
	last CPU to declare a quiescent state will also acquire the
	->lock of the next rcu_node up in the hierarchy,  but only
	after releasing the lower level's lock.  The acquisition of this
	lock clearly cannot occur prior to the acquisition of the leaf
	node's lock.

3.	Step 2 repeats until we reach the root rcu_node structure.
	Please note again that only one lock is held at a time through
	this process.  The acquisition of the root rcu_node's ->lock
	must occur after the release of that of the leaf rcu_node.

4.	At this point, we set the ->completed field in the rcu_state
	structure in rcu_report_qs_rsp().  However, if the rcu_node
	hierarchy contains only one rcu_node, then in theory the code
	preceding the quiescent state could leak into the critical
	section.  We therefore precede the update of ->completed with a
	memory barrier.  All CPUs will therefore agree that any updates
	preceding any report of a quiescent state will have happened
	before the update of ->completed.

5.	Regardless of whether a new grace period is needed, rcu_start_gp()
	will propagate the new value of ->completed to all of the leaf
	rcu_node structures, under the protection of each rcu_node's ->lock.
	If a new grace period is needed immediately, this propagation
	will occur in the same critical section that ->completed was
	set in, but courtesy of the memory barrier in #4 above, is still
	seen to follow any pre-quiescent-state activity.

6.	When a given CPU invokes __rcu_process_gp_end(), it becomes
	aware of the end of the old grace period and therefore makes
	any RCU callbacks that were waiting on that grace period eligible
	for invocation.

	If this CPU is the same one that detected the end of the grace
	period, and if there is but a single rcu_node in the hierarchy,
	we will still be in the single critical section.  In this case,
	the memory barrier in step #4 guarantees that all callbacks will
	be seen to execute after each CPU's quiescent state.

	On the other hand, if this is a different CPU, it will acquire
	the leaf rcu_node's ->lock, and will again be serialized after
	each CPU's quiescent state for the old grace period.

On the strength of this proof, this commit therefore removes the memory
barriers from rcu_process_callbacks() and adds one to rcu_report_qs_rsp().
The effect is to reduce the number of memory barriers by one and to
reduce the frequency of execution from about once per scheduling tick
per CPU to once per grace period.

This was reverted do to hangs found during testing by Yinghai Lu and
Ingo Molnar.  Frederic Weisbecker supplied Yinghai with tracing that
located the underlying problem, and Frederic also provided the fix.

The underlying problem was that the HARDIRQ_ENTER() macro from
lib/locking-selftest.c invoked irq_enter(), which in turn invokes
rcu_irq_enter(), but HARDIRQ_EXIT() invoked __irq_exit(), which
does not invoke rcu_irq_exit().  This situation resulted in calls
to rcu_irq_enter() that were not balanced by the required calls to
rcu_irq_exit().  Therefore, after these locking selftests completed,
RCU's dyntick-idle nesting count was a large number (for example,
72), which caused RCU to to conclude that the affected CPU was not in
dyntick-idle mode when in fact it was.

RCU would therefore incorrectly wait for this dyntick-idle CPU, resulting
in hangs.

In contrast, with Frederic's patch, which replaces the irq_enter()
in HARDIRQ_ENTER() with an __irq_enter(), these tests don't ever call
either rcu_irq_enter() or rcu_irq_exit(), which works because the CPU
running the test is already marked as not being in dyntick-idle mode.
This means that the rcu_irq_enter() and rcu_irq_exit() calls and RCU
then has no problem working out which CPUs are in dyntick-idle mode and
which are not.

The reason that the imbalance was not noticed before the barrier patch
was applied is that the old implementation of rcu_enter_nohz() ignored
the nesting depth.  This could still result in delays, but much shorter
ones.  Whenever there was a delay, RCU would IPI the CPU with the
unbalanced nesting level, which would eventually result in rcu_enter_nohz()
being called, which in turn would force RCU to see that the CPU was in
dyntick-idle mode.

The reason that very few people noticed the problem is that the mismatched
irq_enter() vs. __irq_exit() occured only when the kernel was built with
CONFIG_DEBUG_LOCKING_API_SELFTESTS.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-05-26 09:42:23 -07:00
..
ABI Merge branch 'for-linus2' of git://git.profusion.mobi/users/lucas/linux-2.6 2011-04-07 11:14:49 -07:00
accounting taskstats: pad taskstats netlink response for aligment issues on ia64 2010-12-22 19:43:34 -08:00
acpi ACPI, APEI, Add PCIe AER error information printing support 2011-03-21 22:59:08 -04:00
aoe Documentation: update broken web addresses. 2010-08-04 15:21:40 +02:00
arm Fix common misspellings 2011-03-31 11:26:23 -03:00
auxdisplay includecheck fix: Documentation, cfag12864b-example.c 2009-09-24 07:20:57 -07:00
blackfin Blackfin: document SPI CS limitations with CPHA=0 2010-08-06 12:55:52 -04:00
block Fix common misspellings 2011-03-31 11:26:23 -03:00
blockdev Documentation: update broken web addresses. 2010-08-04 15:21:40 +02:00
cdrom Documentation: update broken web addresses. 2010-08-04 15:21:40 +02:00
cgroups memcg: update documentation to describe usage_in_bytes 2011-04-28 11:28:21 -07:00
connector Documentation/: it's -> its where appropriate 2010-04-23 02:09:52 +02:00
console doc: fix console doc typo 2010-02-24 13:51:32 +01:00
cpu-freq [CPUFREQ] Add documentation for sampling_down_factor 2011-03-16 17:54:31 -04:00
cpuidle
cris
crypto async_tx: add support for asynchronous RAID6 recovery operations 2009-08-29 19:09:27 -07:00
development-process docs: update the development process document 2011-03-25 14:30:31 -06:00
device-mapper Fix common misspellings 2011-03-31 11:26:23 -03:00
devicetree Merge branch 'for-linus2' of git://git.profusion.mobi/users/lucas/linux-2.6 2011-04-07 11:14:49 -07:00
DocBook Merge branch 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-2.6 2011-04-27 15:17:52 -07:00
driver-model driver core: prune docs about device_interface 2010-11-10 16:57:11 -08:00
dvb Merge branch 'for-linus2' of git://git.profusion.mobi/users/lucas/linux-2.6 2011-04-07 11:14:49 -07:00
early-userspace
fault-injection lkdtm: add debugfs access and loosen KPROBE ties 2010-03-06 11:26:32 -08:00
fb Fix common misspellings 2011-03-31 11:26:23 -03:00
filesystems rcu: move TREE_RCU from softirq to kthread 2011-05-05 23:16:54 -07:00
firmware_class firmware: Update hotplug script 2010-08-05 13:53:34 -07:00
frv
hwmon hwmon: (adm1021) Clarify documentation regarding Xeon processors 2011-04-29 16:33:36 +02:00
i2c Fix common misspellings 2011-03-31 11:26:23 -03:00
i2o Fix common misspellings 2011-03-31 11:26:23 -03:00
ia64 Fix common misspellings 2011-03-31 11:26:23 -03:00
ide
infiniband Documentation: update broken web addresses. 2010-08-04 15:21:40 +02:00
input Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input 2011-04-18 13:29:03 -07:00
ioctl Merge branch 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-2.6 2011-03-24 09:50:13 -07:00
isdn Fix common misspellings 2011-03-31 11:26:23 -03:00
ja_JP Documentation: update broken web addresses. 2010-08-04 15:21:40 +02:00
kbuild Fix common misspellings 2011-03-31 11:26:23 -03:00
kdump kdump: update kexec-tools URL and Vivek's email 2010-11-25 14:36:38 +01:00
ko_KR Docs/Kconfig: Update: osdl.org -> linuxfoundation.org 2010-11-15 23:50:13 +01:00
kvm Fix common misspellings 2011-03-31 11:26:23 -03:00
laptops Documentation: fix minor typos/spelling 2011-04-04 17:51:47 -07:00
leds Documentation: consolidate leds files to leds/ subdir 2011-04-04 17:51:47 -07:00
lguest lguest: document --rng in example Launcher 2011-01-20 21:37:29 +10:30
m68k
make kbuild: introduce HDR_ARCH_LIST for headers_install_all 2010-12-14 22:16:19 +01:00
mips Fix common misspellings 2011-03-31 11:26:23 -03:00
misc-devices Fix common misspellings 2011-03-31 11:26:23 -03:00
mmc mmc: add erase, secure erase, trim and secure trim operations 2010-08-12 08:43:30 -07:00
mn10300
mtd Documentation: update broken web addresses. 2010-08-04 15:21:40 +02:00
namespaces
netlabel Documentation/: it's -> its where appropriate 2010-04-23 02:09:52 +02:00
networking Merge branch 'for-linus2' of git://git.profusion.mobi/users/lucas/linux-2.6 2011-04-07 11:14:49 -07:00
nfc NFC: Driver for NXP Semiconductors PN544 NFC chip. 2011-01-13 08:03:19 -08:00
parisc
PCI Fix common misspellings 2011-03-31 11:26:23 -03:00
pcmcia pcmcia: use autoconfiguration feature for ioports and iomem 2010-09-29 17:20:24 +02:00
power Fix common misspellings 2011-03-31 11:26:23 -03:00
powerpc Fix common misspellings 2011-03-31 11:26:23 -03:00
pps pps: add parallel port PPS signal generator 2011-01-13 08:03:21 -08:00
prctl
rapidio rapidio: add RapidIO documentation 2011-03-23 19:46:41 -07:00
RCU rcu: Decrease memory-barrier usage based on semi-formal proof 2011-05-26 09:42:23 -07:00
s390 Documentation: fix minor typos/spelling 2011-04-04 17:51:47 -07:00
scheduler sched, doc: Beef up load balancing description 2011-03-31 13:00:35 +02:00
scsi Merge branch 'for-linus2' of git://git.profusion.mobi/users/lucas/linux-2.6 2011-04-07 11:14:49 -07:00
serial Fix common misspellings 2011-03-31 11:26:23 -03:00
sh sh: clkfwk: Kill off unused clk_set_rate_ex(). 2010-11-15 18:25:12 +09:00
sound Merge branch 'fix/hda' into for-linus 2011-04-21 12:44:38 +02:00
sparc
spi Fix common misspellings 2011-03-31 11:26:23 -03:00
sysctl Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2011-03-18 10:37:40 -07:00
target [SCSI] tcm_mod_builder.py: Fix generated *_drop_nodeacl() handler 2011-03-23 11:36:45 -05:00
telephony Fix common misspellings 2011-03-31 11:26:23 -03:00
thermal thermal: Add event notification to thermal framework 2011-01-12 00:08:35 -05:00
timers tree-wide: fix comment/printk typos 2010-11-01 15:38:34 -04:00
trace Fix common misspellings 2011-03-31 11:26:23 -03:00
uml Documentation: update broken web addresses. 2010-08-04 15:21:40 +02:00
usb USB: usbmon: fix-up docs and text API for sparse ISO 2011-02-04 11:46:57 -08:00
video4linux Merge branch 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-2.6 2011-04-27 15:17:52 -07:00
vm Fix common misspellings 2011-03-31 11:26:23 -03:00
w1 Fix common misspellings 2011-03-31 11:26:23 -03:00
watchdog Fix common misspellings 2011-03-31 11:26:23 -03:00
wimax
x86 move x86 specific oops=panic to generic code 2011-03-22 17:44:11 -07:00
zh_CN Fix spelling mistakes in Documentation/zh_CN/SubmittingPatches 2011-02-28 19:30:48 -08:00
.gitignore add random binaries to .gitignore 2010-04-08 11:34:34 +02:00
00-INDEX Documentation: consolidate leds files to leds/ subdir 2011-04-04 17:51:47 -07:00
apparmor.txt AppArmor: update Maintainer and Documentation 2010-08-02 15:35:15 +10:00
applying-patches.txt
atomic_ops.txt Documentation/: it's -> its where appropriate 2010-04-23 02:09:52 +02:00
bad_memory.txt
basic_profiling.txt
binfmt_misc.txt Documentation: update broken web addresses. 2010-08-04 15:21:40 +02:00
braille-console.txt
bt8xxgpio.txt
btmrvl.txt Bluetooth: Add documentation for Marvell Bluetooth driver 2009-08-22 14:25:32 -07:00
BUG-HUNTING
bus-virt-phys-mapping.txt documentation: fix almost duplicate filenames (IO/io-mapping.txt) 2010-07-20 17:49:30 +00:00
cachetlb.txt Documentation/: it's -> its where appropriate 2010-04-23 02:09:52 +02:00
Changes Documentation/Changes: minor corrections 2011-03-22 17:44:17 -07:00
circular-buffers.txt Document Linux's circular buffering capabilities 2010-03-24 16:31:22 -07:00
coccinelle.txt scripts/coccinelle: update for compatability with Coccinelle 0.2.4 2010-12-03 12:27:01 +01:00
CodingStyle Documentation/CodingStyle: flesh out if-else examples 2011-03-22 17:44:16 -07:00
cpu-hotplug.txt Fix common misspellings 2011-03-31 11:26:23 -03:00
cpu-load.txt
cputopology.txt topology/sysfs: Provide book id and siblings attributes 2010-09-09 20:41:25 +02:00
credentials.txt CRED: Fix __task_cred()'s lockdep check and banner comment 2010-07-29 15:16:18 -07:00
dcdbas.txt
debugging-modules.txt
debugging-via-ohci1394.txt ieee1394: update URLs in debugging-via-ohci1394.txt 2009-10-03 09:28:11 +02:00
dell_rbu.txt Fix common misspellings 2011-03-31 11:26:23 -03:00
devices.txt Merge branch 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-2.6 2010-10-28 09:35:11 -07:00
DMA-API-HOWTO.txt Documentation: DMA-API-HOWTO.txt: rename ARCH_KMALLOC_MINALIGN to ARCH_DMA_MINALIGN 2010-08-14 11:56:46 -07:00
DMA-API.txt dma-mapping: remove dma_is_consistent API 2010-08-11 08:59:21 -07:00
DMA-attributes.txt
DMA-ISA-LPC.txt
dmaengine.txt
dontdiff Documentation/dontdiff: add further autogenerated files to ignore list 2011-01-06 09:59:37 -08:00
dynamic-debug-howto.txt Merge branch 'docs-next' of git://git.lwn.net/linux-2.6 2011-03-27 19:46:59 -07:00
edac.txt Merge branch 'for-linus2' of git://git.profusion.mobi/users/lucas/linux-2.6 2011-04-07 11:14:49 -07:00
eisa.txt Fix common misspellings 2011-03-31 11:26:23 -03:00
email-clients.txt Documentation/email-clients.txt: update Thunderbird docs with wordwrap plugin 2011-01-13 08:03:15 -08:00
feature-removal-schedule.txt asus-laptop: remove removed features from feature-removal-schedule.txt 2011-04-01 14:23:50 -04:00
flexible-arrays.txt Update flex_arrays.txt 2009-10-15 07:25:20 -06:00
futex-requeue-pi.txt
gcov.txt trivial: fix typo in CONFIG_DEBUG_FS in gcov doc 2009-09-21 15:14:56 +02:00
gpio.txt Revert "gpiolib: annotate gpio-intialization with __must_check" 2011-01-13 17:26:46 -08:00
highuid.txt
HOWTO Documentation: update broken web addresses. 2010-08-04 15:21:40 +02:00
hw_random.txt
hwspinlock.txt drivers: hwspinlock: add framework 2011-02-17 09:52:03 -08:00
init.txt init/main.c: improve usability in case of init binary failure 2010-03-06 11:26:29 -08:00
initrd.txt
intel_txt.txt Documentation: update broken web addresses. 2010-08-04 15:21:40 +02:00
Intel-IOMMU.txt intel-iommu: Kill DMAR_BROKEN_GFX_WA option. 2009-09-19 09:37:23 -07:00
io_ordering.txt
io-mapping.txt
iostats.txt Documentation/iostats.txt: bit-size reference etc. 2011-03-23 20:44:18 +01:00
IPMI.txt IPMI: Add the document description of ipmi_get_smi_info 2010-12-14 00:22:00 -05:00
IRQ-affinity.txt
IRQ.txt
irqflags-tracing.txt Fix common misspellings 2011-03-31 11:26:23 -03:00
isapnp.txt
java.txt
kernel-doc-nano-HOWTO.txt docbook: warn on unused doc entries 2010-09-11 16:49:21 -07:00
kernel-docs.txt Documentation: update kernel-docs.txt 2011-01-06 09:59:38 -08:00
kernel-parameters.txt Merge branch 'for-linus2' of git://git.profusion.mobi/users/lucas/linux-2.6 2011-04-07 11:14:49 -07:00
keys-request-key.txt KEYS: Add a new keyctl op to reject a key with a specified error code 2011-03-08 11:17:18 +11:00
keys-trusted-encrypted.txt keys: add new trusted key-type 2010-11-29 08:55:25 +11:00
keys.txt KEYS: Add an iovec version of KEYCTL_INSTANTIATE 2011-03-08 11:17:22 +11:00
kmemcheck.txt kmemcheck: update documentation 2009-07-01 22:36:22 +02:00
kmemleak.txt Documentation: update kmemleak arch. info 2011-04-04 17:51:46 -07:00
kobject.txt kobject: documentation: Update to refer to kset-example.c. 2010-03-19 07:12:20 -07:00
kprobes.txt tree-wide: fix comment/printk typos 2010-11-01 15:38:34 -04:00
kref.txt kref: Fix typo in kref documentation 2011-03-07 13:20:05 -08:00
ldm.txt Documentation: update broken web addresses. 2010-08-04 15:21:40 +02:00
local_ops.txt
lockdep-design.txt lockdep: Fix typos in documentation 2009-08-07 12:03:46 +02:00
lockstat.txt lockstat: Add usage info to Documentation/lockstat.txt 2009-12-06 13:20:02 +01:00
logo.gif
logo.txt
magic-number.txt take coda-private headers out of include/linux 2011-01-12 20:02:48 -05:00
Makefile [media] Remove the old V4L1 v4lgrab.c file 2010-12-29 08:17:12 -02:00
ManagementStyle
mca.txt
md.txt md: Update documentation for sync_min and sync_max entries 2011-04-20 15:40:01 +10:00
media-framework.txt Fix common misspellings 2011-03-31 11:26:23 -03:00
memory-barriers.txt smp: Document transitivity for memory barriers. 2011-03-04 08:05:49 -08:00
memory-hotplug.txt memory hotplug: Allow memory blocks to span multiple memory sections 2011-02-03 16:08:57 -08:00
memory.txt Documentation/memory.txt: remove some very outdated recommendations 2009-09-22 07:17:26 -07:00
mono.txt
mutex-design.txt mutex: Fix annotations to include it in kernel-locking docbook 2010-09-03 08:19:51 +02:00
nmi_watchdog.txt
nommu-mmap.txt nommu: fix malloc performance by adding uninitialized flag 2009-12-15 08:53:24 -08:00
numastat.txt mm: fix NUMA accounting in numastat.txt 2009-09-22 07:17:39 -07:00
oops-tracing.txt panic: Add taint flag TAINT_FIRMWARE_WORKAROUND ('I') 2010-05-19 08:37:43 +01:00
padata.txt Documentation/padata.txt: fix typos etc. 2010-08-11 08:59:18 -07:00
parport-lowlevel.txt
parport.txt
pi-futex.txt
pnp.txt doc: capitalization and other minor fixes in pnp doc 2010-02-05 12:22:44 +01:00
preempt-locking.txt
printk-formats.txt
prio_tree.txt
rbtree.txt Documentation: remove anticipatory scheduler info 2010-11-11 12:09:59 +01:00
rfkill.txt Document the rfkill sysfs ABI 2010-03-10 17:09:33 -05:00
robust-futex-ABI.txt futex: documentation: fix inconsistent description of futex list_op_pending 2009-06-18 13:03:56 -07:00
robust-futexes.txt
rt-mutex-design.txt variable name fix to Documentation/rt-mutex-design.txt 2010-06-05 17:39:09 +02:00
rt-mutex.txt
rtc.txt RTC: Fix up rtc.txt documentation to reflect changes to generic rtc layer 2011-03-09 11:25:10 -08:00
SAK.txt
SecurityBugs Fix common misspellings 2011-03-31 11:26:23 -03:00
SELinux.txt
serial-console.txt
sgi-ioc4.txt
sgi-visws.txt
SM501.txt
Smack.txt Documentation/: it's -> its where appropriate 2010-04-23 02:09:52 +02:00
sparse.txt update email address 2010-07-19 10:56:54 +02:00
spinlocks.txt locking: Remove deprecated lock initializers 2011-01-27 12:30:38 +01:00
stable_api_nonsense.txt
stable_kernel_rules.txt Documentation: -stable rules: upstream commit ID requirement reworded 2010-04-22 15:24:56 -07:00
SubmitChecklist Documentation: update SubmitChecklist for O=objdir and kconfig testing 2010-05-24 07:31:20 -07:00
SubmittingDrivers Fix common misspellings 2011-03-31 11:26:23 -03:00
SubmittingPatches Fix common misspellings 2011-03-31 11:26:23 -03:00
svga.txt
sysfs-rules.txt Fix typos in comments 2010-03-16 11:47:56 +01:00
sysrq.txt documentation: update sysrq.txt magic sysrq keys 2010-10-26 17:32:41 -07:00
tomoyo.txt TOMOYO: Update version to 2.3.0 2010-08-02 15:35:10 +10:00
unaligned-memory-access.txt
unicode.txt
unshare.txt
VGA-softcursor.txt
vgaarbiter.txt vgaarbiter: fix a typo in the vgaarbiter Documentation 2009-12-16 11:28:58 -08:00
video-output.txt
volatile-considered-harmful.txt Documentation/volatile-considered-harmful.txt: correct cpu_relax() documentation 2010-03-24 16:31:20 -07:00
workqueue.txt workqueue: Document debugging tricks 2011-03-31 13:40:42 +02:00
xz.txt decompressors: add XZ decompressor module 2011-01-13 08:03:24 -08:00
zorro.txt