linux/Documentation
Neeraj Upadhyay 683954e55c rcu: Check and report missed fqs timer wakeup on RCU stall
For a new grace period request, the RCU GP kthread transitions through
following states:

a. [RCU_GP_WAIT_GPS] -> [RCU_GP_DONE_GPS]

The RCU_GP_WAIT_GPS state is where the GP kthread waits for a request
for a new GP.  Once it receives a request (for example, when a new RCU
callback is queued), the GP kthread transitions to RCU_GP_DONE_GPS.

b. [RCU_GP_DONE_GPS] -> [RCU_GP_ONOFF]

Grace period initialization starts in rcu_gp_init(), which records the
start of new GP in rcu_state.gp_seq and transitions to RCU_GP_ONOFF.

c. [RCU_GP_ONOFF] -> [RCU_GP_INIT]

The purpose of the RCU_GP_ONOFF state is to apply the online/offline
information that was buffered for any CPUs that recently came online or
went offline.  This state is maintained in per-leaf rcu_node bitmasks,
with the buffered state in ->qsmaskinitnext and the state for the upcoming
GP in ->qsmaskinit.  At the end of this RCU_GP_ONOFF state, each bit in
->qsmaskinit will correspond to a CPU that must pass through a quiescent
state before the upcoming grace period is allowed to complete.

However, a leaf rcu_node structure with an all-zeroes ->qsmaskinit
cannot necessarily be ignored.  In preemptible RCU, there might well be
tasks still in RCU read-side critical sections that were first preempted
while running on one of the CPUs managed by this structure.  Such tasks
will be queued on this structure's ->blkd_tasks list.  Only after this
list fully drains can this leaf rcu_node structure be ignored, and even
then only if none of its CPUs have come back online in the meantime.
Once that happens, the ->qsmaskinit masks further up the tree will be
updated to exclude this leaf rcu_node structure.

Once the ->qsmaskinitnext and ->qsmaskinit fields have been updated
as needed, the GP kthread transitions to RCU_GP_INIT.

d. [RCU_GP_INIT] -> [RCU_GP_WAIT_FQS]

The purpose of the RCU_GP_INIT state is to copy each ->qsmaskinit to
the ->qsmask field within each rcu_node structure.  This copying is done
breadth-first from the root to the leaves.  Why not just copy directly
from ->qsmaskinitnext to ->qsmask?  Because the ->qsmaskinitnext masks
can change in the meantime as additional CPUs come online or go offline.
Such changes would result in inconsistencies in the ->qsmask fields up and
down the tree, which could in turn result in too-short grace periods or
grace-period hangs.  These issues are avoided by snapshotting the leaf
rcu_node structures' ->qsmaskinitnext fields into their ->qsmaskinit
counterparts, generating a consistent set of ->qsmaskinit fields
throughout the tree, and only then copying these consistent ->qsmaskinit
fields to their ->qsmask counterparts.

Once this initialization step is complete, the GP kthread transitions
to RCU_GP_WAIT_FQS, where it waits to do a force-quiescent-state scan
on the one hand or for the end of the grace period on the other.

e. [RCU_GP_WAIT_FQS] -> [RCU_GP_DOING_FQS]

The RCU_GP_WAIT_FQS state waits for one of three things:  (1) An
explicit request to do a force-quiescent-state scan, (2) The end of
the grace period, or (3) A short interval of time, after which it
will do a force-quiescent-state (FQS) scan.  The explicit request can
come from rcutorture or from any CPU that has too many RCU callbacks
queued (see the qhimark kernel parameter and the RCU_GP_FLAG_OVLD
flag).  The aforementioned "short period of time" is specified by the
jiffies_till_first_fqs boot parameter for a given grace period's first
FQS scan and by the jiffies_till_next_fqs for later FQS scans.

Either way, once the wait is over, the GP kthread transitions to
RCU_GP_DOING_FQS.

f. [RCU_GP_DOING_FQS] -> [RCU_GP_CLEANUP]

The RCU_GP_DOING_FQS state performs an FQS scan.  Each such scan carries
out two functions for any CPU whose bit is still set in its leaf rcu_node
structure's ->qsmask field, that is, for any CPU that has not yet reported
a quiescent state for the current grace period:

  i.  Report quiescent states on behalf of CPUs that have been observed
      to be idle (from an RCU perspective) since the beginning of the
      grace period.

  ii. If the current grace period is too old, take various actions to
      encourage holdout CPUs to pass through quiescent states, including
      enlisting the aid of any calls to cond_resched() and might_sleep(),
      and even including IPIing the holdout CPUs.

These checks are skipped for any leaf rcu_node structure with a all-zero
->qsmask field, however such structures are subject to RCU priority
boosting if there are tasks on a given structure blocking the current
grace period.  The end of the grace period is detected when the root
rcu_node structure's ->qsmask is zero and when there are no longer any
preempted tasks blocking the current grace period.  (No, this last check
is not redundant.  To see this, consider an rcu_node tree having exactly
one structure that serves as both root and leaf.)

Once the end of the grace period is detected, the GP kthread transitions
to RCU_GP_CLEANUP.

g. [RCU_GP_CLEANUP] -> [RCU_GP_CLEANED]

The RCU_GP_CLEANUP state marks the end of grace period by updating the
rcu_state structure's ->gp_seq field and also all rcu_node structures'
->gp_seq field.  As before, the rcu_node tree is traversed in breadth
first order.  Once this update is complete, the GP kthread transitions
to the RCU_GP_CLEANED state.

i. [RCU_GP_CLEANED] -> [RCU_GP_INIT]

Once in the RCU_GP_CLEANED state, the GP kthread immediately transitions
into the RCU_GP_INIT state.

j. The role of timers.

If there is at least one idle CPU, and if timers are not firing, the
transition from RCU_GP_DOING_FQS to RCU_GP_CLEANUP will never happen.
Timers can fail to fire for a number of reasons, including issues in
timer configuration, issues in the timer framework, and failure to handle
softirqs (for example, when there is a storm of interrupts).  Whatever the
reason, if the timers fail to fire, the GP kthread will never be awakened,
resulting in RCU CPU stall warnings and eventually in OOM.

However, an RCU CPU stall warning has a large number of potential causes,
as documented in Documentation/RCU/stallwarn.rst.  This commit therefore
adds analysis to the RCU CPU stall-warning code to emit an additional
message if the cause of the stall is likely to be timer failure.

Signed-off-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2021-01-06 16:54:11 -08:00
..
ABI More power management updates for 5.11-rc1 2020-12-22 14:12:10 -08:00
accounting
admin-guide A small set of late-arriving, small documentation fixes. 2020-12-24 14:20:33 -08:00
arm ARM updates for 5.11: 2020-12-22 13:34:27 -08:00
arm64 ARM: 2020-12-20 10:44:05 -08:00
block
bpf
cdrom
core-api Generic interrupt and irqchips subsystem: 2020-12-15 15:03:31 -08:00
cpu-freq
crypto
dev-tools Merge branch 'akpm' (patches from Andrew) 2020-12-22 13:38:17 -08:00
devicetree Devicetree fixes for v5.11, take 1: 2020-12-24 12:09:48 -08:00
doc-guide kernel-doc: Fix example in Nested structs/unions 2020-12-08 10:23:17 -07:00
driver-api RTC for 5.11 2020-12-20 10:12:06 -08:00
fault-injection
fb
features ARM updates for 5.11: 2020-12-22 13:34:27 -08:00
filesystems Various bug fixes and cleanups for ext4; no new features this cycle. 2020-12-24 14:16:02 -08:00
firmware_class
firmware-guide Documentation: ACPI: enumeration: add PCI hierarchy representation 2020-11-23 17:59:49 +01:00
fpga
gpu drm/docs: Fix todo.rst 2020-11-18 11:51:58 +01:00
hid Merge branch 'for-5.11/core' into for-linus 2020-12-16 11:38:38 +01:00
hwmon hwmon: (sbtsi) Add documentation 2020-12-12 08:34:29 -08:00
i2c
ia64 docs: archis: add a per-architecture features list 2020-12-03 15:10:15 -07:00
ide
iio
infiniband
input Input: document inhibiting 2020-12-02 22:10:37 -08:00
isdn
kbuild Kconfig updates for v5.11 2020-12-22 14:04:25 -08:00
kernel-hacking
leds Changes for 5.11-rc1. Small cleanups/fixes mostly thanks to Marek, 2020-12-16 14:56:29 -08:00
litmus-tests
livepatch
locking Documentation: seqlock: s/LOCKTYPE/LOCKNAME/g 2020-12-09 17:08:49 +01:00
m68k docs: archis: add a per-architecture features list 2020-12-03 15:10:15 -07:00
maintainer
mhi
mips docs: archis: add a per-architecture features list 2020-12-03 15:10:15 -07:00
misc-devices Documentation: remove mic/index from misc-devices/index.rst 2020-11-04 11:38:32 +01:00
netlabel
networking ARM: SoC drivers for v5.11 2020-12-16 16:38:41 -08:00
nios2 docs: nios2: add missing ReST file 2020-12-07 08:35:21 -07:00
nvdimm
openrisc docs: archis: add a per-architecture features list 2020-12-03 15:10:15 -07:00
parisc docs: archis: add a per-architecture features list 2020-12-03 15:10:15 -07:00
PCI
pcmcia
power PM: EM: Update Energy Model with new flag indicating power scale 2020-11-10 20:29:28 +01:00
powerpc docs: archis: add a per-architecture features list 2020-12-03 15:10:15 -07:00
process A small set of late-arriving, small documentation fixes. 2020-12-24 14:20:33 -08:00
RCU rcu: Check and report missed fqs timer wakeup on RCU stall 2021-01-06 16:54:11 -08:00
riscv docs: archis: add a per-architecture features list 2020-12-03 15:10:15 -07:00
s390 docs: archis: add a per-architecture features list 2020-12-03 15:10:15 -07:00
scheduler Power management updates for 5.11-rc1 2020-12-15 16:30:31 -08:00
scsi
security
sh docs: archis: add a per-architecture features list 2020-12-03 15:10:15 -07:00
sound ALSA: usb-audio: Add implicit_fb module option 2020-11-23 15:17:24 +01:00
sparc docs: archis: add a per-architecture features list 2020-12-03 15:10:15 -07:00
sphinx Kbuild updates for v5.11 2020-12-22 14:02:39 -08:00
sphinx-static
spi
staging
target tweewide: Fix most Shebang lines 2020-12-08 23:30:04 +09:00
timers
trace Kbuild updates for v5.11 2020-12-22 14:02:39 -08:00
translations Networking updates for 5.11 2020-12-15 13:22:29 -08:00
usb
userspace-api "Intel SGX is new hardware functionality that can be used by 2020-12-14 13:14:57 -08:00
virt ARM: 2020-12-20 10:44:05 -08:00
vm mm/lru: revise the comments of lru_lock 2020-12-15 14:48:04 -08:00
w1 w1: w1_therm: Rename conflicting sysfs attribute 'eeprom' to 'eeprom_cmd' 2020-11-12 08:50:13 +01:00
watchdog
x86 A much quieter cycle for documentation (happily), with, one hopes, the bulk 2020-12-14 16:55:54 -08:00
xtensa A much quieter cycle for documentation (happily), with, one hopes, the bulk 2020-12-14 16:55:54 -08:00
.gitignore
asm-annotations.rst
atomic_bitops.txt
atomic_t.txt
Changes
CodingStyle
conf.py docs: Note that sphinx 1.7 will be required soon 2020-12-11 13:53:38 -07:00
COPYING-logo
docutils.conf
dontdiff
index.rst docs: archis: add a per-architecture features list 2020-12-03 15:10:15 -07:00
Kconfig
logo.gif
Makefile
memory-barriers.txt docs/memory-barriers.txt: Fix a typo in CPU MEMORY BARRIERS section 2020-11-06 17:24:51 -08:00
SubmittingPatches
watch_queue.rst