rcu: Make rcu_read_unlock_special() safe for rq/pi locks
The scheduler is currently required to hold rq/pi locks across the entire RCU read-side critical section or not at all. This is inconvenient and leaves traps for the unwary, including the author of this commit. But now that excessively long grace periods enable scheduling-clock interrupts for holdout nohz_full CPUs, the nohz_full rescue logic in rcu_read_unlock_special() can be dispensed with. In other words, the rcu_read_unlock_special() function can refrain from doing wakeups unless such wakeups are guaranteed safe. This commit therefore avoids unsafe wakeups, freeing the scheduler to hold rq/pi locks across rcu_read_unlock() even if the corresponding RCU read-side critical section might have been preempted. This commit also updates RCU's requirements documentation. This commit is inspired by a patch from Lai Jiangshan: https://lore.kernel.org/lkml/20191102124559.1135-2-laijs@linux.alibaba.com This commit is further intended to be a step towards his goal of permitting the inlining of RCU-preempt's rcu_read_lock() and rcu_read_unlock(). Cc: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit is contained in:
@@ -1943,56 +1943,27 @@ invoked from a CPU-hotplug notifier.
|
|||||||
Scheduler and RCU
|
Scheduler and RCU
|
||||||
~~~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
RCU depends on the scheduler, and the scheduler uses RCU to protect some
|
RCU makes use of kthreads, and it is necessary to avoid excessive CPU-time
|
||||||
of its data structures. The preemptible-RCU ``rcu_read_unlock()``
|
accumulation by these kthreads. This requirement was no surprise, but
|
||||||
implementation must therefore be written carefully to avoid deadlocks
|
RCU's violation of it when running context-switch-heavy workloads when
|
||||||
involving the scheduler's runqueue and priority-inheritance locks. In
|
built with ``CONFIG_NO_HZ_FULL=y`` `did come as a surprise
|
||||||
particular, ``rcu_read_unlock()`` must tolerate an interrupt where the
|
|
||||||
interrupt handler invokes both ``rcu_read_lock()`` and
|
|
||||||
``rcu_read_unlock()``. This possibility requires ``rcu_read_unlock()``
|
|
||||||
to use negative nesting levels to avoid destructive recursion via
|
|
||||||
interrupt handler's use of RCU.
|
|
||||||
|
|
||||||
This scheduler-RCU requirement came as a `complete
|
|
||||||
surprise <https://lwn.net/Articles/453002/>`__.
|
|
||||||
|
|
||||||
As noted above, RCU makes use of kthreads, and it is necessary to avoid
|
|
||||||
excessive CPU-time accumulation by these kthreads. This requirement was
|
|
||||||
no surprise, but RCU's violation of it when running context-switch-heavy
|
|
||||||
workloads when built with ``CONFIG_NO_HZ_FULL=y`` `did come as a
|
|
||||||
surprise
|
|
||||||
[PDF] <http://www.rdrop.com/users/paulmck/scalability/paper/BareMetal.2015.01.15b.pdf>`__.
|
[PDF] <http://www.rdrop.com/users/paulmck/scalability/paper/BareMetal.2015.01.15b.pdf>`__.
|
||||||
RCU has made good progress towards meeting this requirement, even for
|
RCU has made good progress towards meeting this requirement, even for
|
||||||
context-switch-heavy ``CONFIG_NO_HZ_FULL=y`` workloads, but there is
|
context-switch-heavy ``CONFIG_NO_HZ_FULL=y`` workloads, but there is
|
||||||
room for further improvement.
|
room for further improvement.
|
||||||
|
|
||||||
It is forbidden to hold any of scheduler's runqueue or
|
There is no longer any prohibition against holding any of
|
||||||
priority-inheritance spinlocks across an ``rcu_read_unlock()`` unless
|
scheduler's runqueue or priority-inheritance spinlocks across an
|
||||||
interrupts have been disabled across the entire RCU read-side critical
|
``rcu_read_unlock()``, even if interrupts and preemption were enabled
|
||||||
section, that is, up to and including the matching ``rcu_read_lock()``.
|
somewhere within the corresponding RCU read-side critical section.
|
||||||
Violating this restriction can result in deadlocks involving these
|
Therefore, it is now perfectly legal to execute ``rcu_read_lock()``
|
||||||
scheduler spinlocks. There was hope that this restriction might be
|
with preemption enabled, acquire one of the scheduler locks, and hold
|
||||||
lifted when interrupt-disabled calls to ``rcu_read_unlock()`` started
|
that lock across the matching ``rcu_read_unlock()``.
|
||||||
deferring the reporting of the resulting RCU-preempt quiescent state
|
|
||||||
until the end of the corresponding interrupts-disabled region.
|
|
||||||
Unfortunately, timely reporting of the corresponding quiescent state to
|
|
||||||
expedited grace periods requires a call to ``raise_softirq()``, which
|
|
||||||
can acquire these scheduler spinlocks. In addition, real-time systems
|
|
||||||
using RCU priority boosting need this restriction to remain in effect
|
|
||||||
because deferred quiescent-state reporting would also defer deboosting,
|
|
||||||
which in turn would degrade real-time latencies.
|
|
||||||
|
|
||||||
In theory, if a given RCU read-side critical section could be guaranteed
|
Similarly, the RCU flavor consolidation has removed the need for negative
|
||||||
to be less than one second in duration, holding a scheduler spinlock
|
nesting. The fact that interrupt-disabled regions of code act as RCU
|
||||||
across that critical section's ``rcu_read_unlock()`` would require only
|
read-side critical sections implicitly avoids earlier issues that used
|
||||||
that preemption be disabled across the entire RCU read-side critical
|
to result in destructive recursion via interrupt handler's use of RCU.
|
||||||
section, not interrupts. Unfortunately, given the possibility of vCPU
|
|
||||||
preemption, long-running interrupts, and so on, it is not possible in
|
|
||||||
practice to guarantee that a given RCU read-side critical section will
|
|
||||||
complete in less than one second. Therefore, as noted above, if
|
|
||||||
scheduler spinlocks are held across a given call to
|
|
||||||
``rcu_read_unlock()``, interrupts must be disabled across the entire RCU
|
|
||||||
read-side critical section.
|
|
||||||
|
|
||||||
Tracing and RCU
|
Tracing and RCU
|
||||||
~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~
|
||||||
|
|||||||
@@ -615,19 +615,18 @@ static void rcu_read_unlock_special(struct task_struct *t)
|
|||||||
struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
|
struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
|
||||||
struct rcu_node *rnp = rdp->mynode;
|
struct rcu_node *rnp = rdp->mynode;
|
||||||
|
|
||||||
exp = (t->rcu_blocked_node && t->rcu_blocked_node->exp_tasks) ||
|
exp = (t->rcu_blocked_node &&
|
||||||
(rdp->grpmask & READ_ONCE(rnp->expmask)) ||
|
READ_ONCE(t->rcu_blocked_node->exp_tasks)) ||
|
||||||
tick_nohz_full_cpu(rdp->cpu);
|
(rdp->grpmask & READ_ONCE(rnp->expmask));
|
||||||
// Need to defer quiescent state until everything is enabled.
|
// Need to defer quiescent state until everything is enabled.
|
||||||
if (irqs_were_disabled && use_softirq &&
|
if (use_softirq && (in_irq() || (exp && !irqs_were_disabled))) {
|
||||||
(in_interrupt() ||
|
// Using softirq, safe to awaken, and either the
|
||||||
(exp && !t->rcu_read_unlock_special.b.deferred_qs))) {
|
// wakeup is free or there is an expedited GP.
|
||||||
// Using softirq, safe to awaken, and we get
|
|
||||||
// no help from enabling irqs, unlike bh/preempt.
|
|
||||||
raise_softirq_irqoff(RCU_SOFTIRQ);
|
raise_softirq_irqoff(RCU_SOFTIRQ);
|
||||||
} else {
|
} else {
|
||||||
// Enabling BH or preempt does reschedule, so...
|
// Enabling BH or preempt does reschedule, so...
|
||||||
// Also if no expediting or NO_HZ_FULL, slow is OK.
|
// Also if no expediting, slow is OK.
|
||||||
|
// Plus nohz_full CPUs eventually get tick enabled.
|
||||||
set_tsk_need_resched(current);
|
set_tsk_need_resched(current);
|
||||||
set_preempt_need_resched();
|
set_preempt_need_resched();
|
||||||
if (IS_ENABLED(CONFIG_IRQ_WORK) && irqs_were_disabled &&
|
if (IS_ENABLED(CONFIG_IRQ_WORK) && irqs_were_disabled &&
|
||||||
|
|||||||
Reference in New Issue
Block a user