linux/kernel/sched
John Keeping 401e4963bf sched/core: Always flush pending blk_plug
With CONFIG_PREEMPT_RT, it is possible to hit a deadlock between two
normal priority tasks (SCHED_OTHER, nice level zero):

	INFO: task kworker/u8:0:8 blocked for more than 491 seconds.
	      Not tainted 5.15.49-rt46 #1
	"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
	task:kworker/u8:0    state:D stack:    0 pid:    8 ppid:     2 flags:0x00000000
	Workqueue: writeback wb_workfn (flush-7:0)
	[<c08a3a10>] (__schedule) from [<c08a3d84>] (schedule+0xdc/0x134)
	[<c08a3d84>] (schedule) from [<c08a65a0>] (rt_mutex_slowlock_block.constprop.0+0xb8/0x174)
	[<c08a65a0>] (rt_mutex_slowlock_block.constprop.0) from [<c08a6708>]
	+(rt_mutex_slowlock.constprop.0+0xac/0x174)
	[<c08a6708>] (rt_mutex_slowlock.constprop.0) from [<c0374d60>] (fat_write_inode+0x34/0x54)
	[<c0374d60>] (fat_write_inode) from [<c0297304>] (__writeback_single_inode+0x354/0x3ec)
	[<c0297304>] (__writeback_single_inode) from [<c0297998>] (writeback_sb_inodes+0x250/0x45c)
	[<c0297998>] (writeback_sb_inodes) from [<c0297c20>] (__writeback_inodes_wb+0x7c/0xb8)
	[<c0297c20>] (__writeback_inodes_wb) from [<c0297f24>] (wb_writeback+0x2c8/0x2e4)
	[<c0297f24>] (wb_writeback) from [<c0298c40>] (wb_workfn+0x1a4/0x3e4)
	[<c0298c40>] (wb_workfn) from [<c0138ab8>] (process_one_work+0x1fc/0x32c)
	[<c0138ab8>] (process_one_work) from [<c0139120>] (worker_thread+0x22c/0x2d8)
	[<c0139120>] (worker_thread) from [<c013e6e0>] (kthread+0x16c/0x178)
	[<c013e6e0>] (kthread) from [<c01000fc>] (ret_from_fork+0x14/0x38)
	Exception stack(0xc10e3fb0 to 0xc10e3ff8)
	3fa0:                                     00000000 00000000 00000000 00000000
	3fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
	3fe0: 00000000 00000000 00000000 00000000 00000013 00000000

	INFO: task tar:2083 blocked for more than 491 seconds.
	      Not tainted 5.15.49-rt46 #1
	"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
	task:tar             state:D stack:    0 pid: 2083 ppid:  2082 flags:0x00000000
	[<c08a3a10>] (__schedule) from [<c08a3d84>] (schedule+0xdc/0x134)
	[<c08a3d84>] (schedule) from [<c08a41b0>] (io_schedule+0x14/0x24)
	[<c08a41b0>] (io_schedule) from [<c08a455c>] (bit_wait_io+0xc/0x30)
	[<c08a455c>] (bit_wait_io) from [<c08a441c>] (__wait_on_bit_lock+0x54/0xa8)
	[<c08a441c>] (__wait_on_bit_lock) from [<c08a44f4>] (out_of_line_wait_on_bit_lock+0x84/0xb0)
	[<c08a44f4>] (out_of_line_wait_on_bit_lock) from [<c0371fb0>] (fat_mirror_bhs+0xa0/0x144)
	[<c0371fb0>] (fat_mirror_bhs) from [<c0372a68>] (fat_alloc_clusters+0x138/0x2a4)
	[<c0372a68>] (fat_alloc_clusters) from [<c0370b14>] (fat_alloc_new_dir+0x34/0x250)
	[<c0370b14>] (fat_alloc_new_dir) from [<c03787c0>] (vfat_mkdir+0x58/0x148)
	[<c03787c0>] (vfat_mkdir) from [<c0277b60>] (vfs_mkdir+0x68/0x98)
	[<c0277b60>] (vfs_mkdir) from [<c027b484>] (do_mkdirat+0xb0/0xec)
	[<c027b484>] (do_mkdirat) from [<c0100060>] (ret_fast_syscall+0x0/0x1c)
	Exception stack(0xc2e1bfa8 to 0xc2e1bff0)
	bfa0:                   01ee42f0 01ee4208 01ee42f0 000041ed 00000000 00004000
	bfc0: 01ee42f0 01ee4208 00000000 00000027 01ee4302 00000004 000dcb00 01ee4190
	bfe0: 000dc368 bed11924 0006d4b0 b6ebddfc

Here the kworker is waiting on msdos_sb_info::s_lock which is held by
tar which is in turn waiting for a buffer which is locked waiting to be
flushed, but this operation is plugged in the kworker.

The lock is a normal struct mutex, so tsk_is_pi_blocked() will always
return false on !RT and thus the behaviour changes for RT.

It seems that the intent here is to skip blk_flush_plug() in the case
where a non-preemptible lock (such as a spinlock) has been converted to
a rtmutex on RT, which is the case covered by the SM_RTLOCK_WAIT
schedule flag.  But sched_submit_work() is only called from schedule()
which is never called in this scenario, so the check can simply be
deleted.

Looking at the history of the -rt patchset, in fact this change was
present from v5.9.1-rt20 until being dropped in v5.13-rt1 as it was part
of a larger patch [1] most of which was replaced by commit b4bfa3fcfe
("sched/core: Rework the __schedule() preempt argument").

As described in [1]:

   The schedule process must distinguish between blocking on a regular
   sleeping lock (rwsem and mutex) and a RT-only sleeping lock (spinlock
   and rwlock):
   - rwsem and mutex must flush block requests (blk_schedule_flush_plug())
     even if blocked on a lock. This can not deadlock because this also
     happens for non-RT.
     There should be a warning if the scheduling point is within a RCU read
     section.

   - spinlock and rwlock must not flush block requests. This will deadlock
     if the callback attempts to acquire a lock which is already acquired.
     Similarly to being preempted, there should be no warning if the
     scheduling point is within a RCU read section.

and with the tsk_is_pi_blocked() in the scheduler path, we hit the first
issue.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/patches/0022-locking-rtmutex-Use-custom-scheduling-function-for-s.patch?h=linux-5.10.y-rt-patches

Signed-off-by: John Keeping <john@metanate.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Link: https://lkml.kernel.org/r/20220708162702.1758865-1-john@metanate.com
2022-07-13 11:29:17 +02:00
..
autogroup.c sched/autogroup: Fix sysctl move 2022-05-30 12:36:36 +02:00
autogroup.h sched/headers: Add header guard to kernel/sched/stats.h and kernel/sched/autogroup.h 2022-02-23 08:22:00 +01:00
build_policy.c sched: Fix missing prototype warnings 2022-05-01 10:03:43 +02:00
build_utility.c sched: Fix missing prototype warnings 2022-05-01 10:03:43 +02:00
clock.c sched/clock: Use try_cmpxchg64 in sched_clock_{local,remote} 2022-05-19 23:46:09 +02:00
completion.c sched/headers: Introduce kernel/sched/build_utility.c and build multiple .c files there 2022-02-23 10:58:33 +01:00
core_sched.c sched/core: add forced idle accounting for cgroups 2022-07-04 09:23:07 +02:00
core.c sched/core: Always flush pending blk_plug 2022-07-13 11:29:17 +02:00
cpuacct.c Merge branch 'sched/fast-headers' into sched/core 2022-03-15 09:05:05 +01:00
cpudeadline.c sched/headers: Introduce kernel/sched/build_policy.c and build multiple .c files there 2022-02-23 10:58:33 +01:00
cpudeadline.h
cpufreq_schedutil.c sched, drivers: Remove max param from effective_cpu_util()/sched_cpu_util() 2022-06-28 09:17:46 +02:00
cpufreq.c sched/headers: Introduce kernel/sched/build_utility.c and build multiple .c files there 2022-02-23 10:58:33 +01:00
cpupri.c sched/headers: Introduce kernel/sched/build_utility.c and build multiple .c files there 2022-02-23 10:58:33 +01:00
cpupri.h sched/cpupri: Add CPUPRI_HIGHER 2020-10-29 11:00:30 +01:00
cputime.c sched/core: add forced idle accounting for cgroups 2022-07-04 09:23:07 +02:00
deadline.c sched/deadline: Use proc_douintvec_minmax() limit minimum value 2022-06-13 10:30:00 +02:00
debug.c sched/headers: Introduce kernel/sched/build_utility.c and build multiple .c files there 2022-02-23 10:58:33 +01:00
fair.c sched/fair: fix case with reduced capacity CPU 2022-07-13 11:29:17 +02:00
features.h sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg 2022-06-28 09:08:30 +02:00
idle.c Scheduler changes in this cycle were: 2022-05-24 11:11:13 -07:00
isolation.c sched/headers: Introduce kernel/sched/build_utility.c and build multiple .c files there 2022-02-23 10:58:33 +01:00
loadavg.c sched/headers: Introduce kernel/sched/build_utility.c and build multiple .c files there 2022-02-23 10:58:33 +01:00
Makefile sched/headers: Introduce kernel/sched/build_policy.c and build multiple .c files there 2022-02-23 10:58:33 +01:00
membarrier.c sched/headers: Introduce kernel/sched/build_utility.c and build multiple .c files there 2022-02-23 10:58:33 +01:00
pelt.c sched/headers: Introduce kernel/sched/build_policy.c and build multiple .c files there 2022-02-23 10:58:33 +01:00
pelt.h sched/fair: Decay task PELT values during wakeup migration 2022-06-28 09:17:46 +02:00
psi.c sched/psi: report zeroes for CPU full at the system level 2022-04-22 12:14:08 +02:00
rt.c sysctl changes for v5.19-rc1 2022-05-26 16:57:20 -07:00
sched-pelt.h sched/fair: Fix "runnable_avg_yN_inv" not used warnings 2019-06-17 12:15:58 +02:00
sched.h sched, drivers: Remove max param from effective_cpu_util()/sched_cpu_util() 2022-06-28 09:17:46 +02:00
smp.h smp: Rename flush_smp_call_function_from_idle() 2022-05-01 10:03:43 +02:00
stats.c sched/headers: Introduce kernel/sched/build_utility.c and build multiple .c files there 2022-02-23 10:58:33 +01:00
stats.h sched/headers: Reorganize, clean up and optimize kernel/sched/sched.h dependencies 2022-02-23 10:58:34 +01:00
stop_task.c sched/headers: Introduce kernel/sched/build_utility.c and build multiple .c files there 2022-02-23 10:58:33 +01:00
swait.c sched/headers: Introduce kernel/sched/build_utility.c and build multiple .c files there 2022-02-23 10:58:33 +01:00
topology.c sched/numa: Adjust imb_numa_nr to a better approximation of memory channels 2022-06-13 10:30:00 +02:00
wait_bit.c sched/headers: Introduce kernel/sched/build_utility.c and build multiple .c files there 2022-02-23 10:58:33 +01:00
wait.c sched/headers: Introduce kernel/sched/build_utility.c and build multiple .c files there 2022-02-23 10:58:33 +01:00