linux/kernel/sched
Munehisa Kamata c2dbe32d5d sched/psi: Fix use-after-free in ep_remove_wait_queue()
If a non-root cgroup gets removed when there is a thread that registered
trigger and is polling on a pressure file within the cgroup, the polling
waitqueue gets freed in the following path:

 do_rmdir
   cgroup_rmdir
     kernfs_drain_open_files
       cgroup_file_release
         cgroup_pressure_release
           psi_trigger_destroy

However, the polling thread still has a reference to the pressure file and
will access the freed waitqueue when the file is closed or upon exit:

 fput
   ep_eventpoll_release
     ep_free
       ep_remove_wait_queue
         remove_wait_queue

This results in use-after-free as pasted below.

The fundamental problem here is that cgroup_file_release() (and
consequently waitqueue's lifetime) is not tied to the file's real lifetime.
Using wake_up_pollfree() here might be less than ideal, but it is in line
with the comment at commit 42288cb44c ("wait: add wake_up_pollfree()")
since the waitqueue's lifetime is not tied to file's one and can be
considered as another special case. While this would be fixable by somehow
making cgroup_file_release() be tied to the fput(), it would require
sizable refactoring at cgroups or higher layer which might be more
justifiable if we identify more cases like this.

  BUG: KASAN: use-after-free in _raw_spin_lock_irqsave+0x60/0xc0
  Write of size 4 at addr ffff88810e625328 by task a.out/4404

	CPU: 19 PID: 4404 Comm: a.out Not tainted 6.2.0-rc6 #38
	Hardware name: Amazon EC2 c5a.8xlarge/, BIOS 1.0 10/16/2017
	Call Trace:
	<TASK>
	dump_stack_lvl+0x73/0xa0
	print_report+0x16c/0x4e0
	kasan_report+0xc3/0xf0
	kasan_check_range+0x2d2/0x310
	_raw_spin_lock_irqsave+0x60/0xc0
	remove_wait_queue+0x1a/0xa0
	ep_free+0x12c/0x170
	ep_eventpoll_release+0x26/0x30
	__fput+0x202/0x400
	task_work_run+0x11d/0x170
	do_exit+0x495/0x1130
	do_group_exit+0x100/0x100
	get_signal+0xd67/0xde0
	arch_do_signal_or_restart+0x2a/0x2b0
	exit_to_user_mode_prepare+0x94/0x100
	syscall_exit_to_user_mode+0x20/0x40
	do_syscall_64+0x52/0x90
	entry_SYSCALL_64_after_hwframe+0x63/0xcd
	</TASK>

 Allocated by task 4404:

	kasan_set_track+0x3d/0x60
	__kasan_kmalloc+0x85/0x90
	psi_trigger_create+0x113/0x3e0
	pressure_write+0x146/0x2e0
	cgroup_file_write+0x11c/0x250
	kernfs_fop_write_iter+0x186/0x220
	vfs_write+0x3d8/0x5c0
	ksys_write+0x90/0x110
	do_syscall_64+0x43/0x90
	entry_SYSCALL_64_after_hwframe+0x63/0xcd

 Freed by task 4407:

	kasan_set_track+0x3d/0x60
	kasan_save_free_info+0x27/0x40
	____kasan_slab_free+0x11d/0x170
	slab_free_freelist_hook+0x87/0x150
	__kmem_cache_free+0xcb/0x180
	psi_trigger_destroy+0x2e8/0x310
	cgroup_file_release+0x4f/0xb0
	kernfs_drain_open_files+0x165/0x1f0
	kernfs_drain+0x162/0x1a0
	__kernfs_remove+0x1fb/0x310
	kernfs_remove_by_name_ns+0x95/0xe0
	cgroup_addrm_files+0x67f/0x700
	cgroup_destroy_locked+0x283/0x3c0
	cgroup_rmdir+0x29/0x100
	kernfs_iop_rmdir+0xd1/0x140
	vfs_rmdir+0xfe/0x240
	do_rmdir+0x13d/0x280
	__x64_sys_rmdir+0x2c/0x30
	do_syscall_64+0x43/0x90
	entry_SYSCALL_64_after_hwframe+0x63/0xcd

Fixes: 0e94682b73 ("psi: introduce psi monitor")
Signed-off-by: Munehisa Kamata <kamatam@amazon.com>
Signed-off-by: Mengchi Cheng <mengcc@amazon.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/lkml/20230106224859.4123476-1-kamatam@amazon.com/
Link: https://lore.kernel.org/r/20230214212705.4058045-1-kamatam@amazon.com
2023-02-15 14:19:16 +01:00
..
autogroup.c sched/all: Change all BUG_ON() instances in the scheduler to WARN_ON_ONCE() 2022-08-12 11:25:10 +02:00
autogroup.h sched/headers: Add header guard to kernel/sched/stats.h and kernel/sched/autogroup.h 2022-02-23 08:22:00 +01:00
build_policy.c sched: Fix missing prototype warnings 2022-05-01 10:03:43 +02:00
build_utility.c sched: Fix missing prototype warnings 2022-05-01 10:03:43 +02:00
clock.c sched/clock: Use try_cmpxchg64 in sched_clock_{local,remote} 2022-05-19 23:46:09 +02:00
completion.c sched/completion: Add wait_for_completion_state() 2022-09-07 21:53:49 +02:00
core_sched.c sched: Rename task_running() to task_on_cpu() 2022-09-07 21:53:47 +02:00
core.c sched/core: Fix a missed update of user_cpus_ptr 2023-02-13 16:36:14 +01:00
cpuacct.c Merge branch 'sched/fast-headers' into sched/core 2022-03-15 09:05:05 +01:00
cpudeadline.c sched/core: Introduce sched_asym_cpucap_active() 2022-08-02 12:32:45 +02:00
cpudeadline.h
cpufreq_schedutil.c Revert "cpufreq: schedutil: Move max CPU capacity to sugov_policy" 2022-11-22 19:56:52 +01:00
cpufreq.c sched/headers: Introduce kernel/sched/build_utility.c and build multiple .c files there 2022-02-23 10:58:33 +01:00
cpupri.c sched/all: Change all BUG_ON() instances in the scheduler to WARN_ON_ONCE() 2022-08-12 11:25:10 +02:00
cpupri.h sched/cpupri: Add CPUPRI_HIGHER 2020-10-29 11:00:30 +01:00
cputime.c sched/core: add forced idle accounting for cgroups 2022-07-04 09:23:07 +02:00
deadline.c sched: Introduce affinity_context 2022-10-27 11:01:21 +02:00
debug.c - Yu Zhao's Multi-Gen LRU patches are here. They've been under test in 2022-10-10 17:53:04 -07:00
fair.c sched/fair: Fixes for capacity inversion detection 2023-01-13 11:40:21 +01:00
features.h sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg 2022-06-28 09:08:30 +02:00
idle.c context_tracking: Take idle eqs entrypoints over RCU 2022-07-05 13:32:16 -07:00
isolation.c sched/headers: Introduce kernel/sched/build_utility.c and build multiple .c files there 2022-02-23 10:58:33 +01:00
loadavg.c sched/headers: Introduce kernel/sched/build_utility.c and build multiple .c files there 2022-02-23 10:58:33 +01:00
Makefile sched/headers: Introduce kernel/sched/build_policy.c and build multiple .c files there 2022-02-23 10:58:33 +01:00
membarrier.c sched/headers: Introduce kernel/sched/build_utility.c and build multiple .c files there 2022-02-23 10:58:33 +01:00
pelt.c sched/headers: Introduce kernel/sched/build_policy.c and build multiple .c files there 2022-02-23 10:58:33 +01:00
pelt.h sched/fair: Decay task PELT values during wakeup migration 2022-06-28 09:17:46 +02:00
psi.c sched/psi: Fix use-after-free in ep_remove_wait_queue() 2023-02-15 14:19:16 +01:00
rt.c sched: Introduce struct balance_callback to avoid CFI mismatches 2022-10-17 16:41:25 +02:00
sched-pelt.h sched/fair: Fix "runnable_avg_yN_inv" not used warnings 2019-06-17 12:15:58 +02:00
sched.h sched: Enforce user requested affinity 2022-10-27 11:01:22 +02:00
smp.h smp: Rename flush_smp_call_function_from_idle() 2022-05-01 10:03:43 +02:00
stats.c sched/headers: Introduce kernel/sched/build_utility.c and build multiple .c files there 2022-02-23 10:58:33 +01:00
stats.h sched/psi: Use task->psi_flags to clear in CPU migration 2022-10-30 10:12:15 +01:00
stop_task.c sched: Add update_current_exec_runtime helper 2022-08-27 00:05:35 +02:00
swait.c sched/headers: Introduce kernel/sched/build_utility.c and build multiple .c files there 2022-02-23 10:58:33 +01:00
topology.c sched/numa: Adjust imb_numa_nr to a better approximation of memory channels 2022-06-13 10:30:00 +02:00
wait_bit.c wait_on_bit: add an acquire memory barrier 2022-08-26 09:30:25 -07:00
wait.c wait: Return number of exclusive waiters awaken 2022-11-16 11:33:03 -07:00