linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-22 20:22:09 +00:00

History

Mel Gorman f169c62ff7 sched/numa: Complete scanning of inactive VMAs when there is no alternative VMAs are skipped if there is no recent fault activity but this represents a chicken-and-egg problem as there may be no fault activity if the PTEs are never updated to trap NUMA hints. There is an indirect reliance on scanning to be forced early in the lifetime of a task but this may fail to detect changes in phase behaviour. Force inactive VMAs to be scanned when all other eligible VMAs have been updated within the same scan sequence. Test results in general look good with some changes in performance, both negative and positive, depending on whether the additional scanning and faulting was beneficial or not to the workload. The autonuma benchmark workload NUMA01_THREADLOCAL was picked for closer examination. The workload creates two processes with numerous threads and thread-local storage that is zero-filled in a loop. It exercises the corner case where unrelated threads may skip VMAs that are thread-local to another thread and still has some VMAs that inactive while the workload executes. The VMA skipping activity frequency with and without the patch: 6.6.0-rc2-sched-numabtrace-v1 ============================= 649 reason=scan_delay 9,094 reason=unsuitable 48,915 reason=shared_ro 143,919 reason=inaccessible 193,050 reason=pid_inactive 6.6.0-rc2-sched-numabselective-v1 ============================= 146 reason=seq_completed 622 reason=ignore_pid_inactive 624 reason=scan_delay 6,570 reason=unsuitable 16,101 reason=shared_ro 27,608 reason=inaccessible 41,939 reason=pid_inactive Note that with the patch applied, the PID activity is ignored (ignore_pid_inactive) to ensure a VMA with some activity is completely scanned. In addition, a small number of VMAs are scanned when no other eligible VMA is available during a single scan window (seq_completed). The number of times a VMA is skipped due to no PID activity from the scanning task (pid_inactive) drops dramatically. It is expected that this will increase the number of PTEs updated for NUMA hinting faults as well as hinting faults but these represent PTEs that would otherwise have been missed. The tradeoff is scan+fault overhead versus improving locality due to migration. On a 2-socket Cascade Lake test machine, the time to complete the workload is as follows; 6.6.0-rc2 6.6.0-rc2 sched-numabtrace-v1 sched-numabselective-v1 Min elsp-NUMA01_THREADLOCAL 174.22 ( 0.00%) 117.64 ( 32.48%) Amean elsp-NUMA01_THREADLOCAL 175.68 ( 0.00%) 123.34 * 29.79%* Stddev elsp-NUMA01_THREADLOCAL 1.20 ( 0.00%) 4.06 (-238.20%) CoeffVar elsp-NUMA01_THREADLOCAL 0.68 ( 0.00%) 3.29 (-381.70%) Max elsp-NUMA01_THREADLOCAL 177.18 ( 0.00%) 128.03 ( 27.74%) The time to complete the workload is reduced by almost 30%: 6.6.0-rc2 6.6.0-rc2 sched-numabtrace-v1 sched-numabselective-v1 / Duration User 91201.80 63506.64 Duration System 2015.53 1819.78 Duration Elapsed 1234.77 868.37 In this specific case, system CPU time was not increased but it's not universally true. From vmstat, the NUMA scanning and fault activity is as follows; 6.6.0-rc2 6.6.0-rc2 sched-numabtrace-v1 sched-numabselective-v1 Ops NUMA base-page range updates 64272.00 26374386.00 Ops NUMA PTE updates 36624.00 55538.00 Ops NUMA PMD updates 54.00 51404.00 Ops NUMA hint faults 15504.00 75786.00 Ops NUMA hint local faults % 14860.00 56763.00 Ops NUMA hint local percent 95.85 74.90 Ops NUMA pages migrated 1629.00 6469222.00 Both the number of PTE updates and hint faults is dramatically increased. While this is superficially unfortunate, it represents ranges that were simply skipped without the patch. As a result of the scanning and hinting faults, many more pages were also migrated but as the time to completion is reduced, the overhead is offset by the gain. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Raghavendra K T <raghavendra.kt@amd.com> Link: https://lore.kernel.org/r/20231010083143.19593-7-mgorman@techsingularity.net		2023-10-10 23:42:15 +02:00
..
autogroup.c	sched/all: Change all BUG_ON() instances in the scheduler to WARN_ON_ONCE()	2022-08-12 11:25:10 +02:00
autogroup.h	sched/headers: Add header guard to kernel/sched/stats.h and kernel/sched/autogroup.h	2022-02-23 08:22:00 +01:00
build_policy.c	sched: Fix missing prototype warnings	2022-05-01 10:03:43 +02:00
build_utility.c	sched/headers: Remove duplicate header inclusions	2023-10-03 21:27:55 +02:00
clock.c	Locking changes for v6.5:	2023-06-27 14:14:30 -07:00
completion.c	sched: add a few helpers to wake up tasks on the current cpu	2023-07-17 16:08:08 -07:00
core_sched.c	sched: Rename task_running() to task_on_cpu()	2022-09-07 21:53:47 +02:00
core.c	sched/topology: Consolidate and clean up access to a CPU's max compute capacity	2023-10-09 12:59:48 +02:00
cpuacct.c	Merge branch 'sched/fast-headers' into sched/core	2022-03-15 09:05:05 +01:00
cpudeadline.c	sched/topology: Consolidate and clean up access to a CPU's max compute capacity	2023-10-09 12:59:48 +02:00
cpudeadline.h
cpufreq_schedutil.c	cpufreq: schedutil: Update next_freq when cpufreq_limits change	2023-10-05 22:09:50 +02:00
cpufreq.c	sched/headers: Introduce kernel/sched/build_utility.c and build multiple .c files there	2022-02-23 10:58:33 +01:00
cpupri.c	sched/rt: Fix live lock between select_fallback_rq() and RT push	2023-09-28 22:58:13 +02:00
cpupri.h
cputime.c	cputime: remove cputime_to_nsecs fallback	2022-12-27 12:52:17 +01:00
deadline.c	sched/topology: Consolidate and clean up access to a CPU's max compute capacity	2023-10-09 12:59:48 +02:00
debug.c	sched/deadline: Make dl_rq->pushable_dl_tasks update drive dl_rq->overloaded	2023-09-29 10:20:21 +02:00
fair.c	sched/numa: Complete scanning of inactive VMAs when there is no alternative	2023-10-10 23:42:15 +02:00
features.h	sched/eevdf: Curb wakeup-preemption	2023-08-17 17:07:07 +02:00
idle.c	Merge branch 'sched/urgent' into sched/core, to pick up fixes and refresh the branch	2023-10-07 11:32:24 +02:00
isolation.c	sched/headers: Introduce kernel/sched/build_utility.c and build multiple .c files there	2022-02-23 10:58:33 +01:00
loadavg.c	sched/headers: Introduce kernel/sched/build_utility.c and build multiple .c files there	2022-02-23 10:58:33 +01:00
Makefile	sched/headers: Introduce kernel/sched/build_policy.c and build multiple .c files there	2022-02-23 10:58:33 +01:00
membarrier.c	sched/membarrier: Introduce MEMBARRIER_CMD_GET_REGISTRATIONS	2023-01-07 11:29:29 +01:00
pelt.c	sched/headers: Introduce kernel/sched/build_policy.c and build multiple .c files there	2022-02-23 10:58:33 +01:00
pelt.h	sched/fair: Decay task PELT values during wakeup migration	2022-06-28 09:17:46 +02:00
psi.c	sched/psi: Change update_triggers() to a 'void' function	2023-10-09 14:54:50 +02:00
rt.c	sched/topology: Consolidate and clean up access to a CPU's max compute capacity	2023-10-09 12:59:48 +02:00
sched-pelt.h
sched.h	sched/topology: Move the declaration of 'schedutil_gov' to kernel/sched/sched.h	2023-10-09 17:33:10 +02:00
smp.h	sched, smp: Trace smp callback causing an IPI	2023-03-24 11:01:29 +01:00
stats.c	sched/headers: Introduce kernel/sched/build_utility.c and build multiple .c files there	2022-02-23 10:58:33 +01:00
stats.h	sched/psi: Use task->psi_flags to clear in CPU migration	2022-10-30 10:12:15 +01:00
stop_task.c	sched/fair: Rename check_preempt_curr() to wakeup_preempt()	2023-09-19 10:40:10 +02:00
swait.c	sched: add a few helpers to wake up tasks on the current cpu	2023-07-17 16:08:08 -07:00
topology.c	sched/topology: Move the declaration of 'schedutil_gov' to kernel/sched/sched.h	2023-10-09 17:33:10 +02:00
wait_bit.c	wait_on_bit: add an acquire memory barrier	2022-08-26 09:30:25 -07:00
wait.c	sched: add a few helpers to wake up tasks on the current cpu	2023-07-17 16:08:08 -07:00