linux

mirror of https://github.com/torvalds/linux.git synced 2024-12-13 22:53:20 +00:00

History

Rik van Riel 6f9aad0bc3 sched/numa: Only consider less busy nodes as numa balancing destinations Changeset `a43455a1d5` ("sched/numa: Ensure task_numa_migrate() checks the preferred node") fixes an issue where workloads would never converge on a fully loaded (or overloaded) system. However, it introduces a regression on less than fully loaded systems, where workloads converge on a few NUMA nodes, instead of properly staying spread out across the whole system. This leads to a reduction in available memory bandwidth, and usable CPU cache, with predictable performance problems. The root cause appears to be an interaction between the load balancer and NUMA balancing, where the short term load represented by the load balancer differs from the long term load the NUMA balancing code would like to base its decisions on. Simply reverting `a43455a1d5` would re-introduce the non-convergence of workloads on fully loaded systems, so that is not a good option. As an aside, the check done before `a43455a1d5` only applied to a task's preferred node, not to other candidate nodes in the system, so the converge-on-too-few-nodes problem still happens, just to a lesser degree. Instead, try to compensate for the impedance mismatch between the load balancer and NUMA balancing by only ever considering a lesser loaded node as a destination for NUMA balancing, regardless of whether the task is trying to move to the preferred node, or to another node. This patch also addresses the issue that a system with a single runnable thread would never migrate that thread to near its memory, introduced by `095bebf61a` ("sched/numa: Do not move past the balance point if unbalanced"). A test where the main thread creates a large memory area, and spawns a worker thread to iterate over the memory (placed on another node by select_task_rq_fair), after which the main thread goes to sleep and waits for the worker thread to loop over all the memory now sees the worker thread migrated to where the memory is, instead of having all the memory migrated over like before. Jirka has run a number of performance tests on several systems: single instance SpecJBB 2005 performance is 7-15% higher on a 4 node system, with higher gains on systems with more cores per socket. Multi-instance SpecJBB 2005 (one per node), linpack, and stream see little or no changes with the revert of `095bebf61a` and this patch. Reported-by: Artem Bityutski <dedekind1@gmail.com> Reported-by: Jirka Hladky <jhladky@redhat.com> Tested-by: Jirka Hladky <jhladky@redhat.com> Tested-by: Artem Bityutskiy <dedekind1@gmail.com> Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20150528095249.3083ade0@annuminas.surriel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>		2015-06-07 15:57:45 +02:00
..
bpf	bpf: fix 64-bit divide	2015-04-27 23:11:49 -04:00
configs	x86: Add "make tinyconfig" to configure the tiniest possible kernel	2014-08-08 16:30:24 -07:00
debug	debug: prevent entering debug mode on panic/exception.	2015-02-19 12:39:03 -06:00
events	perf: Annotate inherited event ctx->mutex recursion	2015-05-08 11:59:40 +02:00
gcov	gcov: fix softlockups	2015-04-17 09:04:08 -04:00
irq	genirq: Set IRQCHIP_SKIP_SET_WAKE flag for dummy_irq_chip	2015-04-24 20:57:06 +02:00
livepatch	Merge branch 'for-4.1/core-noarch' into for-linus	2015-04-13 23:57:20 +02:00
locking	sched: Handle priority boosted tasks proper in setscheduler()	2015-05-08 11:53:55 +02:00
power	Merge back earlier suspend/hibernate material for v4.1.	2015-04-10 12:01:59 +02:00
printk	TTY/Serial patches for 4.1-rc1	2015-04-21 09:33:10 -07:00
rcu	rcu: Control grace-period delays directly from value	2015-04-14 19:33:59 -07:00
sched	sched/numa: Only consider less busy nodes as numa balancing destinations	2015-06-07 15:57:45 +02:00
time	Merge branch 'linus' into sched/core, to resolve conflict	2015-06-02 08:05:42 +02:00
trace	tracing: Make ftrace_print_array_seq compute buf_len	2015-05-06 23:03:23 -04:00
.gitignore
acct.c	acct: check FMODE_CAN_WRITE	2015-04-11 22:27:55 -04:00
async.c	kernel/async.c: switch to pr_foo()	2014-10-09 22:26:04 -04:00
audit_tree.c	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2015-04-26 17:22:07 -07:00
audit_watch.c	VFS: audit: d_backing_inode() annotations	2015-04-15 15:06:55 -04:00
audit.c	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2015-04-26 17:22:07 -07:00
audit.h	Merge branch 'upstream' of git://git.infradead.org/users/pcmoore/audit	2015-04-22 14:49:23 -07:00
auditfilter.c	Merge branch 'upstream' of git://git.infradead.org/users/pcmoore/audit	2015-02-11 20:07:47 -08:00
auditsc.c	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2015-04-26 17:22:07 -07:00
backtracetest.c
bounds.c	page-cgroup: get rid of NR_PCG_FLAGS	2014-08-08 15:57:18 -07:00
capability.c	kernel: conditionally support non-root users, groups and capabilities	2015-04-15 16:35:22 -07:00
cgroup_freezer.c
cgroup.c	cgroup: remove use of seq_printf return value	2015-04-15 16:35:25 -07:00
compat.c	all arches, signal: move restart_block to struct task_struct	2015-02-12 18:54:12 -08:00
configs.c
context_tracking.c	context_tracking: Export context_tracking_user_enter/exit	2015-03-09 15:43:00 +01:00
cpu_pm.c
cpu.c	Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip	2015-04-14 13:36:04 -07:00
cpuset.c	kernel, cpuset: remove exception for __GFP_THISNODE	2015-04-14 16:49:03 -07:00
crash_dump.c	crash_dump: Make is_kdump_kernel() accessible from modules	2014-08-25 15:42:19 -07:00
cred.c	kernel: conditionally support non-root users, groups and capabilities	2015-04-15 16:35:22 -07:00
delayacct.c
dma.c
elfcore.c
exec_domain.c	Remove rest of exec domains.	2015-04-12 21:03:31 +02:00
exit.c	Remove execution domain support	2015-04-12 20:58:24 +02:00
extable.c	ftrace/x86/extable: Add is_ftrace_trampoline() function	2014-11-19 15:25:26 -05:00
fork.c	sched/preempt, mm/fault: Count pagefault_disable() levels in pagefault_disabled	2015-05-19 08:39:13 +02:00
freezer.c	freezer: remove obsolete comments in __thaw_task()	2014-10-21 23:44:20 +02:00
futex_compat.c
futex.c	futex: Implement lockless wakeups	2015-05-08 12:21:40 +02:00
groups.c	kernel: conditionally support non-root users, groups and capabilities	2015-04-15 16:35:22 -07:00
hung_task.c	kernel/hung_task.c: change hung_task.c to use for_each_process_thread()	2015-04-15 16:35:22 -07:00
irq_work.c	percpu: Convert remaining __get_cpu_var uses in 3.18-rcX	2014-10-29 11:18:18 -04:00
jump_label.c
kallsyms.c	kernel/kallsyms.c: use __seq_open_private()	2014-10-14 02:18:16 +02:00
kcmp.c	kcmp: fix standard comparison bug	2014-09-10 15:42:12 -07:00
Kconfig.freezer
Kconfig.hz
Kconfig.locks	locking/mcs: Better differentiate between MCS variants	2015-01-14 15:07:32 +01:00
Kconfig.preempt
kexec.c	kexec: allocate the kexec control page with KEXEC_CONTROL_MEMORY_GFP	2015-04-23 16:52:01 +02:00
kmod.c	usermodehelper: kill the kmod_thread_locker logic	2014-12-10 17:41:17 -08:00
kprobes.c	kprobes: makes kprobes/enabled works correctly for optimized kprobes.	2015-02-13 21:21:42 -08:00
ksysfs.c
kthread.c	kernel/kthread.c: partial revert of `81c98869fa` ("kthread: ensure locality of task_struct allocations")	2014-10-09 22:25:51 -04:00
latencytop.c
Makefile	modsign: change default key details	2015-04-30 09:35:41 -07:00
module_signing.c
module-internal.h
module.c	module: Call module notifier on failure after complete_formation()	2015-05-09 03:29:24 +09:30
notifier.c	rcu: Make SRCU optional by using CONFIG_SRCU	2015-01-06 11:04:29 -08:00
nsproxy.c	bury struct proc_ns in fs/proc	2014-12-04 14:34:54 -05:00
padata.c	padata: use %*pb[l] to print bitmaps including cpumasks and nodemasks	2015-02-13 21:21:38 -08:00
panic.c	livepatch: kernel: add TAINT_LIVEPATCH	2014-12-22 15:40:48 +01:00
params.c	params: handle quotes properly for values not of form foo="bar".	2015-04-15 13:31:23 +09:30
pid_namespace.c	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2014-12-16 15:53:03 -08:00
pid.c	fork: report pid reservation failure properly	2015-04-17 09:04:06 -04:00
profile.c	profile: use %*pb[l] to print bitmaps including cpumasks and nodemasks	2015-02-13 21:21:38 -08:00
ptrace.c	ptrace: ptrace_detach() can no longer race with SIGKILL	2015-04-17 09:04:06 -04:00
range.c	kernel: avoid overflow in cmp_range	2015-01-17 10:02:23 +13:00
reboot.c	kernel/reboot.c: add orderly_reboot for graceful reboot	2015-04-15 16:35:23 -07:00
relay.c	VFS: kernel/: d_inode() annotations	2015-04-15 15:06:55 -04:00
resource.c	kernel/resource.c: remove deprecated __check_region() and friends	2015-04-15 16:35:22 -07:00
seccomp.c	seccomp: cap SECCOMP_RET_ERRNO data to MAX_ERRNO	2015-02-17 14:34:55 -08:00
signal.c	signals, sched: Change all uses of JOBCTL_* from 'int' to 'long'	2015-05-08 12:04:36 +02:00
smp.c	smp: Fix error case handling in smp_call_function_*()	2015-04-19 13:19:23 -07:00
smpboot.c	smpboot: Add common code for notification from dying CPU	2015-03-11 13:20:25 -07:00
smpboot.h
softirq.c	Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip	2015-02-09 15:24:03 -08:00
stacktrace.c	stacktrace: introduce snprint_stack_trace for buffer output	2014-12-13 12:42:48 -08:00
stop_machine.c
sys_ni.c	kernel: conditionally support non-root users, groups and capabilities	2015-04-15 16:35:22 -07:00
sys.c	prctl: avoid using mmap_sem for exe_file serialization	2015-04-17 09:04:07 -04:00
sysctl_binary.c	kernel: add panic_on_warn	2014-12-10 17:41:10 -08:00
sysctl.c	kernel/sysctl.c: detect overflows when converting to int	2015-04-17 09:04:08 -04:00
system_certificates.S
system_keyring.c
task_work.c
taskstats.c	netlink: make nlmsg_end() and genlmsg_end() void	2015-01-18 01:03:45 -05:00
test_kprobes.c	kernel/test_kprobes.c: use current logging functions	2014-08-08 15:57:18 -07:00
torture.c	torture: Address race in module cleanup	2014-09-16 13:41:06 -07:00
tracepoint.c
tsacct.c
uid16.c	groups: Consolidate the setgroups permission checks	2014-12-05 17:19:27 -06:00
up.c
user_namespace.c	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace	2014-12-17 12:31:40 -08:00
user-return-notifier.c	scheduler: Replace __get_cpu_var with this_cpu_ptr	2014-08-26 13:45:45 -04:00
user.c	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace	2014-12-17 12:31:40 -08:00
utsname_sysctl.c
utsname.c	copy address of proc_ns_ops into ns_common	2014-12-04 14:34:47 -05:00
watchdog.c	watchdog: fix double lock in watchdog_nmi_enable_all	2015-05-19 10:57:03 -07:00
workqueue_internal.h
workqueue.c	workqueue: Reorder sysfs code	2015-04-06 11:16:04 -04:00