linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-24 13:11:40 +00:00

History

Giovanni Gherdovich 6075620b05 sched/cputime: Mitigate performance regression in times()/clock_gettime() Commit: `6e998916df` ("sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency") fixed a problem whereby clock_nanosleep() followed by clock_gettime() could allow a task to wake early. It addressed the problem by calling the scheduling classes update_curr() when the cputimer starts. Said change induced a considerable performance regression on the syscalls times() and clock_gettimes(CLOCK_PROCESS_CPUTIME_ID). There are some debuggers and applications that monitor their own performance that accidentally depend on the performance of these specific calls. This patch mitigates the performace loss by prefetching data in the CPU cache, as stalls due to cache misses appear to be where most time is spent in our benchmarks. Here are the performance gain of this patch over v4.7-rc7 on a Sandy Bridge box with 32 logical cores and 2 NUMA nodes. The test is repeated with a variable number of threads, from 2 to 4num_cpus; the results are in seconds and correspond to the average of 10 runs; the percentage gain is computed with (before-after)/before so a positive value is an improvement (it's faster). The improvement varies between a few percents for 5-20 threads and more than 10% for 2 or >20 threads. pound_clock_gettime: threads 4.7-rc7 patched 4.7-rc7 [num] [secs] [secs (percent)] 2 3.48 3.06 ( 11.83%) 5 3.33 3.25 ( 2.40%) 8 3.37 3.26 ( 3.30%) 12 3.32 3.37 ( -1.60%) 21 4.01 3.90 ( 2.74%) 30 3.63 3.36 ( 7.41%) 48 3.71 3.11 ( 16.27%) 79 3.75 3.16 ( 15.74%) 110 3.81 3.25 ( 14.80%) 128 3.88 3.31 ( 14.76%) pound_times: threads 4.7-rc7 patched 4.7-rc7 [num] [secs] [secs (percent)] 2 3.65 3.25 ( 11.03%) 5 3.45 3.17 ( 7.92%) 8 3.52 3.22 ( 8.69%) 12 3.29 3.36 ( -2.04%) 21 4.07 3.92 ( 3.78%) 30 3.87 3.40 ( 12.17%) 48 3.79 3.16 ( 16.61%) 79 3.88 3.28 ( 15.42%) 110 3.90 3.38 ( 13.35%) 128 4.00 3.38 ( 15.45%) pound_clock_gettime and pound_clock_gettime are two benchmarks included in the MMTests framework. They launch a given number of threads which repeatedly call times() or clock_gettimes(). The results above can be reproduced with cloning MMTests from github.com and running the "poundtime" workload: $ git clone https://github.com/gormanm/mmtests.git $ cd mmtests $ cp configs/config-global-dhp__workload_poundtime config $ ./run-mmtests.sh --run-monitor $(uname -r) The above will run "poundtime" measuring the kernel currently running on the machine; Once a new kernel is installed and the machine rebooted, running again $ cd mmtests $ ./run-mmtests.sh --run-monitor $(uname -r) will produce results to compare with. A comparison table will be output with: $ cd mmtests/work/log $ ../../compare-kernels.sh the table will contain a lot of entries; grepping for "Amean" (as in "arithmetic mean") will give the tables presented above. The source code for the two benchmarks is reported at the end of this changelog for clairity. The cache misses addressed by this patch were found using a combination of `perf top`, `perf record` and `perf annotate`. The incriminated lines were found to be struct sched_entity curr = cfs_rq->curr; and delta_exec = now - curr->exec_start; in the function update_curr() from kernel/sched/fair.c. This patch prefetches the data from memory just before update_curr is called in the interested execution path. A comparison of the total number of cycles before and after the patch follows; the data is obtained using `perf stat -r 10 -ddd <program>` running over the same sequence of number of threads used above (a positive gain is an improvement): threads cycles before cycles after gain 2 19,699,563,964 +-1.19% 17,358,917,517 +-1.85% 11.88% 5 47,401,089,566 +-2.96% 45,103,730,829 +-0.97% 4.85% 8 80,923,501,004 +-3.01% 71,419,385,977 +-0.77% 11.74% 12 112,326,485,473 +-0.47% 110,371,524,403 +-0.47% 1.74% 21 193,455,574,299 +-0.72% 180,120,667,904 +-0.36% 6.89% 30 315,073,519,013 +-1.64% 271,222,225,950 +-1.29% 13.92% 48 321,969,515,332 +-1.48% 273,353,977,321 +-1.16% 15.10% 79 337,866,003,422 +-0.97% 289,462,481,538 +-1.05% 14.33% 110 338,712,691,920 +-0.78% 290,574,233,170 +-0.77% 14.21% 128 348,384,794,006 +-0.50% 292,691,648,206 +-0.66% 15.99% A comparison of cache miss vs total cache loads ratios, before and after the patch (again from the `perf stat -r 10 -ddd <program>` tables): threads L1 misses/total100 L1 misses/total100 gain before after 2 7.43 +-4.90% 7.36 +-4.70% 0.94% 5 13.09 +-4.74% 13.52 +-3.73% -3.28% 8 13.79 +-5.61% 12.90 +-3.27% 6.45% 12 11.57 +-2.44% 8.71 +-1.40% 24.72% 21 12.39 +-3.92% 9.97 +-1.84% 19.53% 30 13.91 +-2.53% 11.73 +-2.28% 15.67% 48 13.71 +-1.59% 12.32 +-1.97% 10.14% 79 14.44 +-0.66% 13.40 +-1.06% 7.20% 110 15.86 +-0.50% 14.46 +-0.59% 8.83% 128 16.51 +-0.32% 15.06 +-0.78% 8.78% As a final note, the following shows the evolution of performance figures in the "poundtime" benchmark and pinpoints commit `6e998916df` ("sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency") as a major source of degradation, mostly unaddressed to this day (figures expressed in seconds). pound_clock_gettime: threads parent of `6e998916df` 4.7-rc7 `6e998916df` itself 2 2.23 3.68 ( -64.56%) 3.48 (-55.48%) 5 2.83 3.78 ( -33.42%) 3.33 (-17.43%) 8 2.84 4.31 ( -52.12%) 3.37 (-18.76%) 12 3.09 3.61 ( -16.74%) 3.32 ( -7.17%) 21 3.14 4.63 ( -47.36%) 4.01 (-27.71%) 30 3.28 5.75 ( -75.37%) 3.63 (-10.80%) 48 3.02 6.05 (-100.56%) 3.71 (-22.99%) 79 2.88 6.30 (-118.90%) 3.75 (-30.26%) 110 2.95 6.46 (-119.00%) 3.81 (-29.24%) 128 3.05 6.42 (-110.08%) 3.88 (-27.04%) pound_times: threads parent of `6e998916df` 4.7-rc7 `6e998916df` itself 2 2.27 3.73 ( -64.71%) 3.65 (-61.14%) 5 2.78 3.77 ( -35.56%) 3.45 (-23.98%) 8 2.79 4.41 ( -57.71%) 3.52 (-26.05%) 12 3.02 3.56 ( -17.94%) 3.29 ( -9.08%) 21 3.10 4.61 ( -48.74%) 4.07 (-31.34%) 30 3.33 5.75 ( -72.53%) 3.87 (-16.01%) 48 2.96 6.06 (-105.04%) 3.79 (-28.10%) 79 2.88 6.24 (-116.83%) 3.88 (-34.81%) 110 2.98 6.37 (-114.08%) 3.90 (-31.12%) 128 3.10 6.35 (-104.61%) 4.00 (-28.87%) The source code of the two benchmarks follows. To compile the two: NR_THREADS=42 for FILE in pound_times pound_clock_gettime; do gcc -lrt -O2 -lpthread -DNUM_THREADS=$NR_THREADS $FILE.c -o $FILE done ==== BEGIN pound_times.c ==== struct tms start; void pound (void threadid) { struct tms end; int oldutime = 0; int utime; int i; for (i = 0; i < 5000000 / NUM_THREADS; i++) { times(&end); utime = ((int)end.tms_utime - (int)start.tms_utime); if (oldutime > utime) { printf("utime decreased, was %d, now %d!\n", oldutime, utime); } oldutime = utime; } pthread_exit(NULL); } int main() { pthread_t th[NUM_THREADS]; long i; times(&start); for (i = 0; i < NUM_THREADS; i++) { pthread_create (&th[i], NULL, pound, (void )i); } pthread_exit(NULL); return 0; } ==== END pound_times.c ==== ==== BEGIN pound_clock_gettime.c ==== void pound (void threadid) { struct timespec ts; int rc, i; unsigned long prev = 0, this = 0; for (i = 0; i < 5000000 / NUM_THREADS; i++) { rc = clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts); if (rc < 0) perror("clock_gettime"); this = (ts.tv_sec 1000000000) + ts.tv_nsec; if (0 && this < prev) printf("%lu ns timewarp at iteration %d\n", prev - this, i); prev = this; } pthread_exit(NULL); } int main() { pthread_t th[NUM_THREADS]; long rc, i; pid_t pgid; for (i = 0; i < NUM_THREADS; i++) { rc = pthread_create(&th[i], NULL, pound, (void *)i); if (rc < 0) perror("pthread_create"); } pthread_exit(NULL); return 0; } ==== END pound_clock_gettime.c ==== Suggested-by: Mike Galbraith <mgalbraith@suse.de> Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1470385316-15027-2-git-send-email-ggherdovich@suse.cz Signed-off-by: Ingo Molnar <mingo@kernel.org>		2016-08-10 13:32:56 +02:00
..
bpf	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next	2016-07-27 12:03:20 -07:00
configs	config: add android config fragments	2016-08-02 19:35:42 -04:00
debug	mm/init: Add 'rodata=off' boot cmdline parameter to disable read-only kernel mappings	2016-02-22 08:51:37 +01:00
events	perf/core: Change log level for duration warning to KERN_INFO	2016-08-02 10:23:57 +02:00
gcov	gcov: add support for gcc version >= 6	2016-07-15 14:54:27 +09:00
irq	genirq: Fix missing irq allocation affinity hint	2016-07-19 10:49:47 +02:00
livepatch	modules: add ro_after_init support	2016-08-04 10:16:55 +09:30
locking	Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip	2016-07-25 12:41:29 -07:00
power	mm, vmscan: move LRU lists to node	2016-07-28 16:07:41 -07:00
printk	Revert "printk: create pr_<level> functions"	2016-08-09 10:48:18 -07:00
rcu	Merge branch 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip	2016-07-29 13:55:30 -07:00
sched	sched/cputime: Mitigate performance regression in times()/clock_gettime()	2016-08-10 13:32:56 +02:00
time	Merge branch 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip	2016-07-29 13:55:30 -07:00
trace	block: rename bio bi_rw to bi_opf	2016-08-07 14:41:02 -06:00
.gitignore
acct.c
async.c
audit_fsnotify.c	wrappers for ->i_mutex access	2016-01-22 18:04:28 -05:00
audit_tree.c	audit: cleanup prune_tree_thread	2016-04-04 09:46:47 -04:00
audit_watch.c	don't bother with ->d_inode->i_sb - it's always equal to ->d_sb	2016-04-10 17:11:51 -04:00
audit.c	Merge branch 'stable-4.8' of git://git.infradead.org/users/pcmoore/audit	2016-07-29 17:54:17 -07:00
audit.h	Merge branch 'stable-4.8' of git://git.infradead.org/users/pcmoore/audit	2016-07-29 17:54:17 -07:00
auditfilter.c	audit: add fields to exclude filter by reusing user filter	2016-06-27 11:01:00 -04:00
auditsc.c	Merge branch 'stable-4.8' of git://git.infradead.org/users/pcmoore/audit	2016-07-29 17:54:17 -07:00
backtracetest.c
bounds.c
capability.c	kernel: Add noaudit variant of ns_capable()	2016-06-06 20:16:18 +10:00
cgroup_freezer.c
cgroup_pids.c	cgroup: Use lld instead of ld when printing pids controller events_limit	2016-06-21 15:03:36 -04:00
cgroup.c	Merge branch 'for-4.7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup	2016-07-29 14:29:04 -07:00
compat.c
configs.c
context_tracking.c
cpu_pm.c
cpu.c	timers/core: Correct callback order during CPU hot plug	2016-07-28 18:56:22 +02:00
cpuset.c	cpuset, mm: fix TIF_MEMDIE check in cpuset_change_task_nodemask	2016-07-28 16:07:41 -07:00
crash_dump.c
cred.c	cred: Reject inodes with invalid ids in set_create_file_as()	2016-06-30 18:05:09 -05:00
delayacct.c	kmemcg: account certain kmem allocations to memcg	2016-01-14 16:00:49 -08:00
dma.c
elfcore.c
exec_domain.c
exit.c	kernel/exit.c: quieten greatest stack depth printk	2016-08-02 19:35:23 -04:00
extable.c
fork.c	mm: fix memcg stack accounting for sub-page stacks	2016-07-28 16:07:41 -07:00
freezer.c	freezer, oom: check TIF_MEMDIE on the correct task	2016-07-28 16:07:41 -07:00
futex_compat.c	ptrace: use fsuid, fsgid, effective creds for fs access checks	2016-01-20 17:09:18 -08:00
futex.c	futex: Calculate the futex key based on a tail page for file-based futexes	2016-06-08 19:23:54 +02:00
groups.c
hung_task.c	kernel/hung_task.c: use timeout diff when timeout is updated	2016-03-22 15:36:02 -07:00
irq_work.c
jump_label.c	powerpc updates for 4.8 #2	2016-08-05 09:00:54 -04:00
kallsyms.c	kallsyms: add support for relative offsets in kallsyms address table	2016-03-15 16:55:16 -07:00
kcmp.c	ptrace: use fsuid, fsgid, effective creds for fs access checks	2016-01-20 17:09:18 -08:00
Kconfig.freezer
Kconfig.hz
Kconfig.locks
Kconfig.preempt
kcov.c	kernel/kcov: unproxify debugfs file's fops	2016-06-15 04:56:35 -07:00
kexec_core.c	kexec: add restriction on kexec_load() segment sizes	2016-08-02 19:35:31 -04:00
kexec_file.c	kexec: introduce a protection mechanism for the crashkernel reserved memory	2016-05-23 17:04:14 -07:00
kexec_internal.h	kexec: move some memembers and definitions within the scope of CONFIG_KEXEC_FILE	2016-01-20 17:09:18 -08:00
kexec.c	kexec: allow architectures to override boot mapping	2016-08-02 19:35:27 -04:00
kmod.c
kprobes.c
ksysfs.c	kexec: add a kexec_crash_loaded() function	2016-08-02 19:35:30 -04:00
kthread.c
latencytop.c	sched/debug: Make schedstats a runtime tunable that is disabled by default	2016-02-09 11:54:23 +01:00
Makefile	ELF/MIPS build fix	2016-05-23 17:04:14 -07:00
membarrier.c
memremap.c	libnvdimm for 4.8	2016-07-28 17:38:16 -07:00
module_signing.c	KEYS: Move the point of trust determination to __key_link()	2016-04-11 22:43:43 +01:00
module-internal.h
module.c	Removed the MODULE_SIG_FORCE-means-no-MODULE_FORCE_LOAD patch.	2016-08-04 09:14:38 -04:00
notifier.c
nsproxy.c	cgroup: introduce cgroup namespaces	2016-02-16 13:04:58 -05:00
padata.c	kernel/padata.c: hide unused functions	2016-05-19 19:12:14 -07:00
panic.c	kexec: use core_param for crash_kexec_post_notifiers boot option	2016-08-02 19:35:29 -04:00
params.c
pid_namespace.c
pid.c	remove lots of IS_ERR_VALUE abuses	2016-05-27 15:26:11 -07:00
profile.c	profile: Convert to hotplug state machine	2016-07-15 10:41:42 +02:00
ptrace.c	tree-wide: replace config_enabled() with IS_ENABLED()	2016-08-04 08:50:07 -04:00
range.c
reboot.c
relay.c	relay: add global mode support for buffer-only channels	2016-08-02 19:35:41 -04:00
resource.c	/proc/iomem: only expose physical resource addresses to privileged users	2016-04-14 12:56:09 -07:00
seccomp.c	tree-wide: replace config_enabled() with IS_ENABLED()	2016-08-04 08:50:07 -04:00
signal.c	signals: Use hrtimer for sigtimedwait()	2016-07-07 10:35:07 +02:00
smp.c	Merge branch 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip	2016-07-29 13:55:30 -07:00
smpboot.c	cpu/hotplug: Unpark smpboot threads from the state machine	2016-03-01 20:36:56 +01:00
smpboot.h	cpu/hotplug: Create hotplug threads	2016-03-01 20:36:56 +01:00
softirq.c	arch, ftrace: for KASAN put hard/soft IRQ entries into separate sections	2016-03-25 16:37:42 -07:00
stacktrace.c
stop_machine.c	stop_machine: Touch_nmi_watchdog() after MULTI_STOP_PREPARE	2016-07-27 11:12:11 +02:00
sys_ni.c
sys.c	prctl: make PR_SET_THP_DISABLE wait for mmap_sem killable	2016-05-23 17:04:14 -07:00
sysctl_binary.c	kernel/sysctl_binary.c: use generic UUID library	2016-05-20 17:58:30 -07:00
sysctl.c	printk: add kernel parameter to control writes to /dev/kmsg	2016-08-02 19:35:06 -04:00
task_work.c	task_work: use READ_ONCE/lockless_dereference, avoid pi_lock if !task_works	2016-08-02 19:35:02 -04:00
taskstats.c	taskstats: use the libnl API to align nlattr on 64-bit	2016-04-23 20:13:25 -04:00
test_kprobes.c
torture.c	torture: Stop onoff task if there is only one cpu	2016-06-14 16:03:28 -07:00
tracepoint.c	kernel/...: convert pr_warning to pr_warn	2016-03-22 15:36:02 -07:00
tsacct.c	time, acct: Drop irq save & restore from __acct_update_integrals()	2016-02-29 09:53:09 +01:00
uid16.c
up.c
user_namespace.c	fs: Limit file caps to the user namespace of the super block	2016-06-24 10:40:31 -05:00
user-return-notifier.c
user.c
utsname_sysctl.c
utsname.c
watchdog.c	Revert "perf/x86/intel, watchdog: Switch NMI watchdog to ref cycles on x86"	2016-07-10 20:58:36 +02:00
workqueue_internal.h	sched/core: Get rid of 'cpu' argument in wq_worker_sleeping()	2016-03-02 10:28:47 -05:00
workqueue.c	Merge branch 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip	2016-07-29 13:55:30 -07:00