linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-23 20:51:44 +00:00

History

Linus Torvalds e7989789c6 Timers and timekeeping updates: - Improve the VDSO build time checks to cover all dynamic relocations VDSO does not allow dynamic relcations, but the build time check is incomplete and fragile. It's based on architectures specifying the relocation types to search for and does not handle R__NONE relocation entries correctly. R__NONE relocations are injected by some GNU ld variants if they fail to determine the exact .rel[a]/dyn_size to cover trailing zeros. R__NONE relocations must be ignored by dynamic loaders, so they should be ignored in the build time check too. Remove the architecture specific relocation types to check for and validate strictly that no other relocations than R__NONE end up in the VSDO .so file. - Prefer signal delivery to the current thread for CLOCK_PROCESS_CPUTIME_ID based posix-timers Such timers prefer to deliver the signal to the main thread of a process even if the context in which the timer expires is the current task. This has the downside that it might wake up an idle thread. As there is no requirement or guarantee that the signal has to be delivered to the main thread, avoid this by preferring the current task if it is part of the thread group which shares sighand. This not only avoids waking idle threads, it also distributes the signal delivery in case of multiple timers firing in the context of different threads close to each other better. - Align the tick period properly (again) For a long time the tick was starting at CLOCK_MONOTONIC zero, which allowed users space applications to either align with the tick or to place a periodic computation so that it does not interfere with the tick. The alignement of the tick period was more by chance than by intention as the tick is set up before a high resolution clocksource is installed, i.e. timekeeping is still tick based and the tick period advances from there. The early enablement of sched_clock() broke this alignement as the time accumulated by sched_clock() is taken into account when timekeeping is initialized. So the base value now(CLOCK_MONOTONIC) is not longer a multiple of tick periods, which breaks applications which relied on that behaviour. Cure this by aligning the tick starting point to the next multiple of tick periods, i.e 1000ms/CONFIG_HZ. - A set of NOHZ fixes and enhancements - Cure the concurrent writer race for idle and IO sleeptime statistics The statitic values which are exposed via /proc/stat are updated from the CPU local idle exit and remotely by cpufreq, but that happens without any form of serialization. As a consequence sleeptimes can be accounted twice or worse. Prevent this by restricting the accumulation writeback to the CPU local idle exit and let the remote access compute the accumulated value. - Protect idle/iowait sleep time with a sequence count Reading idle/iowait sleep time, e.g. from /proc/stat, can race with idle exit updates. As a consequence the readout may result in random and potentially going backwards values. Protect this by a sequence count, which fixes the idle time statistics issue, but cannot fix the iowait time problem because iowait time accounting races with remote wake ups decrementing the remote runqueues nr_iowait counter. The latter is impossible to fix, so the only way to deal with that is to document it properly and to remove the assertion in the selftest which triggers occasionally due to that. - Restructure struct tick_sched for better cache layout - Some small cleanups and a better cache layout for struct tick_sched - Implement the missing timer_wait_running() callback for POSIX CPU timers For unknown reason the introduction of the timer_wait_running() callback missed to fixup posix CPU timers, which went unnoticed for almost four years. While initially only targeted to prevent livelocks between a timer deletion and the timer expiry function on PREEMPT_RT enabled kernels, it turned out that fixing this for mainline is not as trivial as just implementing a stub similar to the hrtimer/timer callbacks. The reason is that for CONFIG_POSIX_CPU_TIMERS_TASK_WORK enabled systems there is a livelock issue independent of RT. CONFIG_POSIX_CPU_TIMERS_TASK_WORK=y moves the expiry of POSIX CPU timers out from hard interrupt context to task work, which is handled before returning to user space or to a VM. The expiry mechanism moves the expired timers to a stack local list head with sighand lock held. Once sighand is dropped the task can be preempted and a task which wants to delete a timer will spin-wait until the expiry task is scheduled back in. In the worst case this will end up in a livelock when the preempting task and the expiry task are pinned on the same CPU. The timer wheel has a timer_wait_running() mechanism for RT, which uses a per CPU timer-base expiry lock which is held by the expiry code and the task waiting for the timer function to complete blocks on that lock. This does not work in the same way for posix CPU timers as there is no timer base and expiry for process wide timers can run on any task belonging to that process, but the concept of waiting on an expiry lock can be used too in a slightly different way. Add a per task mutex to struct posix_cputimers_work, let the expiry task hold it accross the expiry function and let the deleting task which waits for the expiry to complete block on the mutex. In the non-contended case this results in an extra mutex_lock()/unlock() pair on both sides. This avoids spin-waiting on a task which is scheduled out, prevents the livelock and cures the problem for RT and !RT systems. -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmRGrj4THHRnbHhAbGlu dXRyb25peC5kZQAKCRCmGPVMDXSYoZhdEAC/lwfDWCnTXHC8ExQQRDIVNyXmDlLb EHB8ZY7Wc4gNZ8UEXEOLOXJHMG9bsbtPGctVewJwRGnXZWKVhpPwQba6kCRycyX0 0J6l5DlvUaGGrpoOzOZwgETRmtIZE9tEArZR8xlfRScYd93a7yLhwIjO8JaV9vKs IQpAQMeJ/ysp6gHrS59qakYfoHU/ERUAu3Tk4GqHUtPtcyz3nX3eTlLWV8LySqs+ 00qr2yc0bQFUFoKzTCxtM8lcEi9ja9SOj1rw28348O+BXE4d0HC12Ie7eU/CDN2Y OAlWYxVjy4LMh24LDrRQKTzoVqx9MXDx2g+09B3t8NK5LgeS+EJIjujDhZF147/H 5y906nplZUKa8BiZW5Rpm/HKH8tFI80T9XWSQCRBeMgTEJyRyRU1yASAwO4xw+dY Dn3tGmFGymcV/72o4ic9JFKQd8cTSxPjEJS3qqzMkEAtyI/zPBmKxj/Tce50OH40 6FSZq1uU21ZQzszwSHISwgFtNr75laUSK4Z1te5OhPOOz+C7O9YqHvqS/1jwhPj2 tMd8X17fRW3UTUBlBj+zqxqiEGBl/Yk2AvKrJIXGUtfWYCtjMJ7ieCf0kZ7NSVJx 9ewubA0gqseMD783YomZsy8LLtMKnhclJeslUOVb1oKs1q/WF1R/k6qjy9vUwYaB nIJuHl8mxSetag== =SVnj -----END PGP SIGNATURE----- Merge tag 'timers-core-2023-04-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timers and timekeeping updates from Thomas Gleixner: - Improve the VDSO build time checks to cover all dynamic relocations VDSO does not allow dynamic relocations, but the build time check is incomplete and fragile. It's based on architectures specifying the relocation types to search for and does not handle R__NONE relocation entries correctly. R__NONE relocations are injected by some GNU ld variants if they fail to determine the exact .rel[a]/dyn_size to cover trailing zeros. R__NONE relocations must be ignored by dynamic loaders, so they should be ignored in the build time check too. Remove the architecture specific relocation types to check for and validate strictly that no other relocations than R__NONE end up in the VSDO .so file. - Prefer signal delivery to the current thread for CLOCK_PROCESS_CPUTIME_ID based posix-timers Such timers prefer to deliver the signal to the main thread of a process even if the context in which the timer expires is the current task. This has the downside that it might wake up an idle thread. As there is no requirement or guarantee that the signal has to be delivered to the main thread, avoid this by preferring the current task if it is part of the thread group which shares sighand. This not only avoids waking idle threads, it also distributes the signal delivery in case of multiple timers firing in the context of different threads close to each other better. - Align the tick period properly (again) For a long time the tick was starting at CLOCK_MONOTONIC zero, which allowed users space applications to either align with the tick or to place a periodic computation so that it does not interfere with the tick. The alignement of the tick period was more by chance than by intention as the tick is set up before a high resolution clocksource is installed, i.e. timekeeping is still tick based and the tick period advances from there. The early enablement of sched_clock() broke this alignement as the time accumulated by sched_clock() is taken into account when timekeeping is initialized. So the base value now(CLOCK_MONOTONIC) is not longer a multiple of tick periods, which breaks applications which relied on that behaviour. Cure this by aligning the tick starting point to the next multiple of tick periods, i.e 1000ms/CONFIG_HZ. - A set of NOHZ fixes and enhancements: * Cure the concurrent writer race for idle and IO sleeptime statistics The statitic values which are exposed via /proc/stat are updated from the CPU local idle exit and remotely by cpufreq, but that happens without any form of serialization. As a consequence sleeptimes can be accounted twice or worse. Prevent this by restricting the accumulation writeback to the CPU local idle exit and let the remote access compute the accumulated value. * Protect idle/iowait sleep time with a sequence count Reading idle/iowait sleep time, e.g. from /proc/stat, can race with idle exit updates. As a consequence the readout may result in random and potentially going backwards values. Protect this by a sequence count, which fixes the idle time statistics issue, but cannot fix the iowait time problem because iowait time accounting races with remote wake ups decrementing the remote runqueues nr_iowait counter. The latter is impossible to fix, so the only way to deal with that is to document it properly and to remove the assertion in the selftest which triggers occasionally due to that. * Restructure struct tick_sched for better cache layout * Some small cleanups and a better cache layout for struct tick_sched - Implement the missing timer_wait_running() callback for POSIX CPU timers For unknown reason the introduction of the timer_wait_running() callback missed to fixup posix CPU timers, which went unnoticed for almost four years. While initially only targeted to prevent livelocks between a timer deletion and the timer expiry function on PREEMPT_RT enabled kernels, it turned out that fixing this for mainline is not as trivial as just implementing a stub similar to the hrtimer/timer callbacks. The reason is that for CONFIG_POSIX_CPU_TIMERS_TASK_WORK enabled systems there is a livelock issue independent of RT. CONFIG_POSIX_CPU_TIMERS_TASK_WORK=y moves the expiry of POSIX CPU timers out from hard interrupt context to task work, which is handled before returning to user space or to a VM. The expiry mechanism moves the expired timers to a stack local list head with sighand lock held. Once sighand is dropped the task can be preempted and a task which wants to delete a timer will spin-wait until the expiry task is scheduled back in. In the worst case this will end up in a livelock when the preempting task and the expiry task are pinned on the same CPU. The timer wheel has a timer_wait_running() mechanism for RT, which uses a per CPU timer-base expiry lock which is held by the expiry code and the task waiting for the timer function to complete blocks on that lock. This does not work in the same way for posix CPU timers as there is no timer base and expiry for process wide timers can run on any task belonging to that process, but the concept of waiting on an expiry lock can be used too in a slightly different way. Add a per task mutex to struct posix_cputimers_work, let the expiry task hold it accross the expiry function and let the deleting task which waits for the expiry to complete block on the mutex. In the non-contended case this results in an extra mutex_lock()/unlock() pair on both sides. This avoids spin-waiting on a task which is scheduled out, prevents the livelock and cures the problem for RT and !RT systems * tag 'timers-core-2023-04-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: posix-cpu-timers: Implement the missing timer_wait_running callback selftests/proc: Assert clock_gettime(CLOCK_BOOTTIME) VS /proc/uptime monotonicity selftests/proc: Remove idle time monotonicity assertions MAINTAINERS: Remove stale email address timers/nohz: Remove middle-function __tick_nohz_idle_stop_tick() timers/nohz: Add a comment about broken iowait counter update race timers/nohz: Protect idle/iowait sleep time under seqcount timers/nohz: Only ever update sleeptime from idle exit timers/nohz: Restructure and reshuffle struct tick_sched tick/common: Align tick period with the HZ tick. selftests/timers/posix_timers: Test delivery of signals across threads posix-timers: Prefer delivery of signals to the current thread vdso: Improve cmd_vdso_check to check all dynamic relocations		2023-04-25 11:22:46 -07:00
..
bpf	fget() to fdget() conversions	2023-04-24 19:14:20 -07:00
cgroup	fget() to fdget() conversions	2023-04-24 19:14:20 -07:00
configs	mm, slob: rename CONFIG_SLOB to CONFIG_SLOB_DEPRECATED	2022-12-01 00:09:20 +01:00
debug	kdb: use srcu console list iterator	2022-12-02 11:25:00 +01:00
dma	swiotlb: fix a braino in the alignment check fix	2023-04-06 16:45:12 +02:00
entry	ptrace: Provide set/get interface for syscall user dispatch	2023-04-16 14:23:07 +02:00
events	perf/core: Fix the same task check in perf_event_set_output	2023-04-05 09:58:46 +02:00
futex	- Prevent the leaking of a debug timer in futex_waitv()	2023-01-01 11:15:05 -08:00
gcov	gcov: add support for checksum field	2022-12-21 14:31:52 -08:00
irq	genirq: Update affinity of secondary threads	2023-04-15 10:17:16 +02:00
kcsan	Kernel concurrency sanitizer (KCSAN) updates for v6.4	2023-04-24 11:46:53 -07:00
livepatch	Livepatching changes for 6.3	2023-02-23 14:00:10 -08:00
locking	RCU Changes for 6.4:	2023-04-24 12:16:14 -07:00
module	modules-6.3-rc1	2023-02-23 14:05:08 -08:00
power	Merge branches 'powercap', 'pm-domains', 'pm-em' and 'pm-opp'	2023-02-15 20:06:26 +01:00
printk	printk changes for 6.3	2023-02-23 13:49:45 -08:00
rcu	RCU Changes for 6.4:	2023-04-24 12:16:14 -07:00
sched	sched/fair: Fix imbalance overflow	2023-04-12 16:46:30 +02:00
time	Timers and timekeeping updates:	2023-04-25 11:22:46 -07:00
trace	RCU Changes for 6.4:	2023-04-24 12:16:14 -07:00
.gitignore
acct.c	acct: fix potential integer overflow in encode_comp_t()	2022-11-30 16:13:18 -08:00
async.c	Revert "module, async: async_synchronize_full() on module init iff async is used"	2022-02-03 11:20:34 -08:00
audit_fsnotify.c	audit: fix potential double free on error path from fsnotify_add_inode_mark	2022-08-22 18:50:06 -04:00
audit_tree.c	audit: use fsnotify group lock helpers	2022-04-25 14:37:28 +02:00
audit_watch.c	audit_init_parent(): constify path	2022-09-01 17:39:30 -04:00
audit.c	audit: use time_after to compare time	2022-08-29 19:47:03 -04:00
audit.h	audit: remove selinux_audit_rule_update() declaration	2022-09-07 11:30:15 -04:00
auditfilter.c
auditsc.c	capability: just use a 'u64' instead of a 'u32[2]' array	2023-03-01 10:01:22 -08:00
backtracetest.c
bounds.c	mm: multi-gen LRU: minimal implementation	2022-09-26 19:46:09 -07:00
capability.c	capability: just use a 'u64' instead of a 'u32[2]' array	2023-03-01 10:01:22 -08:00
cfi.c	cfi: Switch to -fsanitize=kcfi	2022-09-26 10:13:13 -07:00
compat.c	sched_getaffinity: don't assume 'cpumask_size()' is fully initialized	2023-03-14 19:32:38 -07:00
configs.c
context_tracking.c	context_tracking: Fix noinstr vs KASAN	2023-01-13 11:48:18 +01:00
cpu_pm.c	cpuidle, cpu_pm: Remove RCU fiddling from cpu_pm_{enter,exit}()	2023-01-13 11:48:15 +01:00
cpu.c	cpu/hotplug: Do not bail-out in DYING/STARTING sections	2022-12-02 12:43:02 +01:00
crash_core.c	mm: remove 'First tail page' members from struct page	2023-02-02 22:32:59 -08:00
crash_dump.c
cred.c	cred: Do not default to init_cred in prepare_kernel_cred()	2022-11-01 10:04:52 -07:00
delayacct.c	delayacct: support re-entrance detection of thrashing accounting	2022-09-26 19:46:07 -07:00
dma.c
exec_domain.c
exit.c	arm64 updates for 6.3:	2023-02-21 15:27:48 -08:00
extable.c	context_tracking: Take NMI eqs entrypoints over RCU	2022-07-05 13:32:59 -07:00
fail_function.c	kernel/fail_function: fix memory leak with using debugfs_lookup()	2023-02-08 13:36:22 +01:00
fork.c	v6.4/pidfd.file	2023-04-24 13:03:42 -07:00
freezer.c	freezer,sched: Rewrite core freezer logic	2022-09-07 21:53:50 +02:00
gen_kheaders.sh	kheaders: use standard naming for the temporary directory	2023-01-22 23:43:34 +09:00
groups.c	security: Add LSM hook to setgroups() syscall	2022-07-15 18:21:49 +00:00
hung_task.c	hung_task: print message when hung_task_warnings gets down to zero.	2023-02-09 17:03:20 -08:00
iomem.c
irq_work.c	irq_work: use kasan_record_aux_stack_noalloc() record callstack	2022-04-15 14:49:55 -07:00
jump_label.c	jump_label: Prevent key->enabled int overflow	2022-12-01 15:53:05 -08:00
kallsyms_internal.h	kallsyms: Reduce the memory occupied by kallsyms_seqs_of_names[]	2022-11-12 18:47:36 -08:00
kallsyms_selftest.c	kallsyms: Fix scheduling with interrupts disabled in self-test	2023-01-13 15:09:08 -08:00
kallsyms_selftest.h	kallsyms: Add self-test facility	2022-11-15 00:42:02 -08:00
kallsyms.c	kallsyms: Add self-test facility	2022-11-15 00:42:02 -08:00
kcmp.c
Kconfig.freezer
Kconfig.hz
Kconfig.locks
Kconfig.preempt	Revert "signal, x86: Delay calling signals in atomic on RT enabled kernels"	2022-03-31 10:36:55 +02:00
kcov.c	mm: replace vma->vm_flags direct modifications with modifier calls	2023-02-09 16:51:39 -08:00
kexec_core.c	There is no particular theme here - mainly quick hits all over the tree.	2023-02-23 17:55:40 -08:00
kexec_elf.c
kexec_file.c	kexec: introduce sysctl parameters kexec_load_limit_*	2023-02-02 22:50:05 -08:00
kexec_internal.h	panic, kexec: make __crash_kexec() NMI safe	2022-09-11 21:55:06 -07:00
kexec.c	kexec: introduce sysctl parameters kexec_load_limit_*	2023-02-02 22:50:05 -08:00
kheaders.c
kmod.c
kprobes.c	x86/kprobes: Fix arch_check_optimized_kprobe check within optimized_kprobe range	2023-02-21 08:49:16 +09:00
ksysfs.c	kernels/ksysfs.c: export kernel address bits	2023-01-20 14:30:45 +01:00
kthread.c	kthread: Pass in the thread's name during creation	2023-03-12 10:54:36 +01:00
latencytop.c	latencytop: use the last element of latency_record of system	2022-09-11 21:55:12 -07:00
Makefile	vhost_task: Allow vhost layer to use copy_process	2023-03-23 12:45:36 +01:00
module_signature.c
notifier.c	kernel/notifier: Remove CONFIG_SRCU	2023-02-02 16:26:06 -08:00
nsproxy.c	convert setns(2) to fdget()/fdput()	2023-04-20 22:55:35 -04:00
padata.c	Kbuild updates for v6.2	2022-12-19 12:33:32 -06:00
panic.c	panic: fix the panic_print NMI backtrace setting	2023-03-02 21:54:23 -08:00
params.c	kernel/params.c: Use kstrtobool() instead of strtobool()	2023-01-25 14:07:21 -08:00
pid_namespace.c	- Daniel Verkamp has contributed a memfd series ("mm/memfd: add	2023-02-23 17:09:35 -08:00
pid_sysctl.h	mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC	2023-01-18 17:12:37 -08:00
pid.c	pid: add pidfd_prepare()	2023-04-03 11:16:56 +02:00
profile.c	kernel/profile.c: simplify duplicated code in profile_setup()	2022-09-11 21:55:12 -07:00
ptrace.c	ptrace: Provide set/get interface for syscall user dispatch	2023-04-16 14:23:07 +02:00
range.c
reboot.c	kernel/reboot: Add SYS_OFF_MODE_RESTART_PREPARE mode	2022-10-04 15:59:36 +02:00
regset.c
relay.c	mm: replace vma->vm_flags direct modifications with modifier calls	2023-02-09 16:51:39 -08:00
resource_kunit.c
resource.c	dax/kmem: Fix leak of memory-hotplug resources	2023-02-17 14:58:01 -08:00
rseq.c	rseq: Extend struct rseq with per-memory-map concurrency ID	2022-12-27 12:52:12 +01:00
scftorture.c	scftorture: Fix distribution of short handler delays	2022-04-11 17:07:29 -07:00
scs.c	scs: add support for dynamic shadow call stacks	2022-11-09 18:06:35 +00:00
seccomp.c	seccomp: fix kernel-doc function name warning	2023-01-13 17:01:06 -08:00
signal.c	posix-timers: Prefer delivery of signals to the current thread	2023-04-16 09:00:18 +02:00
smp.c	bitmap patches for v6.1-rc1	2022-10-10 12:49:34 -07:00
smpboot.c	smpboot: use atomic_try_cmpxchg in cpu_wait_death and cpu_report_death	2022-09-11 21:55:10 -07:00
smpboot.h
softirq.c	softirq: Add trace points for tasklet entry/exit	2023-04-15 10:17:16 +02:00
stackleak.c	stackleak: add on/off stack variants	2022-05-08 01:33:09 -07:00
stacktrace.c	uaccess: remove CONFIG_SET_FS	2022-02-25 09:36:06 +01:00
static_call_inline.c	static_call: Add call depth tracking support	2022-10-17 16:41:16 +02:00
static_call.c	static_call: Don't make __static_call_return0 static	2022-04-05 09:59:38 +02:00
stop_machine.c	Scheduler changes in this cycle were:	2022-05-24 11:11:13 -07:00
sys_ni.c	kernel/sys_ni: add compat entry for fadvise64_64	2022-08-20 15:17:45 -07:00
sys.c	kernel/sys.c: fix and improve control flow in __sys_setres[ug]id()	2023-04-18 14:22:12 -07:00
sysctl-test.c	kernel/sysctl-test: use SYSCTL_{ZERO/ONE_HUNDRED} instead of i_{zero/one_hundred}	2022-09-08 16:56:45 -07:00
sysctl.c	sysctl: fix proc_dobool() usability	2023-02-21 13:34:07 -08:00
task_work.c	task_work: use try_cmpxchg in task_work_add, task_work_cancel_match and task_work_run	2022-09-11 21:55:10 -07:00
taskstats.c	genetlink: start to validate reserved header bytes	2022-08-29 12:47:15 +01:00
torture.c	torture: Fix hang during kthread shutdown phase	2023-01-05 12:10:35 -08:00
tracepoint.c	tracepoint: Allow livepatch module add trace event	2023-02-18 14:34:36 -05:00
tsacct.c	taskstats: version 12 with thread group and exe info	2022-04-29 14:38:03 -07:00
ucount.c	ucounts: Split rlimit and ucount values and max values	2022-05-18 18:24:57 -05:00
uid16.c
uid16.h
umh.c	umh: simplify the capability pointer logic	2023-03-03 16:18:19 -08:00
up.c
user_namespace.c	userns: fix a struct's kernel-doc notation	2023-02-02 22:50:04 -08:00
user-return-notifier.c
user.c	kernel/user: Allow user_struct::locked_vm to be usable for iommufd	2022-11-30 20:16:49 -04:00
usermode_driver.c	blob_to_mnt(): kern_unmount() is needed to undo kern_mount()	2022-05-19 23:25:47 -04:00
utsname_sysctl.c	kernel/utsname_sysctl.c: Fix hostname polling	2022-10-23 12:01:01 -07:00
utsname.c
vhost_task.c	vhost_task: Allow vhost layer to use copy_process	2023-03-23 12:45:36 +01:00
watch_queue.c	watch_queue: fix IOC_WATCH_QUEUE_SET_SIZE alloc error paths	2023-03-08 11:44:45 +01:00
watchdog_hld.c	Revert "printk: add functions to prefer direct printing"	2022-06-23 18:41:40 +02:00
watchdog.c	powerpc updates for 6.0	2022-08-06 16:38:17 -07:00
workqueue_internal.h
workqueue.c	workqueue: Fold rebind_worker() within rebind_workers()	2023-01-13 07:50:40 -10:00