From afc256e131bb0e1ecb5e2b1df310b20fa7bd714d Mon Sep 17 00:00:00 2001 From: Geert Uytterhoeven Date: Wed, 2 Oct 2024 17:03:55 +0200 Subject: [PATCH 01/29] locking/spinlocks: Make __raw_* lock ops static If CONFIG_GENERIC_LOCKBREAK=y and CONFIG_DEBUG_LOCK_ALLOC=n (e.g. sh/sdk7786_defconfig): kernel/locking/spinlock.c:68:17: warning: no previous prototype for '__raw_spin_lock' [-Wmissing-prototypes] kernel/locking/spinlock.c:80:26: warning: no previous prototype for '__raw_spin_lock_irqsave' [-Wmissing-prototypes] kernel/locking/spinlock.c:98:17: warning: no previous prototype for '__raw_spin_lock_irq' [-Wmissing-prototypes] kernel/locking/spinlock.c:103:17: warning: no previous prototype for '__raw_spin_lock_bh' [-Wmissing-prototypes] kernel/locking/spinlock.c:68:17: warning: no previous prototype for '__raw_read_lock' [-Wmissing-prototypes] kernel/locking/spinlock.c:80:26: warning: no previous prototype for '__raw_read_lock_irqsave' [-Wmissing-prototypes] kernel/locking/spinlock.c:98:17: warning: no previous prototype for '__raw_read_lock_irq' [-Wmissing-prototypes] kernel/locking/spinlock.c:103:17: warning: no previous prototype for '__raw_read_lock_bh' [-Wmissing-prototypes] kernel/locking/spinlock.c:68:17: warning: no previous prototype for '__raw_write_lock' [-Wmissing-prototypes] kernel/locking/spinlock.c:80:26: warning: no previous prototype for '__raw_write_lock_irqsave' [-Wmissing-prototypes] kernel/locking/spinlock.c:98:17: warning: no previous prototype for '__raw_write_lock_irq' [-Wmissing-prototypes] kernel/locking/spinlock.c:103:17: warning: no previous prototype for '__raw_write_lock_bh' [-Wmissing-prototypes] All __raw_* lock ops are internal functions without external callers. Hence fix this by making them static. Note that if CONFIG_GENERIC_LOCKBREAK=y, no lock ops are inlined, as all of CONFIG_INLINE_*_LOCK* depend on !GENERIC_LOCKBREAK. Signed-off-by: Geert Uytterhoeven Signed-off-by: Peter Zijlstra (Intel) Acked-by: Waiman Long Link: https://lkml.kernel.org/r/7201d7fb408375c6c4df541270d787b1b4a32354.1727879348.git.geert+renesas@glider.be --- kernel/locking/spinlock.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/kernel/locking/spinlock.c b/kernel/locking/spinlock.c index 438c6086d540..7685defd7c52 100644 --- a/kernel/locking/spinlock.c +++ b/kernel/locking/spinlock.c @@ -65,7 +65,7 @@ EXPORT_PER_CPU_SYMBOL(__mmiowb_state); * towards that other CPU that it should break the lock ASAP. */ #define BUILD_LOCK_OPS(op, locktype) \ -void __lockfunc __raw_##op##_lock(locktype##_t *lock) \ +static void __lockfunc __raw_##op##_lock(locktype##_t *lock) \ { \ for (;;) { \ preempt_disable(); \ @@ -77,7 +77,7 @@ void __lockfunc __raw_##op##_lock(locktype##_t *lock) \ } \ } \ \ -unsigned long __lockfunc __raw_##op##_lock_irqsave(locktype##_t *lock) \ +static unsigned long __lockfunc __raw_##op##_lock_irqsave(locktype##_t *lock) \ { \ unsigned long flags; \ \ @@ -95,12 +95,12 @@ unsigned long __lockfunc __raw_##op##_lock_irqsave(locktype##_t *lock) \ return flags; \ } \ \ -void __lockfunc __raw_##op##_lock_irq(locktype##_t *lock) \ +static void __lockfunc __raw_##op##_lock_irq(locktype##_t *lock) \ { \ _raw_##op##_lock_irqsave(lock); \ } \ \ -void __lockfunc __raw_##op##_lock_bh(locktype##_t *lock) \ +static void __lockfunc __raw_##op##_lock_bh(locktype##_t *lock) \ { \ unsigned long flags; \ \ From 823a566221a5639f6c69424897218e5d6431a970 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Thomas=20Hellstr=C3=B6m?= Date: Wed, 9 Oct 2024 11:20:31 +0200 Subject: [PATCH 02/29] locking/ww_mutex: Adjust to lockdep nest_lock requirements MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit When using mutex_acquire_nest() with a nest_lock, lockdep refcounts the number of acquired lockdep_maps of mutexes of the same class, and also keeps a pointer to the first acquired lockdep_map of a class. That pointer is then used for various comparison-, printing- and checking purposes, but there is no mechanism to actively ensure that lockdep_map stays in memory. Instead, a warning is printed if the lockdep_map is freed and there are still held locks of the same lock class, even if the lockdep_map itself has been released. In the context of WW/WD transactions that means that if a user unlocks and frees a ww_mutex from within an ongoing ww transaction, and that mutex happens to be the first ww_mutex grabbed in the transaction, such a warning is printed and there might be a risk of a UAF. Note that this is only problem when lockdep is enabled and affects only dereferences of struct lockdep_map. Adjust to this by adding a fake lockdep_map to the acquired context and make sure it is the first acquired lockdep map of the associated ww_mutex class. Then hold it for the duration of the WW/WD transaction. This has the side effect that trying to lock a ww mutex *without* a ww_acquire_context but where a such context has been acquire, we'd see a lockdep splat. The test-ww_mutex.c selftest attempts to do that, so modify that particular test to not acquire a ww_acquire_context if it is not going to be used. Signed-off-by: Thomas Hellström Signed-off-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/20241009092031.6356-1-thomas.hellstrom@linux.intel.com --- include/linux/ww_mutex.h | 14 ++++++++++++++ kernel/locking/test-ww_mutex.c | 8 +++++--- 2 files changed, 19 insertions(+), 3 deletions(-) diff --git a/include/linux/ww_mutex.h b/include/linux/ww_mutex.h index bb763085479a..a401a2f31a77 100644 --- a/include/linux/ww_mutex.h +++ b/include/linux/ww_mutex.h @@ -65,6 +65,16 @@ struct ww_acquire_ctx { #endif #ifdef CONFIG_DEBUG_LOCK_ALLOC struct lockdep_map dep_map; + /** + * @first_lock_dep_map: fake lockdep_map for first locked ww_mutex. + * + * lockdep requires the lockdep_map for the first locked ww_mutex + * in a ww transaction to remain in memory until all ww_mutexes of + * the transaction have been unlocked. Ensure this by keeping a + * fake locked ww_mutex lockdep map between ww_acquire_init() and + * ww_acquire_fini(). + */ + struct lockdep_map first_lock_dep_map; #endif #ifdef CONFIG_DEBUG_WW_MUTEX_SLOWPATH unsigned int deadlock_inject_interval; @@ -146,7 +156,10 @@ static inline void ww_acquire_init(struct ww_acquire_ctx *ctx, debug_check_no_locks_freed((void *)ctx, sizeof(*ctx)); lockdep_init_map(&ctx->dep_map, ww_class->acquire_name, &ww_class->acquire_key, 0); + lockdep_init_map(&ctx->first_lock_dep_map, ww_class->mutex_name, + &ww_class->mutex_key, 0); mutex_acquire(&ctx->dep_map, 0, 0, _RET_IP_); + mutex_acquire_nest(&ctx->first_lock_dep_map, 0, 0, &ctx->dep_map, _RET_IP_); #endif #ifdef CONFIG_DEBUG_WW_MUTEX_SLOWPATH ctx->deadlock_inject_interval = 1; @@ -185,6 +198,7 @@ static inline void ww_acquire_done(struct ww_acquire_ctx *ctx) static inline void ww_acquire_fini(struct ww_acquire_ctx *ctx) { #ifdef CONFIG_DEBUG_LOCK_ALLOC + mutex_release(&ctx->first_lock_dep_map, _THIS_IP_); mutex_release(&ctx->dep_map, _THIS_IP_); #endif #ifdef DEBUG_WW_MUTEXES diff --git a/kernel/locking/test-ww_mutex.c b/kernel/locking/test-ww_mutex.c index 10a5736a21c2..5d58b2c0ef98 100644 --- a/kernel/locking/test-ww_mutex.c +++ b/kernel/locking/test-ww_mutex.c @@ -62,7 +62,8 @@ static int __test_mutex(unsigned int flags) int ret; ww_mutex_init(&mtx.mutex, &ww_class); - ww_acquire_init(&ctx, &ww_class); + if (flags & TEST_MTX_CTX) + ww_acquire_init(&ctx, &ww_class); INIT_WORK_ONSTACK(&mtx.work, test_mutex_work); init_completion(&mtx.ready); @@ -90,7 +91,8 @@ static int __test_mutex(unsigned int flags) ret = wait_for_completion_timeout(&mtx.done, TIMEOUT); } ww_mutex_unlock(&mtx.mutex); - ww_acquire_fini(&ctx); + if (flags & TEST_MTX_CTX) + ww_acquire_fini(&ctx); if (ret) { pr_err("%s(flags=%x): mutual exclusion failure\n", @@ -679,7 +681,7 @@ static int __init test_ww_mutex_init(void) if (ret) return ret; - ret = stress(2047, hweight32(STRESS_ALL)*ncpus, STRESS_ALL); + ret = stress(2046, hweight32(STRESS_ALL)*ncpus, STRESS_ALL); if (ret) return ret; From 19298f48694987fac843261c84e24834c255b451 Mon Sep 17 00:00:00 2001 From: Uros Bizjak Date: Thu, 10 Oct 2024 09:10:04 +0200 Subject: [PATCH 03/29] futex: Use atomic64_inc_return() in get_inode_sequence_number() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Use atomic64_inc_return(&ref) instead of atomic64_add_return(1, &ref) to use optimized implementation and ease register pressure around the primitive for targets that implement optimized variant. Signed-off-by: Uros Bizjak Signed-off-by: Thomas Gleixner Reviewed-by: André Almeida Link: https://lore.kernel.org/all/20241010071023.21913-1-ubizjak@gmail.com --- kernel/futex/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/futex/core.c b/kernel/futex/core.c index 136768ae2637..3146730e55f7 100644 --- a/kernel/futex/core.c +++ b/kernel/futex/core.c @@ -181,7 +181,7 @@ static u64 get_inode_sequence_number(struct inode *inode) return old; for (;;) { - u64 new = atomic64_add_return(1, &i_seq); + u64 new = atomic64_inc_return(&i_seq); if (WARN_ON_ONCE(!new)) continue; From 87347f148061b48c3495fb61dcbad384760da9cf Mon Sep 17 00:00:00 2001 From: Uros Bizjak Date: Thu, 10 Oct 2024 09:10:05 +0200 Subject: [PATCH 04/29] futex: Use atomic64_try_cmpxchg_relaxed() in get_inode_sequence_number() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Optimize get_inode_sequence_number() to use simpler and faster: !atomic64_try_cmpxchg_relaxed(*ptr, &old, new) instead of: atomic64_cmpxchg relaxed(*ptr, old, new) != old The x86 CMPXCHG instruction returns success in ZF flag, so this change saves a compare after cmpxchg. The generated code improves from: 3da: 31 c0 xor %eax,%eax 3dc: f0 48 0f b1 8a 38 01 lock cmpxchg %rcx,0x138(%rdx) 3e3: 00 00 3e5: 48 85 c0 test %rax,%rax 3e8: 48 0f 44 c1 cmove %rcx,%rax to: 3da: 31 c0 xor %eax,%eax 3dc: f0 48 0f b1 8a 38 01 lock cmpxchg %rcx,0x138(%rdx) 3e3: 00 00 3e5: 48 0f 44 c1 cmove %rcx,%rax Signed-off-by: Uros Bizjak Signed-off-by: Thomas Gleixner Reviewed-by: André Almeida Link: https://lore.kernel.org/all/20241010071023.21913-2-ubizjak@gmail.com --- kernel/futex/core.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/futex/core.c b/kernel/futex/core.c index 3146730e55f7..692912bf1252 100644 --- a/kernel/futex/core.c +++ b/kernel/futex/core.c @@ -185,8 +185,8 @@ static u64 get_inode_sequence_number(struct inode *inode) if (WARN_ON_ONCE(!new)) continue; - old = atomic64_cmpxchg_relaxed(&inode->i_sequence, 0, new); - if (old) + old = 0; + if (!atomic64_try_cmpxchg_relaxed(&inode->i_sequence, &old, new)) return old; return new; } From 0784181b44af831a3fa52e1e5ff77c388d699dba Mon Sep 17 00:00:00 2001 From: David Woodhouse Date: Thu, 26 Sep 2024 16:17:37 +0100 Subject: [PATCH 05/29] lockdep: Add lockdep_cleanup_dead_cpu() Add a function to check that an offline CPU has left the tracing infrastructure in a sane state. Commit 9bb69ba4c177 ("ACPI: processor_idle: use raw_safe_halt() in acpi_idle_play_dead()") fixed an issue where the acpi_idle_play_dead() function called safe_halt() instead of raw_safe_halt(), which had the side-effect of setting the hardirqs_enabled flag for the offline CPU. On x86 this triggered warnings from lockdep_assert_irqs_disabled() when the CPU was brought back online again later. These warnings were too early for the exception to be handled correctly, leading to a triple-fault. Add lockdep_cleanup_dead_cpu() to check for this kind of failure mode, print the events leading up to it, and correct it so that the CPU can come online again correctly. Re-introducing the original bug now merely results in this warning instead: [ 61.556652] smpboot: CPU 1 is now offline [ 61.556769] CPU 1 left hardirqs enabled! [ 61.556915] irq event stamp: 128149 [ 61.556965] hardirqs last enabled at (128149): [] acpi_idle_play_dead+0x46/0x70 [ 61.557055] hardirqs last disabled at (128148): [] do_idle+0x90/0xe0 [ 61.557117] softirqs last enabled at (128078): [] __do_softirq+0x31c/0x423 [ 61.557199] softirqs last disabled at (128065): [] __irq_exit_rcu+0x91/0x100 [boqun: Capitalize the title and reword the message a bit] Signed-off-by: David Woodhouse Reviewed-by: Thomas Gleixner Signed-off-by: Boqun Feng Link: https://lore.kernel.org/r/f7bd2b3b999051bb3ef4be34526a9262008285f5.camel@infradead.org --- include/linux/irqflags.h | 6 ++++++ kernel/cpu.c | 1 + kernel/locking/lockdep.c | 24 ++++++++++++++++++++++++ 3 files changed, 31 insertions(+) diff --git a/include/linux/irqflags.h b/include/linux/irqflags.h index 3f003d5fde53..57b074e0cfbb 100644 --- a/include/linux/irqflags.h +++ b/include/linux/irqflags.h @@ -18,6 +18,8 @@ #include #include +struct task_struct; + /* Currently lockdep_softirqs_on/off is used only by lockdep */ #ifdef CONFIG_PROVE_LOCKING extern void lockdep_softirqs_on(unsigned long ip); @@ -25,12 +27,16 @@ extern void lockdep_hardirqs_on_prepare(void); extern void lockdep_hardirqs_on(unsigned long ip); extern void lockdep_hardirqs_off(unsigned long ip); + extern void lockdep_cleanup_dead_cpu(unsigned int cpu, + struct task_struct *idle); #else static inline void lockdep_softirqs_on(unsigned long ip) { } static inline void lockdep_softirqs_off(unsigned long ip) { } static inline void lockdep_hardirqs_on_prepare(void) { } static inline void lockdep_hardirqs_on(unsigned long ip) { } static inline void lockdep_hardirqs_off(unsigned long ip) { } + static inline void lockdep_cleanup_dead_cpu(unsigned int cpu, + struct task_struct *idle) {} #endif #ifdef CONFIG_TRACE_IRQFLAGS diff --git a/kernel/cpu.c b/kernel/cpu.c index d293d52a3e00..c4aaf73dec9e 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -1338,6 +1338,7 @@ static int takedown_cpu(unsigned int cpu) cpuhp_bp_sync_dead(cpu); + lockdep_cleanup_dead_cpu(cpu, idle_thread_get(cpu)); tick_cleanup_dead_cpu(cpu); /* diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 536bd471557f..6fd4af217e71 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -4586,6 +4586,30 @@ void lockdep_softirqs_off(unsigned long ip) debug_atomic_inc(redundant_softirqs_off); } +/** + * lockdep_cleanup_dead_cpu - Ensure CPU lockdep state is cleanly stopped + * + * @cpu: index of offlined CPU + * @idle: task pointer for offlined CPU's idle thread + * + * Invoked after the CPU is dead. Ensures that the tracing infrastructure + * is left in a suitable state for the CPU to be subsequently brought + * online again. + */ +void lockdep_cleanup_dead_cpu(unsigned int cpu, struct task_struct *idle) +{ + if (unlikely(!debug_locks)) + return; + + if (unlikely(per_cpu(hardirqs_enabled, cpu))) { + pr_warn("CPU %u left hardirqs enabled!", cpu); + if (idle) + print_irqtrace_events(idle); + /* Clean it up for when the CPU comes online again. */ + per_cpu(hardirqs_enabled, cpu) = 0; + } +} + static int mark_usage(struct task_struct *curr, struct held_lock *hlock, int check) { From d7fe143cb115076fed0126ad8cf5ba6c3e575e43 Mon Sep 17 00:00:00 2001 From: Ahmed Ehab Date: Sun, 25 Aug 2024 01:10:30 +0300 Subject: [PATCH 06/29] locking/lockdep: Avoid creating new name string literals in lockdep_set_subclass() Syzbot reports a problem that a warning will be triggered while searching a lock class in look_up_lock_class(). The cause of the issue is that a new name is created and used by lockdep_set_subclass() instead of using the existing one. This results in a lock instance has a different name pointer than previous registered one stored in lock class, and WARN_ONCE() is triggered because of that in look_up_lock_class(). To fix this, change lockdep_set_subclass() to use the existing name instead of a new one. Hence, no new name will be created by lockdep_set_subclass(). Hence, the warning is avoided. [boqun: Reword the commit log to state the correct issue] Reported-by: Fixes: de8f5e4f2dc1f ("lockdep: Introduce wait-type checks") Cc: stable@vger.kernel.org Signed-off-by: Ahmed Ehab Signed-off-by: Boqun Feng Link: https://lore.kernel.org/lkml/20240824221031.7751-1-bottaawesome633@gmail.com/ --- include/linux/lockdep.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h index 217f7abf2cbf..67964dc4db95 100644 --- a/include/linux/lockdep.h +++ b/include/linux/lockdep.h @@ -173,7 +173,7 @@ static inline void lockdep_init_map(struct lockdep_map *lock, const char *name, (lock)->dep_map.lock_type) #define lockdep_set_subclass(lock, sub) \ - lockdep_init_map_type(&(lock)->dep_map, #lock, (lock)->dep_map.key, sub,\ + lockdep_init_map_type(&(lock)->dep_map, (lock)->dep_map.name, (lock)->dep_map.key, sub,\ (lock)->dep_map.wait_type_inner, \ (lock)->dep_map.wait_type_outer, \ (lock)->dep_map.lock_type) From 5eadeb7b3bc206e2ac9494e9499e7c1f1e44eab7 Mon Sep 17 00:00:00 2001 From: Ahmed Ehab Date: Thu, 5 Sep 2024 04:12:20 +0300 Subject: [PATCH 07/29] locking/lockdep: Add a test for lockdep_set_subclass() Add a test case to ensure that no new name string literal will be created in lockdep_set_subclass(), otherwise a warning will be triggered in look_up_lock_class(). Add this to catch the problem in the future. [boqun: Reword the title, replace #if with #ifdef and rename functions and variables] Signed-off-by: Ahmed Ehab Signed-off-by: Boqun Feng Link: https://lore.kernel.org/lkml/20240905011220.356973-1-bottaawesome633@gmail.com/ --- lib/locking-selftest.c | 39 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 39 insertions(+) diff --git a/lib/locking-selftest.c b/lib/locking-selftest.c index 6f6a5fc85b42..6e0c019f71b6 100644 --- a/lib/locking-selftest.c +++ b/lib/locking-selftest.c @@ -2710,6 +2710,43 @@ static void local_lock_3B(void) } +#ifdef CONFIG_DEBUG_LOCK_ALLOC +static inline const char *rw_semaphore_lockdep_name(struct rw_semaphore *rwsem) +{ + return rwsem->dep_map.name; +} +#else +static inline const char *rw_semaphore_lockdep_name(struct rw_semaphore *rwsem) +{ + return NULL; +} +#endif + +static void test_lockdep_set_subclass_name(void) +{ + const char *name_before = rw_semaphore_lockdep_name(&rwsem_X1); + const char *name_after; + + lockdep_set_subclass(&rwsem_X1, 1); + name_after = rw_semaphore_lockdep_name(&rwsem_X1); + DEBUG_LOCKS_WARN_ON(name_before != name_after); +} + +/* + * lockdep_set_subclass() should reuse the existing lock class name instead + * of creating a new one. + */ +static void lockdep_set_subclass_name_test(void) +{ + printk(" --------------------------------------------------------------------------\n"); + printk(" | lockdep_set_subclass() name test|\n"); + printk(" -----------------------------------\n"); + + print_testname("compare name before and after"); + dotest(test_lockdep_set_subclass_name, SUCCESS, LOCKTYPE_RWSEM); + pr_cont("\n"); +} + static void local_lock_tests(void) { printk(" --------------------------------------------------------------------------\n"); @@ -2920,6 +2957,8 @@ void locking_selftest(void) dotest(hardirq_deadlock_softirq_not_deadlock, FAILURE, LOCKTYPE_SPECIAL); pr_cont("\n"); + lockdep_set_subclass_name_test(); + if (unexpected_testcase_failures) { printk("-----------------------------------------------------------------\n"); debug_locks = 0; From e48bf7ca6056297664eb260fa88cae8e50d9b698 Mon Sep 17 00:00:00 2001 From: "Jiri Slaby (SUSE)" Date: Mon, 7 Oct 2024 08:54:57 +0200 Subject: [PATCH 08/29] lockdep: Use info level for lockdep initial info messages All those: Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo Molnar ... MAX_LOCKDEP_SUBCLASSES: 8 ... MAX_LOCK_DEPTH: 48 ... MAX_LOCKDEP_KEYS: 8192 and so on are dumped with the KERN_WARNING level. It is due to missing KERN_* annotation. Use pr_info() instead of bare printk() to dump the info with the info level. Signed-off-by: Jiri Slaby (SUSE) Cc: Peter Zijlstra Cc: Ingo Molnar Cc: Will Deacon Cc: Waiman Long Cc: Boqun Feng Reviewed-by: Waiman Long Signed-off-by: Boqun Feng Link: https://lore.kernel.org/r/20241007065457.20128-1-jirislaby@kernel.org --- kernel/locking/lockdep.c | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 6fd4af217e71..2d8ec0351ef9 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -6600,17 +6600,17 @@ EXPORT_SYMBOL_GPL(lockdep_unregister_key); void __init lockdep_init(void) { - printk("Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo Molnar\n"); + pr_info("Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo Molnar\n"); - printk("... MAX_LOCKDEP_SUBCLASSES: %lu\n", MAX_LOCKDEP_SUBCLASSES); - printk("... MAX_LOCK_DEPTH: %lu\n", MAX_LOCK_DEPTH); - printk("... MAX_LOCKDEP_KEYS: %lu\n", MAX_LOCKDEP_KEYS); - printk("... CLASSHASH_SIZE: %lu\n", CLASSHASH_SIZE); - printk("... MAX_LOCKDEP_ENTRIES: %lu\n", MAX_LOCKDEP_ENTRIES); - printk("... MAX_LOCKDEP_CHAINS: %lu\n", MAX_LOCKDEP_CHAINS); - printk("... CHAINHASH_SIZE: %lu\n", CHAINHASH_SIZE); + pr_info("... MAX_LOCKDEP_SUBCLASSES: %lu\n", MAX_LOCKDEP_SUBCLASSES); + pr_info("... MAX_LOCK_DEPTH: %lu\n", MAX_LOCK_DEPTH); + pr_info("... MAX_LOCKDEP_KEYS: %lu\n", MAX_LOCKDEP_KEYS); + pr_info("... CLASSHASH_SIZE: %lu\n", CLASSHASH_SIZE); + pr_info("... MAX_LOCKDEP_ENTRIES: %lu\n", MAX_LOCKDEP_ENTRIES); + pr_info("... MAX_LOCKDEP_CHAINS: %lu\n", MAX_LOCKDEP_CHAINS); + pr_info("... CHAINHASH_SIZE: %lu\n", CHAINHASH_SIZE); - printk(" memory used by lock dependency info: %zu kB\n", + pr_info(" memory used by lock dependency info: %zu kB\n", (sizeof(lock_classes) + sizeof(lock_classes_in_use) + sizeof(classhash_table) + @@ -6628,12 +6628,12 @@ void __init lockdep_init(void) ); #if defined(CONFIG_TRACE_IRQFLAGS) && defined(CONFIG_PROVE_LOCKING) - printk(" memory used for stack traces: %zu kB\n", + pr_info(" memory used for stack traces: %zu kB\n", (sizeof(stack_trace) + sizeof(stack_trace_hash)) / 1024 ); #endif - printk(" per task-struct memory footprint: %zu bytes\n", + pr_info(" per task-struct memory footprint: %zu bytes\n", sizeof(((struct task_struct *)NULL)->held_locks)); } From 560af5dc839eef08a273908f390cfefefb82aa04 Mon Sep 17 00:00:00 2001 From: Sebastian Andrzej Siewior Date: Wed, 9 Oct 2024 17:45:03 +0200 Subject: [PATCH 09/29] lockdep: Enable PROVE_RAW_LOCK_NESTING with PROVE_LOCKING. With the printk issues solved, the last known splat created by PROVE_RAW_LOCK_NESTING is gone. Enable PROVE_RAW_LOCK_NESTING by default as part of PROVE_LOCKING. Keep the defines around in case something serious pops up and it needs to be disabled. Signed-off-by: Sebastian Andrzej Siewior Acked-by: Waiman Long Signed-off-by: Boqun Feng Link: https://lore.kernel.org/r/20241009161041.1018375-2-bigeasy@linutronix.de --- lib/Kconfig.debug | 12 ++---------- 1 file changed, 2 insertions(+), 10 deletions(-) diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 7315f643817a..5b67816f4a62 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -1409,22 +1409,14 @@ config PROVE_LOCKING For more details, see Documentation/locking/lockdep-design.rst. config PROVE_RAW_LOCK_NESTING - bool "Enable raw_spinlock - spinlock nesting checks" + bool depends on PROVE_LOCKING - default n + default y help Enable the raw_spinlock vs. spinlock nesting checks which ensure that the lock nesting rules for PREEMPT_RT enabled kernels are not violated. - NOTE: There are known nesting problems. So if you enable this - option expect lockdep splats until these problems have been fully - addressed which is work in progress. This config switch allows to - identify and analyze these problems. It will be removed and the - check permanently enabled once the main issues have been fixed. - - If unsure, select N. - config LOCK_STAT bool "Lock usage statistics" depends on DEBUG_KERNEL && LOCK_DEBUGGING_SUPPORT From 2628cbd03924b91a360f72117a9b9c78cfd050e7 Mon Sep 17 00:00:00 2001 From: Qiuxu Zhuo Date: Fri, 9 Aug 2024 09:48:02 +0800 Subject: [PATCH 10/29] locking/pvqspinlock: Convert fields of 'enum vcpu_state' to uppercase Convert the fields of 'enum vcpu_state' to uppercase for better readability. No functional changes intended. Acked-by: Waiman Long Signed-off-by: Qiuxu Zhuo Signed-off-by: Boqun Feng Link: https://lore.kernel.org/r/20240809014802.15320-1-qiuxu.zhuo@intel.com --- kernel/locking/qspinlock_paravirt.h | 36 ++++++++++++++--------------- 1 file changed, 18 insertions(+), 18 deletions(-) diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h index ac2e22502741..dc1cb90e3644 100644 --- a/kernel/locking/qspinlock_paravirt.h +++ b/kernel/locking/qspinlock_paravirt.h @@ -38,13 +38,13 @@ #define PV_PREV_CHECK_MASK 0xff /* - * Queue node uses: vcpu_running & vcpu_halted. - * Queue head uses: vcpu_running & vcpu_hashed. + * Queue node uses: VCPU_RUNNING & VCPU_HALTED. + * Queue head uses: VCPU_RUNNING & VCPU_HASHED. */ enum vcpu_state { - vcpu_running = 0, - vcpu_halted, /* Used only in pv_wait_node */ - vcpu_hashed, /* = pv_hash'ed + vcpu_halted */ + VCPU_RUNNING = 0, + VCPU_HALTED, /* Used only in pv_wait_node */ + VCPU_HASHED, /* = pv_hash'ed + VCPU_HALTED */ }; struct pv_node { @@ -266,7 +266,7 @@ pv_wait_early(struct pv_node *prev, int loop) if ((loop & PV_PREV_CHECK_MASK) != 0) return false; - return READ_ONCE(prev->state) != vcpu_running; + return READ_ONCE(prev->state) != VCPU_RUNNING; } /* @@ -279,7 +279,7 @@ static void pv_init_node(struct mcs_spinlock *node) BUILD_BUG_ON(sizeof(struct pv_node) > sizeof(struct qnode)); pn->cpu = smp_processor_id(); - pn->state = vcpu_running; + pn->state = VCPU_RUNNING; } /* @@ -308,26 +308,26 @@ static void pv_wait_node(struct mcs_spinlock *node, struct mcs_spinlock *prev) /* * Order pn->state vs pn->locked thusly: * - * [S] pn->state = vcpu_halted [S] next->locked = 1 + * [S] pn->state = VCPU_HALTED [S] next->locked = 1 * MB MB - * [L] pn->locked [RmW] pn->state = vcpu_hashed + * [L] pn->locked [RmW] pn->state = VCPU_HASHED * * Matches the cmpxchg() from pv_kick_node(). */ - smp_store_mb(pn->state, vcpu_halted); + smp_store_mb(pn->state, VCPU_HALTED); if (!READ_ONCE(node->locked)) { lockevent_inc(pv_wait_node); lockevent_cond_inc(pv_wait_early, wait_early); - pv_wait(&pn->state, vcpu_halted); + pv_wait(&pn->state, VCPU_HALTED); } /* - * If pv_kick_node() changed us to vcpu_hashed, retain that + * If pv_kick_node() changed us to VCPU_HASHED, retain that * value so that pv_wait_head_or_lock() knows to not also try * to hash this lock. */ - cmpxchg(&pn->state, vcpu_halted, vcpu_running); + cmpxchg(&pn->state, VCPU_HALTED, VCPU_RUNNING); /* * If the locked flag is still not set after wakeup, it is a @@ -357,7 +357,7 @@ static void pv_wait_node(struct mcs_spinlock *node, struct mcs_spinlock *prev) static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node) { struct pv_node *pn = (struct pv_node *)node; - u8 old = vcpu_halted; + u8 old = VCPU_HALTED; /* * If the vCPU is indeed halted, advance its state to match that of * pv_wait_node(). If OTOH this fails, the vCPU was running and will @@ -374,7 +374,7 @@ static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node) * subsequent writes. */ smp_mb__before_atomic(); - if (!try_cmpxchg_relaxed(&pn->state, &old, vcpu_hashed)) + if (!try_cmpxchg_relaxed(&pn->state, &old, VCPU_HASHED)) return; /* @@ -407,7 +407,7 @@ pv_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node) * If pv_kick_node() already advanced our state, we don't need to * insert ourselves into the hash table anymore. */ - if (READ_ONCE(pn->state) == vcpu_hashed) + if (READ_ONCE(pn->state) == VCPU_HASHED) lp = (struct qspinlock **)1; /* @@ -420,7 +420,7 @@ pv_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node) * Set correct vCPU state to be used by queue node wait-early * mechanism. */ - WRITE_ONCE(pn->state, vcpu_running); + WRITE_ONCE(pn->state, VCPU_RUNNING); /* * Set the pending bit in the active lock spinning loop to @@ -460,7 +460,7 @@ pv_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node) goto gotlock; } } - WRITE_ONCE(pn->state, vcpu_hashed); + WRITE_ONCE(pn->state, VCPU_HASHED); lockevent_inc(pv_wait_head); lockevent_cond_inc(pv_wait_again, waitcnt); pv_wait(&lock->locked, _Q_SLOW_VAL); From 52e0874fc16bd26e9ea1871e30ffb2c6dff187cf Mon Sep 17 00:00:00 2001 From: Sebastian Andrzej Siewior Date: Mon, 12 Aug 2024 12:39:02 +0200 Subject: [PATCH 11/29] locking/rt: Add sparse annotation PREEMPT_RT's sleeping locks. The sleeping locks on PREEMPT_RT (rt_spin_lock() and friends) lack sparse annotation. Therefore a missing spin_unlock() won't be spotted by sparse in a PREEMPT_RT build while it is noticed on a !PREEMPT_RT build. Add the __acquires/__releases macros to the lock/ unlock functions. The trylock functions already use the __cond_lock() wrapper. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner Link: https://lore.kernel.org/all/20240812104200.2239232-2-bigeasy@linutronix.de --- include/linux/rwlock_rt.h | 10 +++++----- include/linux/spinlock_rt.h | 8 ++++---- 2 files changed, 9 insertions(+), 9 deletions(-) diff --git a/include/linux/rwlock_rt.h b/include/linux/rwlock_rt.h index 8544ff05e594..7d81fc6918ee 100644 --- a/include/linux/rwlock_rt.h +++ b/include/linux/rwlock_rt.h @@ -24,13 +24,13 @@ do { \ __rt_rwlock_init(rwl, #rwl, &__key); \ } while (0) -extern void rt_read_lock(rwlock_t *rwlock); +extern void rt_read_lock(rwlock_t *rwlock) __acquires(rwlock); extern int rt_read_trylock(rwlock_t *rwlock); -extern void rt_read_unlock(rwlock_t *rwlock); -extern void rt_write_lock(rwlock_t *rwlock); -extern void rt_write_lock_nested(rwlock_t *rwlock, int subclass); +extern void rt_read_unlock(rwlock_t *rwlock) __releases(rwlock); +extern void rt_write_lock(rwlock_t *rwlock) __acquires(rwlock); +extern void rt_write_lock_nested(rwlock_t *rwlock, int subclass) __acquires(rwlock); extern int rt_write_trylock(rwlock_t *rwlock); -extern void rt_write_unlock(rwlock_t *rwlock); +extern void rt_write_unlock(rwlock_t *rwlock) __releases(rwlock); static __always_inline void read_lock(rwlock_t *rwlock) { diff --git a/include/linux/spinlock_rt.h b/include/linux/spinlock_rt.h index 61c49b16f69a..babc3e028779 100644 --- a/include/linux/spinlock_rt.h +++ b/include/linux/spinlock_rt.h @@ -32,10 +32,10 @@ do { \ __rt_spin_lock_init(slock, #slock, &__key, true); \ } while (0) -extern void rt_spin_lock(spinlock_t *lock); -extern void rt_spin_lock_nested(spinlock_t *lock, int subclass); -extern void rt_spin_lock_nest_lock(spinlock_t *lock, struct lockdep_map *nest_lock); -extern void rt_spin_unlock(spinlock_t *lock); +extern void rt_spin_lock(spinlock_t *lock) __acquires(lock); +extern void rt_spin_lock_nested(spinlock_t *lock, int subclass) __acquires(lock); +extern void rt_spin_lock_nest_lock(spinlock_t *lock, struct lockdep_map *nest_lock) __acquires(lock); +extern void rt_spin_unlock(spinlock_t *lock) __releases(lock); extern void rt_spin_lock_unlock(spinlock_t *lock); extern int rt_spin_trylock_bh(spinlock_t *lock); extern int rt_spin_trylock(spinlock_t *lock); From b1f01f9e54b1aaadb6740f86017e8fabdee77fe2 Mon Sep 17 00:00:00 2001 From: Sebastian Andrzej Siewior Date: Mon, 12 Aug 2024 12:39:03 +0200 Subject: [PATCH 12/29] locking/rt: Remove one __cond_lock() in RT's spin_trylock_irqsave() spin_trylock_irqsave() has a __cond_lock() wrapper which points to __spin_trylock_irqsave(). The function then invokes spin_trylock() which has another __cond_lock() finally pointing to rt_spin_trylock(). The compiler has no problem to parse this but sparse does not recognise that users of spin_trylock_irqsave() acquire a conditional lock and complains. Remove one layer of __cond_lock() so that sparse recognises conditional locking. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner Link: https://lore.kernel.org/all/20240812104200.2239232-3-bigeasy@linutronix.de --- include/linux/spinlock_rt.h | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/include/linux/spinlock_rt.h b/include/linux/spinlock_rt.h index babc3e028779..f9f14e135be7 100644 --- a/include/linux/spinlock_rt.h +++ b/include/linux/spinlock_rt.h @@ -132,7 +132,7 @@ static __always_inline void spin_unlock_irqrestore(spinlock_t *lock, #define spin_trylock_irq(lock) \ __cond_lock(lock, rt_spin_trylock(lock)) -#define __spin_trylock_irqsave(lock, flags) \ +#define spin_trylock_irqsave(lock, flags) \ ({ \ int __locked; \ \ @@ -142,9 +142,6 @@ static __always_inline void spin_unlock_irqrestore(spinlock_t *lock, __locked; \ }) -#define spin_trylock_irqsave(lock, flags) \ - __cond_lock(lock, __spin_trylock_irqsave(lock, flags)) - #define spin_is_contended(lock) (((void)(lock), 0)) static inline int spin_is_locked(spinlock_t *lock) From 168660b826a77fda28235e0b0b3027041d6a5240 Mon Sep 17 00:00:00 2001 From: Sebastian Andrzej Siewior Date: Mon, 12 Aug 2024 12:39:04 +0200 Subject: [PATCH 13/29] locking/rt: Add sparse annotation for RCU. Every lock, that becomes a sleeping lock on PREEMPT_RT, starts a RCU read side critical section. There is no sparse annotation for this and sparse complains about unbalanced locking. Add __acquires/ __releases for the RCU lock. This covers all but the trylock functions. A __cond_acquires() annotation didn't work. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner Link: https://lore.kernel.org/all/20240812104200.2239232-4-bigeasy@linutronix.de --- kernel/locking/spinlock_rt.c | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/kernel/locking/spinlock_rt.c b/kernel/locking/spinlock_rt.c index 38e292454fcc..d1cf8b2b6dca 100644 --- a/kernel/locking/spinlock_rt.c +++ b/kernel/locking/spinlock_rt.c @@ -51,7 +51,7 @@ static __always_inline void __rt_spin_lock(spinlock_t *lock) migrate_disable(); } -void __sched rt_spin_lock(spinlock_t *lock) +void __sched rt_spin_lock(spinlock_t *lock) __acquires(RCU) { spin_acquire(&lock->dep_map, 0, 0, _RET_IP_); __rt_spin_lock(lock); @@ -75,7 +75,7 @@ void __sched rt_spin_lock_nest_lock(spinlock_t *lock, EXPORT_SYMBOL(rt_spin_lock_nest_lock); #endif -void __sched rt_spin_unlock(spinlock_t *lock) +void __sched rt_spin_unlock(spinlock_t *lock) __releases(RCU) { spin_release(&lock->dep_map, _RET_IP_); migrate_enable(); @@ -225,7 +225,7 @@ int __sched rt_write_trylock(rwlock_t *rwlock) } EXPORT_SYMBOL(rt_write_trylock); -void __sched rt_read_lock(rwlock_t *rwlock) +void __sched rt_read_lock(rwlock_t *rwlock) __acquires(RCU) { rtlock_might_resched(); rwlock_acquire_read(&rwlock->dep_map, 0, 0, _RET_IP_); @@ -235,7 +235,7 @@ void __sched rt_read_lock(rwlock_t *rwlock) } EXPORT_SYMBOL(rt_read_lock); -void __sched rt_write_lock(rwlock_t *rwlock) +void __sched rt_write_lock(rwlock_t *rwlock) __acquires(RCU) { rtlock_might_resched(); rwlock_acquire(&rwlock->dep_map, 0, 0, _RET_IP_); @@ -246,7 +246,7 @@ void __sched rt_write_lock(rwlock_t *rwlock) EXPORT_SYMBOL(rt_write_lock); #ifdef CONFIG_DEBUG_LOCK_ALLOC -void __sched rt_write_lock_nested(rwlock_t *rwlock, int subclass) +void __sched rt_write_lock_nested(rwlock_t *rwlock, int subclass) __acquires(RCU) { rtlock_might_resched(); rwlock_acquire(&rwlock->dep_map, subclass, 0, _RET_IP_); @@ -257,7 +257,7 @@ void __sched rt_write_lock_nested(rwlock_t *rwlock, int subclass) EXPORT_SYMBOL(rt_write_lock_nested); #endif -void __sched rt_read_unlock(rwlock_t *rwlock) +void __sched rt_read_unlock(rwlock_t *rwlock) __releases(RCU) { rwlock_release(&rwlock->dep_map, _RET_IP_); migrate_enable(); @@ -266,7 +266,7 @@ void __sched rt_read_unlock(rwlock_t *rwlock) } EXPORT_SYMBOL(rt_read_unlock); -void __sched rt_write_unlock(rwlock_t *rwlock) +void __sched rt_write_unlock(rwlock_t *rwlock) __releases(RCU) { rwlock_release(&rwlock->dep_map, _RET_IP_); rcu_read_unlock(); From 77abd3b7d9bf384306872b6201b1dfeb1e899892 Mon Sep 17 00:00:00 2001 From: Sebastian Andrzej Siewior Date: Mon, 12 Aug 2024 12:39:05 +0200 Subject: [PATCH 14/29] locking/rt: Annotate unlock followed by lock for sparse. rt_mutex_slowlock_block() and rtlock_slowlock_locked() both unlock lock::wait_lock and then lock it later. This is unusual and sparse complains about it. Add __releases() + __acquires() annotation to mark that it is expected. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner Link: https://lore.kernel.org/all/20240812104200.2239232-5-bigeasy@linutronix.de --- kernel/locking/rtmutex.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c index ebebd0eec7f6..d3b72c2f983f 100644 --- a/kernel/locking/rtmutex.c +++ b/kernel/locking/rtmutex.c @@ -1601,6 +1601,7 @@ static int __sched rt_mutex_slowlock_block(struct rt_mutex_base *lock, unsigned int state, struct hrtimer_sleeper *timeout, struct rt_mutex_waiter *waiter) + __releases(&lock->wait_lock) __acquires(&lock->wait_lock) { struct rt_mutex *rtm = container_of(lock, struct rt_mutex, rtmutex); struct task_struct *owner; @@ -1805,6 +1806,7 @@ static __always_inline int __rt_mutex_lock(struct rt_mutex_base *lock, * @lock: The underlying RT mutex */ static void __sched rtlock_slowlock_locked(struct rt_mutex_base *lock) + __releases(&lock->wait_lock) __acquires(&lock->wait_lock) { struct rt_mutex_waiter waiter; struct task_struct *owner; From d12b802f183667d4c28589314c99c380a458d57e Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Tue, 8 Oct 2024 11:26:06 +0200 Subject: [PATCH 15/29] locking/rtmutex: Fix misleading comment Going through the RCU-boost and rtmutex code, I ran into this utterly confusing comment. Fix it to avoid confusing future readers. [ tglx: Wordsmithed the comment ] Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Thomas Gleixner Reviewed-by: Sebastian Andrzej Siewior Link: https://lore.kernel.org/all/20241008092606.GJ33184@noisy.programming.kicks-ass.net --- kernel/locking/rtmutex_api.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/kernel/locking/rtmutex_api.c b/kernel/locking/rtmutex_api.c index a6974d044593..7e79258feb27 100644 --- a/kernel/locking/rtmutex_api.c +++ b/kernel/locking/rtmutex_api.c @@ -175,10 +175,10 @@ bool __sched __rt_mutex_futex_unlock(struct rt_mutex_base *lock, } /* - * We've already deboosted, mark_wakeup_next_waiter() will - * retain preempt_disabled when we drop the wait_lock, to - * avoid inversion prior to the wakeup. preempt_disable() - * therein pairs with rt_mutex_postunlock(). + * mark_wakeup_next_waiter() deboosts and retains preemption + * disabled when dropping the wait_lock, to avoid inversion prior + * to the wakeup. preempt_disable() therein pairs with the + * preempt_enable() in rt_mutex_postunlock(). */ mark_wakeup_next_waiter(wqh, lock); From f730fd535fc51573f982fad629f2fc6b4a0cde2f Mon Sep 17 00:00:00 2001 From: Uros Bizjak Date: Mon, 19 Aug 2024 09:41:15 +0200 Subject: [PATCH 16/29] cleanup: Remove address space of returned pointer Guard functions in local_lock.h are defined using DEFINE_GUARD() and DEFINE_LOCK_GUARD_1() macros having lock type defined as pointer in the percpu address space. The functions, defined by these macros return value in generic address space, causing: cleanup.h:157:18: error: return from pointer to non-enclosed address space and cleanup.h:214:18: error: return from pointer to non-enclosed address space when strict percpu checks are enabled. Add explicit casts to remove address space of the returned pointer. Found by GCC's named address space checks. Fixes: e4ab322fbaaa ("cleanup: Add conditional guard support") Signed-off-by: Uros Bizjak Signed-off-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/20240819074124.143565-1-ubizjak@gmail.com --- include/linux/cleanup.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/include/linux/cleanup.h b/include/linux/cleanup.h index 038b2d523bf8..518bd1fd86fb 100644 --- a/include/linux/cleanup.h +++ b/include/linux/cleanup.h @@ -290,7 +290,7 @@ static inline class_##_name##_t class_##_name##ext##_constructor(_init_args) \ #define DEFINE_GUARD(_name, _type, _lock, _unlock) \ DEFINE_CLASS(_name, _type, if (_T) { _unlock; }, ({ _lock; _T; }), _type _T); \ static inline void * class_##_name##_lock_ptr(class_##_name##_t *_T) \ - { return *_T; } + { return (void *)(__force unsigned long)*_T; } #define DEFINE_GUARD_COND(_name, _ext, _condlock) \ EXTEND_CLASS(_name, _ext, \ @@ -347,7 +347,7 @@ static inline void class_##_name##_destructor(class_##_name##_t *_T) \ \ static inline void *class_##_name##_lock_ptr(class_##_name##_t *_T) \ { \ - return _T->lock; \ + return (void *)(__force unsigned long)_T->lock; \ } From 0d75e0c420e52b4057a2de274054a5274209a2ae Mon Sep 17 00:00:00 2001 From: Uros Bizjak Date: Tue, 1 Oct 2024 13:45:57 +0200 Subject: [PATCH 17/29] locking/osq_lock: Use atomic_try_cmpxchg_release() in osq_unlock() Replace this pattern in osq_unlock(): atomic_cmpxchg(*ptr, old, new) == old ... with the simpler and faster: atomic_try_cmpxchg(*ptr, &old, new) The x86 CMPXCHG instruction returns success in the ZF flag, so this change saves a compare after the CMPXCHG. The code in the fast path of osq_unlock() improves from: 11b: 31 c9 xor %ecx,%ecx 11d: 8d 50 01 lea 0x1(%rax),%edx 120: 89 d0 mov %edx,%eax 122: f0 0f b1 0f lock cmpxchg %ecx,(%rdi) 126: 39 c2 cmp %eax,%edx 128: 75 05 jne 12f <...> to: 12b: 31 d2 xor %edx,%edx 12d: 83 c0 01 add $0x1,%eax 130: f0 0f b1 17 lock cmpxchg %edx,(%rdi) 134: 75 05 jne 13b <...> Signed-off-by: Uros Bizjak Signed-off-by: Peter Zijlstra (Intel) Acked-by: Waiman Long Link: https://lore.kernel.org/r/20241001114606.820277-1-ubizjak@gmail.com --- kernel/locking/osq_lock.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c index 75a6f6133866..b4233dc2c2b0 100644 --- a/kernel/locking/osq_lock.c +++ b/kernel/locking/osq_lock.c @@ -215,8 +215,7 @@ void osq_unlock(struct optimistic_spin_queue *lock) /* * Fast path for the uncontended case. */ - if (likely(atomic_cmpxchg_release(&lock->tail, curr, - OSQ_UNLOCKED_VAL) == curr)) + if (atomic_try_cmpxchg_release(&lock->tail, &curr, OSQ_UNLOCKED_VAL)) return; /* From fcc22ac5baf06dd17193de44b60dbceea6461983 Mon Sep 17 00:00:00 2001 From: Przemek Kitszel Date: Fri, 18 Oct 2024 13:38:14 +0200 Subject: [PATCH 18/29] cleanup: Adjust scoped_guard() macros to avoid potential warning Change scoped_guard() and scoped_cond_guard() macros to make reasoning about them easier for static analysis tools (smatch, compiler diagnostics), especially to enable them to tell if the given usage of scoped_guard() is with a conditional lock class (interruptible-locks, try-locks) or not (like simple mutex_lock()). Add compile-time error if scoped_cond_guard() is used for non-conditional lock class. Beyond easier tooling and a little shrink reported by bloat-o-meter this patch enables developer to write code like: int foo(struct my_drv *adapter) { scoped_guard(spinlock, &adapter->some_spinlock) return adapter->spinlock_protected_var; } Current scoped_guard() implementation does not support that, due to compiler complaining: error: control reaches end of non-void function [-Werror=return-type] Technical stuff about the change: scoped_guard() macro uses common idiom of using "for" statement to declare a scoped variable. Unfortunately, current logic is too hard for compiler diagnostics to be sure that there is exactly one loop step; fix that. To make any loop so trivial that there is no above warning, it must not depend on any non-const variable to tell if there are more steps. There is no obvious solution for that in C, but one could use the compound statement expression with "goto" jumping past the "loop", effectively leaving only the subscope part of the loop semantics. More impl details: one more level of macro indirection is now needed to avoid duplicating label names; I didn't spot any other place that is using the "for (...; goto label) if (0) label: break;" idiom, so it's not packed for reuse beyond scoped_guard() family, what makes actual macros code cleaner. There was also a need to introduce const true/false variable per lock class, it is used to aid compiler diagnostics reasoning about "exactly 1 step" loops (note that converting that to function would undo the whole benefit). Big thanks to Andy Shevchenko for help on this patch, both internal and public, ranging from whitespace/formatting, through commit message clarifications, general improvements, ending with presenting alternative approaches - all despite not even liking the idea. Big thanks to Dmitry Torokhov for the idea of compile-time check for scoped_cond_guard() (to use it only with conditional locsk), and general improvements for the patch. Big thanks to David Lechner for idea to cover also scoped_cond_guard(). Signed-off-by: Przemek Kitszel Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Dmitry Torokhov Link: https://lkml.kernel.org/r/20241018113823.171256-1-przemyslaw.kitszel@intel.com --- include/linux/cleanup.h | 48 ++++++++++++++++++++++++++++++++++------- 1 file changed, 40 insertions(+), 8 deletions(-) diff --git a/include/linux/cleanup.h b/include/linux/cleanup.h index 518bd1fd86fb..0cc66f8d28e7 100644 --- a/include/linux/cleanup.h +++ b/include/linux/cleanup.h @@ -285,14 +285,20 @@ static inline class_##_name##_t class_##_name##ext##_constructor(_init_args) \ * similar to scoped_guard(), except it does fail when the lock * acquire fails. * + * Only for conditional locks. */ +#define __DEFINE_CLASS_IS_CONDITIONAL(_name, _is_cond) \ +static __maybe_unused const bool class_##_name##_is_conditional = _is_cond + #define DEFINE_GUARD(_name, _type, _lock, _unlock) \ + __DEFINE_CLASS_IS_CONDITIONAL(_name, false); \ DEFINE_CLASS(_name, _type, if (_T) { _unlock; }, ({ _lock; _T; }), _type _T); \ static inline void * class_##_name##_lock_ptr(class_##_name##_t *_T) \ { return (void *)(__force unsigned long)*_T; } #define DEFINE_GUARD_COND(_name, _ext, _condlock) \ + __DEFINE_CLASS_IS_CONDITIONAL(_name##_ext, true); \ EXTEND_CLASS(_name, _ext, \ ({ void *_t = _T; if (_T && !(_condlock)) _t = NULL; _t; }), \ class_##_name##_t _T) \ @@ -303,17 +309,40 @@ static inline class_##_name##_t class_##_name##ext##_constructor(_init_args) \ CLASS(_name, __UNIQUE_ID(guard)) #define __guard_ptr(_name) class_##_name##_lock_ptr +#define __is_cond_ptr(_name) class_##_name##_is_conditional -#define scoped_guard(_name, args...) \ - for (CLASS(_name, scope)(args), \ - *done = NULL; __guard_ptr(_name)(&scope) && !done; done = (void *)1) +/* + * Helper macro for scoped_guard(). + * + * Note that the "!__is_cond_ptr(_name)" part of the condition ensures that + * compiler would be sure that for the unconditional locks the body of the + * loop (caller-provided code glued to the else clause) could not be skipped. + * It is needed because the other part - "__guard_ptr(_name)(&scope)" - is too + * hard to deduce (even if could be proven true for unconditional locks). + */ +#define __scoped_guard(_name, _label, args...) \ + for (CLASS(_name, scope)(args); \ + __guard_ptr(_name)(&scope) || !__is_cond_ptr(_name); \ + ({ goto _label; })) \ + if (0) { \ +_label: \ + break; \ + } else -#define scoped_cond_guard(_name, _fail, args...) \ - for (CLASS(_name, scope)(args), \ - *done = NULL; !done; done = (void *)1) \ - if (!__guard_ptr(_name)(&scope)) _fail; \ - else +#define scoped_guard(_name, args...) \ + __scoped_guard(_name, __UNIQUE_ID(label), args) +#define __scoped_cond_guard(_name, _fail, _label, args...) \ + for (CLASS(_name, scope)(args); true; ({ goto _label; })) \ + if (!__guard_ptr(_name)(&scope)) { \ + BUILD_BUG_ON(!__is_cond_ptr(_name)); \ + _fail; \ +_label: \ + break; \ + } else + +#define scoped_cond_guard(_name, _fail, args...) \ + __scoped_cond_guard(_name, _fail, __UNIQUE_ID(label), args) /* * Additional helper macros for generating lock guards with types, either for * locks that don't have a native type (eg. RCU, preempt) or those that need a @@ -369,14 +398,17 @@ static inline class_##_name##_t class_##_name##_constructor(void) \ } #define DEFINE_LOCK_GUARD_1(_name, _type, _lock, _unlock, ...) \ +__DEFINE_CLASS_IS_CONDITIONAL(_name, false); \ __DEFINE_UNLOCK_GUARD(_name, _type, _unlock, __VA_ARGS__) \ __DEFINE_LOCK_GUARD_1(_name, _type, _lock) #define DEFINE_LOCK_GUARD_0(_name, _lock, _unlock, ...) \ +__DEFINE_CLASS_IS_CONDITIONAL(_name, false); \ __DEFINE_UNLOCK_GUARD(_name, void, _unlock, __VA_ARGS__) \ __DEFINE_LOCK_GUARD_0(_name, _lock) #define DEFINE_LOCK_GUARD_1_COND(_name, _ext, _condlock) \ + __DEFINE_CLASS_IS_CONDITIONAL(_name##_ext, true); \ EXTEND_CLASS(_name, _ext, \ ({ class_##_name##_t _t = { .lock = l }, *_T = &_t;\ if (_T->lock && !(_condlock)) _T->lock = NULL; \ From 36c2cf88808d47e926d11b98734f154fe4a9f50f Mon Sep 17 00:00:00 2001 From: David Lechner Date: Tue, 1 Oct 2024 17:30:18 -0500 Subject: [PATCH 19/29] cleanup: Add conditional guard helper Add a new if_not_guard() macro to cleanup.h for handling conditional guards such as mutext_trylock(). This is more ergonomic than scoped_guard() for most use cases. Instead of hiding the error handling statement in the macro args, it works like a normal if statement and allow the error path to be indented while the normal code flow path is not indented. And it avoid unwanted side-effect from hidden for loop in scoped_guard(). Signed-off-by: David Lechner Co-developed-by: Fabio M. De Francesco Signed-off-by: Fabio M. De Francesco Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Dan Williams Link: https://lkml.kernel.org/r/20241001-cleanup-if_not_cond_guard-v1-1-7753810b0f7a@baylibre.com --- include/linux/cleanup.h | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/include/linux/cleanup.h b/include/linux/cleanup.h index 0cc66f8d28e7..e859f79b9d2d 100644 --- a/include/linux/cleanup.h +++ b/include/linux/cleanup.h @@ -273,6 +273,12 @@ static inline class_##_name##_t class_##_name##ext##_constructor(_init_args) \ * an anonymous instance of the (guard) class, not recommended for * conditional locks. * + * if_not_guard(name, args...) { }: + * convenience macro for conditional guards that calls the statement that + * follows only if the lock was not acquired (typically an error return). + * + * Only for conditional locks. + * * scoped_guard (name, args...) { }: * similar to CLASS(name, scope)(args), except the variable (with the * explicit name 'scope') is declard in a for-loop such that its scope is @@ -343,6 +349,15 @@ _label: \ #define scoped_cond_guard(_name, _fail, args...) \ __scoped_cond_guard(_name, _fail, __UNIQUE_ID(label), args) + +#define __if_not_guard(_name, _id, args...) \ + BUILD_BUG_ON(!__is_cond_ptr(_name)); \ + CLASS(_name, _id)(args); \ + if (!__guard_ptr(_name)(&_id)) + +#define if_not_guard(_name, args...) \ + __if_not_guard(_name, __UNIQUE_ID(guard), args) + /* * Additional helper macros for generating lock guards with types, either for * locks that don't have a native type (eg. RCU, preempt) or those that need a From 8b64db9733c2e4d30fd068d0b9dcef7b4424b035 Mon Sep 17 00:00:00 2001 From: Uros Bizjak Date: Sun, 3 Nov 2024 17:09:31 +0100 Subject: [PATCH 20/29] locking/atomic/x86: Use ALT_OUTPUT_SP() for __alternative_atomic64() CONFIG_X86_CMPXCHG64 variant of x86_32 __alternative_atomic64() macro uses CALL instruction inside asm statement. Use ALT_OUTPUT_SP() macro to add required dependence on %esp register. Fixes: 819165fb34b9 ("x86: Adjust asm constraints in atomic64 wrappers") Signed-off-by: Uros Bizjak Signed-off-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/20241103160954.3329-1-ubizjak@gmail.com --- arch/x86/include/asm/atomic64_32.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/atomic64_32.h b/arch/x86/include/asm/atomic64_32.h index 1f650b4dde50..6c6e9b9f98a4 100644 --- a/arch/x86/include/asm/atomic64_32.h +++ b/arch/x86/include/asm/atomic64_32.h @@ -51,7 +51,8 @@ static __always_inline s64 arch_atomic64_read_nonatomic(const atomic64_t *v) #ifdef CONFIG_X86_CMPXCHG64 #define __alternative_atomic64(f, g, out, in...) \ asm volatile("call %c[func]" \ - : out : [func] "i" (atomic64_##g##_cx8), ## in) + : ALT_OUTPUT_SP(out) \ + : [func] "i" (atomic64_##g##_cx8), ## in) #define ATOMIC64_DECL(sym) ATOMIC64_DECL_ONE(sym##_cx8) #else From 25cf4fbb596d730476afcc0fb87a9d708db14078 Mon Sep 17 00:00:00 2001 From: Uros Bizjak Date: Sun, 3 Nov 2024 17:09:32 +0100 Subject: [PATCH 21/29] locking/atomic/x86: Use ALT_OUTPUT_SP() for __arch_{,try_}cmpxchg64_emu() x86_32 __arch_{,try_}cmpxchg64_emu()() macros use CALL instruction inside asm statement. Use ALT_OUTPUT_SP() macro to add required dependence on %esp register. Fixes: 79e1dd05d1a2 ("x86: Provide an alternative() based cmpxchg64()") Signed-off-by: Uros Bizjak Signed-off-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/20241103160954.3329-2-ubizjak@gmail.com --- arch/x86/include/asm/cmpxchg_32.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/arch/x86/include/asm/cmpxchg_32.h b/arch/x86/include/asm/cmpxchg_32.h index 62cef2113ca7..fd1282a783dd 100644 --- a/arch/x86/include/asm/cmpxchg_32.h +++ b/arch/x86/include/asm/cmpxchg_32.h @@ -94,7 +94,7 @@ static __always_inline bool __try_cmpxchg64_local(volatile u64 *ptr, u64 *oldp, asm volatile(ALTERNATIVE(_lock_loc \ "call cmpxchg8b_emu", \ _lock "cmpxchg8b %a[ptr]", X86_FEATURE_CX8) \ - : "+a" (o.low), "+d" (o.high) \ + : ALT_OUTPUT_SP("+a" (o.low), "+d" (o.high)) \ : "b" (n.low), "c" (n.high), [ptr] "S" (_ptr) \ : "memory"); \ \ @@ -123,8 +123,8 @@ static __always_inline u64 arch_cmpxchg64_local(volatile u64 *ptr, u64 old, u64 "call cmpxchg8b_emu", \ _lock "cmpxchg8b %a[ptr]", X86_FEATURE_CX8) \ CC_SET(e) \ - : CC_OUT(e) (ret), \ - "+a" (o.low), "+d" (o.high) \ + : ALT_OUTPUT_SP(CC_OUT(e) (ret), \ + "+a" (o.low), "+d" (o.high)) \ : "b" (n.low), "c" (n.high), [ptr] "S" (_ptr) \ : "memory"); \ \ From 1139c71df5ca29a36f08e3a08c7cee160db21ec1 Mon Sep 17 00:00:00 2001 From: Marco Elver Date: Mon, 4 Nov 2024 16:43:05 +0100 Subject: [PATCH 22/29] time/sched_clock: Swap update_clock_read_data() latch writes Swap the writes to the odd and even copies to make the writer critical section look like all other seqcount_latch writers. Signed-off-by: Marco Elver Signed-off-by: Peter Zijlstra (Intel) Link: https://lore.kernel.org/r/20241104161910.780003-2-elver@google.com --- kernel/time/sched_clock.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/kernel/time/sched_clock.c b/kernel/time/sched_clock.c index 68d6c1190ac7..85595fcf6aa2 100644 --- a/kernel/time/sched_clock.c +++ b/kernel/time/sched_clock.c @@ -119,9 +119,6 @@ unsigned long long notrace sched_clock(void) */ static void update_clock_read_data(struct clock_read_data *rd) { - /* update the backup (odd) copy with the new data */ - cd.read_data[1] = *rd; - /* steer readers towards the odd copy */ raw_write_seqcount_latch(&cd.seq); @@ -130,6 +127,9 @@ static void update_clock_read_data(struct clock_read_data *rd) /* switch readers back to the even copy */ raw_write_seqcount_latch(&cd.seq); + + /* update the backup (odd) copy with the new data */ + cd.read_data[1] = *rd; } /* From 8ab40fc2b9086b915e46890bb9252dc7692f1da0 Mon Sep 17 00:00:00 2001 From: Marco Elver Date: Mon, 4 Nov 2024 16:43:06 +0100 Subject: [PATCH 23/29] time/sched_clock: Broaden sched_clock()'s instrumentation coverage Most of sched_clock()'s implementation is ineligible for instrumentation due to relying on sched_clock_noinstr(). Split the implementation off into an __always_inline function __sched_clock(), which is then used by the noinstr and instrumentable version, to allow more of sched_clock() to be covered by various instrumentation. This will allow instrumentation with the various sanitizers (KASAN, KCSAN, KMSAN, UBSAN). For KCSAN, we know that raw seqcount_latch usage without annotations will result in false positive reports: tell it that all of __sched_clock() is "atomic" for the latch reader; later changes in this series will take care of the writers. Co-developed-by: "Peter Zijlstra (Intel)" Signed-off-by: "Peter Zijlstra (Intel)" Signed-off-by: Marco Elver Signed-off-by: Peter Zijlstra (Intel) Link: https://lore.kernel.org/r/20241104161910.780003-3-elver@google.com --- kernel/time/sched_clock.c | 16 ++++++++++++++-- 1 file changed, 14 insertions(+), 2 deletions(-) diff --git a/kernel/time/sched_clock.c b/kernel/time/sched_clock.c index 85595fcf6aa2..29bdf309dae8 100644 --- a/kernel/time/sched_clock.c +++ b/kernel/time/sched_clock.c @@ -80,7 +80,7 @@ notrace int sched_clock_read_retry(unsigned int seq) return raw_read_seqcount_latch_retry(&cd.seq, seq); } -unsigned long long noinstr sched_clock_noinstr(void) +static __always_inline unsigned long long __sched_clock(void) { struct clock_read_data *rd; unsigned int seq; @@ -98,11 +98,23 @@ unsigned long long noinstr sched_clock_noinstr(void) return res; } +unsigned long long noinstr sched_clock_noinstr(void) +{ + return __sched_clock(); +} + unsigned long long notrace sched_clock(void) { unsigned long long ns; preempt_disable_notrace(); - ns = sched_clock_noinstr(); + /* + * All of __sched_clock() is a seqcount_latch reader critical section, + * but relies on the raw helpers which are uninstrumented. For KCSAN, + * mark all accesses in __sched_clock() as atomic. + */ + kcsan_nestable_atomic_begin(); + ns = __sched_clock(); + kcsan_nestable_atomic_end(); preempt_enable_notrace(); return ns; } From 5c1806c41ce0a0110db5dd4c483cf2dc28b3ddf0 Mon Sep 17 00:00:00 2001 From: Marco Elver Date: Mon, 4 Nov 2024 16:43:07 +0100 Subject: [PATCH 24/29] kcsan, seqlock: Support seqcount_latch_t While fuzzing an arm64 kernel, Alexander Potapenko reported: | BUG: KCSAN: data-race in ktime_get_mono_fast_ns / timekeeping_update | | write to 0xffffffc082e74248 of 56 bytes by interrupt on cpu 0: | update_fast_timekeeper kernel/time/timekeeping.c:430 [inline] | timekeeping_update+0x1d8/0x2d8 kernel/time/timekeeping.c:768 | timekeeping_advance+0x9e8/0xb78 kernel/time/timekeeping.c:2344 | update_wall_time+0x18/0x38 kernel/time/timekeeping.c:2360 | [...] | | read to 0xffffffc082e74258 of 8 bytes by task 5260 on cpu 1: | __ktime_get_fast_ns kernel/time/timekeeping.c:372 [inline] | ktime_get_mono_fast_ns+0x88/0x174 kernel/time/timekeeping.c:489 | init_srcu_struct_fields+0x40c/0x530 kernel/rcu/srcutree.c:263 | init_srcu_struct+0x14/0x20 kernel/rcu/srcutree.c:311 | [...] | | value changed: 0x000002f875d33266 -> 0x000002f877416866 | | Reported by Kernel Concurrency Sanitizer on: | CPU: 1 UID: 0 PID: 5260 Comm: syz.2.7483 Not tainted 6.12.0-rc3-dirty #78 This is a false positive data race between a seqcount latch writer and a reader accessing stale data. Since its introduction, KCSAN has never understood the seqcount_latch interface (due to being unannotated). Unlike the regular seqlock interface, the seqcount_latch interface for latch writers never has had a well-defined critical section, making it difficult to teach tooling where the critical section starts and ends. Introduce an instrumentable (non-raw) seqcount_latch interface, with which we can clearly denote writer critical sections. This both helps readability and tooling like KCSAN to understand when the writer is done updating all latch copies. Fixes: 88ecd153be95 ("seqlock, kcsan: Add annotations for KCSAN") Reported-by: Alexander Potapenko Co-developed-by: "Peter Zijlstra (Intel)" Signed-off-by: "Peter Zijlstra (Intel)" Signed-off-by: Marco Elver Signed-off-by: Peter Zijlstra (Intel) Link: https://lore.kernel.org/r/20241104161910.780003-4-elver@google.com --- Documentation/locking/seqlock.rst | 2 +- include/linux/seqlock.h | 86 +++++++++++++++++++++++++------ 2 files changed, 72 insertions(+), 16 deletions(-) diff --git a/Documentation/locking/seqlock.rst b/Documentation/locking/seqlock.rst index bfda1a5fecad..ec6411d02ac8 100644 --- a/Documentation/locking/seqlock.rst +++ b/Documentation/locking/seqlock.rst @@ -153,7 +153,7 @@ Use seqcount_latch_t when the write side sections cannot be protected from interruption by readers. This is typically the case when the read side can be invoked from NMI handlers. -Check `raw_write_seqcount_latch()` for more information. +Check `write_seqcount_latch()` for more information. .. _seqlock_t: diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h index fffeb754880f..45eee0e5dca0 100644 --- a/include/linux/seqlock.h +++ b/include/linux/seqlock.h @@ -621,6 +621,23 @@ static __always_inline unsigned raw_read_seqcount_latch(const seqcount_latch_t * return READ_ONCE(s->seqcount.sequence); } +/** + * read_seqcount_latch() - pick even/odd latch data copy + * @s: Pointer to seqcount_latch_t + * + * See write_seqcount_latch() for details and a full reader/writer usage + * example. + * + * Return: sequence counter raw value. Use the lowest bit as an index for + * picking which data copy to read. The full counter must then be checked + * with read_seqcount_latch_retry(). + */ +static __always_inline unsigned read_seqcount_latch(const seqcount_latch_t *s) +{ + kcsan_atomic_next(KCSAN_SEQLOCK_REGION_MAX); + return raw_read_seqcount_latch(s); +} + /** * raw_read_seqcount_latch_retry() - end a seqcount_latch_t read section * @s: Pointer to seqcount_latch_t @@ -635,9 +652,34 @@ raw_read_seqcount_latch_retry(const seqcount_latch_t *s, unsigned start) return unlikely(READ_ONCE(s->seqcount.sequence) != start); } +/** + * read_seqcount_latch_retry() - end a seqcount_latch_t read section + * @s: Pointer to seqcount_latch_t + * @start: count, from read_seqcount_latch() + * + * Return: true if a read section retry is required, else false + */ +static __always_inline int +read_seqcount_latch_retry(const seqcount_latch_t *s, unsigned start) +{ + kcsan_atomic_next(0); + return raw_read_seqcount_latch_retry(s, start); +} + /** * raw_write_seqcount_latch() - redirect latch readers to even/odd copy * @s: Pointer to seqcount_latch_t + */ +static __always_inline void raw_write_seqcount_latch(seqcount_latch_t *s) +{ + smp_wmb(); /* prior stores before incrementing "sequence" */ + s->seqcount.sequence++; + smp_wmb(); /* increment "sequence" before following stores */ +} + +/** + * write_seqcount_latch_begin() - redirect latch readers to odd copy + * @s: Pointer to seqcount_latch_t * * The latch technique is a multiversion concurrency control method that allows * queries during non-atomic modifications. If you can guarantee queries never @@ -665,17 +707,11 @@ raw_read_seqcount_latch_retry(const seqcount_latch_t *s, unsigned start) * * void latch_modify(struct latch_struct *latch, ...) * { - * smp_wmb(); // Ensure that the last data[1] update is visible - * latch->seq.sequence++; - * smp_wmb(); // Ensure that the seqcount update is visible - * + * write_seqcount_latch_begin(&latch->seq); * modify(latch->data[0], ...); - * - * smp_wmb(); // Ensure that the data[0] update is visible - * latch->seq.sequence++; - * smp_wmb(); // Ensure that the seqcount update is visible - * + * write_seqcount_latch(&latch->seq); * modify(latch->data[1], ...); + * write_seqcount_latch_end(&latch->seq); * } * * The query will have a form like:: @@ -686,13 +722,13 @@ raw_read_seqcount_latch_retry(const seqcount_latch_t *s, unsigned start) * unsigned seq, idx; * * do { - * seq = raw_read_seqcount_latch(&latch->seq); + * seq = read_seqcount_latch(&latch->seq); * * idx = seq & 0x01; * entry = data_query(latch->data[idx], ...); * * // This includes needed smp_rmb() - * } while (raw_read_seqcount_latch_retry(&latch->seq, seq)); + * } while (read_seqcount_latch_retry(&latch->seq, seq)); * * return entry; * } @@ -716,11 +752,31 @@ raw_read_seqcount_latch_retry(const seqcount_latch_t *s, unsigned start) * When data is a dynamic data structure; one should use regular RCU * patterns to manage the lifetimes of the objects within. */ -static inline void raw_write_seqcount_latch(seqcount_latch_t *s) +static __always_inline void write_seqcount_latch_begin(seqcount_latch_t *s) { - smp_wmb(); /* prior stores before incrementing "sequence" */ - s->seqcount.sequence++; - smp_wmb(); /* increment "sequence" before following stores */ + kcsan_nestable_atomic_begin(); + raw_write_seqcount_latch(s); +} + +/** + * write_seqcount_latch() - redirect latch readers to even copy + * @s: Pointer to seqcount_latch_t + */ +static __always_inline void write_seqcount_latch(seqcount_latch_t *s) +{ + raw_write_seqcount_latch(s); +} + +/** + * write_seqcount_latch_end() - end a seqcount_latch_t write section + * @s: Pointer to seqcount_latch_t + * + * Marks the end of a seqcount_latch_t writer section, after all copies of the + * latch-protected data have been updated. + */ +static __always_inline void write_seqcount_latch_end(seqcount_latch_t *s) +{ + kcsan_nestable_atomic_end(); } #define __SEQLOCK_UNLOCKED(lockname) \ From 93190bc35d6d4364a4d8c38ac8961dabecbff4ed Mon Sep 17 00:00:00 2001 From: Marco Elver Date: Mon, 4 Nov 2024 16:43:08 +0100 Subject: [PATCH 25/29] seqlock, treewide: Switch to non-raw seqcount_latch interface Switch all instrumentable users of the seqcount_latch interface over to the non-raw interface. Co-developed-by: "Peter Zijlstra (Intel)" Signed-off-by: "Peter Zijlstra (Intel)" Signed-off-by: Marco Elver Signed-off-by: Peter Zijlstra (Intel) Link: https://lore.kernel.org/r/20241104161910.780003-5-elver@google.com --- arch/x86/kernel/tsc.c | 5 +++-- include/linux/rbtree_latch.h | 20 +++++++++++--------- kernel/printk/printk.c | 9 +++++---- kernel/time/sched_clock.c | 12 +++++++----- kernel/time/timekeeping.c | 12 +++++++----- 5 files changed, 33 insertions(+), 25 deletions(-) diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c index dfe6847fd99e..67aeaba4ba9c 100644 --- a/arch/x86/kernel/tsc.c +++ b/arch/x86/kernel/tsc.c @@ -174,10 +174,11 @@ static void __set_cyc2ns_scale(unsigned long khz, int cpu, unsigned long long ts c2n = per_cpu_ptr(&cyc2ns, cpu); - raw_write_seqcount_latch(&c2n->seq); + write_seqcount_latch_begin(&c2n->seq); c2n->data[0] = data; - raw_write_seqcount_latch(&c2n->seq); + write_seqcount_latch(&c2n->seq); c2n->data[1] = data; + write_seqcount_latch_end(&c2n->seq); } static void set_cyc2ns_scale(unsigned long khz, int cpu, unsigned long long tsc_now) diff --git a/include/linux/rbtree_latch.h b/include/linux/rbtree_latch.h index 6a0999c26c7c..2f630eb8307e 100644 --- a/include/linux/rbtree_latch.h +++ b/include/linux/rbtree_latch.h @@ -14,7 +14,7 @@ * * If we need to allow unconditional lookups (say as required for NMI context * usage) we need a more complex setup; this data structure provides this by - * employing the latch technique -- see @raw_write_seqcount_latch -- to + * employing the latch technique -- see @write_seqcount_latch_begin -- to * implement a latched RB-tree which does allow for unconditional lookups by * virtue of always having (at least) one stable copy of the tree. * @@ -132,7 +132,7 @@ __lt_find(void *key, struct latch_tree_root *ltr, int idx, * @ops: operators defining the node order * * It inserts @node into @root in an ordered fashion such that we can always - * observe one complete tree. See the comment for raw_write_seqcount_latch(). + * observe one complete tree. See the comment for write_seqcount_latch_begin(). * * The inserts use rcu_assign_pointer() to publish the element such that the * tree structure is stored before we can observe the new @node. @@ -145,10 +145,11 @@ latch_tree_insert(struct latch_tree_node *node, struct latch_tree_root *root, const struct latch_tree_ops *ops) { - raw_write_seqcount_latch(&root->seq); + write_seqcount_latch_begin(&root->seq); __lt_insert(node, root, 0, ops->less); - raw_write_seqcount_latch(&root->seq); + write_seqcount_latch(&root->seq); __lt_insert(node, root, 1, ops->less); + write_seqcount_latch_end(&root->seq); } /** @@ -159,7 +160,7 @@ latch_tree_insert(struct latch_tree_node *node, * * Removes @node from the trees @root in an ordered fashion such that we can * always observe one complete tree. See the comment for - * raw_write_seqcount_latch(). + * write_seqcount_latch_begin(). * * It is assumed that @node will observe one RCU quiescent state before being * reused of freed. @@ -172,10 +173,11 @@ latch_tree_erase(struct latch_tree_node *node, struct latch_tree_root *root, const struct latch_tree_ops *ops) { - raw_write_seqcount_latch(&root->seq); + write_seqcount_latch_begin(&root->seq); __lt_erase(node, root, 0); - raw_write_seqcount_latch(&root->seq); + write_seqcount_latch(&root->seq); __lt_erase(node, root, 1); + write_seqcount_latch_end(&root->seq); } /** @@ -204,9 +206,9 @@ latch_tree_find(void *key, struct latch_tree_root *root, unsigned int seq; do { - seq = raw_read_seqcount_latch(&root->seq); + seq = read_seqcount_latch(&root->seq); node = __lt_find(key, root, seq & 1, ops->comp); - } while (raw_read_seqcount_latch_retry(&root->seq, seq)); + } while (read_seqcount_latch_retry(&root->seq, seq)); return node; } diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index beb808f4c367..19911c8fa7b6 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -560,10 +560,11 @@ bool printk_percpu_data_ready(void) /* Must be called under syslog_lock. */ static void latched_seq_write(struct latched_seq *ls, u64 val) { - raw_write_seqcount_latch(&ls->latch); + write_seqcount_latch_begin(&ls->latch); ls->val[0] = val; - raw_write_seqcount_latch(&ls->latch); + write_seqcount_latch(&ls->latch); ls->val[1] = val; + write_seqcount_latch_end(&ls->latch); } /* Can be called from any context. */ @@ -574,10 +575,10 @@ static u64 latched_seq_read_nolock(struct latched_seq *ls) u64 val; do { - seq = raw_read_seqcount_latch(&ls->latch); + seq = read_seqcount_latch(&ls->latch); idx = seq & 0x1; val = ls->val[idx]; - } while (raw_read_seqcount_latch_retry(&ls->latch, seq)); + } while (read_seqcount_latch_retry(&ls->latch, seq)); return val; } diff --git a/kernel/time/sched_clock.c b/kernel/time/sched_clock.c index 29bdf309dae8..fcca4e72f1ef 100644 --- a/kernel/time/sched_clock.c +++ b/kernel/time/sched_clock.c @@ -71,13 +71,13 @@ static __always_inline u64 cyc_to_ns(u64 cyc, u32 mult, u32 shift) notrace struct clock_read_data *sched_clock_read_begin(unsigned int *seq) { - *seq = raw_read_seqcount_latch(&cd.seq); + *seq = read_seqcount_latch(&cd.seq); return cd.read_data + (*seq & 1); } notrace int sched_clock_read_retry(unsigned int seq) { - return raw_read_seqcount_latch_retry(&cd.seq, seq); + return read_seqcount_latch_retry(&cd.seq, seq); } static __always_inline unsigned long long __sched_clock(void) @@ -132,16 +132,18 @@ unsigned long long notrace sched_clock(void) static void update_clock_read_data(struct clock_read_data *rd) { /* steer readers towards the odd copy */ - raw_write_seqcount_latch(&cd.seq); + write_seqcount_latch_begin(&cd.seq); /* now its safe for us to update the normal (even) copy */ cd.read_data[0] = *rd; /* switch readers back to the even copy */ - raw_write_seqcount_latch(&cd.seq); + write_seqcount_latch(&cd.seq); /* update the backup (odd) copy with the new data */ cd.read_data[1] = *rd; + + write_seqcount_latch_end(&cd.seq); } /* @@ -279,7 +281,7 @@ void __init generic_sched_clock_init(void) */ static u64 notrace suspended_sched_clock_read(void) { - unsigned int seq = raw_read_seqcount_latch(&cd.seq); + unsigned int seq = read_seqcount_latch(&cd.seq); return cd.read_data[seq & 1].epoch_cyc; } diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c index 7e6f409bf311..18752983e834 100644 --- a/kernel/time/timekeeping.c +++ b/kernel/time/timekeeping.c @@ -411,7 +411,7 @@ static inline u64 timekeeping_get_ns(const struct tk_read_base *tkr) * We want to use this from any context including NMI and tracing / * instrumenting the timekeeping code itself. * - * Employ the latch technique; see @raw_write_seqcount_latch. + * Employ the latch technique; see @write_seqcount_latch. * * So if a NMI hits the update of base[0] then it will use base[1] * which is still consistent. In the worst case this can result is a @@ -424,16 +424,18 @@ static void update_fast_timekeeper(const struct tk_read_base *tkr, struct tk_read_base *base = tkf->base; /* Force readers off to base[1] */ - raw_write_seqcount_latch(&tkf->seq); + write_seqcount_latch_begin(&tkf->seq); /* Update base[0] */ memcpy(base, tkr, sizeof(*base)); /* Force readers back to base[0] */ - raw_write_seqcount_latch(&tkf->seq); + write_seqcount_latch(&tkf->seq); /* Update base[1] */ memcpy(base + 1, base, sizeof(*base)); + + write_seqcount_latch_end(&tkf->seq); } static __always_inline u64 __ktime_get_fast_ns(struct tk_fast *tkf) @@ -443,11 +445,11 @@ static __always_inline u64 __ktime_get_fast_ns(struct tk_fast *tkf) u64 now; do { - seq = raw_read_seqcount_latch(&tkf->seq); + seq = read_seqcount_latch(&tkf->seq); tkr = tkf->base + (seq & 0x01); now = ktime_to_ns(tkr->base); now += __timekeeping_get_ns(tkr); - } while (raw_read_seqcount_latch_retry(&tkf->seq, seq)); + } while (read_seqcount_latch_retry(&tkf->seq, seq)); return now; } From 183ec5f26b2fc97a4a9871865bfe9b33c41fddb2 Mon Sep 17 00:00:00 2001 From: Marco Elver Date: Mon, 4 Nov 2024 16:43:09 +0100 Subject: [PATCH 26/29] kcsan, seqlock: Fix incorrect assumption in read_seqbegin() During testing of the preceding changes, I noticed that in some cases, current->kcsan_ctx.in_flat_atomic remained true until task exit. This is obviously wrong, because _all_ accesses for the given task will be treated as atomic, resulting in false negatives i.e. missed data races. Debugging led to fs/dcache.c, where we can see this usage of seqlock: struct dentry *d_lookup(const struct dentry *parent, const struct qstr *name) { struct dentry *dentry; unsigned seq; do { seq = read_seqbegin(&rename_lock); dentry = __d_lookup(parent, name); if (dentry) break; } while (read_seqretry(&rename_lock, seq)); [...] As can be seen, read_seqretry() is never called if dentry != NULL; consequently, current->kcsan_ctx.in_flat_atomic will never be reset to false by read_seqretry(). Give up on the wrong assumption of "assume closing read_seqretry()", and rely on the already-present annotations in read_seqcount_begin/retry(). Fixes: 88ecd153be95 ("seqlock, kcsan: Add annotations for KCSAN") Signed-off-by: Marco Elver Signed-off-by: Peter Zijlstra (Intel) Link: https://lore.kernel.org/r/20241104161910.780003-6-elver@google.com --- include/linux/seqlock.h | 12 +----------- 1 file changed, 1 insertion(+), 11 deletions(-) diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h index 45eee0e5dca0..5298765d6ca4 100644 --- a/include/linux/seqlock.h +++ b/include/linux/seqlock.h @@ -810,11 +810,7 @@ static __always_inline void write_seqcount_latch_end(seqcount_latch_t *s) */ static inline unsigned read_seqbegin(const seqlock_t *sl) { - unsigned ret = read_seqcount_begin(&sl->seqcount); - - kcsan_atomic_next(0); /* non-raw usage, assume closing read_seqretry() */ - kcsan_flat_atomic_begin(); - return ret; + return read_seqcount_begin(&sl->seqcount); } /** @@ -830,12 +826,6 @@ static inline unsigned read_seqbegin(const seqlock_t *sl) */ static inline unsigned read_seqretry(const seqlock_t *sl, unsigned start) { - /* - * Assume not nested: read_seqretry() may be called multiple times when - * completing read critical section. - */ - kcsan_flat_atomic_end(); - return read_seqcount_retry(&sl->seqcount, start); } From 5c2e7736e20d9b348a44cafbfa639fe2653fbc34 Mon Sep 17 00:00:00 2001 From: Eder Zulian Date: Thu, 7 Nov 2024 17:32:23 +0100 Subject: [PATCH 27/29] rust: helpers: Avoid raw_spin_lock initialization for PREEMPT_RT When PREEMPT_RT=y, spin locks are mapped to rt_mutex types, so using spinlock_check() + __raw_spin_lock_init() to initialize spin locks is incorrect, and would cause build errors. Introduce __spin_lock_init() to initialize a spin lock with lockdep rquired information for PREEMPT_RT builds, and use it in the Rust helper. Fixes: d2d6422f8bd1 ("x86: Allow to enable PREEMPT_RT.") Closes: https://lore.kernel.org/oe-kbuild-all/202409251238.vetlgXE9-lkp@intel.com/ Reported-by: kernel test robot Signed-off-by: Eder Zulian Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Boqun Feng Tested-by: Boqun Feng Link: https://lore.kernel.org/r/20241107163223.2092690-2-ezulian@redhat.com --- include/linux/spinlock_rt.h | 15 +++++++-------- rust/helpers/spinlock.c | 8 ++++++-- 2 files changed, 13 insertions(+), 10 deletions(-) diff --git a/include/linux/spinlock_rt.h b/include/linux/spinlock_rt.h index f9f14e135be7..f6499c37157d 100644 --- a/include/linux/spinlock_rt.h +++ b/include/linux/spinlock_rt.h @@ -16,22 +16,21 @@ static inline void __rt_spin_lock_init(spinlock_t *lock, const char *name, } #endif -#define spin_lock_init(slock) \ +#define __spin_lock_init(slock, name, key, percpu) \ do { \ - static struct lock_class_key __key; \ - \ rt_mutex_base_init(&(slock)->lock); \ - __rt_spin_lock_init(slock, #slock, &__key, false); \ + __rt_spin_lock_init(slock, name, key, percpu); \ } while (0) -#define local_spin_lock_init(slock) \ +#define _spin_lock_init(slock, percpu) \ do { \ static struct lock_class_key __key; \ - \ - rt_mutex_base_init(&(slock)->lock); \ - __rt_spin_lock_init(slock, #slock, &__key, true); \ + __spin_lock_init(slock, #slock, &__key, percpu); \ } while (0) +#define spin_lock_init(slock) _spin_lock_init(slock, false) +#define local_spin_lock_init(slock) _spin_lock_init(slock, true) + extern void rt_spin_lock(spinlock_t *lock) __acquires(lock); extern void rt_spin_lock_nested(spinlock_t *lock, int subclass) __acquires(lock); extern void rt_spin_lock_nest_lock(spinlock_t *lock, struct lockdep_map *nest_lock) __acquires(lock); diff --git a/rust/helpers/spinlock.c b/rust/helpers/spinlock.c index acc1376b833c..92f7fc418425 100644 --- a/rust/helpers/spinlock.c +++ b/rust/helpers/spinlock.c @@ -7,10 +7,14 @@ void rust_helper___spin_lock_init(spinlock_t *lock, const char *name, struct lock_class_key *key) { #ifdef CONFIG_DEBUG_SPINLOCK +# if defined(CONFIG_PREEMPT_RT) + __spin_lock_init(lock, name, key, false); +# else /*!CONFIG_PREEMPT_RT */ __raw_spin_lock_init(spinlock_check(lock), name, key, LD_WAIT_CONFIG); -#else +# endif /* CONFIG_PREEMPT_RT */ +#else /* !CONFIG_DEBUG_SPINLOCK */ spin_lock_init(lock); -#endif +#endif /* CONFIG_DEBUG_SPINLOCK */ } void rust_helper_spin_lock(spinlock_t *lock) From 9a884bdb6e9560c6da44052d5248e89d78c983a6 Mon Sep 17 00:00:00 2001 From: Stephen Rothwell Date: Fri, 8 Nov 2024 16:41:27 +0100 Subject: [PATCH 28/29] iio: magnetometer: fix if () scoped_guard() formatting Add mising braces after an if condition that contains scoped_guard(). This style is both preferred and necessary here, to fix warning after scoped_guard() change in commit fcc22ac5baf0 ("cleanup: Adjust scoped_guard() macros to avoid potential warning") to have if-else inside of the macro. Current (no braces) use in af8133j_set_scale() yields the following warnings: af8133j.c:315:12: warning: suggest explicit braces to avoid ambiguous 'else' [-Wdangling-else] af8133j.c:316:3: warning: add explicit braces to avoid dangling else [-Wdangling-else] Fixes: fcc22ac5baf0 ("cleanup: Adjust scoped_guard() macros to avoid potential warning") Closes: https://lore.kernel.org/oe-kbuild-all/202409270848.tTpyEAR7-lkp@intel.com/ Reported-by: kernel test robot Signed-off-by: Stephen Rothwell Signed-off-by: Przemek Kitszel Signed-off-by: Peter Zijlstra (Intel) Acked-by: Jonathan Cameron Link: https://lore.kernel.org/r/20241108154258.21411-1-przemyslaw.kitszel@intel.com --- drivers/iio/magnetometer/af8133j.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/iio/magnetometer/af8133j.c b/drivers/iio/magnetometer/af8133j.c index d81d89af6283..acd291f3e792 100644 --- a/drivers/iio/magnetometer/af8133j.c +++ b/drivers/iio/magnetometer/af8133j.c @@ -312,10 +312,11 @@ static int af8133j_set_scale(struct af8133j_data *data, * When suspended, just store the new range to data->range to be * applied later during power up. */ - if (!pm_runtime_status_suspended(dev)) + if (!pm_runtime_status_suspended(dev)) { scoped_guard(mutex, &data->mutex) ret = regmap_write(data->regmap, AF8133J_REG_RANGE, range); + } pm_runtime_enable(dev); From 3b49a347d751553b1d1be69c8619ae2e85fdc28d Mon Sep 17 00:00:00 2001 From: Xiu Jianfeng Date: Tue, 12 Nov 2024 02:57:24 +0000 Subject: [PATCH 29/29] locking/Documentation: Fix grammar in percpu-rw-semaphore.rst s/'is initialized'/'is initialized with' Signed-off-by: Xiu Jianfeng Signed-off-by: Ingo Molnar Link: https://lore.kernel.org/r/20241112025724.474881-1-xiujianfeng@huaweicloud.com --- Documentation/locking/percpu-rw-semaphore.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/Documentation/locking/percpu-rw-semaphore.rst b/Documentation/locking/percpu-rw-semaphore.rst index 247de6410855..a105bf2dd812 100644 --- a/Documentation/locking/percpu-rw-semaphore.rst +++ b/Documentation/locking/percpu-rw-semaphore.rst @@ -16,8 +16,8 @@ writing is very expensive, it calls synchronize_rcu() that can take hundreds of milliseconds. The lock is declared with "struct percpu_rw_semaphore" type. -The lock is initialized percpu_init_rwsem, it returns 0 on success and --ENOMEM on allocation failure. +The lock is initialized with percpu_init_rwsem, it returns 0 on success +and -ENOMEM on allocation failure. The lock must be freed with percpu_free_rwsem to avoid memory leak. The lock is locked for read with percpu_down_read, percpu_up_read and