Scheduler changes for v6.13:

- Core facilities:
 
     - Add the "Lazy preemption" model (CONFIG_PREEMPT_LAZY=y), which optimizes
       fair-class preemption by delaying preemption requests to the
       tick boundary, while working as full preemption for RR/FIFO/DEADLINE
       classes. (Peter Zijlstra)
 
         - x86: Enable Lazy preemption (Peter Zijlstra)
         - riscv: Enable Lazy preemption (Jisheng Zhang)
 
     - Initialize idle tasks only once (Thomas Gleixner)
 
     - sched/ext: Remove sched_fork() hack (Thomas Gleixner)
 
  - Fair scheduler:
     - Optimize the PLACE_LAG when se->vlag is zero (Huang Shijie)
 
  - Idle loop:
       Optimize the generic idle loop by removing unnecessary
       memory barrier (Zhongqiu Han)
 
  - RSEQ:
     - Improve cache locality of RSEQ concurrency IDs for
       intermittent workloads (Mathieu Desnoyers)
 
  - Waitqueues:
     - Make wake_up_{bit,var} less fragile (Neil Brown)
 
  - PSI:
     - Pass enqueue/dequeue flags to psi callbacks directly (Johannes Weiner)
 
  - Preparatory patches for proxy execution:
     - core: Add move_queued_task_locked helper (Connor O'Brien)
     - core: Consolidate pick_*_task to task_is_pushable helper (Connor O'Brien)
     - core: Split out __schedule() deactivate task logic into a helper (John Stultz)
     - core: Split scheduler and execution contexts (Peter Zijlstra)
     - locking/mutex: Make mutex::wait_lock irq safe (Juri Lelli)
     - locking/mutex: Expose __mutex_owner() (Juri Lelli)
     - locking/mutex: Remove wakeups from under mutex::wait_lock (Peter Zijlstra)
 
  - Misc fixes and cleanups:
     - core: Remove unused __HAVE_THREAD_FUNCTIONS hook support (David Disseldorp)
     - core: Update the comment for TIF_NEED_RESCHED_LAZY (Sebastian Andrzej Siewior)
     - wait: Remove unused bit_wait_io_timeout (Dr. David Alan Gilbert)
     - fair: remove the DOUBLE_TICK feature (Huang Shijie)
     - fair: fix the comment for PREEMPT_SHORT (Huang Shijie)
     - uclamp: Fix unnused variable warning (Christian Loehle)
     - rt: No PREEMPT_RT=y for all{yes,mod}config
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmc7fnQRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1hZTBAAozVdWA2m51aNa67HvAZta/olmrIagVbW
 inwbTgqa8b+UfeWEuKOfrZr5khjEh6pLgR3dBTib1uH6xxYj/Okds+qbPWSBPVLh
 yzavlm/zJZM1U1XtxE3eyVfqWik4GrY7DoIMDQQr+YH7rNXonJeJkll38OI2E5MC
 q3Q01qyMo8RJJX8qkf3f8ObOoP/51NsVniTw0Zb2fzEhXz8FjezLlxk6cMfgSkJG
 lg9gfIwUZ7Xg5neRo4kJcc3Ht31KYOhWSiupBJzRD1hss/N/AybvMcTX/Cm8d07w
 HIAdDDAn84o46miFo/a0V/hsJZ72idWbqxVJUCtaezrpOUiFkG+uInRvG/ynr0lF
 5dEI9f+6PUw8Nc7L72IyHkobjPqS2IefSaxYYCBKmxMX2qrenfTor/pKiWzzhBIl
 rX3MZSuUJ8NjV4rNGD/qXRM1IsMJrsDwxDyv+sRec3XdH33x286ds6aAUEPDQ6N7
 96VS0sOKcNUJN8776ErNjlIxRl8HTlpkaO3nZlQIfXgTlXUpRvOuKbEWqP+606lo
 oANgJTKgUhgJPWZnvmdRxDjSiOp93QcImjus9i1tN81FGiEDleONsJUxu2Di1E5+
 s1nCiytjq+cdvzCqFyiOZUh+g6kSZ4yXxNgLg2UvbXzX1zOeUQT3WtyKUhMPXhU8
 esh1TgbUbpE=
 =Zcqj
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Ingo Molnar:
 "Core facilities:

   - Add the "Lazy preemption" model (CONFIG_PREEMPT_LAZY=y), which
     optimizes fair-class preemption by delaying preemption requests to
     the tick boundary, while working as full preemption for
     RR/FIFO/DEADLINE classes. (Peter Zijlstra)
        - x86: Enable Lazy preemption (Peter Zijlstra)
        - riscv: Enable Lazy preemption (Jisheng Zhang)

   - Initialize idle tasks only once (Thomas Gleixner)

   - sched/ext: Remove sched_fork() hack (Thomas Gleixner)

  Fair scheduler:

   - Optimize the PLACE_LAG when se->vlag is zero (Huang Shijie)

  Idle loop:

   - Optimize the generic idle loop by removing unnecessary memory
     barrier (Zhongqiu Han)

  RSEQ:

   - Improve cache locality of RSEQ concurrency IDs for intermittent
     workloads (Mathieu Desnoyers)

  Waitqueues:

   - Make wake_up_{bit,var} less fragile (Neil Brown)

  PSI:

   - Pass enqueue/dequeue flags to psi callbacks directly (Johannes
     Weiner)

  Preparatory patches for proxy execution:

   - Add move_queued_task_locked helper (Connor O'Brien)

   - Consolidate pick_*_task to task_is_pushable helper (Connor O'Brien)

   - Split out __schedule() deactivate task logic into a helper (John
     Stultz)

   - Split scheduler and execution contexts (Peter Zijlstra)

   - Make mutex::wait_lock irq safe (Juri Lelli)

   - Expose __mutex_owner() (Juri Lelli)

   - Remove wakeups from under mutex::wait_lock (Peter Zijlstra)

  Misc fixes and cleanups:

   - Remove unused __HAVE_THREAD_FUNCTIONS hook support (David
     Disseldorp)

   - Update the comment for TIF_NEED_RESCHED_LAZY (Sebastian Andrzej
     Siewior)

   - Remove unused bit_wait_io_timeout (Dr. David Alan Gilbert)

   - remove the DOUBLE_TICK feature (Huang Shijie)

   - fix the comment for PREEMPT_SHORT (Huang Shijie)

   - Fix unnused variable warning (Christian Loehle)

   - No PREEMPT_RT=y for all{yes,mod}config"

* tag 'sched-core-2024-11-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
  sched, x86: Update the comment for TIF_NEED_RESCHED_LAZY.
  sched: No PREEMPT_RT=y for all{yes,mod}config
  riscv: add PREEMPT_LAZY support
  sched, x86: Enable Lazy preemption
  sched: Enable PREEMPT_DYNAMIC for PREEMPT_RT
  sched: Add Lazy preemption model
  sched: Add TIF_NEED_RESCHED_LAZY infrastructure
  sched/ext: Remove sched_fork() hack
  sched: Initialize idle tasks only once
  sched: psi: pass enqueue/dequeue flags to psi callbacks directly
  sched/uclamp: Fix unnused variable warning
  sched: Split scheduler and execution contexts
  sched: Split out __schedule() deactivate task logic into a helper
  sched: Consolidate pick_*_task to task_is_pushable helper
  sched: Add move_queued_task_locked helper
  locking/mutex: Expose __mutex_owner()
  locking/mutex: Make mutex::wait_lock irq safe
  locking/mutex: Remove wakeups from under mutex::wait_lock
  sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloads
  sched: idle: Optimize the generic idle loop by removing needless memory barrier
  ...
This commit is contained in:
Linus Torvalds 2024-11-19 14:16:06 -08:00
commit 3f020399e4
42 changed files with 1106 additions and 505 deletions

View File

@ -39,6 +39,7 @@ config RISCV
select ARCH_HAS_MMIOWB select ARCH_HAS_MMIOWB
select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
select ARCH_HAS_PMEM_API select ARCH_HAS_PMEM_API
select ARCH_HAS_PREEMPT_LAZY
select ARCH_HAS_PREPARE_SYNC_CORE_CMD select ARCH_HAS_PREPARE_SYNC_CORE_CMD
select ARCH_HAS_PTE_DEVMAP if 64BIT && MMU select ARCH_HAS_PTE_DEVMAP if 64BIT && MMU
select ARCH_HAS_PTE_SPECIAL select ARCH_HAS_PTE_SPECIAL

View File

@ -107,9 +107,10 @@ int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src);
* - pending work-to-be-done flags are in lowest half-word * - pending work-to-be-done flags are in lowest half-word
* - other flags in upper half-word(s) * - other flags in upper half-word(s)
*/ */
#define TIF_NOTIFY_RESUME 1 /* callback before returning to user */ #define TIF_NEED_RESCHED 0 /* rescheduling necessary */
#define TIF_SIGPENDING 2 /* signal pending */ #define TIF_NEED_RESCHED_LAZY 1 /* Lazy rescheduling needed */
#define TIF_NEED_RESCHED 3 /* rescheduling necessary */ #define TIF_NOTIFY_RESUME 2 /* callback before returning to user */
#define TIF_SIGPENDING 3 /* signal pending */
#define TIF_RESTORE_SIGMASK 4 /* restore signal mask in do_signal() */ #define TIF_RESTORE_SIGMASK 4 /* restore signal mask in do_signal() */
#define TIF_MEMDIE 5 /* is terminating due to OOM killer */ #define TIF_MEMDIE 5 /* is terminating due to OOM killer */
#define TIF_NOTIFY_SIGNAL 9 /* signal notifications exist */ #define TIF_NOTIFY_SIGNAL 9 /* signal notifications exist */
@ -117,9 +118,10 @@ int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src);
#define TIF_32BIT 11 /* compat-mode 32bit process */ #define TIF_32BIT 11 /* compat-mode 32bit process */
#define TIF_RISCV_V_DEFER_RESTORE 12 /* restore Vector before returing to user */ #define TIF_RISCV_V_DEFER_RESTORE 12 /* restore Vector before returing to user */
#define _TIF_NEED_RESCHED (1 << TIF_NEED_RESCHED)
#define _TIF_NEED_RESCHED_LAZY (1 << TIF_NEED_RESCHED_LAZY)
#define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME) #define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME)
#define _TIF_SIGPENDING (1 << TIF_SIGPENDING) #define _TIF_SIGPENDING (1 << TIF_SIGPENDING)
#define _TIF_NEED_RESCHED (1 << TIF_NEED_RESCHED)
#define _TIF_NOTIFY_SIGNAL (1 << TIF_NOTIFY_SIGNAL) #define _TIF_NOTIFY_SIGNAL (1 << TIF_NOTIFY_SIGNAL)
#define _TIF_UPROBE (1 << TIF_UPROBE) #define _TIF_UPROBE (1 << TIF_UPROBE)
#define _TIF_RISCV_V_DEFER_RESTORE (1 << TIF_RISCV_V_DEFER_RESTORE) #define _TIF_RISCV_V_DEFER_RESTORE (1 << TIF_RISCV_V_DEFER_RESTORE)

View File

@ -93,6 +93,7 @@ config X86
select ARCH_HAS_NMI_SAFE_THIS_CPU_OPS select ARCH_HAS_NMI_SAFE_THIS_CPU_OPS
select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
select ARCH_HAS_PMEM_API if X86_64 select ARCH_HAS_PMEM_API if X86_64
select ARCH_HAS_PREEMPT_LAZY
select ARCH_HAS_PTE_DEVMAP if X86_64 select ARCH_HAS_PTE_DEVMAP if X86_64
select ARCH_HAS_PTE_SPECIAL select ARCH_HAS_PTE_SPECIAL
select ARCH_HAS_HW_PTE_YOUNG select ARCH_HAS_HW_PTE_YOUNG

View File

@ -87,8 +87,9 @@ struct thread_info {
#define TIF_NOTIFY_RESUME 1 /* callback before returning to user */ #define TIF_NOTIFY_RESUME 1 /* callback before returning to user */
#define TIF_SIGPENDING 2 /* signal pending */ #define TIF_SIGPENDING 2 /* signal pending */
#define TIF_NEED_RESCHED 3 /* rescheduling necessary */ #define TIF_NEED_RESCHED 3 /* rescheduling necessary */
#define TIF_SINGLESTEP 4 /* reenable singlestep on user return*/ #define TIF_NEED_RESCHED_LAZY 4 /* Lazy rescheduling needed */
#define TIF_SSBD 5 /* Speculative store bypass disable */ #define TIF_SINGLESTEP 5 /* reenable singlestep on user return*/
#define TIF_SSBD 6 /* Speculative store bypass disable */
#define TIF_SPEC_IB 9 /* Indirect branch speculation mitigation */ #define TIF_SPEC_IB 9 /* Indirect branch speculation mitigation */
#define TIF_SPEC_L1D_FLUSH 10 /* Flush L1D on mm switches (processes) */ #define TIF_SPEC_L1D_FLUSH 10 /* Flush L1D on mm switches (processes) */
#define TIF_USER_RETURN_NOTIFY 11 /* notify kernel of userspace return */ #define TIF_USER_RETURN_NOTIFY 11 /* notify kernel of userspace return */
@ -110,6 +111,7 @@ struct thread_info {
#define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME) #define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME)
#define _TIF_SIGPENDING (1 << TIF_SIGPENDING) #define _TIF_SIGPENDING (1 << TIF_SIGPENDING)
#define _TIF_NEED_RESCHED (1 << TIF_NEED_RESCHED) #define _TIF_NEED_RESCHED (1 << TIF_NEED_RESCHED)
#define _TIF_NEED_RESCHED_LAZY (1 << TIF_NEED_RESCHED_LAZY)
#define _TIF_SINGLESTEP (1 << TIF_SINGLESTEP) #define _TIF_SINGLESTEP (1 << TIF_SINGLESTEP)
#define _TIF_SSBD (1 << TIF_SSBD) #define _TIF_SSBD (1 << TIF_SSBD)
#define _TIF_SPEC_IB (1 << TIF_SPEC_IB) #define _TIF_SPEC_IB (1 << TIF_SPEC_IB)

View File

@ -990,7 +990,7 @@ static int exec_mmap(struct mm_struct *mm)
active_mm = tsk->active_mm; active_mm = tsk->active_mm;
tsk->active_mm = mm; tsk->active_mm = mm;
tsk->mm = mm; tsk->mm = mm;
mm_init_cid(mm); mm_init_cid(mm, tsk);
/* /*
* This prevents preemption while active_mm is being loaded and * This prevents preemption while active_mm is being loaded and
* it and mm are being updated, which could cause problems for * it and mm are being updated, which could cause problems for

View File

@ -64,7 +64,8 @@
#define EXIT_TO_USER_MODE_WORK \ #define EXIT_TO_USER_MODE_WORK \
(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE | \ (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE | \
_TIF_NEED_RESCHED | _TIF_PATCH_PENDING | _TIF_NOTIFY_SIGNAL | \ _TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY | \
_TIF_PATCH_PENDING | _TIF_NOTIFY_SIGNAL | \
ARCH_EXIT_TO_USER_MODE_WORK) ARCH_EXIT_TO_USER_MODE_WORK)
/** /**

View File

@ -17,8 +17,9 @@
#endif #endif
#define XFER_TO_GUEST_MODE_WORK \ #define XFER_TO_GUEST_MODE_WORK \
(_TIF_NEED_RESCHED | _TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL | \ (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY | _TIF_SIGPENDING | \
_TIF_NOTIFY_RESUME | ARCH_XFER_TO_GUEST_MODE_WORK) _TIF_NOTIFY_SIGNAL | _TIF_NOTIFY_RESUME | \
ARCH_XFER_TO_GUEST_MODE_WORK)
struct kvm_vcpu; struct kvm_vcpu;

View File

@ -782,6 +782,7 @@ struct vm_area_struct {
struct mm_cid { struct mm_cid {
u64 time; u64 time;
int cid; int cid;
int recent_cid;
}; };
#endif #endif
@ -852,6 +853,27 @@ struct mm_struct {
* When the next mm_cid scan is due (in jiffies). * When the next mm_cid scan is due (in jiffies).
*/ */
unsigned long mm_cid_next_scan; unsigned long mm_cid_next_scan;
/**
* @nr_cpus_allowed: Number of CPUs allowed for mm.
*
* Number of CPUs allowed in the union of all mm's
* threads allowed CPUs.
*/
unsigned int nr_cpus_allowed;
/**
* @max_nr_cid: Maximum number of concurrency IDs allocated.
*
* Track the highest number of concurrency IDs allocated for the
* mm.
*/
atomic_t max_nr_cid;
/**
* @cpus_allowed_lock: Lock protecting mm cpus_allowed.
*
* Provide mutual exclusion for mm cpus_allowed and
* mm nr_cpus_allowed updates.
*/
raw_spinlock_t cpus_allowed_lock;
#endif #endif
#ifdef CONFIG_MMU #ifdef CONFIG_MMU
atomic_long_t pgtables_bytes; /* size of all page tables */ atomic_long_t pgtables_bytes; /* size of all page tables */
@ -1170,18 +1192,30 @@ static inline int mm_cid_clear_lazy_put(int cid)
return cid & ~MM_CID_LAZY_PUT; return cid & ~MM_CID_LAZY_PUT;
} }
/*
* mm_cpus_allowed: Union of all mm's threads allowed CPUs.
*/
static inline cpumask_t *mm_cpus_allowed(struct mm_struct *mm)
{
unsigned long bitmap = (unsigned long)mm;
bitmap += offsetof(struct mm_struct, cpu_bitmap);
/* Skip cpu_bitmap */
bitmap += cpumask_size();
return (struct cpumask *)bitmap;
}
/* Accessor for struct mm_struct's cidmask. */ /* Accessor for struct mm_struct's cidmask. */
static inline cpumask_t *mm_cidmask(struct mm_struct *mm) static inline cpumask_t *mm_cidmask(struct mm_struct *mm)
{ {
unsigned long cid_bitmap = (unsigned long)mm; unsigned long cid_bitmap = (unsigned long)mm_cpus_allowed(mm);
cid_bitmap += offsetof(struct mm_struct, cpu_bitmap); /* Skip mm_cpus_allowed */
/* Skip cpu_bitmap */
cid_bitmap += cpumask_size(); cid_bitmap += cpumask_size();
return (struct cpumask *)cid_bitmap; return (struct cpumask *)cid_bitmap;
} }
static inline void mm_init_cid(struct mm_struct *mm) static inline void mm_init_cid(struct mm_struct *mm, struct task_struct *p)
{ {
int i; int i;
@ -1189,17 +1223,22 @@ static inline void mm_init_cid(struct mm_struct *mm)
struct mm_cid *pcpu_cid = per_cpu_ptr(mm->pcpu_cid, i); struct mm_cid *pcpu_cid = per_cpu_ptr(mm->pcpu_cid, i);
pcpu_cid->cid = MM_CID_UNSET; pcpu_cid->cid = MM_CID_UNSET;
pcpu_cid->recent_cid = MM_CID_UNSET;
pcpu_cid->time = 0; pcpu_cid->time = 0;
} }
mm->nr_cpus_allowed = p->nr_cpus_allowed;
atomic_set(&mm->max_nr_cid, 0);
raw_spin_lock_init(&mm->cpus_allowed_lock);
cpumask_copy(mm_cpus_allowed(mm), &p->cpus_mask);
cpumask_clear(mm_cidmask(mm)); cpumask_clear(mm_cidmask(mm));
} }
static inline int mm_alloc_cid_noprof(struct mm_struct *mm) static inline int mm_alloc_cid_noprof(struct mm_struct *mm, struct task_struct *p)
{ {
mm->pcpu_cid = alloc_percpu_noprof(struct mm_cid); mm->pcpu_cid = alloc_percpu_noprof(struct mm_cid);
if (!mm->pcpu_cid) if (!mm->pcpu_cid)
return -ENOMEM; return -ENOMEM;
mm_init_cid(mm); mm_init_cid(mm, p);
return 0; return 0;
} }
#define mm_alloc_cid(...) alloc_hooks(mm_alloc_cid_noprof(__VA_ARGS__)) #define mm_alloc_cid(...) alloc_hooks(mm_alloc_cid_noprof(__VA_ARGS__))
@ -1212,16 +1251,31 @@ static inline void mm_destroy_cid(struct mm_struct *mm)
static inline unsigned int mm_cid_size(void) static inline unsigned int mm_cid_size(void)
{ {
return cpumask_size(); return 2 * cpumask_size(); /* mm_cpus_allowed(), mm_cidmask(). */
}
static inline void mm_set_cpus_allowed(struct mm_struct *mm, const struct cpumask *cpumask)
{
struct cpumask *mm_allowed = mm_cpus_allowed(mm);
if (!mm)
return;
/* The mm_cpus_allowed is the union of each thread allowed CPUs masks. */
raw_spin_lock(&mm->cpus_allowed_lock);
cpumask_or(mm_allowed, mm_allowed, cpumask);
WRITE_ONCE(mm->nr_cpus_allowed, cpumask_weight(mm_allowed));
raw_spin_unlock(&mm->cpus_allowed_lock);
} }
#else /* CONFIG_SCHED_MM_CID */ #else /* CONFIG_SCHED_MM_CID */
static inline void mm_init_cid(struct mm_struct *mm) { } static inline void mm_init_cid(struct mm_struct *mm, struct task_struct *p) { }
static inline int mm_alloc_cid(struct mm_struct *mm) { return 0; } static inline int mm_alloc_cid(struct mm_struct *mm, struct task_struct *p) { return 0; }
static inline void mm_destroy_cid(struct mm_struct *mm) { } static inline void mm_destroy_cid(struct mm_struct *mm) { }
static inline unsigned int mm_cid_size(void) static inline unsigned int mm_cid_size(void)
{ {
return 0; return 0;
} }
static inline void mm_set_cpus_allowed(struct mm_struct *mm, const struct cpumask *cpumask) { }
#endif /* CONFIG_SCHED_MM_CID */ #endif /* CONFIG_SCHED_MM_CID */
struct mmu_gather; struct mmu_gather;

View File

@ -486,6 +486,7 @@ DEFINE_LOCK_GUARD_0(migrate, migrate_disable(), migrate_enable())
extern bool preempt_model_none(void); extern bool preempt_model_none(void);
extern bool preempt_model_voluntary(void); extern bool preempt_model_voluntary(void);
extern bool preempt_model_full(void); extern bool preempt_model_full(void);
extern bool preempt_model_lazy(void);
#else #else
@ -502,6 +503,11 @@ static inline bool preempt_model_full(void)
return IS_ENABLED(CONFIG_PREEMPT); return IS_ENABLED(CONFIG_PREEMPT);
} }
static inline bool preempt_model_lazy(void)
{
return IS_ENABLED(CONFIG_PREEMPT_LAZY);
}
#endif #endif
static inline bool preempt_model_rt(void) static inline bool preempt_model_rt(void)
@ -519,7 +525,7 @@ static inline bool preempt_model_rt(void)
*/ */
static inline bool preempt_model_preemptible(void) static inline bool preempt_model_preemptible(void)
{ {
return preempt_model_full() || preempt_model_rt(); return preempt_model_full() || preempt_model_lazy() || preempt_model_rt();
} }
#endif /* __LINUX_PREEMPT_H */ #endif /* __LINUX_PREEMPT_H */

View File

@ -1898,7 +1898,7 @@ extern unsigned long init_stack[THREAD_SIZE / sizeof(unsigned long)];
#ifdef CONFIG_THREAD_INFO_IN_TASK #ifdef CONFIG_THREAD_INFO_IN_TASK
# define task_thread_info(task) (&(task)->thread_info) # define task_thread_info(task) (&(task)->thread_info)
#elif !defined(__HAVE_THREAD_FUNCTIONS) #else
# define task_thread_info(task) ((struct thread_info *)(task)->stack) # define task_thread_info(task) ((struct thread_info *)(task)->stack)
#endif #endif
@ -2002,7 +2002,8 @@ static inline void set_tsk_need_resched(struct task_struct *tsk)
static inline void clear_tsk_need_resched(struct task_struct *tsk) static inline void clear_tsk_need_resched(struct task_struct *tsk)
{ {
clear_tsk_thread_flag(tsk,TIF_NEED_RESCHED); atomic_long_andnot(_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY,
(atomic_long_t *)&task_thread_info(tsk)->flags);
} }
static inline int test_tsk_need_resched(struct task_struct *tsk) static inline int test_tsk_need_resched(struct task_struct *tsk)

View File

@ -199,7 +199,6 @@ struct sched_ext_entity {
#ifdef CONFIG_EXT_GROUP_SCHED #ifdef CONFIG_EXT_GROUP_SCHED
struct cgroup *cgrp_moving_from; struct cgroup *cgrp_moving_from;
#endif #endif
/* must be the last field, see init_scx_entity() */
struct list_head tasks_node; struct list_head tasks_node;
}; };

View File

@ -34,7 +34,7 @@ static __always_inline unsigned long *end_of_stack(const struct task_struct *tas
#endif #endif
} }
#elif !defined(__HAVE_THREAD_FUNCTIONS) #else
#define task_stack_page(task) ((void *)(task)->stack) #define task_stack_page(task) ((void *)(task)->stack)

View File

@ -59,6 +59,14 @@ enum syscall_work_bit {
#include <asm/thread_info.h> #include <asm/thread_info.h>
#ifndef TIF_NEED_RESCHED_LAZY
#ifdef CONFIG_ARCH_HAS_PREEMPT_LAZY
#error Inconsistent PREEMPT_LAZY
#endif
#define TIF_NEED_RESCHED_LAZY TIF_NEED_RESCHED
#define _TIF_NEED_RESCHED_LAZY _TIF_NEED_RESCHED
#endif
#ifdef __KERNEL__ #ifdef __KERNEL__
#ifndef arch_set_restart_data #ifndef arch_set_restart_data
@ -179,22 +187,27 @@ static __always_inline unsigned long read_ti_thread_flags(struct thread_info *ti
#ifdef _ASM_GENERIC_BITOPS_INSTRUMENTED_NON_ATOMIC_H #ifdef _ASM_GENERIC_BITOPS_INSTRUMENTED_NON_ATOMIC_H
static __always_inline bool tif_need_resched(void) static __always_inline bool tif_test_bit(int bit)
{ {
return arch_test_bit(TIF_NEED_RESCHED, return arch_test_bit(bit,
(unsigned long *)(&current_thread_info()->flags)); (unsigned long *)(&current_thread_info()->flags));
} }
#else #else
static __always_inline bool tif_need_resched(void) static __always_inline bool tif_test_bit(int bit)
{ {
return test_bit(TIF_NEED_RESCHED, return test_bit(bit,
(unsigned long *)(&current_thread_info()->flags)); (unsigned long *)(&current_thread_info()->flags));
} }
#endif /* _ASM_GENERIC_BITOPS_INSTRUMENTED_NON_ATOMIC_H */ #endif /* _ASM_GENERIC_BITOPS_INSTRUMENTED_NON_ATOMIC_H */
static __always_inline bool tif_need_resched(void)
{
return tif_test_bit(TIF_NEED_RESCHED);
}
#ifndef CONFIG_HAVE_ARCH_WITHIN_STACK_FRAMES #ifndef CONFIG_HAVE_ARCH_WITHIN_STACK_FRAMES
static inline int arch_within_stack_frames(const void * const stack, static inline int arch_within_stack_frames(const void * const stack,
const void * const stackend, const void * const stackend,

View File

@ -8,7 +8,7 @@
#include <linux/wait.h> #include <linux/wait.h>
struct wait_bit_key { struct wait_bit_key {
void *flags; unsigned long *flags;
int bit_nr; int bit_nr;
unsigned long timeout; unsigned long timeout;
}; };
@ -23,14 +23,14 @@ struct wait_bit_queue_entry {
typedef int wait_bit_action_f(struct wait_bit_key *key, int mode); typedef int wait_bit_action_f(struct wait_bit_key *key, int mode);
void __wake_up_bit(struct wait_queue_head *wq_head, void *word, int bit); void __wake_up_bit(struct wait_queue_head *wq_head, unsigned long *word, int bit);
int __wait_on_bit(struct wait_queue_head *wq_head, struct wait_bit_queue_entry *wbq_entry, wait_bit_action_f *action, unsigned int mode); int __wait_on_bit(struct wait_queue_head *wq_head, struct wait_bit_queue_entry *wbq_entry, wait_bit_action_f *action, unsigned int mode);
int __wait_on_bit_lock(struct wait_queue_head *wq_head, struct wait_bit_queue_entry *wbq_entry, wait_bit_action_f *action, unsigned int mode); int __wait_on_bit_lock(struct wait_queue_head *wq_head, struct wait_bit_queue_entry *wbq_entry, wait_bit_action_f *action, unsigned int mode);
void wake_up_bit(void *word, int bit); void wake_up_bit(unsigned long *word, int bit);
int out_of_line_wait_on_bit(void *word, int, wait_bit_action_f *action, unsigned int mode); int out_of_line_wait_on_bit(unsigned long *word, int, wait_bit_action_f *action, unsigned int mode);
int out_of_line_wait_on_bit_timeout(void *word, int, wait_bit_action_f *action, unsigned int mode, unsigned long timeout); int out_of_line_wait_on_bit_timeout(unsigned long *word, int, wait_bit_action_f *action, unsigned int mode, unsigned long timeout);
int out_of_line_wait_on_bit_lock(void *word, int, wait_bit_action_f *action, unsigned int mode); int out_of_line_wait_on_bit_lock(unsigned long *word, int, wait_bit_action_f *action, unsigned int mode);
struct wait_queue_head *bit_waitqueue(void *word, int bit); struct wait_queue_head *bit_waitqueue(unsigned long *word, int bit);
extern void __init wait_bit_init(void); extern void __init wait_bit_init(void);
int wake_bit_function(struct wait_queue_entry *wq_entry, unsigned mode, int sync, void *key); int wake_bit_function(struct wait_queue_entry *wq_entry, unsigned mode, int sync, void *key);
@ -49,23 +49,24 @@ int wake_bit_function(struct wait_queue_entry *wq_entry, unsigned mode, int sync
extern int bit_wait(struct wait_bit_key *key, int mode); extern int bit_wait(struct wait_bit_key *key, int mode);
extern int bit_wait_io(struct wait_bit_key *key, int mode); extern int bit_wait_io(struct wait_bit_key *key, int mode);
extern int bit_wait_timeout(struct wait_bit_key *key, int mode); extern int bit_wait_timeout(struct wait_bit_key *key, int mode);
extern int bit_wait_io_timeout(struct wait_bit_key *key, int mode);
/** /**
* wait_on_bit - wait for a bit to be cleared * wait_on_bit - wait for a bit to be cleared
* @word: the word being waited on, a kernel virtual address * @word: the address containing the bit being waited on
* @bit: the bit of the word being waited on * @bit: the bit at that address being waited on
* @mode: the task state to sleep in * @mode: the task state to sleep in
* *
* There is a standard hashed waitqueue table for generic use. This * Wait for the given bit in an unsigned long or bitmap (see DECLARE_BITMAP())
* is the part of the hashtable's accessor API that waits on a bit. * to be cleared. The clearing of the bit must be signalled with
* For instance, if one were to have waiters on a bitflag, one would * wake_up_bit(), often as clear_and_wake_up_bit().
* call wait_on_bit() in threads waiting for the bit to clear. *
* One uses wait_on_bit() where one is waiting for the bit to clear, * The process will wait on a waitqueue selected by hash from a shared
* but has no intention of setting it. * pool. It will only be woken on a wake_up for the target bit, even
* Returned value will be zero if the bit was cleared, or non-zero * if other processes on the same queue are waiting for other bits.
* if the process received a signal and the mode permitted wakeup *
* on that signal. * Returned value will be zero if the bit was cleared in which case the
* call has ACQUIRE semantics, or %-EINTR if the process received a
* signal and the mode permitted wake up on that signal.
*/ */
static inline int static inline int
wait_on_bit(unsigned long *word, int bit, unsigned mode) wait_on_bit(unsigned long *word, int bit, unsigned mode)
@ -80,17 +81,20 @@ wait_on_bit(unsigned long *word, int bit, unsigned mode)
/** /**
* wait_on_bit_io - wait for a bit to be cleared * wait_on_bit_io - wait for a bit to be cleared
* @word: the word being waited on, a kernel virtual address * @word: the address containing the bit being waited on
* @bit: the bit of the word being waited on * @bit: the bit at that address being waited on
* @mode: the task state to sleep in * @mode: the task state to sleep in
* *
* Use the standard hashed waitqueue table to wait for a bit * Wait for the given bit in an unsigned long or bitmap (see DECLARE_BITMAP())
* to be cleared. This is similar to wait_on_bit(), but calls * to be cleared. The clearing of the bit must be signalled with
* io_schedule() instead of schedule() for the actual waiting. * wake_up_bit(), often as clear_and_wake_up_bit().
* *
* Returned value will be zero if the bit was cleared, or non-zero * This is similar to wait_on_bit(), but calls io_schedule() instead of
* if the process received a signal and the mode permitted wakeup * schedule() for the actual waiting.
* on that signal. *
* Returned value will be zero if the bit was cleared in which case the
* call has ACQUIRE semantics, or %-EINTR if the process received a
* signal and the mode permitted wake up on that signal.
*/ */
static inline int static inline int
wait_on_bit_io(unsigned long *word, int bit, unsigned mode) wait_on_bit_io(unsigned long *word, int bit, unsigned mode)
@ -104,19 +108,24 @@ wait_on_bit_io(unsigned long *word, int bit, unsigned mode)
} }
/** /**
* wait_on_bit_timeout - wait for a bit to be cleared or a timeout elapses * wait_on_bit_timeout - wait for a bit to be cleared or a timeout to elapse
* @word: the word being waited on, a kernel virtual address * @word: the address containing the bit being waited on
* @bit: the bit of the word being waited on * @bit: the bit at that address being waited on
* @mode: the task state to sleep in * @mode: the task state to sleep in
* @timeout: timeout, in jiffies * @timeout: timeout, in jiffies
* *
* Use the standard hashed waitqueue table to wait for a bit * Wait for the given bit in an unsigned long or bitmap (see
* to be cleared. This is similar to wait_on_bit(), except also takes a * DECLARE_BITMAP()) to be cleared, or for a timeout to expire. The
* timeout parameter. * clearing of the bit must be signalled with wake_up_bit(), often as
* clear_and_wake_up_bit().
* *
* Returned value will be zero if the bit was cleared before the * This is similar to wait_on_bit(), except it also takes a timeout
* @timeout elapsed, or non-zero if the @timeout elapsed or process * parameter.
* received a signal and the mode permitted wakeup on that signal. *
* Returned value will be zero if the bit was cleared in which case the
* call has ACQUIRE semantics, or %-EINTR if the process received a
* signal and the mode permitted wake up on that signal, or %-EAGAIN if the
* timeout elapsed.
*/ */
static inline int static inline int
wait_on_bit_timeout(unsigned long *word, int bit, unsigned mode, wait_on_bit_timeout(unsigned long *word, int bit, unsigned mode,
@ -132,19 +141,21 @@ wait_on_bit_timeout(unsigned long *word, int bit, unsigned mode,
/** /**
* wait_on_bit_action - wait for a bit to be cleared * wait_on_bit_action - wait for a bit to be cleared
* @word: the word being waited on, a kernel virtual address * @word: the address containing the bit waited on
* @bit: the bit of the word being waited on * @bit: the bit at that address being waited on
* @action: the function used to sleep, which may take special actions * @action: the function used to sleep, which may take special actions
* @mode: the task state to sleep in * @mode: the task state to sleep in
* *
* Use the standard hashed waitqueue table to wait for a bit * Wait for the given bit in an unsigned long or bitmap (see DECLARE_BITMAP())
* to be cleared, and allow the waiting action to be specified. * to be cleared. The clearing of the bit must be signalled with
* This is like wait_on_bit() but allows fine control of how the waiting * wake_up_bit(), often as clear_and_wake_up_bit().
* is done.
* *
* Returned value will be zero if the bit was cleared, or non-zero * This is similar to wait_on_bit(), but calls @action() instead of
* if the process received a signal and the mode permitted wakeup * schedule() for the actual waiting.
* on that signal. *
* Returned value will be zero if the bit was cleared in which case the
* call has ACQUIRE semantics, or the error code returned by @action if
* that call returned non-zero.
*/ */
static inline int static inline int
wait_on_bit_action(unsigned long *word, int bit, wait_bit_action_f *action, wait_on_bit_action(unsigned long *word, int bit, wait_bit_action_f *action,
@ -157,23 +168,22 @@ wait_on_bit_action(unsigned long *word, int bit, wait_bit_action_f *action,
} }
/** /**
* wait_on_bit_lock - wait for a bit to be cleared, when wanting to set it * wait_on_bit_lock - wait for a bit to be cleared, then set it
* @word: the word being waited on, a kernel virtual address * @word: the address containing the bit being waited on
* @bit: the bit of the word being waited on * @bit: the bit of the word being waited on and set
* @mode: the task state to sleep in * @mode: the task state to sleep in
* *
* There is a standard hashed waitqueue table for generic use. This * Wait for the given bit in an unsigned long or bitmap (see
* is the part of the hashtable's accessor API that waits on a bit * DECLARE_BITMAP()) to be cleared. The clearing of the bit must be
* when one intends to set it, for instance, trying to lock bitflags. * signalled with wake_up_bit(), often as clear_and_wake_up_bit(). As
* For instance, if one were to have waiters trying to set bitflag * soon as it is clear, atomically set it and return.
* and waiting for it to clear before setting it, one would call
* wait_on_bit() in threads waiting to be able to set the bit.
* One uses wait_on_bit_lock() where one is waiting for the bit to
* clear with the intention of setting it, and when done, clearing it.
* *
* Returns zero if the bit was (eventually) found to be clear and was * This is similar to wait_on_bit(), but sets the bit before returning.
* set. Returns non-zero if a signal was delivered to the process and *
* the @mode allows that signal to wake the process. * Returned value will be zero if the bit was successfully set in which
* case the call has the same memory sequencing semantics as
* test_and_clear_bit(), or %-EINTR if the process received a signal and
* the mode permitted wake up on that signal.
*/ */
static inline int static inline int
wait_on_bit_lock(unsigned long *word, int bit, unsigned mode) wait_on_bit_lock(unsigned long *word, int bit, unsigned mode)
@ -185,15 +195,18 @@ wait_on_bit_lock(unsigned long *word, int bit, unsigned mode)
} }
/** /**
* wait_on_bit_lock_io - wait for a bit to be cleared, when wanting to set it * wait_on_bit_lock_io - wait for a bit to be cleared, then set it
* @word: the word being waited on, a kernel virtual address * @word: the address containing the bit being waited on
* @bit: the bit of the word being waited on * @bit: the bit of the word being waited on and set
* @mode: the task state to sleep in * @mode: the task state to sleep in
* *
* Use the standard hashed waitqueue table to wait for a bit * Wait for the given bit in an unsigned long or bitmap (see
* to be cleared and then to atomically set it. This is similar * DECLARE_BITMAP()) to be cleared. The clearing of the bit must be
* to wait_on_bit(), but calls io_schedule() instead of schedule() * signalled with wake_up_bit(), often as clear_and_wake_up_bit(). As
* for the actual waiting. * soon as it is clear, atomically set it and return.
*
* This is similar to wait_on_bit_lock(), but calls io_schedule() instead
* of schedule().
* *
* Returns zero if the bit was (eventually) found to be clear and was * Returns zero if the bit was (eventually) found to be clear and was
* set. Returns non-zero if a signal was delivered to the process and * set. Returns non-zero if a signal was delivered to the process and
@ -209,21 +222,19 @@ wait_on_bit_lock_io(unsigned long *word, int bit, unsigned mode)
} }
/** /**
* wait_on_bit_lock_action - wait for a bit to be cleared, when wanting to set it * wait_on_bit_lock_action - wait for a bit to be cleared, then set it
* @word: the word being waited on, a kernel virtual address * @word: the address containing the bit being waited on
* @bit: the bit of the word being waited on * @bit: the bit of the word being waited on and set
* @action: the function used to sleep, which may take special actions * @action: the function used to sleep, which may take special actions
* @mode: the task state to sleep in * @mode: the task state to sleep in
* *
* Use the standard hashed waitqueue table to wait for a bit * This is similar to wait_on_bit_lock(), but calls @action() instead of
* to be cleared and then to set it, and allow the waiting action * schedule() for the actual waiting.
* to be specified.
* This is like wait_on_bit() but allows fine control of how the waiting
* is done.
* *
* Returns zero if the bit was (eventually) found to be clear and was * Returned value will be zero if the bit was successfully set in which
* set. Returns non-zero if a signal was delivered to the process and * case the call has the same memory sequencing semantics as
* the @mode allows that signal to wake the process. * test_and_clear_bit(), or the error code returned by @action if that
* call returned non-zero.
*/ */
static inline int static inline int
wait_on_bit_lock_action(unsigned long *word, int bit, wait_bit_action_f *action, wait_on_bit_lock_action(unsigned long *word, int bit, wait_bit_action_f *action,
@ -269,7 +280,26 @@ __out: __ret; \
#define __wait_var_event(var, condition) \ #define __wait_var_event(var, condition) \
___wait_var_event(var, condition, TASK_UNINTERRUPTIBLE, 0, 0, \ ___wait_var_event(var, condition, TASK_UNINTERRUPTIBLE, 0, 0, \
schedule()) schedule())
#define __wait_var_event_io(var, condition) \
___wait_var_event(var, condition, TASK_UNINTERRUPTIBLE, 0, 0, \
io_schedule())
/**
* wait_var_event - wait for a variable to be updated and notified
* @var: the address of variable being waited on
* @condition: the condition to wait for
*
* Wait for a @condition to be true, only re-checking when a wake up is
* received for the given @var (an arbitrary kernel address which need
* not be directly related to the given condition, but usually is).
*
* The process will wait on a waitqueue selected by hash from a shared
* pool. It will only be woken on a wake_up for the given address.
*
* The condition should normally use smp_load_acquire() or a similarly
* ordered access to ensure that any changes to memory made before the
* condition became true will be visible after the wait completes.
*/
#define wait_var_event(var, condition) \ #define wait_var_event(var, condition) \
do { \ do { \
might_sleep(); \ might_sleep(); \
@ -278,10 +308,56 @@ do { \
__wait_var_event(var, condition); \ __wait_var_event(var, condition); \
} while (0) } while (0)
/**
* wait_var_event_io - wait for a variable to be updated and notified
* @var: the address of variable being waited on
* @condition: the condition to wait for
*
* Wait for an IO related @condition to be true, only re-checking when a
* wake up is received for the given @var (an arbitrary kernel address
* which need not be directly related to the given condition, but
* usually is).
*
* The process will wait on a waitqueue selected by hash from a shared
* pool. It will only be woken on a wake_up for the given address.
*
* This is similar to wait_var_event(), but calls io_schedule() instead
* of schedule().
*
* The condition should normally use smp_load_acquire() or a similarly
* ordered access to ensure that any changes to memory made before the
* condition became true will be visible after the wait completes.
*/
#define wait_var_event_io(var, condition) \
do { \
might_sleep(); \
if (condition) \
break; \
__wait_var_event_io(var, condition); \
} while (0)
#define __wait_var_event_killable(var, condition) \ #define __wait_var_event_killable(var, condition) \
___wait_var_event(var, condition, TASK_KILLABLE, 0, 0, \ ___wait_var_event(var, condition, TASK_KILLABLE, 0, 0, \
schedule()) schedule())
/**
* wait_var_event_killable - wait for a variable to be updated and notified
* @var: the address of variable being waited on
* @condition: the condition to wait for
*
* Wait for a @condition to be true or a fatal signal to be received,
* only re-checking the condition when a wake up is received for the given
* @var (an arbitrary kernel address which need not be directly related
* to the given condition, but usually is).
*
* This is similar to wait_var_event() but returns a value which is
* 0 if the condition became true, or %-ERESTARTSYS if a fatal signal
* was received.
*
* The condition should normally use smp_load_acquire() or a similarly
* ordered access to ensure that any changes to memory made before the
* condition became true will be visible after the wait completes.
*/
#define wait_var_event_killable(var, condition) \ #define wait_var_event_killable(var, condition) \
({ \ ({ \
int __ret = 0; \ int __ret = 0; \
@ -296,6 +372,26 @@ do { \
TASK_UNINTERRUPTIBLE, 0, timeout, \ TASK_UNINTERRUPTIBLE, 0, timeout, \
__ret = schedule_timeout(__ret)) __ret = schedule_timeout(__ret))
/**
* wait_var_event_timeout - wait for a variable to be updated or a timeout to expire
* @var: the address of variable being waited on
* @condition: the condition to wait for
* @timeout: maximum time to wait in jiffies
*
* Wait for a @condition to be true or a timeout to expire, only
* re-checking the condition when a wake up is received for the given
* @var (an arbitrary kernel address which need not be directly related
* to the given condition, but usually is).
*
* This is similar to wait_var_event() but returns a value which is 0 if
* the timeout expired and the condition was still false, or the
* remaining time left in the timeout (but at least 1) if the condition
* was found to be true.
*
* The condition should normally use smp_load_acquire() or a similarly
* ordered access to ensure that any changes to memory made before the
* condition became true will be visible after the wait completes.
*/
#define wait_var_event_timeout(var, condition, timeout) \ #define wait_var_event_timeout(var, condition, timeout) \
({ \ ({ \
long __ret = timeout; \ long __ret = timeout; \
@ -309,6 +405,23 @@ do { \
___wait_var_event(var, condition, TASK_INTERRUPTIBLE, 0, 0, \ ___wait_var_event(var, condition, TASK_INTERRUPTIBLE, 0, 0, \
schedule()) schedule())
/**
* wait_var_event_killable - wait for a variable to be updated and notified
* @var: the address of variable being waited on
* @condition: the condition to wait for
*
* Wait for a @condition to be true or a signal to be received, only
* re-checking the condition when a wake up is received for the given
* @var (an arbitrary kernel address which need not be directly related
* to the given condition, but usually is).
*
* This is similar to wait_var_event() but returns a value which is 0 if
* the condition became true, or %-ERESTARTSYS if a signal was received.
*
* The condition should normally use smp_load_acquire() or a similarly
* ordered access to ensure that any changes to memory made before the
* condition became true will be visible after the wait completes.
*/
#define wait_var_event_interruptible(var, condition) \ #define wait_var_event_interruptible(var, condition) \
({ \ ({ \
int __ret = 0; \ int __ret = 0; \
@ -319,15 +432,122 @@ do { \
}) })
/** /**
* clear_and_wake_up_bit - clear a bit and wake up anyone waiting on that bit * wait_var_event_any_lock - wait for a variable to be updated under a lock
* @var: the address of the variable being waited on
* @condition: condition to wait for
* @lock: the object that is locked to protect updates to the variable
* @type: prefix on lock and unlock operations
* @state: waiting state, %TASK_UNINTERRUPTIBLE etc.
* *
* @bit: the bit of the word being waited on * Wait for a condition which can only be reliably tested while holding
* @word: the word being waited on, a kernel virtual address * a lock. The variables assessed in the condition will normal be updated
* under the same lock, and the wake up should be signalled with
* wake_up_var_locked() under the same lock.
* *
* You can use this helper if bitflags are manipulated atomically rather than * This is similar to wait_var_event(), but assumes a lock is held
* non-atomically under a lock. * while calling this function and while updating the variable.
*
* This must be called while the given lock is held and the lock will be
* dropped when schedule() is called to wait for a wake up, and will be
* reclaimed before testing the condition again. The functions used to
* unlock and lock the object are constructed by appending _unlock and _lock
* to @type.
*
* Return %-ERESTARTSYS if a signal arrives which is allowed to interrupt
* the wait according to @state.
*/ */
static inline void clear_and_wake_up_bit(int bit, void *word) #define wait_var_event_any_lock(var, condition, lock, type, state) \
({ \
int __ret = 0; \
if (!(condition)) \
__ret = ___wait_var_event(var, condition, state, 0, 0, \
type ## _unlock(lock); \
schedule(); \
type ## _lock(lock)); \
__ret; \
})
/**
* wait_var_event_spinlock - wait for a variable to be updated under a spinlock
* @var: the address of the variable being waited on
* @condition: condition to wait for
* @lock: the spinlock which protects updates to the variable
*
* Wait for a condition which can only be reliably tested while holding
* a spinlock. The variables assessed in the condition will normal be updated
* under the same spinlock, and the wake up should be signalled with
* wake_up_var_locked() under the same spinlock.
*
* This is similar to wait_var_event(), but assumes a spinlock is held
* while calling this function and while updating the variable.
*
* This must be called while the given lock is held and the lock will be
* dropped when schedule() is called to wait for a wake up, and will be
* reclaimed before testing the condition again.
*/
#define wait_var_event_spinlock(var, condition, lock) \
wait_var_event_any_lock(var, condition, lock, spin, TASK_UNINTERRUPTIBLE)
/**
* wait_var_event_mutex - wait for a variable to be updated under a mutex
* @var: the address of the variable being waited on
* @condition: condition to wait for
* @mutex: the mutex which protects updates to the variable
*
* Wait for a condition which can only be reliably tested while holding
* a mutex. The variables assessed in the condition will normal be
* updated under the same mutex, and the wake up should be signalled
* with wake_up_var_locked() under the same mutex.
*
* This is similar to wait_var_event(), but assumes a mutex is held
* while calling this function and while updating the variable.
*
* This must be called while the given mutex is held and the mutex will be
* dropped when schedule() is called to wait for a wake up, and will be
* reclaimed before testing the condition again.
*/
#define wait_var_event_mutex(var, condition, lock) \
wait_var_event_any_lock(var, condition, lock, mutex, TASK_UNINTERRUPTIBLE)
/**
* wake_up_var_protected - wake up waiters for a variable asserting that it is safe
* @var: the address of the variable being waited on
* @cond: the condition which afirms this is safe
*
* When waking waiters which use wait_var_event_any_lock() the waker must be
* holding the reelvant lock to avoid races. This version of wake_up_var()
* asserts that the relevant lock is held and so no barrier is needed.
* The @cond is only tested when CONFIG_LOCKDEP is enabled.
*/
#define wake_up_var_protected(var, cond) \
do { \
lockdep_assert(cond); \
wake_up_var(var); \
} while (0)
/**
* wake_up_var_locked - wake up waiters for a variable while holding a spinlock or mutex
* @var: the address of the variable being waited on
* @lock: The spinlock or mutex what protects the variable
*
* Send a wake up for the given variable which should be waited for with
* wait_var_event_spinlock() or wait_var_event_mutex(). Unlike wake_up_var(),
* no extra barriers are needed as the locking provides sufficient sequencing.
*/
#define wake_up_var_locked(var, lock) \
wake_up_var_protected(var, lockdep_is_held(lock))
/**
* clear_and_wake_up_bit - clear a bit and wake up anyone waiting on that bit
* @bit: the bit of the word being waited on
* @word: the address containing the bit being waited on
*
* The designated bit is cleared and any tasks waiting in wait_on_bit()
* or similar will be woken. This call has RELEASE semantics so that
* any changes to memory made before this call are guaranteed to be visible
* after the corresponding wait_on_bit() completes.
*/
static inline void clear_and_wake_up_bit(int bit, unsigned long *word)
{ {
clear_bit_unlock(bit, word); clear_bit_unlock(bit, word);
/* See wake_up_bit() for which memory barrier you need to use. */ /* See wake_up_bit() for which memory barrier you need to use. */
@ -335,4 +555,64 @@ static inline void clear_and_wake_up_bit(int bit, void *word)
wake_up_bit(word, bit); wake_up_bit(word, bit);
} }
/**
* test_and_clear_wake_up_bit - clear a bit if it was set: wake up anyone waiting on that bit
* @bit: the bit of the word being waited on
* @word: the address of memory containing that bit
*
* If the bit is set and can be atomically cleared, any tasks waiting in
* wait_on_bit() or similar will be woken. This call has the same
* complete ordering semantics as test_and_clear_bit(). Any changes to
* memory made before this call are guaranteed to be visible after the
* corresponding wait_on_bit() completes.
*
* Returns %true if the bit was successfully set and the wake up was sent.
*/
static inline bool test_and_clear_wake_up_bit(int bit, unsigned long *word)
{
if (!test_and_clear_bit(bit, word))
return false;
/* no extra barrier required */
wake_up_bit(word, bit);
return true;
}
/**
* atomic_dec_and_wake_up - decrement an atomic_t and if zero, wake up waiters
* @var: the variable to dec and test
*
* Decrements the atomic variable and if it reaches zero, send a wake_up to any
* processes waiting on the variable.
*
* This function has the same complete ordering semantics as atomic_dec_and_test.
*
* Returns %true is the variable reaches zero and the wake up was sent.
*/
static inline bool atomic_dec_and_wake_up(atomic_t *var)
{
if (!atomic_dec_and_test(var))
return false;
/* No extra barrier required */
wake_up_var(var);
return true;
}
/**
* store_release_wake_up - update a variable and send a wake_up
* @var: the address of the variable to be updated and woken
* @val: the value to store in the variable.
*
* Store the given value in the variable send a wake up to any tasks
* waiting on the variable. All necessary barriers are included to ensure
* the task calling wait_var_event() sees the new value and all values
* written to memory before this call.
*/
#define store_release_wake_up(var, val) \
do { \
smp_store_release(var, val); \
smp_mb(); \
wake_up_var(var); \
} while (0)
#endif /* _LINUX_WAIT_BIT_H */ #endif /* _LINUX_WAIT_BIT_H */

View File

@ -11,12 +11,16 @@ config PREEMPT_BUILD
select PREEMPTION select PREEMPTION
select UNINLINE_SPIN_UNLOCK if !ARCH_INLINE_SPIN_UNLOCK select UNINLINE_SPIN_UNLOCK if !ARCH_INLINE_SPIN_UNLOCK
config ARCH_HAS_PREEMPT_LAZY
bool
choice choice
prompt "Preemption Model" prompt "Preemption Model"
default PREEMPT_NONE default PREEMPT_NONE
config PREEMPT_NONE config PREEMPT_NONE
bool "No Forced Preemption (Server)" bool "No Forced Preemption (Server)"
depends on !PREEMPT_RT
select PREEMPT_NONE_BUILD if !PREEMPT_DYNAMIC select PREEMPT_NONE_BUILD if !PREEMPT_DYNAMIC
help help
This is the traditional Linux preemption model, geared towards This is the traditional Linux preemption model, geared towards
@ -32,6 +36,7 @@ config PREEMPT_NONE
config PREEMPT_VOLUNTARY config PREEMPT_VOLUNTARY
bool "Voluntary Kernel Preemption (Desktop)" bool "Voluntary Kernel Preemption (Desktop)"
depends on !ARCH_NO_PREEMPT depends on !ARCH_NO_PREEMPT
depends on !PREEMPT_RT
select PREEMPT_VOLUNTARY_BUILD if !PREEMPT_DYNAMIC select PREEMPT_VOLUNTARY_BUILD if !PREEMPT_DYNAMIC
help help
This option reduces the latency of the kernel by adding more This option reduces the latency of the kernel by adding more
@ -51,7 +56,7 @@ config PREEMPT_VOLUNTARY
config PREEMPT config PREEMPT
bool "Preemptible Kernel (Low-Latency Desktop)" bool "Preemptible Kernel (Low-Latency Desktop)"
depends on !ARCH_NO_PREEMPT depends on !ARCH_NO_PREEMPT
select PREEMPT_BUILD select PREEMPT_BUILD if !PREEMPT_DYNAMIC
help help
This option reduces the latency of the kernel by making This option reduces the latency of the kernel by making
all kernel code (that is not executing in a critical section) all kernel code (that is not executing in a critical section)
@ -67,9 +72,23 @@ config PREEMPT
embedded system with latency requirements in the milliseconds embedded system with latency requirements in the milliseconds
range. range.
config PREEMPT_LAZY
bool "Scheduler controlled preemption model"
depends on !ARCH_NO_PREEMPT
depends on ARCH_HAS_PREEMPT_LAZY
select PREEMPT_BUILD if !PREEMPT_DYNAMIC
help
This option provides a scheduler driven preemption model that
is fundamentally similar to full preemption, but is less
eager to preempt SCHED_NORMAL tasks in an attempt to
reduce lock holder preemption and recover some of the performance
gains seen from using Voluntary preemption.
endchoice
config PREEMPT_RT config PREEMPT_RT
bool "Fully Preemptible Kernel (Real-Time)" bool "Fully Preemptible Kernel (Real-Time)"
depends on EXPERT && ARCH_SUPPORTS_RT depends on EXPERT && ARCH_SUPPORTS_RT && !COMPILE_TEST
select PREEMPTION select PREEMPTION
help help
This option turns the kernel into a real-time kernel by replacing This option turns the kernel into a real-time kernel by replacing
@ -84,8 +103,6 @@ config PREEMPT_RT
Select this if you are building a kernel for systems which Select this if you are building a kernel for systems which
require real-time guarantees. require real-time guarantees.
endchoice
config PREEMPT_COUNT config PREEMPT_COUNT
bool bool
@ -95,7 +112,7 @@ config PREEMPTION
config PREEMPT_DYNAMIC config PREEMPT_DYNAMIC
bool "Preemption behaviour defined on boot" bool "Preemption behaviour defined on boot"
depends on HAVE_PREEMPT_DYNAMIC && !PREEMPT_RT depends on HAVE_PREEMPT_DYNAMIC
select JUMP_LABEL if HAVE_PREEMPT_DYNAMIC_KEY select JUMP_LABEL if HAVE_PREEMPT_DYNAMIC_KEY
select PREEMPT_BUILD select PREEMPT_BUILD
default y if HAVE_PREEMPT_DYNAMIC_CALL default y if HAVE_PREEMPT_DYNAMIC_CALL

View File

@ -98,7 +98,7 @@ __always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
local_irq_enable_exit_to_user(ti_work); local_irq_enable_exit_to_user(ti_work);
if (ti_work & _TIF_NEED_RESCHED) if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
schedule(); schedule();
if (ti_work & _TIF_UPROBE) if (ti_work & _TIF_UPROBE)

View File

@ -13,7 +13,7 @@ static int xfer_to_guest_mode_work(struct kvm_vcpu *vcpu, unsigned long ti_work)
return -EINTR; return -EINTR;
} }
if (ti_work & _TIF_NEED_RESCHED) if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
schedule(); schedule();
if (ti_work & _TIF_NOTIFY_RESUME) if (ti_work & _TIF_NOTIFY_RESUME)
@ -24,7 +24,7 @@ static int xfer_to_guest_mode_work(struct kvm_vcpu *vcpu, unsigned long ti_work)
return ret; return ret;
ti_work = read_thread_flags(); ti_work = read_thread_flags();
} while (ti_work & XFER_TO_GUEST_MODE_WORK || need_resched()); } while (ti_work & XFER_TO_GUEST_MODE_WORK);
return 0; return 0;
} }

View File

@ -1299,7 +1299,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
if (init_new_context(p, mm)) if (init_new_context(p, mm))
goto fail_nocontext; goto fail_nocontext;
if (mm_alloc_cid(mm)) if (mm_alloc_cid(mm, p))
goto fail_cid; goto fail_cid;
if (percpu_counter_init_many(mm->rss_stat, 0, GFP_KERNEL_ACCOUNT, if (percpu_counter_init_many(mm->rss_stat, 0, GFP_KERNEL_ACCOUNT,

View File

@ -922,6 +922,7 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
struct rt_mutex_waiter rt_waiter; struct rt_mutex_waiter rt_waiter;
struct futex_hash_bucket *hb; struct futex_hash_bucket *hb;
struct futex_q q = futex_q_init; struct futex_q q = futex_q_init;
DEFINE_WAKE_Q(wake_q);
int res, ret; int res, ret;
if (!IS_ENABLED(CONFIG_FUTEX_PI)) if (!IS_ENABLED(CONFIG_FUTEX_PI))
@ -1018,8 +1019,11 @@ retry_private:
* such that futex_unlock_pi() is guaranteed to observe the waiter when * such that futex_unlock_pi() is guaranteed to observe the waiter when
* it sees the futex_q::pi_state. * it sees the futex_q::pi_state.
*/ */
ret = __rt_mutex_start_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter, current); ret = __rt_mutex_start_proxy_lock(&q.pi_state->pi_mutex, &rt_waiter, current, &wake_q);
preempt_disable();
raw_spin_unlock_irq(&q.pi_state->pi_mutex.wait_lock); raw_spin_unlock_irq(&q.pi_state->pi_mutex.wait_lock);
wake_up_q(&wake_q);
preempt_enable();
if (ret) { if (ret) {
if (ret == 1) if (ret == 1)

View File

@ -56,31 +56,6 @@ __mutex_init(struct mutex *lock, const char *name, struct lock_class_key *key)
} }
EXPORT_SYMBOL(__mutex_init); EXPORT_SYMBOL(__mutex_init);
/*
* @owner: contains: 'struct task_struct *' to the current lock owner,
* NULL means not owned. Since task_struct pointers are aligned at
* at least L1_CACHE_BYTES, we have low bits to store extra state.
*
* Bit0 indicates a non-empty waiter list; unlock must issue a wakeup.
* Bit1 indicates unlock needs to hand the lock to the top-waiter
* Bit2 indicates handoff has been done and we're waiting for pickup.
*/
#define MUTEX_FLAG_WAITERS 0x01
#define MUTEX_FLAG_HANDOFF 0x02
#define MUTEX_FLAG_PICKUP 0x04
#define MUTEX_FLAGS 0x07
/*
* Internal helper function; C doesn't allow us to hide it :/
*
* DO NOT USE (outside of mutex code).
*/
static inline struct task_struct *__mutex_owner(struct mutex *lock)
{
return (struct task_struct *)(atomic_long_read(&lock->owner) & ~MUTEX_FLAGS);
}
static inline struct task_struct *__owner_task(unsigned long owner) static inline struct task_struct *__owner_task(unsigned long owner)
{ {
return (struct task_struct *)(owner & ~MUTEX_FLAGS); return (struct task_struct *)(owner & ~MUTEX_FLAGS);
@ -575,8 +550,10 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
struct lockdep_map *nest_lock, unsigned long ip, struct lockdep_map *nest_lock, unsigned long ip,
struct ww_acquire_ctx *ww_ctx, const bool use_ww_ctx) struct ww_acquire_ctx *ww_ctx, const bool use_ww_ctx)
{ {
DEFINE_WAKE_Q(wake_q);
struct mutex_waiter waiter; struct mutex_waiter waiter;
struct ww_mutex *ww; struct ww_mutex *ww;
unsigned long flags;
int ret; int ret;
if (!use_ww_ctx) if (!use_ww_ctx)
@ -619,13 +596,13 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
return 0; return 0;
} }
raw_spin_lock(&lock->wait_lock); raw_spin_lock_irqsave(&lock->wait_lock, flags);
/* /*
* After waiting to acquire the wait_lock, try again. * After waiting to acquire the wait_lock, try again.
*/ */
if (__mutex_trylock(lock)) { if (__mutex_trylock(lock)) {
if (ww_ctx) if (ww_ctx)
__ww_mutex_check_waiters(lock, ww_ctx); __ww_mutex_check_waiters(lock, ww_ctx, &wake_q);
goto skip_wait; goto skip_wait;
} }
@ -645,7 +622,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
* Add in stamp order, waking up waiters that must kill * Add in stamp order, waking up waiters that must kill
* themselves. * themselves.
*/ */
ret = __ww_mutex_add_waiter(&waiter, lock, ww_ctx); ret = __ww_mutex_add_waiter(&waiter, lock, ww_ctx, &wake_q);
if (ret) if (ret)
goto err_early_kill; goto err_early_kill;
} }
@ -680,7 +657,11 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
goto err; goto err;
} }
raw_spin_unlock(&lock->wait_lock); raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
/* Make sure we do wakeups before calling schedule */
wake_up_q(&wake_q);
wake_q_init(&wake_q);
schedule_preempt_disabled(); schedule_preempt_disabled();
first = __mutex_waiter_is_first(lock, &waiter); first = __mutex_waiter_is_first(lock, &waiter);
@ -701,9 +682,9 @@ __mutex_lock_common(struct mutex *lock, unsigned int state, unsigned int subclas
trace_contention_begin(lock, LCB_F_MUTEX); trace_contention_begin(lock, LCB_F_MUTEX);
} }
raw_spin_lock(&lock->wait_lock); raw_spin_lock_irqsave(&lock->wait_lock, flags);
} }
raw_spin_lock(&lock->wait_lock); raw_spin_lock_irqsave(&lock->wait_lock, flags);
acquired: acquired:
__set_current_state(TASK_RUNNING); __set_current_state(TASK_RUNNING);
@ -714,7 +695,7 @@ acquired:
*/ */
if (!ww_ctx->is_wait_die && if (!ww_ctx->is_wait_die &&
!__mutex_waiter_is_first(lock, &waiter)) !__mutex_waiter_is_first(lock, &waiter))
__ww_mutex_check_waiters(lock, ww_ctx); __ww_mutex_check_waiters(lock, ww_ctx, &wake_q);
} }
__mutex_remove_waiter(lock, &waiter); __mutex_remove_waiter(lock, &waiter);
@ -729,7 +710,8 @@ skip_wait:
if (ww_ctx) if (ww_ctx)
ww_mutex_lock_acquired(ww, ww_ctx); ww_mutex_lock_acquired(ww, ww_ctx);
raw_spin_unlock(&lock->wait_lock); raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
wake_up_q(&wake_q);
preempt_enable(); preempt_enable();
return 0; return 0;
@ -738,9 +720,10 @@ err:
__mutex_remove_waiter(lock, &waiter); __mutex_remove_waiter(lock, &waiter);
err_early_kill: err_early_kill:
trace_contention_end(lock, ret); trace_contention_end(lock, ret);
raw_spin_unlock(&lock->wait_lock); raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
debug_mutex_free_waiter(&waiter); debug_mutex_free_waiter(&waiter);
mutex_release(&lock->dep_map, ip); mutex_release(&lock->dep_map, ip);
wake_up_q(&wake_q);
preempt_enable(); preempt_enable();
return ret; return ret;
} }
@ -908,6 +891,7 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
struct task_struct *next = NULL; struct task_struct *next = NULL;
DEFINE_WAKE_Q(wake_q); DEFINE_WAKE_Q(wake_q);
unsigned long owner; unsigned long owner;
unsigned long flags;
mutex_release(&lock->dep_map, ip); mutex_release(&lock->dep_map, ip);
@ -934,7 +918,7 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
} }
} }
raw_spin_lock(&lock->wait_lock); raw_spin_lock_irqsave(&lock->wait_lock, flags);
debug_mutex_unlock(lock); debug_mutex_unlock(lock);
if (!list_empty(&lock->wait_list)) { if (!list_empty(&lock->wait_list)) {
/* get the first entry from the wait-list: */ /* get the first entry from the wait-list: */
@ -951,9 +935,10 @@ static noinline void __sched __mutex_unlock_slowpath(struct mutex *lock, unsigne
if (owner & MUTEX_FLAG_HANDOFF) if (owner & MUTEX_FLAG_HANDOFF)
__mutex_handoff(lock, next); __mutex_handoff(lock, next);
raw_spin_unlock(&lock->wait_lock); preempt_disable();
raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
wake_up_q(&wake_q); wake_up_q(&wake_q);
preempt_enable();
} }
#ifndef CONFIG_DEBUG_LOCK_ALLOC #ifndef CONFIG_DEBUG_LOCK_ALLOC

View File

@ -20,6 +20,33 @@ struct mutex_waiter {
#endif #endif
}; };
/*
* @owner: contains: 'struct task_struct *' to the current lock owner,
* NULL means not owned. Since task_struct pointers are aligned at
* at least L1_CACHE_BYTES, we have low bits to store extra state.
*
* Bit0 indicates a non-empty waiter list; unlock must issue a wakeup.
* Bit1 indicates unlock needs to hand the lock to the top-waiter
* Bit2 indicates handoff has been done and we're waiting for pickup.
*/
#define MUTEX_FLAG_WAITERS 0x01
#define MUTEX_FLAG_HANDOFF 0x02
#define MUTEX_FLAG_PICKUP 0x04
#define MUTEX_FLAGS 0x07
/*
* Internal helper function; C doesn't allow us to hide it :/
*
* DO NOT USE (outside of mutex & scheduler code).
*/
static inline struct task_struct *__mutex_owner(struct mutex *lock)
{
if (!lock)
return NULL;
return (struct task_struct *)(atomic_long_read(&lock->owner) & ~MUTEX_FLAGS);
}
#ifdef CONFIG_DEBUG_MUTEXES #ifdef CONFIG_DEBUG_MUTEXES
extern void debug_mutex_lock_common(struct mutex *lock, extern void debug_mutex_lock_common(struct mutex *lock,
struct mutex_waiter *waiter); struct mutex_waiter *waiter);

View File

@ -34,13 +34,15 @@
static inline int __ww_mutex_add_waiter(struct rt_mutex_waiter *waiter, static inline int __ww_mutex_add_waiter(struct rt_mutex_waiter *waiter,
struct rt_mutex *lock, struct rt_mutex *lock,
struct ww_acquire_ctx *ww_ctx) struct ww_acquire_ctx *ww_ctx,
struct wake_q_head *wake_q)
{ {
return 0; return 0;
} }
static inline void __ww_mutex_check_waiters(struct rt_mutex *lock, static inline void __ww_mutex_check_waiters(struct rt_mutex *lock,
struct ww_acquire_ctx *ww_ctx) struct ww_acquire_ctx *ww_ctx,
struct wake_q_head *wake_q)
{ {
} }
@ -1201,7 +1203,8 @@ static int __sched task_blocks_on_rt_mutex(struct rt_mutex_base *lock,
struct rt_mutex_waiter *waiter, struct rt_mutex_waiter *waiter,
struct task_struct *task, struct task_struct *task,
struct ww_acquire_ctx *ww_ctx, struct ww_acquire_ctx *ww_ctx,
enum rtmutex_chainwalk chwalk) enum rtmutex_chainwalk chwalk,
struct wake_q_head *wake_q)
{ {
struct task_struct *owner = rt_mutex_owner(lock); struct task_struct *owner = rt_mutex_owner(lock);
struct rt_mutex_waiter *top_waiter = waiter; struct rt_mutex_waiter *top_waiter = waiter;
@ -1245,7 +1248,10 @@ static int __sched task_blocks_on_rt_mutex(struct rt_mutex_base *lock,
/* Check whether the waiter should back out immediately */ /* Check whether the waiter should back out immediately */
rtm = container_of(lock, struct rt_mutex, rtmutex); rtm = container_of(lock, struct rt_mutex, rtmutex);
res = __ww_mutex_add_waiter(waiter, rtm, ww_ctx); preempt_disable();
res = __ww_mutex_add_waiter(waiter, rtm, ww_ctx, wake_q);
wake_up_q(wake_q);
preempt_enable();
if (res) { if (res) {
raw_spin_lock(&task->pi_lock); raw_spin_lock(&task->pi_lock);
rt_mutex_dequeue(lock, waiter); rt_mutex_dequeue(lock, waiter);
@ -1675,12 +1681,14 @@ static void __sched rt_mutex_handle_deadlock(int res, int detect_deadlock,
* @state: The task state for sleeping * @state: The task state for sleeping
* @chwalk: Indicator whether full or partial chainwalk is requested * @chwalk: Indicator whether full or partial chainwalk is requested
* @waiter: Initializer waiter for blocking * @waiter: Initializer waiter for blocking
* @wake_q: The wake_q to wake tasks after we release the wait_lock
*/ */
static int __sched __rt_mutex_slowlock(struct rt_mutex_base *lock, static int __sched __rt_mutex_slowlock(struct rt_mutex_base *lock,
struct ww_acquire_ctx *ww_ctx, struct ww_acquire_ctx *ww_ctx,
unsigned int state, unsigned int state,
enum rtmutex_chainwalk chwalk, enum rtmutex_chainwalk chwalk,
struct rt_mutex_waiter *waiter) struct rt_mutex_waiter *waiter,
struct wake_q_head *wake_q)
{ {
struct rt_mutex *rtm = container_of(lock, struct rt_mutex, rtmutex); struct rt_mutex *rtm = container_of(lock, struct rt_mutex, rtmutex);
struct ww_mutex *ww = ww_container_of(rtm); struct ww_mutex *ww = ww_container_of(rtm);
@ -1691,7 +1699,7 @@ static int __sched __rt_mutex_slowlock(struct rt_mutex_base *lock,
/* Try to acquire the lock again: */ /* Try to acquire the lock again: */
if (try_to_take_rt_mutex(lock, current, NULL)) { if (try_to_take_rt_mutex(lock, current, NULL)) {
if (build_ww_mutex() && ww_ctx) { if (build_ww_mutex() && ww_ctx) {
__ww_mutex_check_waiters(rtm, ww_ctx); __ww_mutex_check_waiters(rtm, ww_ctx, wake_q);
ww_mutex_lock_acquired(ww, ww_ctx); ww_mutex_lock_acquired(ww, ww_ctx);
} }
return 0; return 0;
@ -1701,7 +1709,7 @@ static int __sched __rt_mutex_slowlock(struct rt_mutex_base *lock,
trace_contention_begin(lock, LCB_F_RT); trace_contention_begin(lock, LCB_F_RT);
ret = task_blocks_on_rt_mutex(lock, waiter, current, ww_ctx, chwalk); ret = task_blocks_on_rt_mutex(lock, waiter, current, ww_ctx, chwalk, wake_q);
if (likely(!ret)) if (likely(!ret))
ret = rt_mutex_slowlock_block(lock, ww_ctx, state, NULL, waiter); ret = rt_mutex_slowlock_block(lock, ww_ctx, state, NULL, waiter);
@ -1709,7 +1717,7 @@ static int __sched __rt_mutex_slowlock(struct rt_mutex_base *lock,
/* acquired the lock */ /* acquired the lock */
if (build_ww_mutex() && ww_ctx) { if (build_ww_mutex() && ww_ctx) {
if (!ww_ctx->is_wait_die) if (!ww_ctx->is_wait_die)
__ww_mutex_check_waiters(rtm, ww_ctx); __ww_mutex_check_waiters(rtm, ww_ctx, wake_q);
ww_mutex_lock_acquired(ww, ww_ctx); ww_mutex_lock_acquired(ww, ww_ctx);
} }
} else { } else {
@ -1731,7 +1739,8 @@ static int __sched __rt_mutex_slowlock(struct rt_mutex_base *lock,
static inline int __rt_mutex_slowlock_locked(struct rt_mutex_base *lock, static inline int __rt_mutex_slowlock_locked(struct rt_mutex_base *lock,
struct ww_acquire_ctx *ww_ctx, struct ww_acquire_ctx *ww_ctx,
unsigned int state) unsigned int state,
struct wake_q_head *wake_q)
{ {
struct rt_mutex_waiter waiter; struct rt_mutex_waiter waiter;
int ret; int ret;
@ -1740,7 +1749,7 @@ static inline int __rt_mutex_slowlock_locked(struct rt_mutex_base *lock,
waiter.ww_ctx = ww_ctx; waiter.ww_ctx = ww_ctx;
ret = __rt_mutex_slowlock(lock, ww_ctx, state, RT_MUTEX_MIN_CHAINWALK, ret = __rt_mutex_slowlock(lock, ww_ctx, state, RT_MUTEX_MIN_CHAINWALK,
&waiter); &waiter, wake_q);
debug_rt_mutex_free_waiter(&waiter); debug_rt_mutex_free_waiter(&waiter);
return ret; return ret;
@ -1756,6 +1765,7 @@ static int __sched rt_mutex_slowlock(struct rt_mutex_base *lock,
struct ww_acquire_ctx *ww_ctx, struct ww_acquire_ctx *ww_ctx,
unsigned int state) unsigned int state)
{ {
DEFINE_WAKE_Q(wake_q);
unsigned long flags; unsigned long flags;
int ret; int ret;
@ -1777,8 +1787,11 @@ static int __sched rt_mutex_slowlock(struct rt_mutex_base *lock,
* irqsave/restore variants. * irqsave/restore variants.
*/ */
raw_spin_lock_irqsave(&lock->wait_lock, flags); raw_spin_lock_irqsave(&lock->wait_lock, flags);
ret = __rt_mutex_slowlock_locked(lock, ww_ctx, state); ret = __rt_mutex_slowlock_locked(lock, ww_ctx, state, &wake_q);
preempt_disable();
raw_spin_unlock_irqrestore(&lock->wait_lock, flags); raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
wake_up_q(&wake_q);
preempt_enable();
rt_mutex_post_schedule(); rt_mutex_post_schedule();
return ret; return ret;
@ -1804,8 +1817,10 @@ static __always_inline int __rt_mutex_lock(struct rt_mutex_base *lock,
/** /**
* rtlock_slowlock_locked - Slow path lock acquisition for RT locks * rtlock_slowlock_locked - Slow path lock acquisition for RT locks
* @lock: The underlying RT mutex * @lock: The underlying RT mutex
* @wake_q: The wake_q to wake tasks after we release the wait_lock
*/ */
static void __sched rtlock_slowlock_locked(struct rt_mutex_base *lock) static void __sched rtlock_slowlock_locked(struct rt_mutex_base *lock,
struct wake_q_head *wake_q)
__releases(&lock->wait_lock) __acquires(&lock->wait_lock) __releases(&lock->wait_lock) __acquires(&lock->wait_lock)
{ {
struct rt_mutex_waiter waiter; struct rt_mutex_waiter waiter;
@ -1823,7 +1838,7 @@ static void __sched rtlock_slowlock_locked(struct rt_mutex_base *lock)
trace_contention_begin(lock, LCB_F_RT); trace_contention_begin(lock, LCB_F_RT);
task_blocks_on_rt_mutex(lock, &waiter, current, NULL, RT_MUTEX_MIN_CHAINWALK); task_blocks_on_rt_mutex(lock, &waiter, current, NULL, RT_MUTEX_MIN_CHAINWALK, wake_q);
for (;;) { for (;;) {
/* Try to acquire the lock again */ /* Try to acquire the lock again */
@ -1834,7 +1849,11 @@ static void __sched rtlock_slowlock_locked(struct rt_mutex_base *lock)
owner = rt_mutex_owner(lock); owner = rt_mutex_owner(lock);
else else
owner = NULL; owner = NULL;
preempt_disable();
raw_spin_unlock_irq(&lock->wait_lock); raw_spin_unlock_irq(&lock->wait_lock);
wake_up_q(wake_q);
wake_q_init(wake_q);
preempt_enable();
if (!owner || !rtmutex_spin_on_owner(lock, &waiter, owner)) if (!owner || !rtmutex_spin_on_owner(lock, &waiter, owner))
schedule_rtlock(); schedule_rtlock();
@ -1859,10 +1878,14 @@ static void __sched rtlock_slowlock_locked(struct rt_mutex_base *lock)
static __always_inline void __sched rtlock_slowlock(struct rt_mutex_base *lock) static __always_inline void __sched rtlock_slowlock(struct rt_mutex_base *lock)
{ {
unsigned long flags; unsigned long flags;
DEFINE_WAKE_Q(wake_q);
raw_spin_lock_irqsave(&lock->wait_lock, flags); raw_spin_lock_irqsave(&lock->wait_lock, flags);
rtlock_slowlock_locked(lock); rtlock_slowlock_locked(lock, &wake_q);
preempt_disable();
raw_spin_unlock_irqrestore(&lock->wait_lock, flags); raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
wake_up_q(&wake_q);
preempt_enable();
} }
#endif /* RT_MUTEX_BUILD_SPINLOCKS */ #endif /* RT_MUTEX_BUILD_SPINLOCKS */

View File

@ -275,6 +275,7 @@ void __sched rt_mutex_proxy_unlock(struct rt_mutex_base *lock)
* @lock: the rt_mutex to take * @lock: the rt_mutex to take
* @waiter: the pre-initialized rt_mutex_waiter * @waiter: the pre-initialized rt_mutex_waiter
* @task: the task to prepare * @task: the task to prepare
* @wake_q: the wake_q to wake tasks after we release the wait_lock
* *
* Starts the rt_mutex acquire; it enqueues the @waiter and does deadlock * Starts the rt_mutex acquire; it enqueues the @waiter and does deadlock
* detection. It does not wait, see rt_mutex_wait_proxy_lock() for that. * detection. It does not wait, see rt_mutex_wait_proxy_lock() for that.
@ -291,7 +292,8 @@ void __sched rt_mutex_proxy_unlock(struct rt_mutex_base *lock)
*/ */
int __sched __rt_mutex_start_proxy_lock(struct rt_mutex_base *lock, int __sched __rt_mutex_start_proxy_lock(struct rt_mutex_base *lock,
struct rt_mutex_waiter *waiter, struct rt_mutex_waiter *waiter,
struct task_struct *task) struct task_struct *task,
struct wake_q_head *wake_q)
{ {
int ret; int ret;
@ -302,7 +304,7 @@ int __sched __rt_mutex_start_proxy_lock(struct rt_mutex_base *lock,
/* We enforce deadlock detection for futexes */ /* We enforce deadlock detection for futexes */
ret = task_blocks_on_rt_mutex(lock, waiter, task, NULL, ret = task_blocks_on_rt_mutex(lock, waiter, task, NULL,
RT_MUTEX_FULL_CHAINWALK); RT_MUTEX_FULL_CHAINWALK, wake_q);
if (ret && !rt_mutex_owner(lock)) { if (ret && !rt_mutex_owner(lock)) {
/* /*
@ -341,12 +343,16 @@ int __sched rt_mutex_start_proxy_lock(struct rt_mutex_base *lock,
struct task_struct *task) struct task_struct *task)
{ {
int ret; int ret;
DEFINE_WAKE_Q(wake_q);
raw_spin_lock_irq(&lock->wait_lock); raw_spin_lock_irq(&lock->wait_lock);
ret = __rt_mutex_start_proxy_lock(lock, waiter, task); ret = __rt_mutex_start_proxy_lock(lock, waiter, task, &wake_q);
if (unlikely(ret)) if (unlikely(ret))
remove_waiter(lock, waiter); remove_waiter(lock, waiter);
preempt_disable();
raw_spin_unlock_irq(&lock->wait_lock); raw_spin_unlock_irq(&lock->wait_lock);
wake_up_q(&wake_q);
preempt_enable();
return ret; return ret;
} }

View File

@ -83,7 +83,8 @@ extern void rt_mutex_init_proxy_locked(struct rt_mutex_base *lock,
extern void rt_mutex_proxy_unlock(struct rt_mutex_base *lock); extern void rt_mutex_proxy_unlock(struct rt_mutex_base *lock);
extern int __rt_mutex_start_proxy_lock(struct rt_mutex_base *lock, extern int __rt_mutex_start_proxy_lock(struct rt_mutex_base *lock,
struct rt_mutex_waiter *waiter, struct rt_mutex_waiter *waiter,
struct task_struct *task); struct task_struct *task,
struct wake_q_head *);
extern int rt_mutex_start_proxy_lock(struct rt_mutex_base *lock, extern int rt_mutex_start_proxy_lock(struct rt_mutex_base *lock,
struct rt_mutex_waiter *waiter, struct rt_mutex_waiter *waiter,
struct task_struct *task); struct task_struct *task);

View File

@ -69,6 +69,7 @@ static int __sched __rwbase_read_lock(struct rwbase_rt *rwb,
unsigned int state) unsigned int state)
{ {
struct rt_mutex_base *rtm = &rwb->rtmutex; struct rt_mutex_base *rtm = &rwb->rtmutex;
DEFINE_WAKE_Q(wake_q);
int ret; int ret;
rwbase_pre_schedule(); rwbase_pre_schedule();
@ -110,7 +111,7 @@ static int __sched __rwbase_read_lock(struct rwbase_rt *rwb,
* For rwlocks this returns 0 unconditionally, so the below * For rwlocks this returns 0 unconditionally, so the below
* !ret conditionals are optimized out. * !ret conditionals are optimized out.
*/ */
ret = rwbase_rtmutex_slowlock_locked(rtm, state); ret = rwbase_rtmutex_slowlock_locked(rtm, state, &wake_q);
/* /*
* On success the rtmutex is held, so there can't be a writer * On success the rtmutex is held, so there can't be a writer
@ -121,7 +122,12 @@ static int __sched __rwbase_read_lock(struct rwbase_rt *rwb,
*/ */
if (!ret) if (!ret)
atomic_inc(&rwb->readers); atomic_inc(&rwb->readers);
preempt_disable();
raw_spin_unlock_irq(&rtm->wait_lock); raw_spin_unlock_irq(&rtm->wait_lock);
wake_up_q(&wake_q);
preempt_enable();
if (!ret) if (!ret)
rwbase_rtmutex_unlock(rtm); rwbase_rtmutex_unlock(rtm);

View File

@ -1413,8 +1413,8 @@ static inline void __downgrade_write(struct rw_semaphore *sem)
#define rwbase_rtmutex_lock_state(rtm, state) \ #define rwbase_rtmutex_lock_state(rtm, state) \
__rt_mutex_lock(rtm, state) __rt_mutex_lock(rtm, state)
#define rwbase_rtmutex_slowlock_locked(rtm, state) \ #define rwbase_rtmutex_slowlock_locked(rtm, state, wq) \
__rt_mutex_slowlock_locked(rtm, NULL, state) __rt_mutex_slowlock_locked(rtm, NULL, state, wq)
#define rwbase_rtmutex_unlock(rtm) \ #define rwbase_rtmutex_unlock(rtm) \
__rt_mutex_unlock(rtm) __rt_mutex_unlock(rtm)

View File

@ -162,9 +162,10 @@ rwbase_rtmutex_lock_state(struct rt_mutex_base *rtm, unsigned int state)
} }
static __always_inline int static __always_inline int
rwbase_rtmutex_slowlock_locked(struct rt_mutex_base *rtm, unsigned int state) rwbase_rtmutex_slowlock_locked(struct rt_mutex_base *rtm, unsigned int state,
struct wake_q_head *wake_q)
{ {
rtlock_slowlock_locked(rtm); rtlock_slowlock_locked(rtm, wake_q);
return 0; return 0;
} }

View File

@ -70,14 +70,14 @@ __ww_mutex_has_waiters(struct mutex *lock)
return atomic_long_read(&lock->owner) & MUTEX_FLAG_WAITERS; return atomic_long_read(&lock->owner) & MUTEX_FLAG_WAITERS;
} }
static inline void lock_wait_lock(struct mutex *lock) static inline void lock_wait_lock(struct mutex *lock, unsigned long *flags)
{ {
raw_spin_lock(&lock->wait_lock); raw_spin_lock_irqsave(&lock->wait_lock, *flags);
} }
static inline void unlock_wait_lock(struct mutex *lock) static inline void unlock_wait_lock(struct mutex *lock, unsigned long *flags)
{ {
raw_spin_unlock(&lock->wait_lock); raw_spin_unlock_irqrestore(&lock->wait_lock, *flags);
} }
static inline void lockdep_assert_wait_lock_held(struct mutex *lock) static inline void lockdep_assert_wait_lock_held(struct mutex *lock)
@ -144,14 +144,14 @@ __ww_mutex_has_waiters(struct rt_mutex *lock)
return rt_mutex_has_waiters(&lock->rtmutex); return rt_mutex_has_waiters(&lock->rtmutex);
} }
static inline void lock_wait_lock(struct rt_mutex *lock) static inline void lock_wait_lock(struct rt_mutex *lock, unsigned long *flags)
{ {
raw_spin_lock(&lock->rtmutex.wait_lock); raw_spin_lock_irqsave(&lock->rtmutex.wait_lock, *flags);
} }
static inline void unlock_wait_lock(struct rt_mutex *lock) static inline void unlock_wait_lock(struct rt_mutex *lock, unsigned long *flags)
{ {
raw_spin_unlock(&lock->rtmutex.wait_lock); raw_spin_unlock_irqrestore(&lock->rtmutex.wait_lock, *flags);
} }
static inline void lockdep_assert_wait_lock_held(struct rt_mutex *lock) static inline void lockdep_assert_wait_lock_held(struct rt_mutex *lock)
@ -275,7 +275,7 @@ __ww_ctx_less(struct ww_acquire_ctx *a, struct ww_acquire_ctx *b)
*/ */
static bool static bool
__ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter, __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
struct ww_acquire_ctx *ww_ctx) struct ww_acquire_ctx *ww_ctx, struct wake_q_head *wake_q)
{ {
if (!ww_ctx->is_wait_die) if (!ww_ctx->is_wait_die)
return false; return false;
@ -284,7 +284,7 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
#ifndef WW_RT #ifndef WW_RT
debug_mutex_wake_waiter(lock, waiter); debug_mutex_wake_waiter(lock, waiter);
#endif #endif
wake_up_process(waiter->task); wake_q_add(wake_q, waiter->task);
} }
return true; return true;
@ -299,7 +299,8 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
*/ */
static bool __ww_mutex_wound(struct MUTEX *lock, static bool __ww_mutex_wound(struct MUTEX *lock,
struct ww_acquire_ctx *ww_ctx, struct ww_acquire_ctx *ww_ctx,
struct ww_acquire_ctx *hold_ctx) struct ww_acquire_ctx *hold_ctx,
struct wake_q_head *wake_q)
{ {
struct task_struct *owner = __ww_mutex_owner(lock); struct task_struct *owner = __ww_mutex_owner(lock);
@ -331,7 +332,7 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
* wakeup pending to re-read the wounded state. * wakeup pending to re-read the wounded state.
*/ */
if (owner != current) if (owner != current)
wake_up_process(owner); wake_q_add(wake_q, owner);
return true; return true;
} }
@ -352,7 +353,8 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
* The current task must not be on the wait list. * The current task must not be on the wait list.
*/ */
static void static void
__ww_mutex_check_waiters(struct MUTEX *lock, struct ww_acquire_ctx *ww_ctx) __ww_mutex_check_waiters(struct MUTEX *lock, struct ww_acquire_ctx *ww_ctx,
struct wake_q_head *wake_q)
{ {
struct MUTEX_WAITER *cur; struct MUTEX_WAITER *cur;
@ -364,8 +366,8 @@ __ww_mutex_check_waiters(struct MUTEX *lock, struct ww_acquire_ctx *ww_ctx)
if (!cur->ww_ctx) if (!cur->ww_ctx)
continue; continue;
if (__ww_mutex_die(lock, cur, ww_ctx) || if (__ww_mutex_die(lock, cur, ww_ctx, wake_q) ||
__ww_mutex_wound(lock, cur->ww_ctx, ww_ctx)) __ww_mutex_wound(lock, cur->ww_ctx, ww_ctx, wake_q))
break; break;
} }
} }
@ -377,6 +379,9 @@ __ww_mutex_check_waiters(struct MUTEX *lock, struct ww_acquire_ctx *ww_ctx)
static __always_inline void static __always_inline void
ww_mutex_set_context_fastpath(struct ww_mutex *lock, struct ww_acquire_ctx *ctx) ww_mutex_set_context_fastpath(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
{ {
DEFINE_WAKE_Q(wake_q);
unsigned long flags;
ww_mutex_lock_acquired(lock, ctx); ww_mutex_lock_acquired(lock, ctx);
/* /*
@ -404,9 +409,12 @@ ww_mutex_set_context_fastpath(struct ww_mutex *lock, struct ww_acquire_ctx *ctx)
* Uh oh, we raced in fastpath, check if any of the waiters need to * Uh oh, we raced in fastpath, check if any of the waiters need to
* die or wound us. * die or wound us.
*/ */
lock_wait_lock(&lock->base); lock_wait_lock(&lock->base, &flags);
__ww_mutex_check_waiters(&lock->base, ctx); __ww_mutex_check_waiters(&lock->base, ctx, &wake_q);
unlock_wait_lock(&lock->base); preempt_disable();
unlock_wait_lock(&lock->base, &flags);
wake_up_q(&wake_q);
preempt_enable();
} }
static __always_inline int static __always_inline int
@ -488,7 +496,8 @@ __ww_mutex_check_kill(struct MUTEX *lock, struct MUTEX_WAITER *waiter,
static inline int static inline int
__ww_mutex_add_waiter(struct MUTEX_WAITER *waiter, __ww_mutex_add_waiter(struct MUTEX_WAITER *waiter,
struct MUTEX *lock, struct MUTEX *lock,
struct ww_acquire_ctx *ww_ctx) struct ww_acquire_ctx *ww_ctx,
struct wake_q_head *wake_q)
{ {
struct MUTEX_WAITER *cur, *pos = NULL; struct MUTEX_WAITER *cur, *pos = NULL;
bool is_wait_die; bool is_wait_die;
@ -532,7 +541,7 @@ __ww_mutex_add_waiter(struct MUTEX_WAITER *waiter,
pos = cur; pos = cur;
/* Wait-Die: ensure younger waiters die. */ /* Wait-Die: ensure younger waiters die. */
__ww_mutex_die(lock, cur, ww_ctx); __ww_mutex_die(lock, cur, ww_ctx, wake_q);
} }
__ww_waiter_add(lock, waiter, pos); __ww_waiter_add(lock, waiter, pos);
@ -550,7 +559,7 @@ __ww_mutex_add_waiter(struct MUTEX_WAITER *waiter,
* such that either we or the fastpath will wound @ww->ctx. * such that either we or the fastpath will wound @ww->ctx.
*/ */
smp_mb(); smp_mb();
__ww_mutex_wound(lock, ww_ctx, ww->ctx); __ww_mutex_wound(lock, ww_ctx, ww->ctx, wake_q);
} }
return 0; return 0;

View File

@ -832,7 +832,7 @@ static enum hrtimer_restart hrtick(struct hrtimer *timer)
rq_lock(rq, &rf); rq_lock(rq, &rf);
update_rq_clock(rq); update_rq_clock(rq);
rq->curr->sched_class->task_tick(rq, rq->curr, 1); rq->donor->sched_class->task_tick(rq, rq->curr, 1);
rq_unlock(rq, &rf); rq_unlock(rq, &rf);
return HRTIMER_NORESTART; return HRTIMER_NORESTART;
@ -941,10 +941,9 @@ static inline void hrtick_rq_init(struct rq *rq)
* this avoids any races wrt polling state changes and thereby avoids * this avoids any races wrt polling state changes and thereby avoids
* spurious IPIs. * spurious IPIs.
*/ */
static inline bool set_nr_and_not_polling(struct task_struct *p) static inline bool set_nr_and_not_polling(struct thread_info *ti, int tif)
{ {
struct thread_info *ti = task_thread_info(p); return !(fetch_or(&ti->flags, 1 << tif) & _TIF_POLLING_NRFLAG);
return !(fetch_or(&ti->flags, _TIF_NEED_RESCHED) & _TIF_POLLING_NRFLAG);
} }
/* /*
@ -969,9 +968,9 @@ static bool set_nr_if_polling(struct task_struct *p)
} }
#else #else
static inline bool set_nr_and_not_polling(struct task_struct *p) static inline bool set_nr_and_not_polling(struct thread_info *ti, int tif)
{ {
set_tsk_need_resched(p); set_ti_thread_flag(ti, tif);
return true; return true;
} }
@ -1076,28 +1075,70 @@ void wake_up_q(struct wake_q_head *head)
* might also involve a cross-CPU call to trigger the scheduler on * might also involve a cross-CPU call to trigger the scheduler on
* the target CPU. * the target CPU.
*/ */
void resched_curr(struct rq *rq) static void __resched_curr(struct rq *rq, int tif)
{ {
struct task_struct *curr = rq->curr; struct task_struct *curr = rq->curr;
struct thread_info *cti = task_thread_info(curr);
int cpu; int cpu;
lockdep_assert_rq_held(rq); lockdep_assert_rq_held(rq);
if (test_tsk_need_resched(curr)) /*
* Always immediately preempt the idle task; no point in delaying doing
* actual work.
*/
if (is_idle_task(curr) && tif == TIF_NEED_RESCHED_LAZY)
tif = TIF_NEED_RESCHED;
if (cti->flags & ((1 << tif) | _TIF_NEED_RESCHED))
return; return;
cpu = cpu_of(rq); cpu = cpu_of(rq);
if (cpu == smp_processor_id()) { if (cpu == smp_processor_id()) {
set_tsk_need_resched(curr); set_ti_thread_flag(cti, tif);
if (tif == TIF_NEED_RESCHED)
set_preempt_need_resched(); set_preempt_need_resched();
return; return;
} }
if (set_nr_and_not_polling(curr)) if (set_nr_and_not_polling(cti, tif)) {
if (tif == TIF_NEED_RESCHED)
smp_send_reschedule(cpu); smp_send_reschedule(cpu);
else } else {
trace_sched_wake_idle_without_ipi(cpu); trace_sched_wake_idle_without_ipi(cpu);
}
}
void resched_curr(struct rq *rq)
{
__resched_curr(rq, TIF_NEED_RESCHED);
}
#ifdef CONFIG_PREEMPT_DYNAMIC
static DEFINE_STATIC_KEY_FALSE(sk_dynamic_preempt_lazy);
static __always_inline bool dynamic_preempt_lazy(void)
{
return static_branch_unlikely(&sk_dynamic_preempt_lazy);
}
#else
static __always_inline bool dynamic_preempt_lazy(void)
{
return IS_ENABLED(CONFIG_PREEMPT_LAZY);
}
#endif
static __always_inline int get_lazy_tif_bit(void)
{
if (dynamic_preempt_lazy())
return TIF_NEED_RESCHED_LAZY;
return TIF_NEED_RESCHED;
}
void resched_curr_lazy(struct rq *rq)
{
__resched_curr(rq, get_lazy_tif_bit());
} }
void resched_cpu(int cpu) void resched_cpu(int cpu)
@ -1192,7 +1233,7 @@ static void wake_up_idle_cpu(int cpu)
* and testing of the above solutions didn't appear to report * and testing of the above solutions didn't appear to report
* much benefits. * much benefits.
*/ */
if (set_nr_and_not_polling(rq->idle)) if (set_nr_and_not_polling(task_thread_info(rq->idle), TIF_NEED_RESCHED))
smp_send_reschedule(cpu); smp_send_reschedule(cpu);
else else
trace_sched_wake_idle_without_ipi(cpu); trace_sched_wake_idle_without_ipi(cpu);
@ -1399,7 +1440,7 @@ void set_load_weight(struct task_struct *p, bool update_load)
* requests are serialized using a mutex to reduce the risk of conflicting * requests are serialized using a mutex to reduce the risk of conflicting
* updates or API abuses. * updates or API abuses.
*/ */
static DEFINE_MUTEX(uclamp_mutex); static __maybe_unused DEFINE_MUTEX(uclamp_mutex);
/* Max allowed minimum utilization */ /* Max allowed minimum utilization */
static unsigned int __maybe_unused sysctl_sched_uclamp_util_min = SCHED_CAPACITY_SCALE; static unsigned int __maybe_unused sysctl_sched_uclamp_util_min = SCHED_CAPACITY_SCALE;
@ -2024,10 +2065,10 @@ void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
*/ */
uclamp_rq_inc(rq, p); uclamp_rq_inc(rq, p);
if (!(flags & ENQUEUE_RESTORE)) { psi_enqueue(p, flags);
if (!(flags & ENQUEUE_RESTORE))
sched_info_enqueue(rq, p); sched_info_enqueue(rq, p);
psi_enqueue(p, flags & ENQUEUE_MIGRATED);
}
if (sched_core_enabled(rq)) if (sched_core_enabled(rq))
sched_core_enqueue(rq, p); sched_core_enqueue(rq, p);
@ -2044,10 +2085,10 @@ inline bool dequeue_task(struct rq *rq, struct task_struct *p, int flags)
if (!(flags & DEQUEUE_NOCLOCK)) if (!(flags & DEQUEUE_NOCLOCK))
update_rq_clock(rq); update_rq_clock(rq);
if (!(flags & DEQUEUE_SAVE)) { if (!(flags & DEQUEUE_SAVE))
sched_info_dequeue(rq, p); sched_info_dequeue(rq, p);
psi_dequeue(p, !(flags & DEQUEUE_SLEEP));
} psi_dequeue(p, flags);
/* /*
* Must be before ->dequeue_task() because ->dequeue_task() can 'fail' * Must be before ->dequeue_task() because ->dequeue_task() can 'fail'
@ -2135,16 +2176,18 @@ void check_class_changed(struct rq *rq, struct task_struct *p,
void wakeup_preempt(struct rq *rq, struct task_struct *p, int flags) void wakeup_preempt(struct rq *rq, struct task_struct *p, int flags)
{ {
if (p->sched_class == rq->curr->sched_class) struct task_struct *donor = rq->donor;
rq->curr->sched_class->wakeup_preempt(rq, p, flags);
else if (sched_class_above(p->sched_class, rq->curr->sched_class)) if (p->sched_class == donor->sched_class)
donor->sched_class->wakeup_preempt(rq, p, flags);
else if (sched_class_above(p->sched_class, donor->sched_class))
resched_curr(rq); resched_curr(rq);
/* /*
* A queue event has occurred, and we're going to schedule. In * A queue event has occurred, and we're going to schedule. In
* this case, we can save a useless back to back clock update. * this case, we can save a useless back to back clock update.
*/ */
if (task_on_rq_queued(rq->curr) && test_tsk_need_resched(rq->curr)) if (task_on_rq_queued(donor) && test_tsk_need_resched(rq->curr))
rq_clock_skip_update(rq); rq_clock_skip_update(rq);
} }
@ -2620,9 +2663,7 @@ int push_cpu_stop(void *arg)
// XXX validate p is still the highest prio task // XXX validate p is still the highest prio task
if (task_rq(p) == rq) { if (task_rq(p) == rq) {
deactivate_task(rq, p, 0); move_queued_task_locked(rq, lowest_rq, p);
set_task_cpu(p, lowest_rq->cpu);
activate_task(lowest_rq, p, 0);
resched_curr(lowest_rq); resched_curr(lowest_rq);
} }
@ -2682,7 +2723,7 @@ __do_set_cpus_allowed(struct task_struct *p, struct affinity_context *ctx)
lockdep_assert_held(&p->pi_lock); lockdep_assert_held(&p->pi_lock);
queued = task_on_rq_queued(p); queued = task_on_rq_queued(p);
running = task_current(rq, p); running = task_current_donor(rq, p);
if (queued) { if (queued) {
/* /*
@ -2696,6 +2737,7 @@ __do_set_cpus_allowed(struct task_struct *p, struct affinity_context *ctx)
put_prev_task(rq, p); put_prev_task(rq, p);
p->sched_class->set_cpus_allowed(p, ctx); p->sched_class->set_cpus_allowed(p, ctx);
mm_set_cpus_allowed(p->mm, ctx->new_mask);
if (queued) if (queued)
enqueue_task(rq, p, ENQUEUE_RESTORE | ENQUEUE_NOCLOCK); enqueue_task(rq, p, ENQUEUE_RESTORE | ENQUEUE_NOCLOCK);
@ -3308,9 +3350,7 @@ static void __migrate_swap_task(struct task_struct *p, int cpu)
rq_pin_lock(src_rq, &srf); rq_pin_lock(src_rq, &srf);
rq_pin_lock(dst_rq, &drf); rq_pin_lock(dst_rq, &drf);
deactivate_task(src_rq, p, 0); move_queued_task_locked(src_rq, dst_rq, p);
set_task_cpu(p, cpu);
activate_task(dst_rq, p, 0);
wakeup_preempt(dst_rq, p, 0); wakeup_preempt(dst_rq, p, 0);
rq_unpin_lock(dst_rq, &drf); rq_unpin_lock(dst_rq, &drf);
@ -4424,7 +4464,8 @@ int wake_up_state(struct task_struct *p, unsigned int state)
* Perform scheduler related setup for a newly forked process p. * Perform scheduler related setup for a newly forked process p.
* p is forked by current. * p is forked by current.
* *
* __sched_fork() is basic setup used by init_idle() too: * __sched_fork() is basic setup which is also used by sched_init() to
* initialize the boot CPU's idle task.
*/ */
static void __sched_fork(unsigned long clone_flags, struct task_struct *p) static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
{ {
@ -5517,7 +5558,7 @@ unsigned long long task_sched_runtime(struct task_struct *p)
* project cycles that may never be accounted to this * project cycles that may never be accounted to this
* thread, breaking clock_gettime(). * thread, breaking clock_gettime().
*/ */
if (task_current(rq, p) && task_on_rq_queued(p)) { if (task_current_donor(rq, p) && task_on_rq_queued(p)) {
prefetch_curr_exec_start(p); prefetch_curr_exec_start(p);
update_rq_clock(rq); update_rq_clock(rq);
p->sched_class->update_curr(rq); p->sched_class->update_curr(rq);
@ -5585,7 +5626,8 @@ void sched_tick(void)
{ {
int cpu = smp_processor_id(); int cpu = smp_processor_id();
struct rq *rq = cpu_rq(cpu); struct rq *rq = cpu_rq(cpu);
struct task_struct *curr; /* accounting goes to the donor task */
struct task_struct *donor;
struct rq_flags rf; struct rq_flags rf;
unsigned long hw_pressure; unsigned long hw_pressure;
u64 resched_latency; u64 resched_latency;
@ -5596,19 +5638,23 @@ void sched_tick(void)
sched_clock_tick(); sched_clock_tick();
rq_lock(rq, &rf); rq_lock(rq, &rf);
donor = rq->donor;
curr = rq->curr; psi_account_irqtime(rq, donor, NULL);
psi_account_irqtime(rq, curr, NULL);
update_rq_clock(rq); update_rq_clock(rq);
hw_pressure = arch_scale_hw_pressure(cpu_of(rq)); hw_pressure = arch_scale_hw_pressure(cpu_of(rq));
update_hw_load_avg(rq_clock_task(rq), rq, hw_pressure); update_hw_load_avg(rq_clock_task(rq), rq, hw_pressure);
curr->sched_class->task_tick(rq, curr, 0);
if (dynamic_preempt_lazy() && tif_test_bit(TIF_NEED_RESCHED_LAZY))
resched_curr(rq);
donor->sched_class->task_tick(rq, donor, 0);
if (sched_feat(LATENCY_WARN)) if (sched_feat(LATENCY_WARN))
resched_latency = cpu_resched_latency(rq); resched_latency = cpu_resched_latency(rq);
calc_global_load_tick(rq); calc_global_load_tick(rq);
sched_core_tick(rq); sched_core_tick(rq);
task_tick_mm_cid(rq, curr); task_tick_mm_cid(rq, donor);
scx_tick(rq); scx_tick(rq);
rq_unlock(rq, &rf); rq_unlock(rq, &rf);
@ -5618,8 +5664,8 @@ void sched_tick(void)
perf_event_task_tick(); perf_event_task_tick();
if (curr->flags & PF_WQ_WORKER) if (donor->flags & PF_WQ_WORKER)
wq_worker_tick(curr); wq_worker_tick(donor);
#ifdef CONFIG_SMP #ifdef CONFIG_SMP
if (!scx_switched_all()) { if (!scx_switched_all()) {
@ -5686,6 +5732,12 @@ static void sched_tick_remote(struct work_struct *work)
struct task_struct *curr = rq->curr; struct task_struct *curr = rq->curr;
if (cpu_online(cpu)) { if (cpu_online(cpu)) {
/*
* Since this is a remote tick for full dynticks mode,
* we are always sure that there is no proxy (only a
* single task is running).
*/
SCHED_WARN_ON(rq->curr != rq->donor);
update_rq_clock(rq); update_rq_clock(rq);
if (!is_idle_task(curr)) { if (!is_idle_task(curr)) {
@ -6309,10 +6361,7 @@ static bool try_steal_cookie(int this, int that)
if (sched_task_is_throttled(p, this)) if (sched_task_is_throttled(p, this))
goto next; goto next;
deactivate_task(src, p, 0); move_queued_task_locked(src, dst, p);
set_task_cpu(p, this);
activate_task(dst, p, 0);
resched_curr(dst); resched_curr(dst);
success = true; success = true;
@ -6506,6 +6555,45 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
#define SM_PREEMPT 1 #define SM_PREEMPT 1
#define SM_RTLOCK_WAIT 2 #define SM_RTLOCK_WAIT 2
/*
* Helper function for __schedule()
*
* If a task does not have signals pending, deactivate it
* Otherwise marks the task's __state as RUNNING
*/
static bool try_to_block_task(struct rq *rq, struct task_struct *p,
unsigned long task_state)
{
int flags = DEQUEUE_NOCLOCK;
if (signal_pending_state(task_state, p)) {
WRITE_ONCE(p->__state, TASK_RUNNING);
return false;
}
p->sched_contributes_to_load =
(task_state & TASK_UNINTERRUPTIBLE) &&
!(task_state & TASK_NOLOAD) &&
!(task_state & TASK_FROZEN);
if (unlikely(is_special_task_state(task_state)))
flags |= DEQUEUE_SPECIAL;
/*
* __schedule() ttwu()
* prev_state = prev->state; if (p->on_rq && ...)
* if (prev_state) goto out;
* p->on_rq = 0; smp_acquire__after_ctrl_dep();
* p->state = TASK_WAKING
*
* Where __schedule() and ttwu() have matching control dependencies.
*
* After this, schedule() must not care about p->state any more.
*/
block_task(rq, p, flags);
return true;
}
/* /*
* __schedule() is the main scheduler function. * __schedule() is the main scheduler function.
* *
@ -6614,37 +6702,12 @@ static void __sched notrace __schedule(int sched_mode)
goto picked; goto picked;
} }
} else if (!preempt && prev_state) { } else if (!preempt && prev_state) {
if (signal_pending_state(prev_state, prev)) { block = try_to_block_task(rq, prev, prev_state);
WRITE_ONCE(prev->__state, TASK_RUNNING);
} else {
int flags = DEQUEUE_NOCLOCK;
prev->sched_contributes_to_load =
(prev_state & TASK_UNINTERRUPTIBLE) &&
!(prev_state & TASK_NOLOAD) &&
!(prev_state & TASK_FROZEN);
if (unlikely(is_special_task_state(prev_state)))
flags |= DEQUEUE_SPECIAL;
/*
* __schedule() ttwu()
* prev_state = prev->state; if (p->on_rq && ...)
* if (prev_state) goto out;
* p->on_rq = 0; smp_acquire__after_ctrl_dep();
* p->state = TASK_WAKING
*
* Where __schedule() and ttwu() have matching control dependencies.
*
* After this, schedule() must not care about p->state any more.
*/
block_task(rq, prev, flags);
block = true;
}
switch_count = &prev->nvcsw; switch_count = &prev->nvcsw;
} }
next = pick_next_task(rq, prev, &rf); next = pick_next_task(rq, prev, &rf);
rq_set_donor(rq, next);
picked: picked:
clear_tsk_need_resched(prev); clear_tsk_need_resched(prev);
clear_preempt_need_resched(); clear_preempt_need_resched();
@ -7151,7 +7214,7 @@ void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task)
dequeue_task(rq, p, DEQUEUE_SLEEP | DEQUEUE_DELAYED | DEQUEUE_NOCLOCK); dequeue_task(rq, p, DEQUEUE_SLEEP | DEQUEUE_DELAYED | DEQUEUE_NOCLOCK);
queued = task_on_rq_queued(p); queued = task_on_rq_queued(p);
running = task_current(rq, p); running = task_current_donor(rq, p);
if (queued) if (queued)
dequeue_task(rq, p, queue_flag); dequeue_task(rq, p, queue_flag);
if (running) if (running)
@ -7351,6 +7414,7 @@ EXPORT_SYMBOL(__cond_resched_rwlock_write);
* preempt_schedule <- NOP * preempt_schedule <- NOP
* preempt_schedule_notrace <- NOP * preempt_schedule_notrace <- NOP
* irqentry_exit_cond_resched <- NOP * irqentry_exit_cond_resched <- NOP
* dynamic_preempt_lazy <- false
* *
* VOLUNTARY: * VOLUNTARY:
* cond_resched <- __cond_resched * cond_resched <- __cond_resched
@ -7358,6 +7422,7 @@ EXPORT_SYMBOL(__cond_resched_rwlock_write);
* preempt_schedule <- NOP * preempt_schedule <- NOP
* preempt_schedule_notrace <- NOP * preempt_schedule_notrace <- NOP
* irqentry_exit_cond_resched <- NOP * irqentry_exit_cond_resched <- NOP
* dynamic_preempt_lazy <- false
* *
* FULL: * FULL:
* cond_resched <- RET0 * cond_resched <- RET0
@ -7365,6 +7430,15 @@ EXPORT_SYMBOL(__cond_resched_rwlock_write);
* preempt_schedule <- preempt_schedule * preempt_schedule <- preempt_schedule
* preempt_schedule_notrace <- preempt_schedule_notrace * preempt_schedule_notrace <- preempt_schedule_notrace
* irqentry_exit_cond_resched <- irqentry_exit_cond_resched * irqentry_exit_cond_resched <- irqentry_exit_cond_resched
* dynamic_preempt_lazy <- false
*
* LAZY:
* cond_resched <- RET0
* might_resched <- RET0
* preempt_schedule <- preempt_schedule
* preempt_schedule_notrace <- preempt_schedule_notrace
* irqentry_exit_cond_resched <- irqentry_exit_cond_resched
* dynamic_preempt_lazy <- true
*/ */
enum { enum {
@ -7372,30 +7446,41 @@ enum {
preempt_dynamic_none, preempt_dynamic_none,
preempt_dynamic_voluntary, preempt_dynamic_voluntary,
preempt_dynamic_full, preempt_dynamic_full,
preempt_dynamic_lazy,
}; };
int preempt_dynamic_mode = preempt_dynamic_undefined; int preempt_dynamic_mode = preempt_dynamic_undefined;
int sched_dynamic_mode(const char *str) int sched_dynamic_mode(const char *str)
{ {
#ifndef CONFIG_PREEMPT_RT
if (!strcmp(str, "none")) if (!strcmp(str, "none"))
return preempt_dynamic_none; return preempt_dynamic_none;
if (!strcmp(str, "voluntary")) if (!strcmp(str, "voluntary"))
return preempt_dynamic_voluntary; return preempt_dynamic_voluntary;
#endif
if (!strcmp(str, "full")) if (!strcmp(str, "full"))
return preempt_dynamic_full; return preempt_dynamic_full;
#ifdef CONFIG_ARCH_HAS_PREEMPT_LAZY
if (!strcmp(str, "lazy"))
return preempt_dynamic_lazy;
#endif
return -EINVAL; return -EINVAL;
} }
#define preempt_dynamic_key_enable(f) static_key_enable(&sk_dynamic_##f.key)
#define preempt_dynamic_key_disable(f) static_key_disable(&sk_dynamic_##f.key)
#if defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL) #if defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
#define preempt_dynamic_enable(f) static_call_update(f, f##_dynamic_enabled) #define preempt_dynamic_enable(f) static_call_update(f, f##_dynamic_enabled)
#define preempt_dynamic_disable(f) static_call_update(f, f##_dynamic_disabled) #define preempt_dynamic_disable(f) static_call_update(f, f##_dynamic_disabled)
#elif defined(CONFIG_HAVE_PREEMPT_DYNAMIC_KEY) #elif defined(CONFIG_HAVE_PREEMPT_DYNAMIC_KEY)
#define preempt_dynamic_enable(f) static_key_enable(&sk_dynamic_##f.key) #define preempt_dynamic_enable(f) preempt_dynamic_key_enable(f)
#define preempt_dynamic_disable(f) static_key_disable(&sk_dynamic_##f.key) #define preempt_dynamic_disable(f) preempt_dynamic_key_disable(f)
#else #else
#error "Unsupported PREEMPT_DYNAMIC mechanism" #error "Unsupported PREEMPT_DYNAMIC mechanism"
#endif #endif
@ -7415,6 +7500,7 @@ static void __sched_dynamic_update(int mode)
preempt_dynamic_enable(preempt_schedule); preempt_dynamic_enable(preempt_schedule);
preempt_dynamic_enable(preempt_schedule_notrace); preempt_dynamic_enable(preempt_schedule_notrace);
preempt_dynamic_enable(irqentry_exit_cond_resched); preempt_dynamic_enable(irqentry_exit_cond_resched);
preempt_dynamic_key_disable(preempt_lazy);
switch (mode) { switch (mode) {
case preempt_dynamic_none: case preempt_dynamic_none:
@ -7424,6 +7510,7 @@ static void __sched_dynamic_update(int mode)
preempt_dynamic_disable(preempt_schedule); preempt_dynamic_disable(preempt_schedule);
preempt_dynamic_disable(preempt_schedule_notrace); preempt_dynamic_disable(preempt_schedule_notrace);
preempt_dynamic_disable(irqentry_exit_cond_resched); preempt_dynamic_disable(irqentry_exit_cond_resched);
preempt_dynamic_key_disable(preempt_lazy);
if (mode != preempt_dynamic_mode) if (mode != preempt_dynamic_mode)
pr_info("Dynamic Preempt: none\n"); pr_info("Dynamic Preempt: none\n");
break; break;
@ -7435,6 +7522,7 @@ static void __sched_dynamic_update(int mode)
preempt_dynamic_disable(preempt_schedule); preempt_dynamic_disable(preempt_schedule);
preempt_dynamic_disable(preempt_schedule_notrace); preempt_dynamic_disable(preempt_schedule_notrace);
preempt_dynamic_disable(irqentry_exit_cond_resched); preempt_dynamic_disable(irqentry_exit_cond_resched);
preempt_dynamic_key_disable(preempt_lazy);
if (mode != preempt_dynamic_mode) if (mode != preempt_dynamic_mode)
pr_info("Dynamic Preempt: voluntary\n"); pr_info("Dynamic Preempt: voluntary\n");
break; break;
@ -7446,9 +7534,22 @@ static void __sched_dynamic_update(int mode)
preempt_dynamic_enable(preempt_schedule); preempt_dynamic_enable(preempt_schedule);
preempt_dynamic_enable(preempt_schedule_notrace); preempt_dynamic_enable(preempt_schedule_notrace);
preempt_dynamic_enable(irqentry_exit_cond_resched); preempt_dynamic_enable(irqentry_exit_cond_resched);
preempt_dynamic_key_disable(preempt_lazy);
if (mode != preempt_dynamic_mode) if (mode != preempt_dynamic_mode)
pr_info("Dynamic Preempt: full\n"); pr_info("Dynamic Preempt: full\n");
break; break;
case preempt_dynamic_lazy:
if (!klp_override)
preempt_dynamic_disable(cond_resched);
preempt_dynamic_disable(might_resched);
preempt_dynamic_enable(preempt_schedule);
preempt_dynamic_enable(preempt_schedule_notrace);
preempt_dynamic_enable(irqentry_exit_cond_resched);
preempt_dynamic_key_enable(preempt_lazy);
if (mode != preempt_dynamic_mode)
pr_info("Dynamic Preempt: lazy\n");
break;
} }
preempt_dynamic_mode = mode; preempt_dynamic_mode = mode;
@ -7511,6 +7612,8 @@ static void __init preempt_dynamic_init(void)
sched_dynamic_update(preempt_dynamic_none); sched_dynamic_update(preempt_dynamic_none);
} else if (IS_ENABLED(CONFIG_PREEMPT_VOLUNTARY)) { } else if (IS_ENABLED(CONFIG_PREEMPT_VOLUNTARY)) {
sched_dynamic_update(preempt_dynamic_voluntary); sched_dynamic_update(preempt_dynamic_voluntary);
} else if (IS_ENABLED(CONFIG_PREEMPT_LAZY)) {
sched_dynamic_update(preempt_dynamic_lazy);
} else { } else {
/* Default static call setting, nothing to do */ /* Default static call setting, nothing to do */
WARN_ON_ONCE(!IS_ENABLED(CONFIG_PREEMPT)); WARN_ON_ONCE(!IS_ENABLED(CONFIG_PREEMPT));
@ -7531,6 +7634,7 @@ static void __init preempt_dynamic_init(void)
PREEMPT_MODEL_ACCESSOR(none); PREEMPT_MODEL_ACCESSOR(none);
PREEMPT_MODEL_ACCESSOR(voluntary); PREEMPT_MODEL_ACCESSOR(voluntary);
PREEMPT_MODEL_ACCESSOR(full); PREEMPT_MODEL_ACCESSOR(full);
PREEMPT_MODEL_ACCESSOR(lazy);
#else /* !CONFIG_PREEMPT_DYNAMIC: */ #else /* !CONFIG_PREEMPT_DYNAMIC: */
@ -7683,8 +7787,6 @@ void __init init_idle(struct task_struct *idle, int cpu)
struct rq *rq = cpu_rq(cpu); struct rq *rq = cpu_rq(cpu);
unsigned long flags; unsigned long flags;
__sched_fork(0, idle);
raw_spin_lock_irqsave(&idle->pi_lock, flags); raw_spin_lock_irqsave(&idle->pi_lock, flags);
raw_spin_rq_lock(rq); raw_spin_rq_lock(rq);
@ -7699,10 +7801,8 @@ void __init init_idle(struct task_struct *idle, int cpu)
#ifdef CONFIG_SMP #ifdef CONFIG_SMP
/* /*
* It's possible that init_idle() gets called multiple times on a task, * No validation and serialization required at boot time and for
* in that case do_set_cpus_allowed() will not do the right thing. * setting up the idle tasks of not yet online CPUs.
*
* And since this is boot we can forgo the serialization.
*/ */
set_cpus_allowed_common(idle, &ac); set_cpus_allowed_common(idle, &ac);
#endif #endif
@ -7721,6 +7821,7 @@ void __init init_idle(struct task_struct *idle, int cpu)
rcu_read_unlock(); rcu_read_unlock();
rq->idle = idle; rq->idle = idle;
rq_set_donor(rq, idle);
rcu_assign_pointer(rq->curr, idle); rcu_assign_pointer(rq->curr, idle);
idle->on_rq = TASK_ON_RQ_QUEUED; idle->on_rq = TASK_ON_RQ_QUEUED;
#ifdef CONFIG_SMP #ifdef CONFIG_SMP
@ -7810,7 +7911,7 @@ void sched_setnuma(struct task_struct *p, int nid)
rq = task_rq_lock(p, &rf); rq = task_rq_lock(p, &rf);
queued = task_on_rq_queued(p); queued = task_on_rq_queued(p);
running = task_current(rq, p); running = task_current_donor(rq, p);
if (queued) if (queued)
dequeue_task(rq, p, DEQUEUE_SAVE); dequeue_task(rq, p, DEQUEUE_SAVE);
@ -8546,6 +8647,7 @@ void __init sched_init(void)
* but because we are the idle thread, we just pick up running again * but because we are the idle thread, we just pick up running again
* when this runqueue becomes "idle". * when this runqueue becomes "idle".
*/ */
__sched_fork(0, current);
init_idle(current, smp_processor_id()); init_idle(current, smp_processor_id());
calc_load_update = jiffies + LOAD_FREQ; calc_load_update = jiffies + LOAD_FREQ;
@ -8960,7 +9062,7 @@ void sched_move_task(struct task_struct *tsk)
update_rq_clock(rq); update_rq_clock(rq);
running = task_current(rq, tsk); running = task_current_donor(rq, tsk);
queued = task_on_rq_queued(tsk); queued = task_on_rq_queued(tsk);
if (queued) if (queued)
@ -10253,6 +10355,7 @@ int __sched_mm_cid_migrate_from_try_steal_cid(struct rq *src_rq,
*/ */
if (!try_cmpxchg(&src_pcpu_cid->cid, &lazy_cid, MM_CID_UNSET)) if (!try_cmpxchg(&src_pcpu_cid->cid, &lazy_cid, MM_CID_UNSET))
return -1; return -1;
WRITE_ONCE(src_pcpu_cid->recent_cid, MM_CID_UNSET);
return src_cid; return src_cid;
} }
@ -10265,7 +10368,8 @@ void sched_mm_cid_migrate_to(struct rq *dst_rq, struct task_struct *t)
{ {
struct mm_cid *src_pcpu_cid, *dst_pcpu_cid; struct mm_cid *src_pcpu_cid, *dst_pcpu_cid;
struct mm_struct *mm = t->mm; struct mm_struct *mm = t->mm;
int src_cid, dst_cid, src_cpu; int src_cid, src_cpu;
bool dst_cid_is_set;
struct rq *src_rq; struct rq *src_rq;
lockdep_assert_rq_held(dst_rq); lockdep_assert_rq_held(dst_rq);
@ -10282,9 +10386,9 @@ void sched_mm_cid_migrate_to(struct rq *dst_rq, struct task_struct *t)
* allocation closest to 0 in cases where few threads migrate around * allocation closest to 0 in cases where few threads migrate around
* many CPUs. * many CPUs.
* *
* If destination cid is already set, we may have to just clear * If destination cid or recent cid is already set, we may have
* the src cid to ensure compactness in frequent migrations * to just clear the src cid to ensure compactness in frequent
* scenarios. * migrations scenarios.
* *
* It is not useful to clear the src cid when the number of threads is * It is not useful to clear the src cid when the number of threads is
* greater or equal to the number of allowed CPUs, because user-space * greater or equal to the number of allowed CPUs, because user-space
@ -10292,9 +10396,9 @@ void sched_mm_cid_migrate_to(struct rq *dst_rq, struct task_struct *t)
* allowed CPUs. * allowed CPUs.
*/ */
dst_pcpu_cid = per_cpu_ptr(mm->pcpu_cid, cpu_of(dst_rq)); dst_pcpu_cid = per_cpu_ptr(mm->pcpu_cid, cpu_of(dst_rq));
dst_cid = READ_ONCE(dst_pcpu_cid->cid); dst_cid_is_set = !mm_cid_is_unset(READ_ONCE(dst_pcpu_cid->cid)) ||
if (!mm_cid_is_unset(dst_cid) && !mm_cid_is_unset(READ_ONCE(dst_pcpu_cid->recent_cid));
atomic_read(&mm->mm_users) >= t->nr_cpus_allowed) if (dst_cid_is_set && atomic_read(&mm->mm_users) >= READ_ONCE(mm->nr_cpus_allowed))
return; return;
src_pcpu_cid = per_cpu_ptr(mm->pcpu_cid, src_cpu); src_pcpu_cid = per_cpu_ptr(mm->pcpu_cid, src_cpu);
src_rq = cpu_rq(src_cpu); src_rq = cpu_rq(src_cpu);
@ -10305,13 +10409,14 @@ void sched_mm_cid_migrate_to(struct rq *dst_rq, struct task_struct *t)
src_cid); src_cid);
if (src_cid == -1) if (src_cid == -1)
return; return;
if (!mm_cid_is_unset(dst_cid)) { if (dst_cid_is_set) {
__mm_cid_put(mm, src_cid); __mm_cid_put(mm, src_cid);
return; return;
} }
/* Move src_cid to dst cpu. */ /* Move src_cid to dst cpu. */
mm_cid_snapshot_time(dst_rq, mm); mm_cid_snapshot_time(dst_rq, mm);
WRITE_ONCE(dst_pcpu_cid->cid, src_cid); WRITE_ONCE(dst_pcpu_cid->cid, src_cid);
WRITE_ONCE(dst_pcpu_cid->recent_cid, src_cid);
} }
static void sched_mm_cid_remote_clear(struct mm_struct *mm, struct mm_cid *pcpu_cid, static void sched_mm_cid_remote_clear(struct mm_struct *mm, struct mm_cid *pcpu_cid,
@ -10550,7 +10655,7 @@ void sched_mm_cid_after_execve(struct task_struct *t)
* Matches barrier in sched_mm_cid_remote_clear_old(). * Matches barrier in sched_mm_cid_remote_clear_old().
*/ */
smp_mb(); smp_mb();
t->last_mm_cid = t->mm_cid = mm_cid_get(rq, mm); t->last_mm_cid = t->mm_cid = mm_cid_get(rq, t, mm);
} }
rseq_set_notify_resume(t); rseq_set_notify_resume(t);
} }

View File

@ -1339,7 +1339,7 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
#endif #endif
enqueue_task_dl(rq, p, ENQUEUE_REPLENISH); enqueue_task_dl(rq, p, ENQUEUE_REPLENISH);
if (dl_task(rq->curr)) if (dl_task(rq->donor))
wakeup_preempt_dl(rq, p, 0); wakeup_preempt_dl(rq, p, 0);
else else
resched_curr(rq); resched_curr(rq);
@ -1736,11 +1736,11 @@ int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 perio
*/ */
static void update_curr_dl(struct rq *rq) static void update_curr_dl(struct rq *rq)
{ {
struct task_struct *curr = rq->curr; struct task_struct *donor = rq->donor;
struct sched_dl_entity *dl_se = &curr->dl; struct sched_dl_entity *dl_se = &donor->dl;
s64 delta_exec; s64 delta_exec;
if (!dl_task(curr) || !on_dl_rq(dl_se)) if (!dl_task(donor) || !on_dl_rq(dl_se))
return; return;
/* /*
@ -2213,7 +2213,7 @@ static int find_later_rq(struct task_struct *task);
static int static int
select_task_rq_dl(struct task_struct *p, int cpu, int flags) select_task_rq_dl(struct task_struct *p, int cpu, int flags)
{ {
struct task_struct *curr; struct task_struct *curr, *donor;
bool select_rq; bool select_rq;
struct rq *rq; struct rq *rq;
@ -2224,6 +2224,7 @@ select_task_rq_dl(struct task_struct *p, int cpu, int flags)
rcu_read_lock(); rcu_read_lock();
curr = READ_ONCE(rq->curr); /* unlocked access */ curr = READ_ONCE(rq->curr); /* unlocked access */
donor = READ_ONCE(rq->donor);
/* /*
* If we are dealing with a -deadline task, we must * If we are dealing with a -deadline task, we must
@ -2234,9 +2235,9 @@ select_task_rq_dl(struct task_struct *p, int cpu, int flags)
* other hand, if it has a shorter deadline, we * other hand, if it has a shorter deadline, we
* try to make it stay here, it might be important. * try to make it stay here, it might be important.
*/ */
select_rq = unlikely(dl_task(curr)) && select_rq = unlikely(dl_task(donor)) &&
(curr->nr_cpus_allowed < 2 || (curr->nr_cpus_allowed < 2 ||
!dl_entity_preempt(&p->dl, &curr->dl)) && !dl_entity_preempt(&p->dl, &donor->dl)) &&
p->nr_cpus_allowed > 1; p->nr_cpus_allowed > 1;
/* /*
@ -2299,7 +2300,7 @@ static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
* let's hope p can move out. * let's hope p can move out.
*/ */
if (rq->curr->nr_cpus_allowed == 1 || if (rq->curr->nr_cpus_allowed == 1 ||
!cpudl_find(&rq->rd->cpudl, rq->curr, NULL)) !cpudl_find(&rq->rd->cpudl, rq->donor, NULL))
return; return;
/* /*
@ -2338,7 +2339,7 @@ static int balance_dl(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
static void wakeup_preempt_dl(struct rq *rq, struct task_struct *p, static void wakeup_preempt_dl(struct rq *rq, struct task_struct *p,
int flags) int flags)
{ {
if (dl_entity_preempt(&p->dl, &rq->curr->dl)) { if (dl_entity_preempt(&p->dl, &rq->donor->dl)) {
resched_curr(rq); resched_curr(rq);
return; return;
} }
@ -2348,7 +2349,7 @@ static void wakeup_preempt_dl(struct rq *rq, struct task_struct *p,
* In the unlikely case current and p have the same deadline * In the unlikely case current and p have the same deadline
* let us try to decide what's the best thing to do... * let us try to decide what's the best thing to do...
*/ */
if ((p->dl.deadline == rq->curr->dl.deadline) && if ((p->dl.deadline == rq->donor->dl.deadline) &&
!test_tsk_need_resched(rq->curr)) !test_tsk_need_resched(rq->curr))
check_preempt_equal_dl(rq, p); check_preempt_equal_dl(rq, p);
#endif /* CONFIG_SMP */ #endif /* CONFIG_SMP */
@ -2380,7 +2381,7 @@ static void set_next_task_dl(struct rq *rq, struct task_struct *p, bool first)
if (!first) if (!first)
return; return;
if (rq->curr->sched_class != &dl_sched_class) if (rq->donor->sched_class != &dl_sched_class)
update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 0); update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 0);
deadline_queue_push_tasks(rq); deadline_queue_push_tasks(rq);
@ -2487,14 +2488,6 @@ static void task_fork_dl(struct task_struct *p)
/* Only try algorithms three times */ /* Only try algorithms three times */
#define DL_MAX_TRIES 3 #define DL_MAX_TRIES 3
static int pick_dl_task(struct rq *rq, struct task_struct *p, int cpu)
{
if (!task_on_cpu(rq, p) &&
cpumask_test_cpu(cpu, &p->cpus_mask))
return 1;
return 0;
}
/* /*
* Return the earliest pushable rq's task, which is suitable to be executed * Return the earliest pushable rq's task, which is suitable to be executed
* on the CPU, NULL otherwise: * on the CPU, NULL otherwise:
@ -2513,7 +2506,7 @@ next_node:
if (next_node) { if (next_node) {
p = __node_2_pdl(next_node); p = __node_2_pdl(next_node);
if (pick_dl_task(rq, p, cpu)) if (task_is_pushable(rq, p, cpu))
return p; return p;
next_node = rb_next(next_node); next_node = rb_next(next_node);
@ -2707,8 +2700,8 @@ retry:
* can move away, it makes sense to just reschedule * can move away, it makes sense to just reschedule
* without going further in pushing next_task. * without going further in pushing next_task.
*/ */
if (dl_task(rq->curr) && if (dl_task(rq->donor) &&
dl_time_before(next_task->dl.deadline, rq->curr->dl.deadline) && dl_time_before(next_task->dl.deadline, rq->donor->dl.deadline) &&
rq->curr->nr_cpus_allowed > 1) { rq->curr->nr_cpus_allowed > 1) {
resched_curr(rq); resched_curr(rq);
return 0; return 0;
@ -2751,9 +2744,7 @@ retry:
goto retry; goto retry;
} }
deactivate_task(rq, next_task, 0); move_queued_task_locked(rq, later_rq, next_task);
set_task_cpu(next_task, later_rq->cpu);
activate_task(later_rq, next_task, 0);
ret = 1; ret = 1;
resched_curr(later_rq); resched_curr(later_rq);
@ -2833,15 +2824,13 @@ static void pull_dl_task(struct rq *this_rq)
* deadline than the current task of its runqueue. * deadline than the current task of its runqueue.
*/ */
if (dl_time_before(p->dl.deadline, if (dl_time_before(p->dl.deadline,
src_rq->curr->dl.deadline)) src_rq->donor->dl.deadline))
goto skip; goto skip;
if (is_migration_disabled(p)) { if (is_migration_disabled(p)) {
push_task = get_push_task(src_rq); push_task = get_push_task(src_rq);
} else { } else {
deactivate_task(src_rq, p, 0); move_queued_task_locked(src_rq, this_rq, p);
set_task_cpu(p, this_cpu);
activate_task(this_rq, p, 0);
dmin = p->dl.deadline; dmin = p->dl.deadline;
resched = true; resched = true;
} }
@ -2874,9 +2863,9 @@ static void task_woken_dl(struct rq *rq, struct task_struct *p)
if (!task_on_cpu(rq, p) && if (!task_on_cpu(rq, p) &&
!test_tsk_need_resched(rq->curr) && !test_tsk_need_resched(rq->curr) &&
p->nr_cpus_allowed > 1 && p->nr_cpus_allowed > 1 &&
dl_task(rq->curr) && dl_task(rq->donor) &&
(rq->curr->nr_cpus_allowed < 2 || (rq->curr->nr_cpus_allowed < 2 ||
!dl_entity_preempt(&p->dl, &rq->curr->dl))) { !dl_entity_preempt(&p->dl, &rq->donor->dl))) {
push_dl_tasks(rq); push_dl_tasks(rq);
} }
} }
@ -3051,12 +3040,12 @@ static void switched_to_dl(struct rq *rq, struct task_struct *p)
return; return;
} }
if (rq->curr != p) { if (rq->donor != p) {
#ifdef CONFIG_SMP #ifdef CONFIG_SMP
if (p->nr_cpus_allowed > 1 && rq->dl.overloaded) if (p->nr_cpus_allowed > 1 && rq->dl.overloaded)
deadline_queue_push_tasks(rq); deadline_queue_push_tasks(rq);
#endif #endif
if (dl_task(rq->curr)) if (dl_task(rq->donor))
wakeup_preempt_dl(rq, p, 0); wakeup_preempt_dl(rq, p, 0);
else else
resched_curr(rq); resched_curr(rq);
@ -3085,7 +3074,7 @@ static void prio_changed_dl(struct rq *rq, struct task_struct *p,
if (!rq->dl.overloaded) if (!rq->dl.overloaded)
deadline_queue_pull_task(rq); deadline_queue_pull_task(rq);
if (task_current(rq, p)) { if (task_current_donor(rq, p)) {
/* /*
* If we now have a earlier deadline task than p, * If we now have a earlier deadline task than p,
* then reschedule, provided p is still on this * then reschedule, provided p is still on this

View File

@ -245,11 +245,12 @@ static ssize_t sched_dynamic_write(struct file *filp, const char __user *ubuf,
static int sched_dynamic_show(struct seq_file *m, void *v) static int sched_dynamic_show(struct seq_file *m, void *v)
{ {
static const char * preempt_modes[] = { static const char * preempt_modes[] = {
"none", "voluntary", "full" "none", "voluntary", "full", "lazy",
}; };
int i; int j = ARRAY_SIZE(preempt_modes) - !IS_ENABLED(CONFIG_ARCH_HAS_PREEMPT_LAZY);
int i = IS_ENABLED(CONFIG_PREEMPT_RT) * 2;
for (i = 0; i < ARRAY_SIZE(preempt_modes); i++) { for (; i < j; i++) {
if (preempt_dynamic_mode == i) if (preempt_dynamic_mode == i)
seq_puts(m, "("); seq_puts(m, "(");
seq_puts(m, preempt_modes[i]); seq_puts(m, preempt_modes[i]);

View File

@ -3567,12 +3567,7 @@ static void scx_ops_exit_task(struct task_struct *p)
void init_scx_entity(struct sched_ext_entity *scx) void init_scx_entity(struct sched_ext_entity *scx)
{ {
/* memset(scx, 0, sizeof(*scx));
* init_idle() calls this function again after fork sequence is
* complete. Don't touch ->tasks_node as it's already linked.
*/
memset(scx, 0, offsetof(struct sched_ext_entity, tasks_node));
INIT_LIST_HEAD(&scx->dsq_list.node); INIT_LIST_HEAD(&scx->dsq_list.node);
RB_CLEAR_NODE(&scx->dsq_priq); RB_CLEAR_NODE(&scx->dsq_priq);
scx->sticky_cpu = -1; scx->sticky_cpu = -1;

View File

@ -1200,12 +1200,12 @@ static inline bool do_preempt_short(struct cfs_rq *cfs_rq,
*/ */
s64 update_curr_common(struct rq *rq) s64 update_curr_common(struct rq *rq)
{ {
struct task_struct *curr = rq->curr; struct task_struct *donor = rq->donor;
s64 delta_exec; s64 delta_exec;
delta_exec = update_curr_se(rq, &curr->se); delta_exec = update_curr_se(rq, &donor->se);
if (likely(delta_exec > 0)) if (likely(delta_exec > 0))
update_curr_task(curr, delta_exec); update_curr_task(donor, delta_exec);
return delta_exec; return delta_exec;
} }
@ -1251,14 +1251,14 @@ static void update_curr(struct cfs_rq *cfs_rq)
return; return;
if (resched || did_preempt_short(cfs_rq, curr)) { if (resched || did_preempt_short(cfs_rq, curr)) {
resched_curr(rq); resched_curr_lazy(rq);
clear_buddies(cfs_rq, curr); clear_buddies(cfs_rq, curr);
} }
} }
static void update_curr_fair(struct rq *rq) static void update_curr_fair(struct rq *rq)
{ {
update_curr(cfs_rq_of(&rq->curr->se)); update_curr(cfs_rq_of(&rq->donor->se));
} }
static inline void static inline void
@ -5280,7 +5280,7 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
* *
* EEVDF: placement strategy #1 / #2 * EEVDF: placement strategy #1 / #2
*/ */
if (sched_feat(PLACE_LAG) && cfs_rq->nr_running) { if (sched_feat(PLACE_LAG) && cfs_rq->nr_running && se->vlag) {
struct sched_entity *curr = cfs_rq->curr; struct sched_entity *curr = cfs_rq->curr;
unsigned long load; unsigned long load;
@ -5678,15 +5678,9 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
* validating it and just reschedule. * validating it and just reschedule.
*/ */
if (queued) { if (queued) {
resched_curr(rq_of(cfs_rq)); resched_curr_lazy(rq_of(cfs_rq));
return; return;
} }
/*
* don't let the period tick interfere with the hrtick preemption
*/
if (!sched_feat(DOUBLE_TICK) &&
hrtimer_active(&rq_of(cfs_rq)->hrtick_timer))
return;
#endif #endif
} }
@ -6822,7 +6816,7 @@ static void hrtick_start_fair(struct rq *rq, struct task_struct *p)
s64 delta = slice - ran; s64 delta = slice - ran;
if (delta < 0) { if (delta < 0) {
if (task_current(rq, p)) if (task_current_donor(rq, p))
resched_curr(rq); resched_curr(rq);
return; return;
} }
@ -6837,12 +6831,12 @@ static void hrtick_start_fair(struct rq *rq, struct task_struct *p)
*/ */
static void hrtick_update(struct rq *rq) static void hrtick_update(struct rq *rq)
{ {
struct task_struct *curr = rq->curr; struct task_struct *donor = rq->donor;
if (!hrtick_enabled_fair(rq) || curr->sched_class != &fair_sched_class) if (!hrtick_enabled_fair(rq) || donor->sched_class != &fair_sched_class)
return; return;
hrtick_start_fair(rq, curr); hrtick_start_fair(rq, donor);
} }
#else /* !CONFIG_SCHED_HRTICK */ #else /* !CONFIG_SCHED_HRTICK */
static inline void static inline void
@ -8763,9 +8757,9 @@ static void set_next_buddy(struct sched_entity *se)
*/ */
static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int wake_flags) static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int wake_flags)
{ {
struct task_struct *curr = rq->curr; struct task_struct *donor = rq->donor;
struct sched_entity *se = &curr->se, *pse = &p->se; struct sched_entity *se = &donor->se, *pse = &p->se;
struct cfs_rq *cfs_rq = task_cfs_rq(curr); struct cfs_rq *cfs_rq = task_cfs_rq(donor);
int cse_is_idle, pse_is_idle; int cse_is_idle, pse_is_idle;
if (unlikely(se == pse)) if (unlikely(se == pse))
@ -8794,7 +8788,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
* prevents us from potentially nominating it as a false LAST_BUDDY * prevents us from potentially nominating it as a false LAST_BUDDY
* below. * below.
*/ */
if (test_tsk_need_resched(curr)) if (test_tsk_need_resched(rq->curr))
return; return;
if (!sched_feat(WAKEUP_PREEMPTION)) if (!sched_feat(WAKEUP_PREEMPTION))
@ -8842,7 +8836,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
return; return;
preempt: preempt:
resched_curr(rq); resched_curr_lazy(rq);
} }
static struct task_struct *pick_task_fair(struct rq *rq) static struct task_struct *pick_task_fair(struct rq *rq)
@ -13093,7 +13087,7 @@ prio_changed_fair(struct rq *rq, struct task_struct *p, int oldprio)
* our priority decreased, or if we are not currently running on * our priority decreased, or if we are not currently running on
* this runqueue and our priority is higher than the current's * this runqueue and our priority is higher than the current's
*/ */
if (task_current(rq, p)) { if (task_current_donor(rq, p)) {
if (p->prio > oldprio) if (p->prio > oldprio)
resched_curr(rq); resched_curr(rq);
} else } else
@ -13200,7 +13194,7 @@ static void switched_to_fair(struct rq *rq, struct task_struct *p)
* kick off the schedule if running, otherwise just see * kick off the schedule if running, otherwise just see
* if we can still preempt the current task. * if we can still preempt the current task.
*/ */
if (task_current(rq, p)) if (task_current_donor(rq, p))
resched_curr(rq); resched_curr(rq);
else else
wakeup_preempt(rq, p, 0); wakeup_preempt(rq, p, 0);

View File

@ -19,7 +19,7 @@ SCHED_FEAT(PLACE_REL_DEADLINE, true)
*/ */
SCHED_FEAT(RUN_TO_PARITY, true) SCHED_FEAT(RUN_TO_PARITY, true)
/* /*
* Allow wakeup of tasks with a shorter slice to cancel RESPECT_SLICE for * Allow wakeup of tasks with a shorter slice to cancel RUN_TO_PARITY for
* current. * current.
*/ */
SCHED_FEAT(PREEMPT_SHORT, true) SCHED_FEAT(PREEMPT_SHORT, true)
@ -56,7 +56,6 @@ SCHED_FEAT(WAKEUP_PREEMPTION, true)
SCHED_FEAT(HRTICK, false) SCHED_FEAT(HRTICK, false)
SCHED_FEAT(HRTICK_DL, false) SCHED_FEAT(HRTICK_DL, false)
SCHED_FEAT(DOUBLE_TICK, false)
/* /*
* Decrement CPU capacity based on time not spent running tasks * Decrement CPU capacity based on time not spent running tasks

View File

@ -271,7 +271,6 @@ static void do_idle(void)
tick_nohz_idle_enter(); tick_nohz_idle_enter();
while (!need_resched()) { while (!need_resched()) {
rmb();
/* /*
* Interrupts shouldn't be re-enabled from that point on until * Interrupts shouldn't be re-enabled from that point on until

View File

@ -476,7 +476,7 @@ int update_irq_load_avg(struct rq *rq, u64 running)
bool update_other_load_avgs(struct rq *rq) bool update_other_load_avgs(struct rq *rq)
{ {
u64 now = rq_clock_pelt(rq); u64 now = rq_clock_pelt(rq);
const struct sched_class *curr_class = rq->curr->sched_class; const struct sched_class *curr_class = rq->donor->sched_class;
unsigned long hw_pressure = arch_scale_hw_pressure(cpu_of(rq)); unsigned long hw_pressure = arch_scale_hw_pressure(cpu_of(rq));
lockdep_assert_rq_held(rq); lockdep_assert_rq_held(rq);

View File

@ -528,7 +528,7 @@ static void dequeue_rt_entity(struct sched_rt_entity *rt_se, unsigned int flags)
static void sched_rt_rq_enqueue(struct rt_rq *rt_rq) static void sched_rt_rq_enqueue(struct rt_rq *rt_rq)
{ {
struct task_struct *curr = rq_of_rt_rq(rt_rq)->curr; struct task_struct *donor = rq_of_rt_rq(rt_rq)->donor;
struct rq *rq = rq_of_rt_rq(rt_rq); struct rq *rq = rq_of_rt_rq(rt_rq);
struct sched_rt_entity *rt_se; struct sched_rt_entity *rt_se;
@ -542,7 +542,7 @@ static void sched_rt_rq_enqueue(struct rt_rq *rt_rq)
else if (!on_rt_rq(rt_se)) else if (!on_rt_rq(rt_se))
enqueue_rt_entity(rt_se, 0); enqueue_rt_entity(rt_se, 0);
if (rt_rq->highest_prio.curr < curr->prio) if (rt_rq->highest_prio.curr < donor->prio)
resched_curr(rq); resched_curr(rq);
} }
} }
@ -988,10 +988,10 @@ static inline int rt_se_prio(struct sched_rt_entity *rt_se)
*/ */
static void update_curr_rt(struct rq *rq) static void update_curr_rt(struct rq *rq)
{ {
struct task_struct *curr = rq->curr; struct task_struct *donor = rq->donor;
s64 delta_exec; s64 delta_exec;
if (curr->sched_class != &rt_sched_class) if (donor->sched_class != &rt_sched_class)
return; return;
delta_exec = update_curr_common(rq); delta_exec = update_curr_common(rq);
@ -999,7 +999,7 @@ static void update_curr_rt(struct rq *rq)
return; return;
#ifdef CONFIG_RT_GROUP_SCHED #ifdef CONFIG_RT_GROUP_SCHED
struct sched_rt_entity *rt_se = &curr->rt; struct sched_rt_entity *rt_se = &donor->rt;
if (!rt_bandwidth_enabled()) if (!rt_bandwidth_enabled())
return; return;
@ -1535,7 +1535,7 @@ static int find_lowest_rq(struct task_struct *task);
static int static int
select_task_rq_rt(struct task_struct *p, int cpu, int flags) select_task_rq_rt(struct task_struct *p, int cpu, int flags)
{ {
struct task_struct *curr; struct task_struct *curr, *donor;
struct rq *rq; struct rq *rq;
bool test; bool test;
@ -1547,6 +1547,7 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)
rcu_read_lock(); rcu_read_lock();
curr = READ_ONCE(rq->curr); /* unlocked access */ curr = READ_ONCE(rq->curr); /* unlocked access */
donor = READ_ONCE(rq->donor);
/* /*
* If the current task on @p's runqueue is an RT task, then * If the current task on @p's runqueue is an RT task, then
@ -1575,8 +1576,8 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)
* systems like big.LITTLE. * systems like big.LITTLE.
*/ */
test = curr && test = curr &&
unlikely(rt_task(curr)) && unlikely(rt_task(donor)) &&
(curr->nr_cpus_allowed < 2 || curr->prio <= p->prio); (curr->nr_cpus_allowed < 2 || donor->prio <= p->prio);
if (test || !rt_task_fits_capacity(p, cpu)) { if (test || !rt_task_fits_capacity(p, cpu)) {
int target = find_lowest_rq(p); int target = find_lowest_rq(p);
@ -1606,12 +1607,8 @@ out:
static void check_preempt_equal_prio(struct rq *rq, struct task_struct *p) static void check_preempt_equal_prio(struct rq *rq, struct task_struct *p)
{ {
/*
* Current can't be migrated, useless to reschedule,
* let's hope p can move out.
*/
if (rq->curr->nr_cpus_allowed == 1 || if (rq->curr->nr_cpus_allowed == 1 ||
!cpupri_find(&rq->rd->cpupri, rq->curr, NULL)) !cpupri_find(&rq->rd->cpupri, rq->donor, NULL))
return; return;
/* /*
@ -1654,7 +1651,9 @@ static int balance_rt(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
*/ */
static void wakeup_preempt_rt(struct rq *rq, struct task_struct *p, int flags) static void wakeup_preempt_rt(struct rq *rq, struct task_struct *p, int flags)
{ {
if (p->prio < rq->curr->prio) { struct task_struct *donor = rq->donor;
if (p->prio < donor->prio) {
resched_curr(rq); resched_curr(rq);
return; return;
} }
@ -1672,7 +1671,7 @@ static void wakeup_preempt_rt(struct rq *rq, struct task_struct *p, int flags)
* to move current somewhere else, making room for our non-migratable * to move current somewhere else, making room for our non-migratable
* task. * task.
*/ */
if (p->prio == rq->curr->prio && !test_tsk_need_resched(rq->curr)) if (p->prio == donor->prio && !test_tsk_need_resched(rq->curr))
check_preempt_equal_prio(rq, p); check_preempt_equal_prio(rq, p);
#endif #endif
} }
@ -1697,7 +1696,7 @@ static inline void set_next_task_rt(struct rq *rq, struct task_struct *p, bool f
* utilization. We only care of the case where we start to schedule a * utilization. We only care of the case where we start to schedule a
* rt task * rt task
*/ */
if (rq->curr->sched_class != &rt_sched_class) if (rq->donor->sched_class != &rt_sched_class)
update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 0); update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 0);
rt_queue_push_tasks(rq); rt_queue_push_tasks(rq);
@ -1773,15 +1772,6 @@ static void put_prev_task_rt(struct rq *rq, struct task_struct *p, struct task_s
/* Only try algorithms three times */ /* Only try algorithms three times */
#define RT_MAX_TRIES 3 #define RT_MAX_TRIES 3
static int pick_rt_task(struct rq *rq, struct task_struct *p, int cpu)
{
if (!task_on_cpu(rq, p) &&
cpumask_test_cpu(cpu, &p->cpus_mask))
return 1;
return 0;
}
/* /*
* Return the highest pushable rq's task, which is suitable to be executed * Return the highest pushable rq's task, which is suitable to be executed
* on the CPU, NULL otherwise * on the CPU, NULL otherwise
@ -1795,7 +1785,7 @@ static struct task_struct *pick_highest_pushable_task(struct rq *rq, int cpu)
return NULL; return NULL;
plist_for_each_entry(p, head, pushable_tasks) { plist_for_each_entry(p, head, pushable_tasks) {
if (pick_rt_task(rq, p, cpu)) if (task_is_pushable(rq, p, cpu))
return p; return p;
} }
@ -1968,6 +1958,7 @@ static struct task_struct *pick_next_pushable_task(struct rq *rq)
BUG_ON(rq->cpu != task_cpu(p)); BUG_ON(rq->cpu != task_cpu(p));
BUG_ON(task_current(rq, p)); BUG_ON(task_current(rq, p));
BUG_ON(task_current_donor(rq, p));
BUG_ON(p->nr_cpus_allowed <= 1); BUG_ON(p->nr_cpus_allowed <= 1);
BUG_ON(!task_on_rq_queued(p)); BUG_ON(!task_on_rq_queued(p));
@ -2000,7 +1991,7 @@ retry:
* higher priority than current. If that's the case * higher priority than current. If that's the case
* just reschedule current. * just reschedule current.
*/ */
if (unlikely(next_task->prio < rq->curr->prio)) { if (unlikely(next_task->prio < rq->donor->prio)) {
resched_curr(rq); resched_curr(rq);
return 0; return 0;
} }
@ -2021,7 +2012,7 @@ retry:
* Note that the stoppers are masqueraded as SCHED_FIFO * Note that the stoppers are masqueraded as SCHED_FIFO
* (cf. sched_set_stop_task()), so we can't rely on rt_task(). * (cf. sched_set_stop_task()), so we can't rely on rt_task().
*/ */
if (rq->curr->sched_class != &rt_sched_class) if (rq->donor->sched_class != &rt_sched_class)
return 0; return 0;
cpu = find_lowest_rq(rq->curr); cpu = find_lowest_rq(rq->curr);
@ -2088,9 +2079,7 @@ retry:
goto retry; goto retry;
} }
deactivate_task(rq, next_task, 0); move_queued_task_locked(rq, lowest_rq, next_task);
set_task_cpu(next_task, lowest_rq->cpu);
activate_task(lowest_rq, next_task, 0);
resched_curr(lowest_rq); resched_curr(lowest_rq);
ret = 1; ret = 1;
@ -2355,15 +2344,13 @@ static void pull_rt_task(struct rq *this_rq)
* p if it is lower in priority than the * p if it is lower in priority than the
* current task on the run queue * current task on the run queue
*/ */
if (p->prio < src_rq->curr->prio) if (p->prio < src_rq->donor->prio)
goto skip; goto skip;
if (is_migration_disabled(p)) { if (is_migration_disabled(p)) {
push_task = get_push_task(src_rq); push_task = get_push_task(src_rq);
} else { } else {
deactivate_task(src_rq, p, 0); move_queued_task_locked(src_rq, this_rq, p);
set_task_cpu(p, this_cpu);
activate_task(this_rq, p, 0);
resched = true; resched = true;
} }
/* /*
@ -2399,9 +2386,9 @@ static void task_woken_rt(struct rq *rq, struct task_struct *p)
bool need_to_push = !task_on_cpu(rq, p) && bool need_to_push = !task_on_cpu(rq, p) &&
!test_tsk_need_resched(rq->curr) && !test_tsk_need_resched(rq->curr) &&
p->nr_cpus_allowed > 1 && p->nr_cpus_allowed > 1 &&
(dl_task(rq->curr) || rt_task(rq->curr)) && (dl_task(rq->donor) || rt_task(rq->donor)) &&
(rq->curr->nr_cpus_allowed < 2 || (rq->curr->nr_cpus_allowed < 2 ||
rq->curr->prio <= p->prio); rq->donor->prio <= p->prio);
if (need_to_push) if (need_to_push)
push_rt_tasks(rq); push_rt_tasks(rq);
@ -2485,7 +2472,7 @@ static void switched_to_rt(struct rq *rq, struct task_struct *p)
if (p->nr_cpus_allowed > 1 && rq->rt.overloaded) if (p->nr_cpus_allowed > 1 && rq->rt.overloaded)
rt_queue_push_tasks(rq); rt_queue_push_tasks(rq);
#endif /* CONFIG_SMP */ #endif /* CONFIG_SMP */
if (p->prio < rq->curr->prio && cpu_online(cpu_of(rq))) if (p->prio < rq->donor->prio && cpu_online(cpu_of(rq)))
resched_curr(rq); resched_curr(rq);
} }
} }
@ -2500,7 +2487,7 @@ prio_changed_rt(struct rq *rq, struct task_struct *p, int oldprio)
if (!task_on_rq_queued(p)) if (!task_on_rq_queued(p))
return; return;
if (task_current(rq, p)) { if (task_current_donor(rq, p)) {
#ifdef CONFIG_SMP #ifdef CONFIG_SMP
/* /*
* If our priority decreases while running, we * If our priority decreases while running, we
@ -2526,7 +2513,7 @@ prio_changed_rt(struct rq *rq, struct task_struct *p, int oldprio)
* greater than the current running task * greater than the current running task
* then reschedule. * then reschedule.
*/ */
if (p->prio < rq->curr->prio) if (p->prio < rq->donor->prio)
resched_curr(rq); resched_curr(rq);
} }
} }

View File

@ -1148,7 +1148,10 @@ struct rq {
*/ */
unsigned int nr_uninterruptible; unsigned int nr_uninterruptible;
struct task_struct __rcu *curr; union {
struct task_struct __rcu *donor; /* Scheduler context */
struct task_struct __rcu *curr; /* Execution context */
};
struct sched_dl_entity *dl_server; struct sched_dl_entity *dl_server;
struct task_struct *idle; struct task_struct *idle;
struct task_struct *stop; struct task_struct *stop;
@ -1345,6 +1348,11 @@ DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
#define cpu_curr(cpu) (cpu_rq(cpu)->curr) #define cpu_curr(cpu) (cpu_rq(cpu)->curr)
#define raw_rq() raw_cpu_ptr(&runqueues) #define raw_rq() raw_cpu_ptr(&runqueues)
static inline void rq_set_donor(struct rq *rq, struct task_struct *t)
{
/* Do nothing */
}
#ifdef CONFIG_SCHED_CORE #ifdef CONFIG_SCHED_CORE
static inline struct cpumask *sched_group_span(struct sched_group *sg); static inline struct cpumask *sched_group_span(struct sched_group *sg);
@ -2086,34 +2094,6 @@ static inline const struct cpumask *task_user_cpus(struct task_struct *p)
#endif /* CONFIG_SMP */ #endif /* CONFIG_SMP */
#include "stats.h"
#if defined(CONFIG_SCHED_CORE) && defined(CONFIG_SCHEDSTATS)
extern void __sched_core_account_forceidle(struct rq *rq);
static inline void sched_core_account_forceidle(struct rq *rq)
{
if (schedstat_enabled())
__sched_core_account_forceidle(rq);
}
extern void __sched_core_tick(struct rq *rq);
static inline void sched_core_tick(struct rq *rq)
{
if (sched_core_enabled(rq) && schedstat_enabled())
__sched_core_tick(rq);
}
#else /* !(CONFIG_SCHED_CORE && CONFIG_SCHEDSTATS): */
static inline void sched_core_account_forceidle(struct rq *rq) { }
static inline void sched_core_tick(struct rq *rq) { }
#endif /* !(CONFIG_SCHED_CORE && CONFIG_SCHEDSTATS) */
#ifdef CONFIG_CGROUP_SCHED #ifdef CONFIG_CGROUP_SCHED
/* /*
@ -2261,11 +2241,25 @@ static inline u64 global_rt_runtime(void)
return (u64)sysctl_sched_rt_runtime * NSEC_PER_USEC; return (u64)sysctl_sched_rt_runtime * NSEC_PER_USEC;
} }
/*
* Is p the current execution context?
*/
static inline int task_current(struct rq *rq, struct task_struct *p) static inline int task_current(struct rq *rq, struct task_struct *p)
{ {
return rq->curr == p; return rq->curr == p;
} }
/*
* Is p the current scheduling context?
*
* Note that it might be the current execution context at the same time if
* rq->curr == rq->donor == p.
*/
static inline int task_current_donor(struct rq *rq, struct task_struct *p)
{
return rq->donor == p;
}
static inline int task_on_cpu(struct rq *rq, struct task_struct *p) static inline int task_on_cpu(struct rq *rq, struct task_struct *p)
{ {
#ifdef CONFIG_SMP #ifdef CONFIG_SMP
@ -2452,7 +2446,7 @@ struct sched_class {
static inline void put_prev_task(struct rq *rq, struct task_struct *prev) static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
{ {
WARN_ON_ONCE(rq->curr != prev); WARN_ON_ONCE(rq->donor != prev);
prev->sched_class->put_prev_task(rq, prev, NULL); prev->sched_class->put_prev_task(rq, prev, NULL);
} }
@ -2616,7 +2610,7 @@ static inline cpumask_t *alloc_user_cpus_ptr(int node)
static inline struct task_struct *get_push_task(struct rq *rq) static inline struct task_struct *get_push_task(struct rq *rq)
{ {
struct task_struct *p = rq->curr; struct task_struct *p = rq->donor;
lockdep_assert_rq_held(rq); lockdep_assert_rq_held(rq);
@ -2696,6 +2690,7 @@ extern void init_sched_rt_class(void);
extern void init_sched_fair_class(void); extern void init_sched_fair_class(void);
extern void resched_curr(struct rq *rq); extern void resched_curr(struct rq *rq);
extern void resched_curr_lazy(struct rq *rq);
extern void resched_cpu(int cpu); extern void resched_cpu(int cpu);
extern void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime); extern void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime);
@ -3200,6 +3195,34 @@ extern void nohz_run_idle_balance(int cpu);
static inline void nohz_run_idle_balance(int cpu) { } static inline void nohz_run_idle_balance(int cpu) { }
#endif #endif
#include "stats.h"
#if defined(CONFIG_SCHED_CORE) && defined(CONFIG_SCHEDSTATS)
extern void __sched_core_account_forceidle(struct rq *rq);
static inline void sched_core_account_forceidle(struct rq *rq)
{
if (schedstat_enabled())
__sched_core_account_forceidle(rq);
}
extern void __sched_core_tick(struct rq *rq);
static inline void sched_core_tick(struct rq *rq)
{
if (sched_core_enabled(rq) && schedstat_enabled())
__sched_core_tick(rq);
}
#else /* !(CONFIG_SCHED_CORE && CONFIG_SCHEDSTATS): */
static inline void sched_core_account_forceidle(struct rq *rq) { }
static inline void sched_core_tick(struct rq *rq) { }
#endif /* !(CONFIG_SCHED_CORE && CONFIG_SCHEDSTATS) */
#ifdef CONFIG_IRQ_TIME_ACCOUNTING #ifdef CONFIG_IRQ_TIME_ACCOUNTING
struct irqtime { struct irqtime {
@ -3630,24 +3653,41 @@ static inline void mm_cid_put(struct mm_struct *mm)
__mm_cid_put(mm, mm_cid_clear_lazy_put(cid)); __mm_cid_put(mm, mm_cid_clear_lazy_put(cid));
} }
static inline int __mm_cid_try_get(struct mm_struct *mm) static inline int __mm_cid_try_get(struct task_struct *t, struct mm_struct *mm)
{ {
struct cpumask *cpumask; struct cpumask *cidmask = mm_cidmask(mm);
int cid; struct mm_cid __percpu *pcpu_cid = mm->pcpu_cid;
int cid = __this_cpu_read(pcpu_cid->recent_cid);
cpumask = mm_cidmask(mm); /* Try to re-use recent cid. This improves cache locality. */
if (!mm_cid_is_unset(cid) && !cpumask_test_and_set_cpu(cid, cidmask))
return cid;
/* /*
* Expand cid allocation if the maximum number of concurrency
* IDs allocated (max_nr_cid) is below the number cpus allowed
* and number of threads. Expanding cid allocation as much as
* possible improves cache locality.
*/
cid = atomic_read(&mm->max_nr_cid);
while (cid < READ_ONCE(mm->nr_cpus_allowed) && cid < atomic_read(&mm->mm_users)) {
if (!atomic_try_cmpxchg(&mm->max_nr_cid, &cid, cid + 1))
continue;
if (!cpumask_test_and_set_cpu(cid, cidmask))
return cid;
}
/*
* Find the first available concurrency id.
* Retry finding first zero bit if the mask is temporarily * Retry finding first zero bit if the mask is temporarily
* filled. This only happens during concurrent remote-clear * filled. This only happens during concurrent remote-clear
* which owns a cid without holding a rq lock. * which owns a cid without holding a rq lock.
*/ */
for (;;) { for (;;) {
cid = cpumask_first_zero(cpumask); cid = cpumask_first_zero(cidmask);
if (cid < nr_cpu_ids) if (cid < READ_ONCE(mm->nr_cpus_allowed))
break; break;
cpu_relax(); cpu_relax();
} }
if (cpumask_test_and_set_cpu(cid, cpumask)) if (cpumask_test_and_set_cpu(cid, cidmask))
return -1; return -1;
return cid; return cid;
@ -3665,7 +3705,8 @@ static inline void mm_cid_snapshot_time(struct rq *rq, struct mm_struct *mm)
WRITE_ONCE(pcpu_cid->time, rq->clock); WRITE_ONCE(pcpu_cid->time, rq->clock);
} }
static inline int __mm_cid_get(struct rq *rq, struct mm_struct *mm) static inline int __mm_cid_get(struct rq *rq, struct task_struct *t,
struct mm_struct *mm)
{ {
int cid; int cid;
@ -3675,13 +3716,13 @@ static inline int __mm_cid_get(struct rq *rq, struct mm_struct *mm)
* guarantee forward progress. * guarantee forward progress.
*/ */
if (!READ_ONCE(use_cid_lock)) { if (!READ_ONCE(use_cid_lock)) {
cid = __mm_cid_try_get(mm); cid = __mm_cid_try_get(t, mm);
if (cid >= 0) if (cid >= 0)
goto end; goto end;
raw_spin_lock(&cid_lock); raw_spin_lock(&cid_lock);
} else { } else {
raw_spin_lock(&cid_lock); raw_spin_lock(&cid_lock);
cid = __mm_cid_try_get(mm); cid = __mm_cid_try_get(t, mm);
if (cid >= 0) if (cid >= 0)
goto unlock; goto unlock;
} }
@ -3701,7 +3742,7 @@ static inline int __mm_cid_get(struct rq *rq, struct mm_struct *mm)
* all newcoming allocations observe the use_cid_lock flag set. * all newcoming allocations observe the use_cid_lock flag set.
*/ */
do { do {
cid = __mm_cid_try_get(mm); cid = __mm_cid_try_get(t, mm);
cpu_relax(); cpu_relax();
} while (cid < 0); } while (cid < 0);
/* /*
@ -3718,7 +3759,8 @@ end:
return cid; return cid;
} }
static inline int mm_cid_get(struct rq *rq, struct mm_struct *mm) static inline int mm_cid_get(struct rq *rq, struct task_struct *t,
struct mm_struct *mm)
{ {
struct mm_cid __percpu *pcpu_cid = mm->pcpu_cid; struct mm_cid __percpu *pcpu_cid = mm->pcpu_cid;
struct cpumask *cpumask; struct cpumask *cpumask;
@ -3735,8 +3777,9 @@ static inline int mm_cid_get(struct rq *rq, struct mm_struct *mm)
if (try_cmpxchg(&this_cpu_ptr(pcpu_cid)->cid, &cid, MM_CID_UNSET)) if (try_cmpxchg(&this_cpu_ptr(pcpu_cid)->cid, &cid, MM_CID_UNSET))
__mm_cid_put(mm, mm_cid_clear_lazy_put(cid)); __mm_cid_put(mm, mm_cid_clear_lazy_put(cid));
} }
cid = __mm_cid_get(rq, mm); cid = __mm_cid_get(rq, t, mm);
__this_cpu_write(pcpu_cid->cid, cid); __this_cpu_write(pcpu_cid->cid, cid);
__this_cpu_write(pcpu_cid->recent_cid, cid);
return cid; return cid;
} }
@ -3789,7 +3832,7 @@ static inline void switch_mm_cid(struct rq *rq,
prev->mm_cid = -1; prev->mm_cid = -1;
} }
if (next->mm_cid_active) if (next->mm_cid_active)
next->last_mm_cid = next->mm_cid = mm_cid_get(rq, next->mm); next->last_mm_cid = next->mm_cid = mm_cid_get(rq, next, next->mm);
} }
#else /* !CONFIG_SCHED_MM_CID: */ #else /* !CONFIG_SCHED_MM_CID: */
@ -3802,6 +3845,28 @@ static inline void init_sched_mm_cid(struct task_struct *t) { }
extern u64 avg_vruntime(struct cfs_rq *cfs_rq); extern u64 avg_vruntime(struct cfs_rq *cfs_rq);
extern int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se); extern int entity_eligible(struct cfs_rq *cfs_rq, struct sched_entity *se);
#ifdef CONFIG_SMP
static inline
void move_queued_task_locked(struct rq *src_rq, struct rq *dst_rq, struct task_struct *task)
{
lockdep_assert_rq_held(src_rq);
lockdep_assert_rq_held(dst_rq);
deactivate_task(src_rq, task, 0);
set_task_cpu(task, dst_rq->cpu);
activate_task(dst_rq, task, 0);
}
static inline
bool task_is_pushable(struct rq *rq, struct task_struct *p, int cpu)
{
if (!task_on_cpu(rq, p) &&
cpumask_test_cpu(cpu, &p->cpus_mask))
return true;
return false;
}
#endif
#ifdef CONFIG_RT_MUTEXES #ifdef CONFIG_RT_MUTEXES

View File

@ -127,21 +127,25 @@ static inline void psi_account_irqtime(struct rq *rq, struct task_struct *curr,
* go through migration requeues. In this case, *sleeping* states need * go through migration requeues. In this case, *sleeping* states need
* to be transferred. * to be transferred.
*/ */
static inline void psi_enqueue(struct task_struct *p, bool migrate) static inline void psi_enqueue(struct task_struct *p, int flags)
{ {
int clear = 0, set = 0; int clear = 0, set = 0;
if (static_branch_likely(&psi_disabled)) if (static_branch_likely(&psi_disabled))
return; return;
/* Same runqueue, nothing changed for psi */
if (flags & ENQUEUE_RESTORE)
return;
if (p->se.sched_delayed) { if (p->se.sched_delayed) {
/* CPU migration of "sleeping" task */ /* CPU migration of "sleeping" task */
SCHED_WARN_ON(!migrate); SCHED_WARN_ON(!(flags & ENQUEUE_MIGRATED));
if (p->in_memstall) if (p->in_memstall)
set |= TSK_MEMSTALL; set |= TSK_MEMSTALL;
if (p->in_iowait) if (p->in_iowait)
set |= TSK_IOWAIT; set |= TSK_IOWAIT;
} else if (migrate) { } else if (flags & ENQUEUE_MIGRATED) {
/* CPU migration of runnable task */ /* CPU migration of runnable task */
set = TSK_RUNNING; set = TSK_RUNNING;
if (p->in_memstall) if (p->in_memstall)
@ -158,17 +162,14 @@ static inline void psi_enqueue(struct task_struct *p, bool migrate)
psi_task_change(p, clear, set); psi_task_change(p, clear, set);
} }
static inline void psi_dequeue(struct task_struct *p, bool migrate) static inline void psi_dequeue(struct task_struct *p, int flags)
{ {
if (static_branch_likely(&psi_disabled)) if (static_branch_likely(&psi_disabled))
return; return;
/* /* Same runqueue, nothing changed for psi */
* When migrating a task to another CPU, clear all psi if (flags & DEQUEUE_SAVE)
* state. The enqueue callback above will work it out. return;
*/
if (migrate)
psi_task_change(p, p->psi_flags, 0);
/* /*
* A voluntary sleep is a dequeue followed by a task switch. To * A voluntary sleep is a dequeue followed by a task switch. To
@ -176,6 +177,14 @@ static inline void psi_dequeue(struct task_struct *p, bool migrate)
* TSK_RUNNING and TSK_IOWAIT for us when it moves TSK_ONCPU. * TSK_RUNNING and TSK_IOWAIT for us when it moves TSK_ONCPU.
* Do nothing here. * Do nothing here.
*/ */
if (flags & DEQUEUE_SLEEP)
return;
/*
* When migrating a task to another CPU, clear all psi
* state. The enqueue callback above will work it out.
*/
psi_task_change(p, p->psi_flags, 0);
} }
static inline void psi_ttwu_dequeue(struct task_struct *p) static inline void psi_ttwu_dequeue(struct task_struct *p)

View File

@ -91,7 +91,7 @@ void set_user_nice(struct task_struct *p, long nice)
} }
queued = task_on_rq_queued(p); queued = task_on_rq_queued(p);
running = task_current(rq, p); running = task_current_donor(rq, p);
if (queued) if (queued)
dequeue_task(rq, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK); dequeue_task(rq, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK);
if (running) if (running)
@ -713,7 +713,7 @@ change:
dequeue_task(rq, p, DEQUEUE_SLEEP | DEQUEUE_DELAYED | DEQUEUE_NOCLOCK); dequeue_task(rq, p, DEQUEUE_SLEEP | DEQUEUE_DELAYED | DEQUEUE_NOCLOCK);
queued = task_on_rq_queued(p); queued = task_on_rq_queued(p);
running = task_current(rq, p); running = task_current_donor(rq, p);
if (queued) if (queued)
dequeue_task(rq, p, queue_flags); dequeue_task(rq, p, queue_flags);
if (running) if (running)

View File

@ -9,7 +9,7 @@
static wait_queue_head_t bit_wait_table[WAIT_TABLE_SIZE] __cacheline_aligned; static wait_queue_head_t bit_wait_table[WAIT_TABLE_SIZE] __cacheline_aligned;
wait_queue_head_t *bit_waitqueue(void *word, int bit) wait_queue_head_t *bit_waitqueue(unsigned long *word, int bit)
{ {
const int shift = BITS_PER_LONG == 32 ? 5 : 6; const int shift = BITS_PER_LONG == 32 ? 5 : 6;
unsigned long val = (unsigned long)word << shift | bit; unsigned long val = (unsigned long)word << shift | bit;
@ -55,7 +55,7 @@ __wait_on_bit(struct wait_queue_head *wq_head, struct wait_bit_queue_entry *wbq_
} }
EXPORT_SYMBOL(__wait_on_bit); EXPORT_SYMBOL(__wait_on_bit);
int __sched out_of_line_wait_on_bit(void *word, int bit, int __sched out_of_line_wait_on_bit(unsigned long *word, int bit,
wait_bit_action_f *action, unsigned mode) wait_bit_action_f *action, unsigned mode)
{ {
struct wait_queue_head *wq_head = bit_waitqueue(word, bit); struct wait_queue_head *wq_head = bit_waitqueue(word, bit);
@ -66,7 +66,7 @@ int __sched out_of_line_wait_on_bit(void *word, int bit,
EXPORT_SYMBOL(out_of_line_wait_on_bit); EXPORT_SYMBOL(out_of_line_wait_on_bit);
int __sched out_of_line_wait_on_bit_timeout( int __sched out_of_line_wait_on_bit_timeout(
void *word, int bit, wait_bit_action_f *action, unsigned long *word, int bit, wait_bit_action_f *action,
unsigned mode, unsigned long timeout) unsigned mode, unsigned long timeout)
{ {
struct wait_queue_head *wq_head = bit_waitqueue(word, bit); struct wait_queue_head *wq_head = bit_waitqueue(word, bit);
@ -108,7 +108,7 @@ __wait_on_bit_lock(struct wait_queue_head *wq_head, struct wait_bit_queue_entry
} }
EXPORT_SYMBOL(__wait_on_bit_lock); EXPORT_SYMBOL(__wait_on_bit_lock);
int __sched out_of_line_wait_on_bit_lock(void *word, int bit, int __sched out_of_line_wait_on_bit_lock(unsigned long *word, int bit,
wait_bit_action_f *action, unsigned mode) wait_bit_action_f *action, unsigned mode)
{ {
struct wait_queue_head *wq_head = bit_waitqueue(word, bit); struct wait_queue_head *wq_head = bit_waitqueue(word, bit);
@ -118,7 +118,7 @@ int __sched out_of_line_wait_on_bit_lock(void *word, int bit,
} }
EXPORT_SYMBOL(out_of_line_wait_on_bit_lock); EXPORT_SYMBOL(out_of_line_wait_on_bit_lock);
void __wake_up_bit(struct wait_queue_head *wq_head, void *word, int bit) void __wake_up_bit(struct wait_queue_head *wq_head, unsigned long *word, int bit)
{ {
struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(word, bit); struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(word, bit);
@ -128,23 +128,31 @@ void __wake_up_bit(struct wait_queue_head *wq_head, void *word, int bit)
EXPORT_SYMBOL(__wake_up_bit); EXPORT_SYMBOL(__wake_up_bit);
/** /**
* wake_up_bit - wake up a waiter on a bit * wake_up_bit - wake up waiters on a bit
* @word: the word being waited on, a kernel virtual address * @word: the address containing the bit being waited on
* @bit: the bit of the word being waited on * @bit: the bit at that address being waited on
* *
* There is a standard hashed waitqueue table for generic use. This * Wake up any process waiting in wait_on_bit() or similar for the
* is the part of the hash-table's accessor API that wakes up waiters * given bit to be cleared.
* on a bit. For instance, if one were to have waiters on a bitflag,
* one would call wake_up_bit() after clearing the bit.
* *
* In order for this to function properly, as it uses waitqueue_active() * The wake-up is sent to tasks in a waitqueue selected by hash from a
* internally, some kind of memory barrier must be done prior to calling * shared pool. Only those tasks on that queue which have requested
* this. Typically, this will be smp_mb__after_atomic(), but in some * wake_up on this specific address and bit will be woken, and only if the
* cases where bitflags are manipulated non-atomically under a lock, one * bit is clear.
* may need to use a less regular barrier, such fs/inode.c's smp_mb(), *
* because spin_unlock() does not guarantee a memory barrier. * In order for this to function properly there must be a full memory
* barrier after the bit is cleared and before this function is called.
* If the bit was cleared atomically, such as a by clear_bit() then
* smb_mb__after_atomic() can be used, othwewise smb_mb() is needed.
* If the bit was cleared with a fully-ordered operation, no further
* barrier is required.
*
* Normally the bit should be cleared by an operation with RELEASE
* semantics so that any changes to memory made before the bit is
* cleared are guaranteed to be visible after the matching wait_on_bit()
* completes.
*/ */
void wake_up_bit(void *word, int bit) void wake_up_bit(unsigned long *word, int bit)
{ {
__wake_up_bit(bit_waitqueue(word, bit), word, bit); __wake_up_bit(bit_waitqueue(word, bit), word, bit);
} }
@ -188,6 +196,36 @@ void init_wait_var_entry(struct wait_bit_queue_entry *wbq_entry, void *var, int
} }
EXPORT_SYMBOL(init_wait_var_entry); EXPORT_SYMBOL(init_wait_var_entry);
/**
* wake_up_var - wake up waiters on a variable (kernel address)
* @var: the address of the variable being waited on
*
* Wake up any process waiting in wait_var_event() or similar for the
* given variable to change. wait_var_event() can be waiting for an
* arbitrary condition to be true and associates that condition with an
* address. Calling wake_up_var() suggests that the condition has been
* made true, but does not strictly require the condtion to use the
* address given.
*
* The wake-up is sent to tasks in a waitqueue selected by hash from a
* shared pool. Only those tasks on that queue which have requested
* wake_up on this specific address will be woken.
*
* In order for this to function properly there must be a full memory
* barrier after the variable is updated (or more accurately, after the
* condition waited on has been made to be true) and before this function
* is called. If the variable was updated atomically, such as a by
* atomic_dec() then smb_mb__after_atomic() can be used. If the
* variable was updated by a fully ordered operation such as
* atomic_dec_and_test() then no extra barrier is required. Otherwise
* smb_mb() is needed.
*
* Normally the variable should be updated (the condition should be made
* to be true) by an operation with RELEASE semantics such as
* smp_store_release() so that any changes to memory made before the
* variable was updated are guaranteed to be visible after the matching
* wait_var_event() completes.
*/
void wake_up_var(void *var) void wake_up_var(void *var)
{ {
__wake_up_bit(__var_waitqueue(var), var, -1); __wake_up_bit(__var_waitqueue(var), var, -1);
@ -228,20 +266,6 @@ __sched int bit_wait_timeout(struct wait_bit_key *word, int mode)
} }
EXPORT_SYMBOL_GPL(bit_wait_timeout); EXPORT_SYMBOL_GPL(bit_wait_timeout);
__sched int bit_wait_io_timeout(struct wait_bit_key *word, int mode)
{
unsigned long now = READ_ONCE(jiffies);
if (time_after_eq(now, word->timeout))
return -EAGAIN;
io_schedule_timeout(word->timeout - now);
if (signal_pending_state(mode, current))
return -EINTR;
return 0;
}
EXPORT_SYMBOL_GPL(bit_wait_io_timeout);
void __init wait_bit_init(void) void __init wait_bit_init(void)
{ {
int i; int i;

View File

@ -748,10 +748,8 @@ EXPORT_SYMBOL(__tasklet_hi_schedule);
static bool tasklet_clear_sched(struct tasklet_struct *t) static bool tasklet_clear_sched(struct tasklet_struct *t)
{ {
if (test_and_clear_bit(TASKLET_STATE_SCHED, &t->state)) { if (test_and_clear_wake_up_bit(TASKLET_STATE_SCHED, &t->state))
wake_up_var(&t->state);
return true; return true;
}
WARN_ONCE(1, "tasklet SCHED state not set: %s %pS\n", WARN_ONCE(1, "tasklet SCHED state not set: %s %pS\n",
t->use_callback ? "callback" : "func", t->use_callback ? "callback" : "func",
@ -871,8 +869,7 @@ void tasklet_kill(struct tasklet_struct *t)
if (in_interrupt()) if (in_interrupt())
pr_notice("Attempt to kill tasklet from interrupt\n"); pr_notice("Attempt to kill tasklet from interrupt\n");
while (test_and_set_bit(TASKLET_STATE_SCHED, &t->state)) wait_on_bit_lock(&t->state, TASKLET_STATE_SCHED, TASK_UNINTERRUPTIBLE);
wait_var_event(&t->state, !test_bit(TASKLET_STATE_SCHED, &t->state));
tasklet_unlock_wait(t); tasklet_unlock_wait(t);
tasklet_clear_sched(t); tasklet_clear_sched(t);
@ -882,16 +879,13 @@ EXPORT_SYMBOL(tasklet_kill);
#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RT) #if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RT)
void tasklet_unlock(struct tasklet_struct *t) void tasklet_unlock(struct tasklet_struct *t)
{ {
smp_mb__before_atomic(); clear_and_wake_up_bit(TASKLET_STATE_RUN, &t->state);
clear_bit(TASKLET_STATE_RUN, &t->state);
smp_mb__after_atomic();
wake_up_var(&t->state);
} }
EXPORT_SYMBOL_GPL(tasklet_unlock); EXPORT_SYMBOL_GPL(tasklet_unlock);
void tasklet_unlock_wait(struct tasklet_struct *t) void tasklet_unlock_wait(struct tasklet_struct *t)
{ {
wait_var_event(&t->state, !test_bit(TASKLET_STATE_RUN, &t->state)); wait_on_bit(&t->state, TASKLET_STATE_RUN, TASK_UNINTERRUPTIBLE);
} }
EXPORT_SYMBOL_GPL(tasklet_unlock_wait); EXPORT_SYMBOL_GPL(tasklet_unlock_wait);
#endif #endif