commit 7dc603c902 ("sched/fair: Fix PELT integrity for new tasks")
fixed two load tracking problems for new task, including detach on
unattached new task problem.
There still left another detach on unattached task problem for the task
which has been woken up by try_to_wake_up() and waiting for actually
being woken up by sched_ttwu_pending().
try_to_wake_up(p)
cpu = select_task_rq(p)
if (task_cpu(p) != cpu)
set_task_cpu(p, cpu)
migrate_task_rq_fair()
remove_entity_load_avg() --> unattached
se->avg.last_update_time = 0;
__set_task_cpu()
ttwu_queue(p, cpu)
ttwu_queue_wakelist()
__ttwu_queue_wakelist()
task_change_group_fair()
detach_task_cfs_rq()
detach_entity_cfs_rq()
detach_entity_load_avg() --> detach on unattached task
set_task_rq()
attach_task_cfs_rq()
attach_entity_cfs_rq()
attach_entity_load_avg()
The reason of this problem is similar, we should check in detach_entity_cfs_rq()
that se->avg.last_update_time != 0, before do detach_entity_load_avg().
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20220818124805.601-7-zhouchengming@bytedance.com
When we are migrating task out of the CPU, we can combine detach and
propagation into dequeue_entity() to save the detach_entity_cfs_rq()
in migrate_task_rq_fair().
This optimization is like combining DO_ATTACH in the enqueue_entity()
when migrating task to the CPU. So we don't have to traverse the CFS tree
extra time to do the detach_entity_cfs_rq() -> propagate_entity_cfs_rq(),
which wouldn't be called anymore with this patch's change.
detach_task()
deactivate_task()
dequeue_task_fair()
for_each_sched_entity(se)
dequeue_entity()
update_load_avg() /* (1) */
detach_entity_load_avg()
set_task_cpu()
migrate_task_rq_fair()
detach_entity_cfs_rq() /* (2) */
update_load_avg();
detach_entity_load_avg();
propagate_entity_cfs_rq();
for_each_sched_entity()
update_load_avg()
This patch save the detach_entity_cfs_rq() called in (2) by doing
the detach_entity_load_avg() for a CPU migrating task inside (1)
(the task being the first se in the loop)
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20220818124805.601-6-zhouchengming@bytedance.com
When reading the sched_avg related code, I found the comments in
enqueue/dequeue_entity() are not updated with the current code.
We don't add/subtract entity's runnable_avg from cfs_rq->runnable_avg
during enqueue/dequeue_entity(), those are done only for attach/detach.
This patch updates the comments to reflect the current code working.
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20220818124805.601-5-zhouchengming@bytedance.com
set_task_rq() -> set_task_rq_fair() will try to synchronize the blocked
task's sched_avg when migrate, which is not needed for already detached
task.
task_change_group_fair() will detached the task sched_avg from prev cfs_rq
first, so reset sched_avg last_update_time before set_task_rq() to avoid that.
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20220818124805.601-4-zhouchengming@bytedance.com
We use cpu_cgrp_subsys->fork() to set task group for the new fair task
in cgroup_post_fork().
Since commit b1e8206582 ("sched: Fix yet more sched_fork() races")
has already set_task_rq() for the new fair task in sched_cgroup_fork(),
so cpu_cgrp_subsys->fork() can be removed.
cgroup_can_fork() --> pin parent's sched_task_group
sched_cgroup_fork()
__set_task_cpu()
set_task_rq()
cgroup_post_fork()
ss->fork() := cpu_cgroup_fork()
sched_change_group(..., TASK_SET_GROUP)
task_set_group_fair()
set_task_rq() --> can be removed
After this patch's change, task_change_group_fair() only need to
care about task cgroup migration, make the code much simplier.
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20220818124805.601-3-zhouchengming@bytedance.com
Previously we only maintain task se depth in task_move_group_fair(),
if a !fair task change task group, its se depth will not be updated,
so commit eb7a59b2c8 ("sched/fair: Reset se-depth when task switched to FAIR")
fix the problem by updating se depth in switched_to_fair() too.
Then commit daa59407b5 ("sched/fair: Unify switched_{from,to}_fair()
and task_move_group_fair()") unified these two functions, moved se.depth
setting to attach_task_cfs_rq(), which further into attach_entity_cfs_rq()
with commit df217913e7 ("sched/fair: Factorize attach/detach entity").
This patch move task se depth maintenance from attach_entity_cfs_rq()
to set_task_rq(), which will be called when CPU/cgroup change, so its
depth will always be correct.
This patch is preparation for the next patch.
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20220818124805.601-2-zhouchengming@bytedance.com
The load_balance_mask and select_rq_mask percpu variables are only used in
kernel/sched/fair.c.
Make them static and move their allocation into init_sched_fair_class().
Replace kzalloc_node() with zalloc_cpumask_var_node() to get rid of the
CONFIG_CPUMASK_OFFSTACK #ifdef and to align with per-cpu cpumask
allocation for RT (local_cpu_mask in init_sched_rt_class()) and DL
class (local_cpu_mask_dl in init_sched_dl_class()).
[ mingo: Tidied up changelog & touched up the code. ]
Signed-off-by: Bing Huang <huangbing@kylinos.cn>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20220722213609.3901-1-huangbing775@126.com
After commit 7a82e5f52a ("sched/fair: Merge for each idle cpu loop of ILB"),
_nohz_idle_balance()'s 'idle' parameter is not used anymore, so we can remove it.
Signed-off-by: Hao Jia <jiahao.os@bytedance.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20220803130223.70419-1-jiahao.os@bytedance.com
Currently, the values of some fields are printed right-aligned, causing
the field value to be next to the next field name rather than next to its
own field name. So print each field value left-aligned, to make it more
readable.
Before:
stack: 0 pid: 307 ppid: 2 flags:0x00000008
After:
stack:0 pid:308 ppid:2 flags:0x0000000a
This also makes them print in the same style as the other two fields:
task:demo0 state:R running task
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/20220727060819.1085-1-thunder.leizhen@huawei.com
Save a multiplication in dl_task_fits_capacity() by using already
maintained per-sched_dl_entity (i.e. per-task) `dl_runtime/dl_deadline`
(dl_density).
cap_scale(dl_deadline, cap) >= dl_runtime
dl_deadline * cap >> SCHED_CAPACITY_SHIFT >= dl_runtime
cap >= dl_runtime << SCHED_CAPACITY_SHIFT / dl_deadline
cap >= (dl_runtime << BW_SHIFT / dl_deadline) >>
BW_SHIFT - SCHED_CAPACITY_SHIFT
cap >= dl_density >> BW_SHIFT - SCHED_CAPACITY_SHIFT
__sched_setscheduler()->__checkparam_dl() ensures that the 2 corner
cases (if conditions) `runtime == RUNTIME_INF (-1)` and `period == 0`
of to_ratio(deadline, runtime) are not met when setting dl_density in
__sched_setscheduler()-> __setscheduler_params()->__setparam_dl().
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20220729111305.1275158-4-dietmar.eggemann@arm.com
dl_cpuset_cpumask_can_shrink() is used to validate whether there is
still enough CPU capacity for DL tasks in the reduced cpuset.
Currently it still operates on `# remaining CPUs in the cpuset` (1).
Change this to use the already capacity-aware DL admission control
__dl_overflow() for the `cpumask can shrink` test.
dl_b->bw = sched_rt_period << BW_SHIFT / sched_rt_period
dl_b->bw * (1) >= currently allocated bandwidth in root_domain (rd)
Replace (1) w/ `\Sum CPU capacity in rd >> SCHED_CAPACITY_SHIFT`
Adapt __dl_bw_capacity() to take a cpumask instead of a CPU number
argument so that `rd->span` and `cpumask of the reduced cpuset` can
be used here.
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20220729111305.1275158-3-dietmar.eggemann@arm.com
Create an inline helper for conditional code to be only executed on
asymmetric CPU capacity systems. This makes these (currently ~10 and
future) conditions a lot more readable.
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20220729111305.1275158-2-dietmar.eggemann@arm.com
core:
- Fix a few inconsistencies between UP and SMP vs. interrupt affinities
- Small updates and cleanups all over the place
drivers:
- New driver for the LoongArch interrupt controller
- New driver for the Renesas RZ/G2L interrupt controller
- Hotpath optimization for SiFive PLIC
- Workaround for broken PLIC edge triggered interrupts
- Simall cleanups and improvements as usual
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmLn5agTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoV2HD/4u0+09Fd8Awt1Knnb4CInmwFihZ/bu
EiS1Air+MEJ/fyFb5sT/Dn8YdUWYA6a3ifpLMGBwrKCcb5pMwPEtI8uqjSmtgsN/
2Z4o3N5v6EgM25CtrHNBrXK0E9Rz5Py49gm5p3K7+h4g63z9JwrM4G0Bvr8+krLS
EV9IZU6kVmGC6gnG/MspkArsLk1rCM0PU0SJ2lEPsWd1fjhVSDfunvy/qnnzXRzz
wjrcAf+a2Kgb1TMnpL6tx9d2Xx8KrKfODZTdOmPHrdv58F0EbJzapJnAVkYZDPtR
LE2kQc2Qhdawx0kgNNNhvu9P6oZtpnK9N7KAhDQdw17sgrRygINjAMSEe2RykYL1
lK+lJOIzfyd2JkEuC/8w1ZezL88S0EaTNawrkxjJ8L3fa7WDbwilCC+1w95QydCv
sQB137OaLKgWetcRsht9PLWFb4ujkWzxoPf2cPPsm81EzCicNtBuNPLReBTcNrWJ
X2VPpbaqRK8t8bnkXRqhahbq7f8c86feoICHfA4c7T4eZUp/Oq6T8aNvf6WPgjae
I2/FO6kxZj3CQqFkhJGhiZRtGZdx6HLCsL84A+2Ktsra+D8+/qecZCnkHYtz0Vo6
aFuGg+Wj+zuc2QfdaWwG8Dh5dijbxgHGHhzbh9znsWzytN9gfoBxuvBejf65i6sC
In63mEkv35ttfA==
=OnhF
-----END PGP SIGNATURE-----
Merge tag 'irq-core-2022-08-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq updates from Thomas Gleixner:
"Updates for interrupt core and drivers:
Core:
- Fix a few inconsistencies between UP and SMP vs interrupt
affinities
- Small updates and cleanups all over the place
New drivers:
- LoongArch interrupt controller
- Renesas RZ/G2L interrupt controller
Updates:
- Hotpath optimization for SiFive PLIC
- Workaround for broken PLIC edge triggered interrupts
- Simall cleanups and improvements as usual"
* tag 'irq-core-2022-08-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits)
irqchip/mmp: Declare init functions in common header file
irqchip/mips-gic: Check the return value of ioremap() in gic_of_init()
genirq: Use for_each_action_of_desc in actions_show()
irqchip / ACPI: Introduce ACPI_IRQ_MODEL_LPIC for LoongArch
irqchip: Add LoongArch CPU interrupt controller support
irqchip: Add Loongson Extended I/O interrupt controller support
irqchip/loongson-liointc: Add ACPI init support
irqchip/loongson-pch-msi: Add ACPI init support
irqchip/loongson-pch-pic: Add ACPI init support
irqchip: Add Loongson PCH LPC controller support
LoongArch: Prepare to support multiple pch-pic and pch-msi irqdomain
LoongArch: Use ACPI_GENERIC_GSI for gsi handling
genirq/generic_chip: Export irq_unmap_generic_chip
ACPI: irq: Allow acpi_gsi_to_irq() to have an arch-specific fallback
APCI: irq: Add support for multiple GSI domains
LoongArch: Provisionally add ACPICA data structures
irqdomain: Use hwirq_max instead of revmap_size for NOMAP domains
irqdomain: Report irq number for NOMAP domains
irqchip/gic-v3: Fix comment typo
dt-bindings: interrupt-controller: renesas,rzg2l-irqc: Document RZ/V2L SoC
...
- Fix Intel Alder Lake PEBS memory access latency & data source profiling info bugs.
- Use Intel large-PEBS hardware feature in more circumstances, to reduce
PMI overhead & reduce sampling data.
- Extend the lost-sample profiling output with the PERF_FORMAT_LOST ABI variant,
which tells tooling the exact number of samples lost.
- Add new IBS register bits definitions.
- AMD uncore events: Add PerfMonV2 DF (Data Fabric) enhancements.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmLn5MARHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1jAWA/+N48UX35dD0u3k5S2zdYJRHzQkdbivVGc
dOuCB3XTJYneaSI5byQkI4Xo8LUuMbF4q2Zi3/+XhTaqn2zYPP65D6ACL5hU9Shh
F95TnLWbedIaxSJmjMCsWDlwBob8WgtLhokWvyq+ks66BqaDoBKHRtn+2fi0rwZb
MbuN0199Gx/EicWEOeUGBSxoeKbjSix0BApqy+CuXC0DC3+3iwIPk4dbNfHXpHYs
nqxjQKhJnoxdlgjiOY3UuYhdCZl1cuQFIu2Ce1N2nXCAgR2FeQD7ZqtcaA2TnsAO
9BwRfLljavzHhOoz0zALR42kF+eOcnH5K9pIBx7py9Hjdmdsx88fUCovWK34MdG5
KTuqiMWNLIUvoP9WBjl7wUtl2+vcjr9XwgCdneOO+zoNsk44qSRyer1RpEP6D9UM
e9HvdXBVRzhnIhK9NYugeLJ+3nxvFL+OLvc3ZfUrtm04UzeetCBxMlvMv3y021V7
0fInZjhzh4Dz2tJgNlG7AKXkXlsHlyj6/BH9uKc9wVokK+94g5mbspxW8R4gKPr2
l06pYB7ttSpp26sq9vl5ASHO0rniiYAPsQcr7Ko3y72mmp6kfIe/HzYNhCEvgYe2
6JJ8F9kPgRuKr0CwGvUzxFwBC7PJR80zUtZkRCIpV+rgxQcNmK5YXp/KQFIjQqkI
rJfEaDOshl0=
=DqaA
-----END PGP SIGNATURE-----
Merge tag 'perf-core-2022-08-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf events updates from Ingo Molnar:
- Fix Intel Alder Lake PEBS memory access latency & data source
profiling info bugs.
- Use Intel large-PEBS hardware feature in more circumstances, to
reduce PMI overhead & reduce sampling data.
- Extend the lost-sample profiling output with the PERF_FORMAT_LOST ABI
variant, which tells tooling the exact number of samples lost.
- Add new IBS register bits definitions.
- AMD uncore events: Add PerfMonV2 DF (Data Fabric) enhancements.
* tag 'perf-core-2022-08-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/ibs: Add new IBS register bits into header
perf/x86/intel: Fix PEBS data source encoding for ADL
perf/x86/intel: Fix PEBS memory access info encoding for ADL
perf/core: Add a new read format to get a number of lost samples
perf/x86/amd/uncore: Add PerfMonV2 RDPMC assignments
perf/x86/amd/uncore: Add PerfMonV2 DF event format
perf/x86/amd/uncore: Detect available DF counters
perf/x86/amd/uncore: Use attr_update for format attributes
perf/x86/amd/uncore: Use dynamic events array
x86/events/intel/ds: Enable large PEBS for PERF_SAMPLE_WEIGHT_TYPE
- lockdep: Fix a handful of the more complex lockdep_init_map_*() primitives
that can lose the lock_type & cause false reports. No such mishap was
observed in the wild.
- jump_label improvements: simplify the cross-arch support of
initial NOP patching by making it arch-specific code (used on MIPS only),
and remove the s390 initial NOP patching that was superfluous.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmLn3jERHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1hzeg/7BTC90XeMANhTiL23iiH7dOYZwqdFeB12
VBqdaPaGC8i+mJzVAdGyPFwCFDww6Ak6P33PcHkemuIO5+DhWis8hfw5krHEOO1k
AyVSMOZuWJ8/g6ZenjgNFozQ8C+3NqURrpdqN55d7jhMazPWbsNLLqUgvSSqo6DY
Ah2O+EKrDfGNCxT6/YaTAmUryctotxafSyFDQxv3RKPfCoIIVv9b3WApYqTOqFIu
VYTPr+aAcMsU20hPMWQI4kbQaoCxFqr3bZiZtAiS/IEunqi+PlLuWjrnCUpLwVTC
+jOCkNJHt682FPKTWelUnCnkOg9KhHRujRst5mi1+2tWAOEvKltxfe05UpsZYC3b
jhzddREMwBt3iYsRn65LxxsN4AMK/C/41zjejHjZpf+Q5kwDsc6Ag3L5VifRFURS
KRwAy9ejoVYwnL7CaVHM2zZtOk4YNxPeXmiwoMJmOufpdmD1LoYbNUbpSDf+goIZ
yPJpxFI5UN8gi8IRo3DMe4K2nqcFBC3wFn8tNSAu+44gqDwGJAJL6MsLpkLSZkk8
3QN9O11UCRTJDkURjoEWPgRRuIu9HZ4GKNhiblDy6gNM/jDE/m5OG4OYfiMhojgc
KlMhsPzypSpeApL55lvZ+AzxH8mtwuUGwm8lnIdZ2kIse1iMwapxdWXWq9wQr8eW
jLWHgyZ6rcg=
=4B89
-----END PGP SIGNATURE-----
Merge tag 'locking-core-2022-08-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
"This was a fairly quiet cycle for the locking subsystem:
- lockdep: Fix a handful of the more complex lockdep_init_map_*()
primitives that can lose the lock_type & cause false reports. No
such mishap was observed in the wild.
- jump_label improvements: simplify the cross-arch support of initial
NOP patching by making it arch-specific code (used on MIPS only),
and remove the s390 initial NOP patching that was superfluous"
* tag 'locking-core-2022-08-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/lockdep: Fix lockdep_init_map_*() confusion
jump_label: make initial NOP patching the special case
jump_label: mips: move module NOP patching into arch code
jump_label: s390: avoid pointless initial NOP patching
Load-balancing improvements:
============================
- Improve NUMA balancing on AMD Zen systems for affine workloads.
- Improve the handling of reduced-capacity CPUs in load-balancing.
- Energy Model improvements: fix & refine all the energy fairness metrics (PELT),
and remove the conservative threshold requiring 6% energy savings to
migrate a task. Doing this improves power efficiency for most workloads,
and also increases the reliability of energy-efficiency scheduling.
- Optimize/tweak select_idle_cpu() to spend (much) less time searching
for an idle CPU on overloaded systems. There's reports of several
milliseconds spent there on large systems with large workloads ...
[ Since the search logic changed, there might be behavioral side effects. ]
- Improve NUMA imbalance behavior. On certain systems
with spare capacity, initial placement of tasks is non-deterministic,
and such an artificial placement imbalance can persist for a long time,
hurting (and sometimes helping) performance.
The fix is to make fork-time task placement consistent with runtime
NUMA balancing placement.
Note that some performance regressions were reported against this,
caused by workloads that are not memory bandwith limited, which benefit
from the artificial locality of the placement bug(s). Mel Gorman's
conclusion, with which we concur, was that consistency is better than
random workload benefits from non-deterministic bugs:
"Given there is no crystal ball and it's a tradeoff, I think it's
better to be consistent and use similar logic at both fork time
and runtime even if it doesn't have universal benefit."
- Improve core scheduling by fixing a bug in sched_core_update_cookie() that
caused unnecessary forced idling.
- Improve wakeup-balancing by allowing same-LLC wakeup of idle CPUs for newly
woken tasks.
- Fix a newidle balancing bug that introduced unnecessary wakeup latencies.
ABI improvements/fixes:
=======================
- Do not check capabilities and do not issue capability check denial messages
when a scheduler syscall doesn't require privileges. (Such as increasing niceness.)
- Add forced-idle accounting to cgroups too.
- Fix/improve the RSEQ ABI to not just silently accept unknown flags.
(No existing tooling is known to have learned to rely on the previous behavior.)
- Depreciate the (unused) RSEQ_CS_FLAG_NO_RESTART_ON_* flags.
Optimizations:
==============
- Optimize & simplify leaf_cfs_rq_list()
- Micro-optimize set_nr_{and_not,if}_polling() via try_cmpxchg().
Misc fixes & cleanups:
======================
- Fix the RSEQ self-tests on RISC-V and Glibc 2.35 systems.
- Fix a full-NOHZ bug that can in some cases result in the tick not being
re-enabled when the last SCHED_RT task is gone from a runqueue but there's
still SCHED_OTHER tasks around.
- Various PREEMPT_RT related fixes.
- Misc cleanups & smaller fixes.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmLn2ywRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1iNfxAAhPJMwM4tYCpIM6PhmxKiHl6kkiT2tt42
HhEmiJVLjczLybWaWwmGA2dSFkv1f4+hG7nqdZTm9QYn0Pqat2UTSRcwoKQc+gpB
x85Hwt2IUmnUman52fRl5r1miH9LTdCI6agWaFLQae5ds1XmOugFo52t2ahax+Gn
dB8LxS2fa/GrKj229EhkJSPWAK4Y94asoTProwpKLuKEeXhDkqUNrOWbKhz+wEnA
pVZySpA9uEOdNLVSr1s0VB6mZoh5/z6yQefj5YSNntsG71XWo9jxKCIm5buVdk2U
wjdn6UzoTThOy/5Ygm64eYRexMHG71UamF1JYUdmvDeUJZ5fhG6RD0FECUQNVcJB
Msu2fce6u1AV0giZGYtiooLGSawB/+e6MoDkjTl8guFHi/peve9CezKX1ZgDWPfE
eGn+EbYkUS9RMafXCKuEUBAC1UUqAavGN9sGGN1ufyR4za6ogZplOqAFKtTRTGnT
/Ne3fHTtvv73DLGW9ohO5vSS2Rp7zhAhB6FunhibhxCWlt7W6hA4Ze2vU9hf78Yn
SJDLAJjOEilLaKUkRG/d9uM3FjKJM1tqxuT76+sUbM0MNxdyiKcviQlP1b8oq5Um
xE1KNZUevnr/WXqOTGDKHH/HNPFgwxbwavMiP7dNFn8h/hEk4t9dkf5siDmVHtn4
nzDVOob1LgE=
=xr2b
-----END PGP SIGNATURE-----
Merge tag 'sched-core-2022-08-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
"Load-balancing improvements:
- Improve NUMA balancing on AMD Zen systems for affine workloads.
- Improve the handling of reduced-capacity CPUs in load-balancing.
- Energy Model improvements: fix & refine all the energy fairness
metrics (PELT), and remove the conservative threshold requiring 6%
energy savings to migrate a task. Doing this improves power
efficiency for most workloads, and also increases the reliability
of energy-efficiency scheduling.
- Optimize/tweak select_idle_cpu() to spend (much) less time
searching for an idle CPU on overloaded systems. There's reports of
several milliseconds spent there on large systems with large
workloads ...
[ Since the search logic changed, there might be behavioral side
effects. ]
- Improve NUMA imbalance behavior. On certain systems with spare
capacity, initial placement of tasks is non-deterministic, and such
an artificial placement imbalance can persist for a long time,
hurting (and sometimes helping) performance.
The fix is to make fork-time task placement consistent with runtime
NUMA balancing placement.
Note that some performance regressions were reported against this,
caused by workloads that are not memory bandwith limited, which
benefit from the artificial locality of the placement bug(s). Mel
Gorman's conclusion, with which we concur, was that consistency is
better than random workload benefits from non-deterministic bugs:
"Given there is no crystal ball and it's a tradeoff, I think
it's better to be consistent and use similar logic at both fork
time and runtime even if it doesn't have universal benefit."
- Improve core scheduling by fixing a bug in
sched_core_update_cookie() that caused unnecessary forced idling.
- Improve wakeup-balancing by allowing same-LLC wakeup of idle CPUs
for newly woken tasks.
- Fix a newidle balancing bug that introduced unnecessary wakeup
latencies.
ABI improvements/fixes:
- Do not check capabilities and do not issue capability check denial
messages when a scheduler syscall doesn't require privileges. (Such
as increasing niceness.)
- Add forced-idle accounting to cgroups too.
- Fix/improve the RSEQ ABI to not just silently accept unknown flags.
(No existing tooling is known to have learned to rely on the
previous behavior.)
- Depreciate the (unused) RSEQ_CS_FLAG_NO_RESTART_ON_* flags.
Optimizations:
- Optimize & simplify leaf_cfs_rq_list()
- Micro-optimize set_nr_{and_not,if}_polling() via try_cmpxchg().
Misc fixes & cleanups:
- Fix the RSEQ self-tests on RISC-V and Glibc 2.35 systems.
- Fix a full-NOHZ bug that can in some cases result in the tick not
being re-enabled when the last SCHED_RT task is gone from a
runqueue but there's still SCHED_OTHER tasks around.
- Various PREEMPT_RT related fixes.
- Misc cleanups & smaller fixes"
* tag 'sched-core-2022-08-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (32 commits)
rseq: Kill process when unknown flags are encountered in ABI structures
rseq: Deprecate RSEQ_CS_FLAG_NO_RESTART_ON_* flags
sched/core: Fix the bug that task won't enqueue into core tree when update cookie
nohz/full, sched/rt: Fix missed tick-reenabling bug in dequeue_task_rt()
sched/core: Always flush pending blk_plug
sched/fair: fix case with reduced capacity CPU
sched/core: Use try_cmpxchg in set_nr_{and_not,if}_polling
sched/core: add forced idle accounting for cgroups
sched/fair: Remove the energy margin in feec()
sched/fair: Remove task_util from effective utilization in feec()
sched/fair: Use the same cpumask per-PD throughout find_energy_efficient_cpu()
sched/fair: Rename select_idle_mask to select_rq_mask
sched, drivers: Remove max param from effective_cpu_util()/sched_cpu_util()
sched/fair: Decay task PELT values during wakeup migration
sched/fair: Provide u64 read for 32-bits arch helper
sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg
sched: only perform capability check on privileged operation
sched: Remove unused function group_first_cpu()
sched/fair: Remove redundant word " *"
selftests/rseq: check if libc rseq support is registered
...
rseq_abi()->flags and rseq_abi()->rseq_cs->flags 29 upper bits are
currently unused.
The current behavior when those bits are set is to ignore them. This is
not an ideal behavior, because when future features will start using
those flags, if user-space fails to correctly validate that the kernel
indeed supports those flags (e.g. with a new sys_rseq flags bit) before
using them, it may incorrectly assume that the kernel will handle those
flags way when in fact those will be silently ignored on older kernels.
Validating that unused flags bits are cleared will allow a smoother
transition when those flags will start to be used by allowing
applications to fail early, and obviously, when they attempt to use the
new flags on an older kernel that does not support them.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20220622194617.1155957-2-mathieu.desnoyers@efficios.com
- Check the IBPB feature flag before enabling IBPB in firmware calls
because cloud vendors' fantasy when it comes to creating guest
configurations is unlimited
- Unexport sev_es_ghcb_hv_call() before 5.19 releases now that HyperV
doesn't need it anymore
- Remove dead CONFIG_* items
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmLmVtEACgkQEsHwGGHe
VUoPnBAApfqJMYSnevjBqhiO7W/8s1GDkbvzZD/qHwQKIiTSNZWmB1QGaBJLmPWr
6UvsFq3ElxFkg7rovHKYV197cHZlldWNt6BC2mDUESAHZb8HMw38e0IUcxbOJHZq
DnLVxcek3VkDG8THGSoY+NX3lvcvTx+w5C7o2SZnjBxhBYMBEXWP14UvoVAWV+HT
/vEcHi3jkYiNwyTtQFdszIxF5u5qMo2qV24hiTZDYFHBBsEGTRxVRgo4kHBQlQ/t
3AxrW01Ut4zunqKlXG0wXncF1aSgfsb7XplR9bqfWz9eQzFHkZ0DqqfoCXQZRQZo
nYQQT/A/hY2rm/HFBZ329hDm6fnu+u/8FzaBGm3DUp9UWGLqxFcCqH+QtKmpJXhr
wTK/7mB2Baw0lhc110LhDLLFydI8smQwfPf8B9IzR3Ij7j9OYqO8+NFwNR+tMk+J
VWl5aFafzVEQcf7gBGVsu/sRkxc05VtEohOV25J9VHDzlaBCMCvCpoGKfwntpp0h
9xaWUNE9/P1ggbRcxUHVmdnDnoNn087hqUBOO7GOX/cnFvADMjL3h0GqvZinj/wI
8BbpTxAU8i5qodJcsnnzxtzekxzKk6KhcHo/sMULyVSAeDnTfaPIkyfE3b6Pxiam
U1QFTWPqV9371u26dnF0bYsg+UEJasuuth8noybVwej+MJvapts=
=fEYI
-----END PGP SIGNATURE-----
Merge tag 'x86_urgent_for_v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
- Update the 'mitigations=' kernel param documentation
- Check the IBPB feature flag before enabling IBPB in firmware calls
because cloud vendors' fantasy when it comes to creating guest
configurations is unlimited
- Unexport sev_es_ghcb_hv_call() before 5.19 releases now that HyperV
doesn't need it anymore
- Remove dead CONFIG_* items
* tag 'x86_urgent_for_v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
docs/kernel-parameters: Update descriptions for "mitigations=" param with retbleed
x86/bugs: Do not enable IBPB at firmware entry when IBPB is not available
Revert "x86/sev: Expose sev_es_ghcb_hv_call() for use by HyperV"
x86/configs: Update configs in x86_debug.config
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmLmUPkACgkQEsHwGGHe
VUqgow/+Oj8acqImjR1OGW0MGW5F4OBRxPlWYGRBem0PwtysKSOUEuLKFGrfUPP8
9/o/WDK7sKm0A0Ph4++zyuxQVUdww1kWR1BaOzBBJZMhB3dYk511JW2EZc7TPQg8
qnBWOh1WGztaIATImo1JtN7GVlz6mWEq5i7CkyYWOfqqgMMfzS5N548KtFs37k1F
GPwR2fntThsgYlL7+5ekHVBabx3Lf5CvpUkct484LtIrvO9xvBr+R5fzxdkd/j7s
xGVFpt0sMEGjnOatLP+Q41E6n4Vugzjk9FdxOAYLcSl8NPGj/7HUtXB0oLcU7jSn
eFxr2vurueVxpueNieBKJNiSicFsgx+QNsEtERtzLfyosgKtDkWtl5cP6k7qzqVm
9KGAWc5tiQJ5DcIoxf+pKBEXBnf6EKFS7PrknYFTbWPFnbun0nw4OnFLufUgeg9c
qB6afbWUOwKLWYIcJZadmnvmE2ZhaPAv1KPvqeE7E8ln5ERbg2UKY4qV37bvyJFg
N+gVv+acSip4KtGswGUBKFriJ/vvN1dh/PiBqqJC3AHwlz+CxYsOVgpk9tkhlaQ9
1HsQ51hyN/pb688J9SshqZf2BH3qS6Kz4eLa1eXGPEywsRBJfg4lufncn1JbrCg8
CzkUfVPbS31LahMDs5U3IWGSiYSUsy1JDRLZ2zns9ZEMaaZWPKQ=
=SBw2
-----END PGP SIGNATURE-----
Merge tag 'locking_urgent_for_v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fix from Borislav Petkov:
- Avoid rwsem lockups in certain situations when handling the handoff
bit
* tag 'locking_urgent_for_v5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/rwsem: Allow slowpath writer to ignore handoff bit if not set by first waiter
With commit d257cc8cb8 ("locking/rwsem: Make handoff bit handling more
consistent"), the writer that sets the handoff bit can be interrupted
out without clearing the bit if the wait queue isn't empty. This disables
reader and writer optimistic lock spinning and stealing.
Now if a non-first writer in the queue is somehow woken up or a new
waiter enters the slowpath, it can't acquire the lock. This is not the
case before commit d257cc8cb8 as the writer that set the handoff bit
will clear it when exiting out via the out_nolock path. This is less
efficient as the busy rwsem stays in an unlock state for a longer time.
In some cases, this new behavior may cause lockups as shown in [1] and
[2].
This patch allows a non-first writer to ignore the handoff bit if it
is not originally set or initiated by the first waiter. This patch is
shown to be effective in fixing the lockup problem reported in [1].
[1] https://lore.kernel.org/lkml/20220617134325.GC30825@techsingularity.net/
[2] https://lore.kernel.org/lkml/3f02975c-1a9d-be20-32cf-f1d8e3dfafcc@oracle.com/
Fixes: d257cc8cb8 ("locking/rwsem: Make handoff bit handling more consistent")
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: John Donnelly <john.p.donnelly@oracle.com>
Tested-by: Mel Gorman <mgorman@techsingularity.net>
Link: https://lore.kernel.org/r/20220622200419.778799-1-longman@redhat.com
Just one commit to suppress a spurious warning added during the 5.19 cycle.
-----BEGIN PGP SIGNATURE-----
iIQEABYIACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCYuQfNg4cdGpAa2VybmVs
Lm9yZwAKCRCxYfJx3gVYGdjFAQDAPPlHskr1oC6d2k2nqPNEzEpOq1LWLxRK/hR2
dddxsgD+KV0GMGb43W5Au2lbscze1WNM9jeanpofRoyV+l1gyQA=
=hlX7
-----END PGP SIGNATURE-----
Merge tag 'wq-for-5.19-rc8-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue fix from Tejun Heo:
"Just one commit to suppress a spurious warning added during the 5.19
cycle"
* tag 'wq-for-5.19-rc8-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: Avoid a false warning in unbind_workers()
Doing set_cpus_allowed_ptr() with wq_unbound_cpumask can be possible
fails and trigger the false warning.
Use cpu_possible_mask instead when wq_unbound_cpumask has no active CPUs.
It is very easy to trigger the warning:
Set wq_unbound_cpumask to a small set of CPUs.
Offline all the CPUs of wq_unbound_cpumask.
Offline an extra CPU and trigger the warning.
Fixes: 10a5a651e3 ("workqueue: Restrict kworker in the offline CPU pool running on housekeeping CPUs")
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
If a watch is being added to a queue, it needs to guard against
interference from addition of a new watch, manual removal of a watch and
removal of a watch due to some other queue being destroyed.
KEYCTL_WATCH_KEY guards against this for the same {key,queue} pair by
holding the key->sem writelocked and by holding refs on both the key and
the queue - but that doesn't prevent interaction from other {key,queue}
pairs.
While add_watch_to_object() does take the spinlock on the event queue,
it doesn't take the lock on the source's watch list. The assumption was
that the caller would prevent that (say by taking key->sem) - but that
doesn't prevent interference from the destruction of another queue.
Fix this by locking the watcher list in add_watch_to_object().
Fixes: c73be61ced ("pipe: Add general notification queue support")
Reported-by: syzbot+03d7b43290037d1f87ca@syzkaller.appspotmail.com
Signed-off-by: David Howells <dhowells@redhat.com>
cc: keyrings@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since __post_watch_notification() walks wlist->watchers with only the
RCU read lock held, we need to use RCU methods to add to the list (we
already use RCU methods to remove from the list).
Fix add_watch_to_object() to use hlist_add_head_rcu() instead of
hlist_add_head() for that list.
Fixes: c73be61ced ("pipe: Add general notification queue support")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Core code update:
- Non-SMP IRQ affinity fixes, allowing UP kernel to behave similarly
to SMP ones for the purpose of interrupt affinity
- Let irq_set_chip_handler_name_locked() take a const struct irq_chip *
- Tidy-up the NOMAP irqdomain API variant
- Teach action_show() to use for_each_action_of_desc()
- Make irq_chip_request_resources_parent() allow the parent callback
to be optional
- Remove dynamic allocations from populate_parent_alloc_arg()
* New drivers:
- Merge the long awaited IRQ support for the LoongArch architecture,
with the provisional ACPICA update (to be reverted once the official
support lands)
- New Renesas RZ/G2L IRQC driver, equipped with its companion GPIO
driver
* Driver updates
- Optimise the hot path operations for the SiFive PLIC, trading the
locking for per-CPU priority masking masking operations which are
apparently faster
- Work around broken PLIC implementations that deal pretty badly with
edge-triggered interrupts. Flag two implementations as affected.
- Simplify the irq-stm32-exti driver, particularly the table that
remaps the interrupts from exti to the GIC, reducing the memory usage
- Convert the ocelot irq_chip to being immutable
- Check ioremap() return value in the MIPS GIC driver
- Move MMP driver init function declarations into the common .h
- The obligatory typo fixes
-----BEGIN PGP SIGNATURE-----
iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmLhi0EPHG1hekBrZXJu
ZWwub3JnAAoJECPQ0LrRPXpDI+wP/2BPABqCwu7JAmue8hHtpGweVkEBNulaS1K+
1v/ElU8E1P8ppn1AVmu0lwgmAWiTtPuVWT21+AUbfOjQQ/MYKkegkH4KLmK93qSi
Dn3MEiYv8WYsEV4yrJ7Il7fuuzr1iHIhIfhg3tMxDAJx47lzZH0o8nVoNFqXD1Ro
WUFab/qTAOxJ/I53R9nrpx/yRf5dVRFUAZznrabYKpc/CiD/X+RLcHkbiybbRERu
n3xwEtGMA2faCUgifKhsXoNUaW9mZLaufoF/F3J3Dyt7jNB/TTmdnxqWo6e4/rtd
+Ut0MlH0W7bUdHGiVl1E90hDQ3yuBykUpKlTfMoYWOxeTqAx0bPYjGIuh6ajrAIy
+fXWcK89KimOGB+aLC0cR5YrzvShHnH1G2qlrQg3pAXporigAXfZvzhnouRBsVKO
RfnQHNsHSQHXTEu1u2HjMt7AXtmy/SoJENuPPUrtLfojg8b3aupwOvPLVx7w1Ok0
5TKZ2yhOHdskfr3lCPisVPKK0KZ+QZhDdBkd319JkxR5/iA/tzMeMTzKslruhd2U
Ug6hYhKNE2kKkBBBMcEVCHAuuq94DnU/q6l458UTSkkBmvq5cMMSz5Fs0kMwGFRc
q/DncpKQnrPKGrwiilj1uGgOWO2vec8KfMJUYtKzSM/QELbKUvF7yzZeIjUQxiDO
KqlWNczX
=E5fZ
-----END PGP SIGNATURE-----
Merge tag 'irqchip-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms into irq/core
Pull irqchip/genirq updates from Marc Zyngier:
* Core code update:
- Non-SMP IRQ affinity fixes, allowing UP kernel to behave similarly
to SMP ones for the purpose of interrupt affinity
- Let irq_set_chip_handler_name_locked() take a const struct irq_chip *
- Tidy-up the NOMAP irqdomain API variant
- Teach action_show() to use for_each_action_of_desc()
- Make irq_chip_request_resources_parent() allow the parent callback
to be optional
- Remove dynamic allocations from populate_parent_alloc_arg()
* New drivers:
- Merge the long awaited IRQ support for the LoongArch architecture,
with the provisional ACPICA update (to be reverted once the official
support lands)
- New Renesas RZ/G2L IRQC driver, equipped with its companion GPIO
driver
* Driver updates
- Optimise the hot path operations for the SiFive PLIC, trading the
locking for per-CPU priority masking masking operations which are
apparently faster
- Work around broken PLIC implementations that deal pretty badly with
edge-triggered interrupts. Flag two implementations as affected.
- Simplify the irq-stm32-exti driver, particularly the table that
remaps the interrupts from exti to the GIC, reducing the memory usage
- Convert the ocelot irq_chip to being immutable
- Check ioremap() return value in the MIPS GIC driver
- Move MMP driver init function declarations into the common .h
- The obligatory typo fixes
Link: https://lore.kernel.org/all/20220727192356.1860546-1-maz@kernel.org
* irq/misc-5.20:
: .
: Misc IRQ changes for 5.20:
:
: - Let irq_set_chip_handler_name_locked() take a const struct irq_chip *
:
: - Convert the ocelot irq_chip to being immutable (depends on the above)
:
: - Tidy-up the NOMAP irqdomain API variant
:
: - Teach action_show() to use for_each_action_of_desc()
:
: - Check ioremap() return value in the MIPS GIC driver
:
: - Move MMP driver init function declarations into the common .h
:
: - The obligatory typo fixes
: .
irqchip/mmp: Declare init functions in common header file
irqchip/mips-gic: Check the return value of ioremap() in gic_of_init()
genirq: Use for_each_action_of_desc in actions_show()
irqdomain: Use hwirq_max instead of revmap_size for NOMAP domains
irqdomain: Report irq number for NOMAP domains
irqchip/gic-v3: Fix comment typo
pinctrl: ocelot: Make irq_chip immutable
genirq: Allow irq_set_chip_handler_name_locked() to take a const irq_chip
Signed-off-by: Marc Zyngier <maz@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmLdDPMACgkQEsHwGGHe
VUqmvRAAjIKW411+4FH0QRS5itqugSSwx6gAqX8muM4dn5B/aY0P3I14ghxrxnTt
QG9XuevCK1i4OpxKcZLALC5fRqLNKr20WAdO+sNJzJYz1krB264bHOy46UY3MuwL
wv2/nTLiR0MQisLfjE8Cdot0bYcTnIyKqSDvdrfBxqmoKo553A8uRTMZl3iPYU+W
e3NhmG0PPzWzSz4y/Autk1GQMHOKcvvcPdsAUI+S2FwiQt/TIZ/Px2152NSV5Q4Y
TIYfN5ylNw4BZxkq9tM3NMrZnrhT4TRihlYDf7PFf3WHDgh5vQmtOUZvLpH/0ZVO
KUGCH2BPpfTVL4WBxfB2ADJWXoudEVb5r00JdI6TI9yYUXUE726BOOs3TwH+xvhP
nGcLGErJcvFMYABMvJ7tLQpcC5561MNnqfRBO3svcVkNRKQb7r7UGqUpoevkpXLw
63G+HxzDFbs0BOwaOr8hjUnhu78hKVjHXr6IbBTjda7P5WNQgTE0a9oD1JiLAJVa
RLupgq0X0FlTQip+EtbOhdGPui1HTDzYbGRoXkOxFBND4Zce9DEIkF1exsnITQat
hsvCdUjhqOX5aOlrKTSVYAi4utYm5GcOU84x4andvg7z9vYDJpqpDXHFtXjF/za9
TIj3W4PXbNHaPsMD7Ph4RF98HDyrrVvwFce4wfx5u9Xrix/lVHE=
=jolV
-----END PGP SIGNATURE-----
Merge tag 'sched_urgent_for_v5.19_rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Borislav Petkov:
"A single fix to correct a wrong BUG_ON() condition for deboosted
tasks"
* tag 'sched_urgent_for_v5.19_rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/deadline: Fix BUG_ON condition for deboosted tasks
This pull request contains a pair of commits that fix 282d8998e9 ("srcu:
Prevent expedited GPs and blocking readers from consuming CPU"), which
was itself a fix to an SRCU expedited grace-period problem that could
prevent kernel live patching (KLP) from completing. That SRCU fix for
KLP introduced large (as in minutes) boot-time delays to embedded Linux
kernels running on qemu/KVM. These delays were due to the emulation of
certain MMIO operations controlling memory layout, which were emulated
with one expedited grace period per access. Common configurations
required thousands of boot-time MMIO accesses, and thus thousands of
boot-time expedited SRCU grace periods.
In these configurations, the occasional sleeps that allowed KLP to proceed
caused excessive boot delays. These commits preserve enough sleeps to
permit KLP to proceed, but few enough that the virtual embedded kernels
still boot reasonably quickly.
This represents a regression introduced in the v5.19 merge window,
and the bug is causing significant inconvenience, hence this pull request.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmLZ6LoTHHBhdWxtY2tA
a2VybmVsLm9yZwAKCRCevxLzctn7jNHgD/4tb8Un6vZlrEaYbyA/ztUITX/2DisS
kiqbQz1BH8V3B3PxSo4ldEiw+z3fC3SMyIPymuu9bhwm6SFdjEsarFkIqySxkYnX
jnuk0JbWxs4Kk64rIkHHzAxzvM2Iw1EjSzjP1M+DC7iymSJpsgp+0zFJJtcJ8Y87
67hbQRQYk+1T7ZT+vq77NiyAAFEzSd8UydgBVxlsOOdkXQ91NYTyB8D6ldUJAnLU
opwCEpgpu74Sp4Te5q6f9uAt8xZmXsyrm8zJgzTz0KSgivcpt4GmIoyEFYUQczj0
Hewr6+qM9AWfvfQxNvRCS25yeox18kbdp1qdp9rl0BZMtYN2Zsk1Ec4c79s7NBLc
G3TIvJkGLHuZO1dO4BhLkYczgRYlaPxOR/0GKNn4m69/TbVmseUL1WeZS0pswB0q
cH1AKKEg9KdPoaX0hTLoOrlv/vwbgjhKKuoqEv7yEUhJJdACy50rmnhWhSxeuQDb
aIITVKkjkwpDtRX5QTdG1f5uIMoGz9BbUDv7VeodB0mrYHluXEfyNTwlqcISKAgm
T9kLmsdfvMrQ4fLR5S3i3dwnL3b52OB8h5NyfW3YRkXEnA7//ef/XpPiW2HY8BMT
7QwPqOoUSr/IraAcI8j0QxRpioUk1oaNi+UJ3FSHni8re6rZ0kaxatRCT20h6Djq
C9RVLaevw3bGXQ==
=ndhB
-----END PGP SIGNATURE-----
Merge tag 'rcu-urgent.2022.07.21a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull RCU fix from Paul McKenney:
"This contains a pair of commits that fix 282d8998e9 ("srcu: Prevent
expedited GPs and blocking readers from consuming CPU"), which was
itself a fix to an SRCU expedited grace-period problem that could
prevent kernel live patching (KLP) from completing.
That SRCU fix for KLP introduced large (as in minutes) boot-time
delays to embedded Linux kernels running on qemu/KVM. These delays
were due to the emulation of certain MMIO operations controlling
memory layout, which were emulated with one expedited grace period per
access. Common configurations required thousands of boot-time MMIO
accesses, and thus thousands of boot-time expedited SRCU grace
periods.
In these configurations, the occasional sleeps that allowed KLP to
proceed caused excessive boot delays. These commits preserve enough
sleeps to permit KLP to proceed, but few enough that the virtual
embedded kernels still boot reasonably quickly.
This represents a regression introduced in the v5.19 merge window, and
the bug is causing significant inconvenience"
* tag 'rcu-urgent.2022.07.21a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
srcu: Make expedited RCU grace periods block even less frequently
srcu: Block less aggressively for expedited grace periods
Sedat Dilek noticed that I had an extraneous semicolon at the end of a
line in the previous patch.
It's harmless, but unintentional, and while compilers just treat it as
an extra empty statement, for all I know some other tooling might warn
about it. So clean it up before other people notice too ;)
Fixes: 353f7988dd ("watchqueue: make sure to serialize 'wqueue->defunct' properly")
Reported-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Sedat Dilek <sedat.dilek@gmail.com>
In function sched_core_update_cookie(), a task will enqueue into the
core tree only when it enqueued before, that is, if an uncookied task
is cookied, it will not enqueue into the core tree until it enqueue
again, which will result in unnecessary force idle.
Here follows the scenario:
CPU x and CPU y are a pair of SMT siblings.
1. Start task a running on CPU x without sleeping, and task b and
task c running on CPU y without sleeping.
2. We create a cookie and share it to task a and task b, and then
we create another cookie and share it to task c.
3. Simpling core_forceidle_sum of task a and b from /proc/PID/sched
And we will find out that core_forceidle_sum of task a takes 30%
time of the sampling period, which shouldn't happen as task a and b
have the same cookie.
Then we migrate task a to CPU x', migrate task b and c to CPU y', where
CPU x' and CPU y' are a pair of SMT siblings, and sampling again, we
will found out that core_forceidle_sum of task a and b are almost zero.
To solve this problem, we enqueue the task into the core tree if it's
on rq.
Fixes: 6e33cad0af49("sched: Trivial core scheduling cookie management")
Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1656403045-100840-2-git-send-email-CruzZhao@linux.alibaba.com
dequeue_task_rt() only decrements 'rt_rq->rt_nr_running' after having
called sched_update_tick_dependency() preventing it from re-enabling the
tick on systems that no longer have pending SCHED_RT tasks but have
multiple runnable SCHED_OTHER tasks:
dequeue_task_rt()
dequeue_rt_entity()
dequeue_rt_stack()
dequeue_top_rt_rq()
sub_nr_running() // decrements rq->nr_running
sched_update_tick_dependency()
sched_can_stop_tick() // checks rq->rt.rt_nr_running,
...
__dequeue_rt_entity()
dec_rt_tasks() // decrements rq->rt.rt_nr_running
...
Every other scheduler class performs the operation in the opposite
order, and sched_update_tick_dependency() expects the values to be
updated as such. So avoid the misbehaviour by inverting the order in
which the above operations are performed in the RT scheduler.
Fixes: 76d92ac305 ("sched: Migrate sched to use new tick dependency mask model")
Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Reviewed-by: Phil Auld <pauld@redhat.com>
Link: https://lore.kernel.org/r/20220628092259.330171-1-nsaenzju@redhat.com
Tasks the are being deboosted from SCHED_DEADLINE might enter
enqueue_task_dl() one last time and hit an erroneous BUG_ON condition:
since they are not boosted anymore, the if (is_dl_boosted()) branch is
not taken, but the else if (!dl_prio) is and inside this one we
BUG_ON(!is_dl_boosted), which is of course false (BUG_ON triggered)
otherwise we had entered the if branch above. Long story short, the
current condition doesn't make sense and always leads to triggering of a
BUG.
Fix this by only checking enqueue flags, properly: ENQUEUE_REPLENISH has
to be present, but additional flags are not a problem.
Fixes: 64be6f1f5f ("sched/deadline: Don't replenish from a !SCHED_DEADLINE entity")
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20220714151908.533052-1-juri.lelli@redhat.com
When the pipe is closed, we mark the associated watchqueue defunct by
calling watch_queue_clear(). However, while that is protected by the
watchqueue lock, new watchqueue entries aren't actually added under that
lock at all: they use the pipe->rd_wait.lock instead, and looking up
that pipe happens without any locking.
The watchqueue code uses the RCU read-side section to make sure that the
wqueue entry itself hasn't disappeared, but that does not protect the
pipe_info in any way.
So make sure to actually hold the wqueue lock when posting watch events,
properly serializing against the pipe being torn down.
Reported-by: Noam Rathaus <noamr@ssd-disclosure.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* irq/loongarch:
: .
: Merge the long awaited IRQ support for the LoongArch architecture.
:
: From the cover letter:
:
: "Currently, LoongArch based processors (e.g. Loongson-3A5000)
: can only work together with LS7A chipsets. The irq chips in
: LoongArch computers include CPUINTC (CPU Core Interrupt
: Controller), LIOINTC (Legacy I/O Interrupt Controller),
: EIOINTC (Extended I/O Interrupt Controller), PCH-PIC (Main
: Interrupt Controller in LS7A chipset), PCH-LPC (LPC Interrupt
: Controller in LS7A chipset) and PCH-MSI (MSI Interrupt Controller)."
:
: Note that this comes with non-official, arch private ACPICA
: definitions until the official ACPICA update is realeased.
: .
irqchip / ACPI: Introduce ACPI_IRQ_MODEL_LPIC for LoongArch
irqchip: Add LoongArch CPU interrupt controller support
irqchip: Add Loongson Extended I/O interrupt controller support
irqchip/loongson-liointc: Add ACPI init support
irqchip/loongson-pch-msi: Add ACPI init support
irqchip/loongson-pch-pic: Add ACPI init support
irqchip: Add Loongson PCH LPC controller support
LoongArch: Prepare to support multiple pch-pic and pch-msi irqdomain
LoongArch: Use ACPI_GENERIC_GSI for gsi handling
genirq/generic_chip: Export irq_unmap_generic_chip
ACPI: irq: Allow acpi_gsi_to_irq() to have an arch-specific fallback
APCI: irq: Add support for multiple GSI domains
LoongArch: Provisionally add ACPICA data structures
Signed-off-by: Marc Zyngier <maz@kernel.org>
Refactor action_show() to use for_each_action_of_desc instead
of a similar open-coded loop.
Signed-off-by: Paran Lee <p4ranlee@gmail.com>
[maz: reword commit message]
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20220710112614.19410-1-p4ranlee@gmail.com
Some irq controllers have to re-implement a private version for
irq_generic_chip_ops, because they have a different xlate to translate
hwirq. Export irq_unmap_generic_chip to allow reusing in drivers.
Signed-off-by: Jianmin Lv <lvjianmin@loongson.cn>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/1658314292-35346-5-git-send-email-lvjianmin@loongson.cn
The purpose of commit 282d8998e9 ("srcu: Prevent expedited GPs
and blocking readers from consuming CPU") was to prevent a long
series of never-blocking expedited SRCU grace periods from blocking
kernel-live-patching (KLP) progress. Although it was successful, it also
resulted in excessive boot times on certain embedded workloads running
under qemu with the "-bios QEMU_EFI.fd" command line. Here "excessive"
means increasing the boot time up into the three-to-four minute range.
This increase in boot time was due to the more than 6000 back-to-back
invocations of synchronize_rcu_expedited() within the KVM host OS, which
in turn resulted from qemu's emulation of a long series of MMIO accesses.
Commit 640a7d37c3f4 ("srcu: Block less aggressively for expedited grace
periods") did not significantly help this particular use case.
Zhangfei Gao and Shameerali Kolothum Thodi did experiments varying the
value of SRCU_MAX_NODELAY_PHASE with HZ=250 and with various values
of non-sleeping per phase counts on a system with preemption enabled,
and observed the following boot times:
+──────────────────────────+────────────────+
| SRCU_MAX_NODELAY_PHASE | Boot time (s) |
+──────────────────────────+────────────────+
| 100 | 30.053 |
| 150 | 25.151 |
| 200 | 20.704 |
| 250 | 15.748 |
| 500 | 11.401 |
| 1000 | 11.443 |
| 10000 | 11.258 |
| 1000000 | 11.154 |
+──────────────────────────+────────────────+
Analysis on the experiment results show additional improvements with
CPU-bound delays approaching one jiffy in duration. This improvement was
also seen when number of per-phase iterations were scaled to one jiffy.
This commit therefore scales per-grace-period phase number of non-sleeping
polls so that non-sleeping polls extend for about one jiffy. In addition,
the delay-calculation call to srcu_get_delay() in srcu_gp_end() is
replaced with a simple check for an expedited grace period. This change
schedules callback invocation immediately after expedited grace periods
complete, which results in greatly improved boot times. Testing done
by Marc and Zhangfei confirms that this change recovers most of the
performance degradation in boottime; for CONFIG_HZ_250 configuration,
specifically, boot times improve from 3m50s to 41s on Marc's setup;
and from 2m40s to ~9.7s on Zhangfei's setup.
In addition to the changes to default per phase delays, this
change adds 3 new kernel parameters - srcutree.srcu_max_nodelay,
srcutree.srcu_max_nodelay_phase, and srcutree.srcu_retry_check_delay.
This allows users to configure the srcu grace period scanning delays in
order to more quickly react to additional use cases.
Fixes: 640a7d37c3f4 ("srcu: Block less aggressively for expedited grace periods")
Fixes: 282d8998e9 ("srcu: Prevent expedited GPs and blocking readers from consuming CPU")
Reported-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Reported-by: yueluck <yueluck@163.com>
Signed-off-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Tested-by: Marc Zyngier <maz@kernel.org>
Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Link: https://lore.kernel.org/all/20615615-0013-5adc-584f-2b1d5c03ebfc@linaro.org/
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Commit 282d8998e9 ("srcu: Prevent expedited GPs and blocking readers
from consuming CPU") fixed a problem where a long-running expedited SRCU
grace period could block kernel live patching. It did so by giving up
on expediting once a given SRCU expedited grace period grew too old.
Unfortunately, this added excessive delays to boots of virtual embedded
systems specifying "-bios QEMU_EFI.fd" to qemu. This commit therefore
makes the transition away from expediting less aggressive, increasing
the per-grace-period phase number of non-sleeping polls of readers from
one to three and increasing the required grace-period age from one jiffy
(actually from zero to one jiffies) to two jiffies (actually from one
to two jiffies).
Fixes: 282d8998e9 ("srcu: Prevent expedited GPs and blocking readers from consuming CPU")
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reported-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Reported-by: chenxiang (M)" <chenxiang66@hisilicon.com>
Cc: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Link: https://lore.kernel.org/all/20615615-0013-5adc-584f-2b1d5c03ebfc@linaro.org/
NOMAP irq domains use the revmap_size field to indicate the maximum
hwirq number the domain accepts. This is a bit confusing as
revmap_size is usually used to indicate the size of the revmap array,
which a NOMAP domain doesn't have.
Instead, use the hwirq_max field which has the correct semantics, and
keep revmap_size to 0 for a NOMAP domain.
Signed-off-by: Xu Qiang <xuqiang36@huawei.com>
[maz: commit message]
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20220719063641.56541-3-xuqiang36@huawei.com
When using a NOMAP domain, __irq_resolve_mapping() doesn't store
the Linux IRQ number at the address optionally provided by the caller.
While this isn't a huge deal (the returned value is guaranteed
to the hwirq that was passed as a parameter), let's honour the letter
of the API by writing the expected value.
Fixes: d22558dd0a (“irqdomain: Introduce irq_resolve_mapping()”)
Signed-off-by: Xu Qiang <xuqiang36@huawei.com>
[maz: commit message]
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20220719063641.56541-2-xuqiang36@huawei.com
loops due to insufficient locking
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmLTu80ACgkQEsHwGGHe
VUrophAApPj8K9M6+JLeVKNocQMA+1XhWL/HRVmabI+1TpxO4/663wcbbUI04Z5e
51dGvvCBK413duoDUAn8tYAPjQStTrwqAS/toHcaSj+dDPHzZDd3M/Gn68SRy08d
if26OjsXIGTZoHsCYJx0y01m9XHY4ZhVTtonsc3jZCmb/b8/feSBZcMtw+tASDAw
8m/P9rHfzVlfBYmZnyf2NH24NTVcHgoQUGobDo16ve1CTvH8d4jEr+YPsNLTYN+P
4cUslnvRG4HhC/u8namO8CbQVuXicyJBVdVBtfUJ0+IKojie7zCkVUOIPv+mWgQ7
r1XE2MPSeFQRa0IptiA0vIXQCgs9BRj6cBzgo2f3Y0QjU0GGKLTcIKrILv95aej7
X12+uNLKfnkYU4vuyG4o4AnXh047YxgfWLSQ569c/hHKuw8klTQkh0PbJEs6Epn0
21dU+9/p66ZPTCXXjEDDNsMHeVY00+lkdEOu9YzNzMUfR5Fo+zbAN7X9jiDAQDqc
D9IdDeEmhdmrEKNOkankMTBF1tG1XiU5zWerREeMHRMKpJhxC5X1BGIDpuEq1PJD
xa7uAPvc0O6WmNfVvXaJ2GFPzx8oq9inlocNk/0I2ZJxgkGFqKCYUZQI0AdtzPAj
dHx66z09uXMQN+ecXwf5pF1QS/R6BEajOaUhBEFPUZ21pkEl12c=
=/ETy
-----END PGP SIGNATURE-----
Merge tag 'perf_urgent_for_v5.19_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fix from Borislav Petkov:
- A single data race fix on the perf event cleanup path to avoid
endless loops due to insufficient locking
* tag 'perf_urgent_for_v5.19_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/core: Fix data race between perf_event_set_output() and perf_mmap_close()
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAmLRfc8ACgkQUqAMR0iA
lPIoGg/+M1WzHrSD4R9Per6WijKUxh3iL227uUd4QgsXKyA+2B/LMNOx6cGY24UW
xo1hCKcvn0Q2xSKC96fuMPWgax7tjrnCGY3Jii095Q3pCIVCjknYi9tVq9GWlnkK
FmwsxQMdX8llQPz4STttRISAq1E/RFOu4ImDvsBhO/45pW1f6lX+ITWixuMuqcRU
X1ILQZ6gxuO9KDOKxfv7Go5owDSaWqYK7skjfIFlfDUy0o2p4moqndwj4OQWdsAU
UOJvEeUc/ExvGW//xxkkuekGEqlsTpFj7LJeYl5jwT1FxNhVRVcrM1ds1Q3NApg4
9pyVdzQBgf+ZhBLPn1MqMEitSVz36A0lt41kUMdZ2g5pgHTPpqsgUQrCiqmUTJUo
mM/7QvYDw4qFaPfxRSNWI4Nsy/dOevTcIJQhJC/nMKVGMnBv1C9xK9uzQuooK7BF
zQXZeuktYjjhc115yYtFh22u1IEkRcttHd6aIqNAkplSVB+CmrRZuhmfNmJomQgD
Rqn58fcHUvQYMtS9H14W2cKgpifG0uN1Qjq0hZ81bT8cSjNiVJQklifDtsEj+Oor
sK7mLxmDdYhwcGHGz6Pt6iMLZbzUxgGcIMGUIIcYRakafttKwS/Wq6yIACB/zzkE
LMxiSASOJDX7bh0qZNoOAekz3YUbhIr9PIIs9/OS22U2mL2LXcA=
=vRnn
-----END PGP SIGNATURE-----
Merge tag 'printk-for-5.19-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux
Pull printk fix from Petr Mladek:
- Make pr_flush() fast when consoles are suspended.
* tag 'printk-for-5.19-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux:
printk: do not wait for consoles when suspended
The console_stop() and console_start() functions call pr_flush().
When suspending, these functions are called by the serial subsystem
while the serial port is suspended. In this scenario, if there are
any pending messages, a call to pr_flush() will always result in a
timeout because the serial port cannot make forward progress. This
causes longer suspend and resume times.
Add a check in pr_flush() so that it will immediately timeout if
the consoles are suspended.
Fixes: 3b604ca812 ("printk: add pr_flush()")
Reported-by: Todd Brandt <todd.e.brandt@linux.intel.com>
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Tested-by: Todd Brandt <todd.e.brandt@linux.intel.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20220715061042.373640-2-john.ogness@linutronix.de
"numa_stat" should not be included in the scope of CONFIG_HUGETLB_PAGE, if
CONFIG_HUGETLB_PAGE is not configured even if CONFIG_NUMA is configured,
"numa_stat" is missed form /proc. Move it out of CONFIG_HUGETLB_PAGE to
fix it.
Fixes: 4518085e12 ("mm, sysctl: make NUMA stats configurable")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Cc: <stable@vger.kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Current release - regressions:
- wifi: rtw88: fix write to const table of channel parameters
Current release - new code bugs:
- mac80211: add gfp_t parameter to
ieeee80211_obss_color_collision_notify
- mlx5:
- TC, allow offload from uplink to other PF's VF
- Lag, decouple FDB selection and shared FDB
- Lag, correct get the port select mode str
- bnxt_en: fix and simplify XDP transmit path
- r8152: fix accessing unset transport header
Previous releases - regressions:
- conntrack: fix crash due to confirmed bit load reordering
(after atomic -> refcount conversion)
- stmmac: dwc-qos: disable split header for Tegra194
Previous releases - always broken:
- mlx5e: ring the TX doorbell on DMA errors
- bpf: make sure mac_header was set before using it
- mac80211: do not wake queues on a vif that is being stopped
- mac80211: fix queue selection for mesh/OCB interfaces
- ip: fix dflt addr selection for connected nexthop
- seg6: fix skb checksums for SRH encapsulation/insertion
- xdp: fix spurious packet loss in generic XDP TX path
- bunch of sysctl data race fixes
- nf_log: incorrect offset to network header
Misc:
- bpf: add flags arg to bpf_dynptr_read and bpf_dynptr_write APIs
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmLQXuAACgkQMUZtbf5S
Irv3sBAAxoD5A0Q5zRLmfTvbXth8fVfWmqvDfxJvOcChr97Q/JyCTZrmSIqhIz85
6ADxF45PuOivpBU8dA3MF9gCtvlWcU6SJpRVZOP0v+FfZBESGdskG9OWXlS50mht
IF64LlEzfjvD8Mylf2xiuuuaDcDzuF9s2KXCBSh3qFDXP9VYPaSMjA22+YwApkvT
29EKSujBIod0ScIeP6rA7nZKtxNloGp+tGNeHqxP+LrALq5pQlwA43wTyvcgvfME
QgGsqUcn4UzaxJ6YIFNNwx+KRJI7JCdgxNupehaExdnvZJNHDuxSZKXwkCKFOhB6
vOQDDbfDCtTaFfw0elpF18hayUtDyl9ezAR1DlxZWwyPv46gHFlH/PreXLf4Zvvh
R8dAP5YLQjtNri3Ae8gdiQYzct0WXKjiauNdjF60Hh1dXe6j01Vbqh92J96Zr14U
uxDRWzKi1pyfrAULY4BB7sRLXc6IllcUFEnMmKYhYl7afV8VB0OjQ83VKjxW4az8
gcczXejgW6rNcV128vLYHICUCawoiIlA29efM17vGG7Q65O/vhqOxO8Moi1hiQN+
2GwMWxCQ3lIXz41oQ2TNt3ekDYuSFhj8T/qyQEOckp+QW91nbseJBIhyU7MF0WE9
e5sETK8CJMzQwF/zkJMAuohvc+IelGdhRayHVGBYWGwVN1CCqiU=
=TFnI
-----END PGP SIGNATURE-----
Merge tag 'net-5.19-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from netfilter, bpf and wireless.
Still no major regressions, the release continues to be calm. An
uptick of fixes this time around due to trivial data race fixes and
patches flowing down from subtrees.
There has been a few driver fixes (particularly a few fixes for false
positives due to 66e4c8d950 which went into -next in May!) that make
me worry the wide testing is not exactly fully through.
So "calm" but not "let's just cut the final ASAP" vibes over here.
Current release - regressions:
- wifi: rtw88: fix write to const table of channel parameters
Current release - new code bugs:
- mac80211: add gfp_t arg to ieeee80211_obss_color_collision_notify
- mlx5:
- TC, allow offload from uplink to other PF's VF
- Lag, decouple FDB selection and shared FDB
- Lag, correct get the port select mode str
- bnxt_en: fix and simplify XDP transmit path
- r8152: fix accessing unset transport header
Previous releases - regressions:
- conntrack: fix crash due to confirmed bit load reordering (after
atomic -> refcount conversion)
- stmmac: dwc-qos: disable split header for Tegra194
Previous releases - always broken:
- mlx5e: ring the TX doorbell on DMA errors
- bpf: make sure mac_header was set before using it
- mac80211: do not wake queues on a vif that is being stopped
- mac80211: fix queue selection for mesh/OCB interfaces
- ip: fix dflt addr selection for connected nexthop
- seg6: fix skb checksums for SRH encapsulation/insertion
- xdp: fix spurious packet loss in generic XDP TX path
- bunch of sysctl data race fixes
- nf_log: incorrect offset to network header
Misc:
- bpf: add flags arg to bpf_dynptr_read and bpf_dynptr_write APIs"
* tag 'net-5.19-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (87 commits)
nfp: flower: configure tunnel neighbour on cmsg rx
net/tls: Check for errors in tls_device_init
MAINTAINERS: Add an additional maintainer to the AMD XGBE driver
xen/netback: avoid entering xenvif_rx_next_skb() with an empty rx queue
selftests/net: test nexthop without gw
ip: fix dflt addr selection for connected nexthop
net: atlantic: remove aq_nic_deinit() when resume
net: atlantic: remove deep parameter on suspend/resume functions
sfc: fix kernel panic when creating VF
seg6: bpf: fix skb checksum in bpf_push_seg6_encap()
seg6: fix skb checksum in SRv6 End.B6 and End.B6.Encaps behaviors
seg6: fix skb checksum evaluation in SRH encapsulation/insertion
sfc: fix use after free when disabling sriov
net: sunhme: output link status with a single print.
r8152: fix accessing unset transport header
net: stmmac: fix leaks in probe
net: ftgmac100: Hold reference returned by of_get_child_by_name()
nexthop: Fix data-races around nexthop_compat_mode.
ipv4: Fix data-races around sysctl_ip_dynaddr.
tcp: Fix a data-race around sysctl_tcp_ecn_fallback.
...