linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-23 04:31:50 +00:00

Author	SHA1	Message	Date
Tejun Heo	978b8409ea	workqueue: Factor out work_grab_pending() from __cancel_work_sync() The planned disable/enable support will need the same logic. Let's factor it out. No functional changes. v2: Update function comment to include @irq_flags. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>	2024-02-20 19:36:14 -10:00
Tejun Heo	e9a8e01f9b	workqueue: Clean up enum work_bits and related constants The bits of work->data are used for a few different purposes. How the bits are used is determined by enum work_bits. The planned disable/enable support will add another use, so let's clean it up a bit in preparation. - Let WORK_STRUCT_*_BIT's values be determined by enum definition order. - Deliminate different bit sections the same way using SHIFT and BITS values. - Rename __WORK_OFFQ_CANCELING to WORK_OFFQ_CANCELING_BIT for consistency. - Introduce WORK_STRUCT_PWQ_SHIFT and replace WORK_STRUCT_FLAG_MASK and WORK_STRUCT_WQ_DATA_MASK with WQ_STRUCT_PWQ_MASK for clarity. - Improve documentation. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>	2024-02-20 19:36:14 -10:00
Tejun Heo	c5f5b9422a	workqueue: Introduce work_cancel_flags The cancel path used bool @is_dwork to distinguish canceling a regular work and a delayed one. The planned disable/enable support will need passing around another flag in the code path. As passing them around with bools will be confusing, let's introduce named flags to pass around in the cancel path. WORK_CANCEL_DELAYED replaces @is_dwork. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>	2024-02-20 19:36:14 -10:00
Tejun Heo	c26e2f2e2f	workqueue: Use variable name irq_flags for saving local irq flags Using the generic term `flags` for irq flags is conventional but can be confusing as there's quite a bit of code dealing with work flags which involves some subtleties. Let's use a more explicit name `irq_flags` for local irq flags. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>	2024-02-20 19:36:14 -10:00
Tejun Heo	cdc6e4b329	workqueue: Reorganize flush and cancel[_sync] functions They are currently a bit disorganized with flush and cancel functions mixed. Reoranize them so that flush functions come first, cancel next and cancel_sync last. This way, we won't have to add prototypes for internal functions for the planned disable/enable support. This is pure code reorganization. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>	2024-02-20 19:36:14 -10:00
Tejun Heo	c5140688d1	workqueue: Rename __cancel_work_timer() to __cancel_timer_sync() __cancel_work_timer() is used to implement cancel_work_sync() and cancel_delayed_work_sync(), similarly to how __cancel_work() is used to implement cancel_work() and cancel_delayed_work(). ie. The _timer part of the name is a complete misnomer. The difference from __cancel_work() is the fact that it syncs against work item execution not whether it handles timers or not. Let's rename it to less confusing __cancel_work_sync(). No functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>	2024-02-20 19:36:14 -10:00
Tejun Heo	d355001fa9	workqueue: Use rcu_read_lock_any_held() instead of rcu_read_lock_held() The different flavors of RCU read critical sections have been unified. Let's update the locking assertion macros accordingly to avoid requiring unnecessary explicit rcu_read_[un]lock() calls. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>	2024-02-20 19:36:13 -10:00
Tejun Heo	c7a40c49af	workqueue: Cosmetic changes Reorder some global declarations and adjust comments and whitespaces for clarity and consistency. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>	2024-02-20 19:36:13 -10:00
Tejun Heo	fd0a68a233	workqueue, irq_work: Build fix for !CONFIG_IRQ_WORK `2f34d7337d` ("workqueue: Fix queue_work_on() with BH workqueues") added irq_work usage to workqueue; however, it turns out irq_work is actually optional and the change breaks build on configuration which doesn't have CONFIG_IRQ_WORK enabled. Fix build by making workqueue use irq_work only when CONFIG_SMP and enabling CONFIG_IRQ_WORK when CONFIG_SMP is set. It's reasonable to argue that it may be better to just always enable it. However, this still saves a small bit of memory for tiny UP configs and also the least amount of change, so, for now, let's keep it conditional. Verified to do the right thing for x86_64 allnoconfig and defconfig, and aarch64 allnoconfig, allnoconfig + prink disable (SMP but nothing selects IRQ_WORK) and a modified aarch64 Kconfig where !SMP and nothing selects IRQ_WORK. v2: `depends on SMP` leads to Kconfig warnings when CONFIG_IRQ_WORK is selected by something else when !CONFIG_SMP. Use `def_bool y if SMP` instead. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Tested-by: Anders Roxell <anders.roxell@linaro.org> Fixes: `2f34d7337d` ("workqueue: Fix queue_work_on() with BH workqueues") Cc: Stephen Rothwell <sfr@canb.auug.org.au>	2024-02-16 06:33:43 -10:00
Tejun Heo	2f34d7337d	workqueue: Fix queue_work_on() with BH workqueues When queue_work_on() is used to queue a BH work item on a remote CPU, the work item is queued on that CPU but kick_pool() raises softirq on the local CPU. This leads to stalls as the work item won't be executed until something else on the remote CPU schedules a BH work item or tasklet locally. Fix it by bouncing raising softirq to the target CPU using per-cpu irq_work. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: `4cb1ef6460` ("workqueue: Implement BH workqueues to eventually replace tasklets")	2024-02-14 08:33:55 -10:00
Tejun Heo	bf52b1ac6a	async: Use a dedicated unbound workqueue with raised min_active Async can schedule a number of interdependent work items. However, since `5797b1c189` ("workqueue: Implement system-wide nr_active enforcement for unbound workqueues"), unbound workqueues have separate min_active which sets the number of interdependent work items that can be handled. This default value is 8 which isn't sufficient for async and can lead to stalls during resume from suspend in some cases. Let's use a dedicated unbound workqueue with raised min_active. Link: http://lkml.kernel.org/r/708a65cc-79ec-44a6-8454-a93d0f3114c3@samsung.com Reported-by: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Rafael J. Wysocki <rjw@rjwysocki.net> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-02-09 11:13:59 -10:00
Tejun Heo	8f172181f2	workqueue: Implement workqueue_set_min_active() Since `5797b1c189` ("workqueue: Implement system-wide nr_active enforcement for unbound workqueues"), unbound workqueues have separate min_active which sets the number of interdependent work items that can be handled. This value is currently initialized to WQ_DFL_MIN_ACTIVE which is 8. This isn't high enough for some users, let's add an interface to adjust the setting. Signed-off-by: Tejun Heo <tj@kernel.org>	2024-02-09 11:13:59 -10:00
Waiman Long	516d3dc99f	workqueue: Fix kernel-doc comment of unplug_oldest_pwq() Fix the kernel-doc comment of the unplug_oldest_pwq() function to enable proper processing and formatting of the embedded ASCII diagram. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-02-09 11:04:13 -10:00
Waiman Long	49584bb8dd	workqueue: Bind unbound workqueue rescuer to wq_unbound_cpumask Commit `85f0ab43f9` ("kernel/workqueue: Bind rescuer to unbound cpumask for WQ_UNBOUND") modified init_rescuer() to bind rescuer of an unbound workqueue to the cpumask in wq->unbound_attrs. However unbound_attrs->cpumask's of all workqueues are initialized to cpu_possible_mask and will only be changed if it has the WQ_SYSFS flag to expose a cpumask sysfs file to be written by users. So this patch doesn't achieve what it is intended to do. If an unbound workqueue is created after wq_unbound_cpumask is modified and there is no more unbound cpumask update after that, the unbound rescuer will be bound to all CPUs unless the workqueue is created with the WQ_SYSFS flag and a user explicitly modified its cpumask sysfs file. Fix this problem by binding directly to wq_unbound_cpumask in init_rescuer(). Fixes: `85f0ab43f9` ("kernel/workqueue: Bind rescuer to unbound cpumask for WQ_UNBOUND") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-02-08 09:23:38 -10:00
Juri Lelli	d64f2fa064	kernel/workqueue: Let rescuers follow unbound wq cpumask changes When workqueue cpumask changes are committed the associated rescuer (if one exists) affinity is not touched and this might be a problem down the line for isolated setups. Make sure rescuers affinity is updated every time a workqueue cpumask changes, so that rescuers can't break isolation. [longman: set_cpus_allowed_ptr() will block until the designated task is enqueued on an allowed CPU, no wake_up_process() needed. Also use the unbound_effective_cpumask() helper as suggested by Tejun.] Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-02-08 09:23:32 -10:00
Waiman Long	4c065dbce1	workqueue: Enable unbound cpumask update on ordered workqueues Ordered workqueues does not currently follow changes made to the global unbound cpumask because per-pool workqueue changes may break the ordering guarantee. IOW, a work function in an ordered workqueue may run on an isolated CPU. This patch enables ordered workqueues to follow changes made to the global unbound cpumask by temporaily plug or suspend the newly allocated pool_workqueue from executing newly queued work items until the old pwq has been properly drained. For ordered workqueues, there should only be one pwq that is unplugged, the rests should be plugged. This enables ordered workqueues to follow the unbound cpumask changes like other unbound workqueues at the expense of some delay in execution of work functions during the transition period. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-02-08 09:22:49 -10:00
Waiman Long	26fb7e3dda	workqueue: Link pwq's into wq->pwqs from oldest to newest Add a new pwq into the tail of wq->pwqs so that pwq iteration will start from the oldest pwq to the newest. This ordering will facilitate the inclusion of ordered workqueues in a wq_unbound_cpumask update. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-02-08 09:22:30 -10:00
Tejun Heo	40911d4457	Merge branch 'for-6.8-fixes' into for-6.9 The for-6.8-fixes commit ae9cc8956944 ("Revert "workqueue: Override implicit ordered attribute in workqueue_apply_unbound_cpumask()") also fixes build for Signed-off-by: Tejun Heo <tj@kernel.org>	2024-02-05 15:49:47 -10:00
Tejun Heo	aac8a59537	Revert "workqueue: Override implicit ordered attribute in workqueue_apply_unbound_cpumask()" This reverts commit `ca10d851b9`. The commit allowed workqueue_apply_unbound_cpumask() to clear __WQ_ORDERED on now removed implicitly ordered workqueues. This was incorrect in that system-wide config change shouldn't break ordering properties of all workqueues. The reason why apply_workqueue_attrs() path was allowed to do so was because it was targeting the specific workqueue - either the workqueue had WQ_SYSFS set or the workqueue user specifically tried to change max_active, both of which indicate that the workqueue doesn't need to be ordered. The implicitly ordered workqueue promotion was removed by the previous commit `3bc1e711c2` ("workqueue: Don't implicitly make UNBOUND workqueues w/ @max_active==1 ordered"). However, it didn't update this path and broke build. Let's revert the commit which was incorrect in the first place which also fixes build. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: `3bc1e711c2` ("workqueue: Don't implicitly make UNBOUND workqueues w/ @max_active==1 ordered") Fixes: `ca10d851b9` ("workqueue: Override implicit ordered attribute in workqueue_apply_unbound_cpumask()") Cc: stable@vger.kernel.org # v6.6+ Signed-off-by: Tejun Heo <tj@kernel.org>	2024-02-05 15:49:06 -10:00
Tejun Heo	3bc1e711c2	workqueue: Don't implicitly make UNBOUND workqueues w/ @max_active==1 ordered `5c0338c687` ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered") automoatically promoted UNBOUND workqueues w/ @max_active==1 to ordered workqueues because UNBOUND workqueues w/ @max_active==1 used to be the way to create ordered workqueues and the new NUMA support broke it. These problems can be subtle and the fact that they can only trigger on NUMA machines made them even more difficult to debug. However, overloading the UNBOUND allocation interface this way creates other issues. It's difficult to tell whether a given workqueue actually needs to be ordered and users that legitimately want a min concurrency level wq unexpectedly gets an ordered one instead. With planned UNBOUND workqueue udpates to improve execution locality and more prevalence of chiplet designs which can benefit from such improvements, this isn't a state we wanna be in forever. There aren't that many UNBOUND w/ @max_active==1 users in the tree and the preceding patches audited all and converted them to alloc_ordered_workqueue() as appropriate. This patch removes the implicit promotion of UNBOUND w/ @max_active==1 workqueues to ordered ones. v2: v1 patch incorrectly dropped !list_empty(&wq->pwqs) condition in apply_workqueue_attrs_locked() which spuriously triggers WARNING and fails workqueue creation. Fix it. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: kernel test robot <oliver.sang@intel.com> Link: https://lore.kernel.org/oe-lkp/202304251050.45a5df1f-oliver.sang@intel.com	2024-02-05 14:19:10 -10:00
Waiman Long	8eb17dc1a6	workqueue: Skip __WQ_DESTROYING workqueues when updating global unbound cpumask Skip updating workqueues with __WQ_DESTROYING bit set when updating global unbound cpumask to avoid unnecessary work and other complications. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-02-05 07:52:22 -10:00
Wang Jinchao	96068b6030	workqueue: fix a typo in comment There should be three, fix it. Signed-off-by: Wang Jinchao <wangjinchao@xfusion.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-02-05 07:48:25 -10:00
Tejun Heo	4f19b8e01e	Revert "workqueue: make wq_subsys const" This reverts commit `d412ace111`. This leads to build failures as it depends on a driver-core commit `32f78abe59` ("driver core: bus: constantify subsys_register() calls"). Let's drop it from wq tree and route it through driver-core tree. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202402051505.kM9Rr3CJ-lkp@intel.com/	2024-02-05 07:19:54 -10:00
Tejun Heo	4cb1ef6460	workqueue: Implement BH workqueues to eventually replace tasklets The only generic interface to execute asynchronously in the BH context is tasklet; however, it's marked deprecated and has some design flaws such as the execution code accessing the tasklet item after the execution is complete which can lead to subtle use-after-free in certain usage scenarios and less-developed flush and cancel mechanisms. This patch implements BH workqueues which share the same semantics and features of regular workqueues but execute their work items in the softirq context. As there is always only one BH execution context per CPU, none of the concurrency management mechanisms applies and a BH workqueue can be thought of as a convenience wrapper around softirq. Except for the inability to sleep while executing and lack of max_active adjustments, BH workqueues and work items should behave the same as regular workqueues and work items. Currently, the execution is hooked to tasklet[_hi]. However, the goal is to convert all tasklet users over to BH workqueues. Once the conversion is complete, tasklet can be removed and BH workqueues can directly take over the tasklet softirqs. system_bh[_highpri]_wq are added. As queue-wide flushing doesn't exist in tasklet, all existing tasklet users should be able to use the system BH workqueues without creating their own workqueues. v3: - Add missing interrupt.h include. v2: - Instead of using tasklets, hook directly into its softirq action functions - tasklet[_hi]_action(). This is slightly cheaper and closer to the eventual code structure we want to arrive at. Suggested by Lai. - Lai also pointed out several places which need NULL worker->task handling or can use clarification. Updated. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/CAHk-=wjDW53w4-YcSmgKC5RruiRLHmJ1sXeYdp_ZgVoBw=5byA@mail.gmail.com Tested-by: Allen Pais <allen.lkml@gmail.com> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>	2024-02-04 11:28:06 -10:00
Tejun Heo	2fcdb1b444	workqueue: Factor out init_cpu_worker_pool() Factor out init_cpu_worker_pool() from workqueue_init_early(). This is pure reorganization in preparation of BH workqueue support. Signed-off-by: Tejun Heo <tj@kernel.org> Tested-by: Allen Pais <allen.lkml@gmail.com>	2024-02-04 11:28:06 -10:00
Tejun Heo	c35aea39d1	workqueue: Update lock debugging code These changes are in preparation of BH workqueue which will execute work items from BH context. - Update lock and RCU depth checks in process_one_work() so that it remembers and checks against the starting depths and prints out the depth changes. - Factor out lockdep annotations in the flush paths into touch_{wq\|work}_lockdep_map(). The work->lockdep_map touching is moved from __flush_work() to its callee - start_flush_work(). This brings it closer to the wq counterpart and will allow testing the associated wq's flags which will be needed to support BH workqueues. This is not expected to cause any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Tested-by: Allen Pais <allen.lkml@gmail.com>	2024-02-04 11:28:06 -10:00
Ricardo B. Marliere	d412ace111	workqueue: make wq_subsys const Now that the driver core can properly handle constant struct bus_type, move the wq_subsys variable to be a constant structure as well, placing it into read-only memory which can not be modified at runtime. Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Suggested-and-reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-02-04 11:23:25 -10:00
Tejun Heo	c70e1779b7	workqueue: Fix pwq->nr_in_flight corruption in try_to_grab_pending() `dd6c3c5441` ("workqueue: Move pwq_dec_nr_in_flight() to the end of work item handling") relocated pwq_dec_nr_in_flight() after set_work_pool_and_keep_pending(). However, the latter destroys information contained in work->data that's needed by pwq_dec_nr_in_flight() including the flush color. With flush color destroyed, flush_workqueue() can stall easily when mixed with cancel_work*() usages. This is easily triggered by running xfstests generic/001 test on xfs: INFO: task umount:6305 blocked for more than 122 seconds. ... task:umount state:D stack:13008 pid:6305 tgid:6305 ppid:6301 flags:0x00004000 Call Trace: <TASK> __schedule+0x2f6/0xa20 schedule+0x36/0xb0 schedule_timeout+0x20b/0x280 wait_for_completion+0x8a/0x140 __flush_workqueue+0x11a/0x3b0 xfs_inodegc_flush+0x24/0xf0 xfs_unmountfs+0x14/0x180 xfs_fs_put_super+0x3d/0x90 generic_shutdown_super+0x7c/0x160 kill_block_super+0x1b/0x40 xfs_kill_sb+0x12/0x30 deactivate_locked_super+0x35/0x90 deactivate_super+0x42/0x50 cleanup_mnt+0x109/0x170 __cleanup_mnt+0x12/0x20 task_work_run+0x60/0x90 syscall_exit_to_user_mode+0x146/0x150 do_syscall_64+0x5d/0x110 entry_SYSCALL_64_after_hwframe+0x6c/0x74 Fix it by stashing work_data before calling set_work_pool_and_keep_pending() and using the stashed value for pwq_dec_nr_in_flight(). Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Chandan Babu R <chandanbabu@kernel.org> Link: http://lkml.kernel.org/r/87o7cxeehy.fsf@debian-BULLSEYE-live-builder-AMD64 Fixes: `dd6c3c5441` ("workqueue: Move pwq_dec_nr_in_flight() to the end of work item handling")	2024-02-04 11:14:49 -10:00
Tejun Heo	c5f8cd6c62	workqueue: Avoid premature init of wq->node_nr_active[].max System workqueues are allocated early during boot from workqueue_init_early(). While allocating unbound workqueues, wq_update_node_max_active() is invoked from apply_workqueue_attrs() and accesses NUMA topology to initialize wq->node_nr_active[].max. However, topology information may not be set up at this point. wq_update_node_max_active() is explicitly invoked from workqueue_init_topology() later when topology information is known to be available. This doesn't seem to crash anything but it's doing useless work with dubious data. Let's skip the premature and duplicate node_max_active updates by initializing the field to WQ_DFL_MIN_ACTIVE on allocation and making wq_update_node_max_active() noop until workqueue_init_topology(). Signed-off-by: Tejun Heo <tj@kernel.org> --- kernel/workqueue.c \| 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 9221a4c57ae1..a65081ec6780 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -386,6 +386,8 @@ static const char wq_affn_names[WQ_AFFN_NR_TYPES] = { [WQ_AFFN_SYSTEM] = "system", }; +static bool wq_topo_initialized = false; + / * Per-cpu work items which run for longer than the following threshold are * automatically considered CPU intensive and excluded from concurrency @@ -1510,6 +1512,9 @@ static void wq_update_node_max_active(struct workqueue_struct wq, int off_cpu) lockdep_assert_held(&wq->mutex); + if (!wq_topo_initialized) + return; + if (!cpumask_test_cpu(off_cpu, effective)) off_cpu = -1; @@ -4356,6 +4361,7 @@ static void free_node_nr_active(struct wq_node_nr_active nna_ar) static void init_node_nr_active(struct wq_node_nr_active nna) { + nna->max = WQ_DFL_MIN_ACTIVE; atomic_set(&nna->nr, 0); raw_spin_lock_init(&nna->lock); INIT_LIST_HEAD(&nna->pending_pwqs); @@ -7400,6 +7406,8 @@ void __init workqueue_init_topology(void) init_pod_type(&wq_pod_types[WQ_AFFN_CACHE], cpus_share_cache); init_pod_type(&wq_pod_types[WQ_AFFN_NUMA], cpus_share_numa); + wq_topo_initialized = true; + mutex_lock(&wq_pool_mutex); /*	2024-01-30 19:17:00 -10:00
Tejun Heo	15930da42f	workqueue: Don't call cpumask_test_cpu() with -1 CPU in wq_update_node_max_active() For wq_update_node_max_active(), @off_cpu of -1 indicates that no CPU is going down. The function was incorrectly calling cpumask_test_cpu() with -1 CPU leading to oopses like the following on some archs: Unable to handle kernel paging request at virtual address ffff0002100296e0 .. pc : wq_update_node_max_active+0x50/0x1fc lr : wq_update_node_max_active+0x1f0/0x1fc ... Call trace: wq_update_node_max_active+0x50/0x1fc apply_wqattrs_commit+0xf0/0x114 apply_workqueue_attrs_locked+0x58/0xa0 alloc_workqueue+0x5ac/0x774 workqueue_init_early+0x460/0x540 start_kernel+0x258/0x684 __primary_switched+0xb8/0xc0 Code: 9100a273 35000d01 53067f00 d0016dc1 (f8607a60) ---[ end trace 0000000000000000 ]--- Kernel panic - not syncing: Attempted to kill the idle task! ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]--- Fix it. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Marek Szyprowski <m.szyprowski@samsung.com> Reported-by: Nathan Chancellor <nathan@kernel.org> Tested-by: Nathan Chancellor <nathan@kernel.org> Link: http://lkml.kernel.org/r/91eacde0-df99-4d5c-a980-91046f66e612@samsung.com Fixes: `5797b1c189` ("workqueue: Implement system-wide nr_active enforcement for unbound workqueues")	2024-01-30 18:55:55 -10:00
Leonardo Bras	aae17ebb53	workqueue: Avoid using isolated cpus' timers on queue_delayed_work When __queue_delayed_work() is called, it chooses a cpu for handling the timer interrupt. As of today, it will pick either the cpu passed as parameter or the last cpu used for this. This is not good if a system does use CPU isolation, because it can take away some valuable cpu time to: 1 - deal with the timer interrupt, 2 - schedule-out the desired task, 3 - queue work on a random workqueue, and 4 - schedule the desired task back to the cpu. So to fix this, during __queue_delayed_work(), if cpu isolation is in place, pick a random non-isolated cpu to handle the timer interrupt. As an optimization, if the current cpu is not isolated, use it instead of looking for another candidate. Signed-off-by: Leonardo Bras <leobras@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-01-29 15:21:37 -10:00
Tejun Heo	5797b1c189	workqueue: Implement system-wide nr_active enforcement for unbound workqueues A pool_workqueue (pwq) represents the connection between a workqueue and a worker_pool. One of the roles that a pwq plays is enforcement of the max_active concurrency limit. Before `636b927eba` ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues"), there was one pwq per each CPU for per-cpu workqueues and per each NUMA node for unbound workqueues, which was a natural result of per-cpu workqueues being served by per-cpu pools and unbound by per-NUMA pools. In terms of max_active enforcement, this was, while not perfect, workable. For per-cpu workqueues, it was fine. For unbound, it wasn't great in that NUMA machines would get max_active that's multiplied by the number of nodes but didn't cause huge problems because NUMA machines are relatively rare and the node count is usually pretty low. However, cache layouts are more complex now and sharing a worker pool across a whole node didn't really work well for unbound workqueues. Thus, a series of commits culminating on `8639ecebc9` ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues") implemented more flexible affinity mechanism for unbound workqueues which enables using e.g. last-level-cache aligned pools. In the process, `636b927eba` ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues") made unbound workqueues use per-cpu pwqs like per-cpu workqueues. While the change was necessary to enable more flexible affinity scopes, this came with the side effect of blowing up the effective max_active for unbound workqueues. Before, the effective max_active for unbound workqueues was multiplied by the number of nodes. After, by the number of CPUs. `636b927eba` ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues") claims that this should generally be okay. It is okay for users which self-regulates concurrency level which are the vast majority; however, there are enough use cases which actually depend on max_active to prevent the level of concurrency from going bonkers including several IO handling workqueues that can issue a work item for each in-flight IO. With targeted benchmarks, the misbehavior can easily be exposed as reported in http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3. Unfortunately, there is no way to express what these use cases need using per-cpu max_active. A CPU may issue most of in-flight IOs, so we don't want to set max_active too low but as soon as we increase max_active a bit, we can end up with unreasonable number of in-flight work items when many CPUs issue IOs at the same time. ie. The acceptable lowest max_active is higher than the acceptable highest max_active. Ideally, max_active for an unbound workqueue should be system-wide so that the users can regulate the total level of concurrency regardless of node and cache layout. The reasons workqueue hasn't implemented that yet are: - One max_active enforcement decouples from pool boundaires, chaining execution after a work item finishes requires inter-pool operations which would require lock dancing, which is nasty. - Sharing a single nr_active count across the whole system can be pretty expensive on NUMA machines. - Per-pwq enforcement had been more or less okay while we were using per-node pools. It looks like we no longer can avoid decoupling max_active enforcement from pool boundaries. This patch implements system-wide nr_active mechanism with the following design characteristics: - To avoid sharing a single counter across multiple nodes, the configured max_active is split across nodes according to the proportion of each workqueue's online effective CPUs per node. e.g. A node with twice more online effective CPUs will get twice higher portion of max_active. - Workqueue used to be able to process a chain of interdependent work items which is as long as max_active. We can't do this anymore as max_active is distributed across the nodes. Instead, a new parameter min_active is introduced which determines the minimum level of concurrency within a node regardless of how max_active distribution comes out to be. It is set to the smaller of max_active and WQ_DFL_MIN_ACTIVE which is 8. This can lead to higher effective max_weight than configured and also deadlocks if a workqueue was depending on being able to handle chains of interdependent work items that are longer than 8. I believe these should be fine given that the number of CPUs in each NUMA node is usually higher than 8 and work item chain longer than 8 is pretty unlikely. However, if these assumptions turn out to be wrong, we'll need to add an interface to adjust min_active. - Each unbound wq has an array of struct wq_node_nr_active which tracks per-node nr_active. When its pwq wants to run a work item, it has to obtain the matching node's nr_active. If over the node's max_active, the pwq is queued on wq_node_nr_active->pending_pwqs. As work items finish, the completion path round-robins the pending pwqs activating the first inactive work item of each, which involves some pool lock dancing and kicking other pools. It's not the simplest code but doesn't look too bad. v4: - wq_adjust_max_active() updated to invoke wq_update_node_max_active(). - wq_adjust_max_active() is now protected by wq->mutex instead of wq_pool_mutex. v3: - wq_node_max_active() used to calculate per-node max_active on the fly based on system-wide CPU online states. Lai pointed out that this can lead to skewed distributions for workqueues with restricted cpumasks. Update the max_active distribution to use per-workqueue effective online CPU counts instead of system-wide and cache the calculation results in node_nr_active->max. v2: - wq->min/max_active now uses WRITE/READ_ONCE() as suggested by Lai. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Naohiro Aota <Naohiro.Aota@wdc.com> Link: http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3 Fixes: `636b927eba` ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues") Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>	2024-01-29 08:11:25 -10:00
Tejun Heo	91ccc6e723	workqueue: Introduce struct wq_node_nr_active Currently, for both percpu and unbound workqueues, max_active applies per-cpu, which is a recent change for unbound workqueues. The change for unbound workqueues was a significant departure from the previous behavior of per-node application. It made some use cases create undesirable number of concurrent work items and left no good way of fixing them. To address the problem, workqueue is implementing a NUMA node segmented global nr_active mechanism, which will be explained further in the next patch. As a preparation, this patch introduces struct wq_node_nr_active. It's a data structured allocated for each workqueue and NUMA node pair and currently only tracks the workqueue's number of active work items on the node. This is split out from the next patch to make it easier to understand and review. Note that there is an extra wq_node_nr_active allocated for the invalid node nr_node_ids which is used to track nr_active for pools which don't have NUMA node associated such as the default fallback system-wide pool. This doesn't cause any behavior changes visible to userland yet. The next patch will expand to implement the control mechanism on top. v4: - Fixed out-of-bound access when freeing per-cpu workqueues. v3: - Use flexible array for wq->node_nr_active as suggested by Lai. v2: - wq->max_active now uses WRITE/READ_ONCE() as suggested by Lai. - Lai pointed out that pwq_tryinc_nr_active() incorrectly dropped pwq->max_active check. Restored. As the next patch replaces the max_active enforcement mechanism, this doesn't change the end result. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>	2024-01-29 08:11:24 -10:00
Tejun Heo	dd6c3c5441	workqueue: Move pwq_dec_nr_in_flight() to the end of work item handling The planned shared nr_active handling for unbound workqueues will make pwq_dec_nr_active() sometimes drop the pool lock temporarily to acquire other pool locks, which is necessary as retirement of an nr_active count from one pool may need kick off an inactive work item in another pool. This patch moves pwq_dec_nr_in_flight() call in try_to_grab_pending() to the end of work item handling so that work item state changes stay atomic. process_one_work() which is the other user of pwq_dec_nr_in_flight() already calls it at the end of work item handling. Comments are added to both call sites and pwq_dec_nr_in_flight(). This shouldn't cause any behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>	2024-01-29 08:11:24 -10:00
Tejun Heo	9f66cff212	workqueue: RCU protect wq->dfl_pwq and implement accessors for it wq->cpu_pwq is RCU protected but wq->dfl_pwq isn't. This is okay because currently wq->dfl_pwq is used only accessed to install it into wq->cpu_pwq which doesn't require RCU access. However, we want to be able to access wq->dfl_pwq under RCU in the future to access its __pod_cpumask and the code can be made easier to read by making the two pwq fields behave in the same way. - Make wq->dfl_pwq RCU protected. - Add unbound_pwq_slot() and unbound_pwq() which can access both ->dfl_pwq and ->cpu_pwq. The former returns the double pointer that can be used access and update the pwqs. The latter performs locking check and dereferences the double pointer. - pwq accesses and updates are converted to use unbound_pwq[_slot](). Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>	2024-01-29 08:11:24 -10:00
Tejun Heo	c5404d4e6d	workqueue: Make wq_adjust_max_active() round-robin pwqs while activating wq_adjust_max_active() needs to activate work items after max_active is increased. Previously, it did that by visiting each pwq once activating all that could be activated. While this makes sense with per-pwq nr_active, nr_active will be shared across multiple pwqs for unbound wqs. Then, we'd want to round-robin through pwqs to be fairer. In preparation, this patch makes wq_adjust_max_active() round-robin pwqs while activating. While the activation ordering changes, this shouldn't cause user-noticeable behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>	2024-01-29 08:11:24 -10:00
Tejun Heo	1c270b79ce	workqueue: Move nr_active handling into helpers __queue_work(), pwq_dec_nr_in_flight() and wq_adjust_max_active() were open-coding nr_active handling, which is fine given that the operations are trivial. However, the planned unbound nr_active update will make them more complicated, so let's move them into helpers. - pwq_tryinc_nr_active() is added. It increments nr_active if under max_active limit and return a boolean indicating whether inc was successful. Note that the function is structured to accommodate future changes. __queue_work() is updated to use the new helper. - pwq_activate_first_inactive() is updated to use pwq_tryinc_nr_active() and thus no longer assumes that nr_active is under max_active and returns a boolean to indicate whether a work item has been activated. - wq_adjust_max_active() no longer tests directly whether a work item can be activated. Instead, it's updated to use the return value of pwq_activate_first_inactive() to tell whether a work item has been activated. - nr_active decrement and activating the first inactive work item is factored into pwq_dec_nr_active(). v3: - WARN_ON_ONCE(!WORK_STRUCT_INACTIVE) added to __pwq_activate_work() as now we're calling the function unconditionally from pwq_activate_first_inactive(). v2: - wq->max_active now uses WRITE/READ_ONCE() as suggested by Lai. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>	2024-01-29 08:11:24 -10:00
Tejun Heo	4c6380305d	workqueue: Replace pwq_activate_inactive_work() with [__]pwq_activate_work() To prepare for unbound nr_active handling improvements, move work activation part of pwq_activate_inactive_work() into __pwq_activate_work() and add pwq_activate_work() which tests WORK_STRUCT_INACTIVE and updates nr_active. pwq_activate_first_inactive() and try_to_grab_pending() are updated to use pwq_activate_work(). The latter conversion is functionally identical. For the former, this conversion adds an unnecessary WORK_STRUCT_INACTIVE testing. This is temporary and will be removed by the next patch. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>	2024-01-29 08:11:24 -10:00
Tejun Heo	afa87ce853	workqueue: Factor out pwq_is_empty() "!pwq->nr_active && list_empty(&pwq->inactive_works)" test is repeated multiple times. Let's factor it out into pwq_is_empty(). Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>	2024-01-29 08:11:24 -10:00
Tejun Heo	a045a272d8	workqueue: Move pwq->max_active to wq->max_active max_active is a workqueue-wide setting and the configured value is stored in wq->saved_max_active; however, the effective value was stored in pwq->max_active. While this is harmless, it makes max_active update process more complicated and gets in the way of the planned max_active semantic updates for unbound workqueues. This patches moves pwq->max_active to wq->max_active. This simplifies the code and makes freezing and noop max_active updates cheaper too. No user-visible behavior change is intended. As wq->max_active is updated while holding wq mutex but read without any locking, it now uses WRITE/READ_ONCE(). A new locking locking rule WO is added for it. v2: wq->max_active now uses WRITE/READ_ONCE() as suggested by Lai. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>	2024-01-29 08:11:24 -10:00
Tejun Heo	e563d0a7cd	workqueue: Break up enum definitions and give names to the types workqueue is collecting different sorts of enums into a single unnamed enum type which can increase confusion around enum width. Also, unnamed enums can't be accessed from BPF. Let's break up enum definitions according to their purposes and give them type names. Signed-off-by: Tejun Heo <tj@kernel.org>	2024-01-26 11:55:50 -10:00
Tejun Heo	6a229b0e2f	workqueue: Drop unnecessary kick_pool() in create_worker() After creating a new worker, create_worker() is calling kick_pool() to wake up the new worker task. However, as kick_pool() doesn't do anything if there is no work pending, it also calls wake_up_process() explicitly. There's no reason to call kick_pool() at all. wake_up_process() is enough by itself. Drop the unnecessary kick_pool() call. Signed-off-by: Tejun Heo <tj@kernel.org>	2024-01-26 11:55:46 -10:00
Audra Mitchell	8318d6a636	workqueue: Shorten events_freezable_power_efficient name Since we have set the WQ_NAME_LEN to 32, decrease the name of events_freezable_power_efficient so that it does not trip the name length warning when the workqueue is created. Signed-off-by: Audra Mitchell <audra@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-01-25 09:11:40 -10:00
Marcelo Tosatti	7bd20b6b87	workqueue: mark power efficient workqueue as unbounded if nohz_full enabled A customer using nohz_full has experienced the following interruption: oslat-1004510 [018] timer_cancel: timer=0xffff90a7ca663cf8 oslat-1004510 [018] timer_expire_entry: timer=0xffff90a7ca663cf8 function=delayed_work_timer_fn now=4709188240 baseclk=4709188240 oslat-1004510 [018] workqueue_queue_work: work struct=0xffff90a7ca663cd8 function=fb_flashcursor workqueue=events_power_efficient req_cpu=8192 cpu=18 oslat-1004510 [018] workqueue_activate_work: work struct 0xffff90a7ca663cd8 oslat-1004510 [018] sched_wakeup: kworker/18:1:326 [120] CPU:018 oslat-1004510 [018] timer_expire_exit: timer=0xffff90a7ca663cf8 oslat-1004510 [018] irq_work_entry: vector=246 oslat-1004510 [018] irq_work_exit: vector=246 oslat-1004510 [018] tick_stop: success=0 dependency=SCHED oslat-1004510 [018] hrtimer_start: hrtimer=0xffff90a70009cb00 function=tick_sched_timer/0x0 ... oslat-1004510 [018] softirq_exit: vec=1 [action=TIMER] oslat-1004510 [018] softirq_entry: vec=7 [action=SCHED] oslat-1004510 [018] softirq_exit: vec=7 [action=SCHED] oslat-1004510 [018] tick_stop: success=0 dependency=SCHED oslat-1004510 [018] sched_switch: oslat:1004510 [120] R ==> kworker/18:1:326 [120] kworker/18:1-326 [018] workqueue_execute_start: work struct 0xffff90a7ca663cd8: function fb_flashcursor kworker/18:1-326 [018] workqueue_queue_work: work struct=0xffff9078f119eed0 function=drm_fb_helper_damage_work workqueue=events req_cpu=8192 cpu=18 kworker/18:1-326 [018] workqueue_activate_work: work struct 0xffff9078f119eed0 kworker/18:1-326 [018] timer_start: timer=0xffff90a7ca663cf8 function=delayed_work_timer_fn ... Set wq_power_efficient to true, in case nohz_full is enabled. This makes the power efficient workqueue be unbounded, which allows workqueue items there to be moved to HK CPUs. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-01-19 13:55:47 -10:00
Xuewen Yan	1a65a6d17c	workqueue: Add rcu lock check at the end of work item execution Currently the workqueue just checks the atomic and locking states after work execution ends. However, sometimes, a work item may not unlock rcu after acquiring rcu_read_lock(). And as a result, it would cause rcu stall, but the rcu stall warning can not dump the work func, because the work has finished. In order to quickly discover those works that do not call rcu_read_unlock() after rcu_read_lock(), add the rcu lock check. Use rcu_preempt_depth() to check the work's rcu status. Normally, this value is 0. If this value is bigger than 0, it means the work are still holding rcu lock. If so, print err info and the work func. tj: Reworded the description for clarity. Minor formatting tweak. Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-01-16 10:20:44 -10:00
Juri Lelli	85f0ab43f9	kernel/workqueue: Bind rescuer to unbound cpumask for WQ_UNBOUND At the time they are created unbound workqueues rescuers currently use cpu_possible_mask as their affinity, but this can be too wide in case a workqueue unbound mask has been set as a subset of cpu_possible_mask. Make new rescuers use their associated workqueue unbound cpumask from the start. Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-01-16 08:47:30 -10:00
Audra Mitchell	31c8900728	workqueue.c: Increase workqueue name length Currently we limit the size of the workqueue name to 24 characters due to commit `ecf6881ff3` ("workqueue: make workqueue->name[] fixed len") Increase the size to 32 characters and print a warning in the event the requested name is larger than the limit of 32 characters. Signed-off-by: Audra Mitchell <audra@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-01-16 08:31:24 -10:00
Linus Torvalds	23a80d462c	RCU pull request for v6.8 This pull request contains the following branches: doc.2023.12.13a: Documentation and comment updates. torture.2023.11.23a: RCU torture, locktorture updates that include cleanups; nolibc init build support for mips, ppc and rv64; testing of mid stall duration scenario and fixing fqs task creation conditions. fixes.2023.12.13a: Misc fixes, most notably restricting usage of RCU CPU stall notifiers, to confine their usage primarily to debug kernels. rcu-tasks.2023.12.12b: RCU tasks minor fixes. srcu.2023.12.13a: lockdep annotation fix for NMI-safe accesses, callback advancing/acceleration cleanup and documentation improvements. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQSi2tPIQIc2VEtjarIAHS7/6Z0wpQUCZYUS0AAKCRAAHS7/6Z0w pRXgAQD+k8oqjvKL6la61ppWm5Y7NLjdj/IbV+cOd42jKnM6PAEAyavNhX0n7zGx o9cDlvIDxJfHnFrOTc5WLH9yEs3IiQQ= =8rdu -----END PGP SIGNATURE----- Merge tag 'rcu.release.v6.8' of https://github.com/neeraju/linux Pull RCU updates from Neeraj Upadhyay: - Documentation and comment updates - RCU torture, locktorture updates that include cleanups; nolibc init build support for mips, ppc and rv64; testing of mid stall duration scenario and fixing fqs task creation conditions - Misc fixes, most notably restricting usage of RCU CPU stall notifiers, to confine their usage primarily to debug kernels - RCU tasks minor fixes - lockdep annotation fix for NMI-safe accesses, callback advancing/acceleration cleanup and documentation improvements * tag 'rcu.release.v6.8' of https://github.com/neeraju/linux: rcu: Force quiescent states only for ongoing grace period doc: Clarify historical disclaimers in memory-barriers.txt doc: Mention address and data dependencies in rcu_dereference.rst doc: Clarify RCU Tasks reader/updater checklist rculist.h: docs: Fix wrong function summary Documentation: RCU: Remove repeated word in comments srcu: Use try-lock lockdep annotation for NMI-safe access. srcu: Explain why callbacks invocations can't run concurrently srcu: No need to advance/accelerate if no callback enqueued srcu: Remove superfluous callbacks advancing from srcu_gp_start() rcu: Remove unused macros from rcupdate.h rcu: Restrict access to RCU CPU stall notifiers rcu-tasks: Mark RCU Tasks accesses to current->rcu_tasks_idle_cpu rcutorture: Add fqs_holdoff check before fqs_task is created rcutorture: Add mid-sized stall to TREE07 rcutorture: add nolibc init support for mips, ppc and rv64 locktorture: Increase Hamming distance between call_rcu_chain and rcu_call_chains	2024-01-12 16:35:58 -08:00
Linus Torvalds	893e2f9eac	dma-mapping updates for Linux 6.8 - reduce area lock contention for non-primary IO TLB pools (Petr Tesarik) - don't store redundant offsets in the dma_ranges stuctures (Robin Murphy) - clear dev->dma_mem when freeing per-device pools (Joakim Zhang) -----BEGIN PGP SIGNATURE----- iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmWcGxQLHGhjaEBsc3Qu ZGUACgkQD55TZVIEUYNf6hAAi9wP1ehnIqHCczCxpideyJnE76r+LgjInjudQUqE cnpl7E+dO/e/7Trk+L7hIrzi5uz8m9e+DZgL9wUY4h5mvJ+8ELet3Ec62UMVL1g0 cWSYOtlMdUZn9Oy+qy2TTCa//1HyzcWQdplVwcOqD7zCLO4PavUR18+Vw5eDUpBR TE8EB+7P8ta8XSFnsryZS4zI1AhTTjZfh8ZgPdp+niBh7XAqOFNn3WiGK4qvA9o6 nIjIV6ydBjZYkyYPeDsqszqmZG64mEeGUZhLWmjAyg1/c8so7uFviNfJ05od34js aWpmFxrM9Mm4BaBiU3FsSQkMBGCGaD/H2UXjIl1Qayt+pzUfaP7+8UWWo/T7Mj35 RFKe9xzlPY8rqOszdBBvy6lCWguHXw4d4IFoqOz+YoUaxlV+RAbFOCHtW2BNvtPe b1YCr/FKNQ8NxsJWnbcehDtClY461pqBbaDrio3K7eTJgG10biAoWBfPhV+5VEer aB14krQcn7v1vXjfLu2huSrPt1ZjXuWVfXA3nO3Mt3VWxZWat82gLkFyt5N6ZfQ4 juaDMX3Vzlz3VPf4MHFC+yFRx55b/9X26lC1BlSoo4tAknoo746Lvy/PasZarILC sGPt+2BVlPQ466zkjky4GtoNof1TMNuPF0Xr/mNCEjxCYUGQcTIdmOvA9y8mu/V+ 7M4= =v5F+ -----END PGP SIGNATURE----- Merge tag 'dma-mapping-6.8-2024-01-08' of git://git.infradead.org/users/hch/dma-mapping Pull dma-mapping updates from Christoph Hellwig: - reduce area lock contention for non-primary IO TLB pools (Petr Tesarik) - don't store redundant offsets in the dma_ranges stuctures (Robin Murphy) - clear dev->dma_mem when freeing per-device pools (Joakim Zhang) * tag 'dma-mapping-6.8-2024-01-08' of git://git.infradead.org/users/hch/dma-mapping: dma-mapping: clear dev->dma_mem to NULL after freeing it swiotlb: reduce area lock contention for non-primary IO TLB pools dma-mapping: don't store redundant offsets	2024-01-11 13:46:50 -08:00
Linus Torvalds	3e7aeb78ab	Networking changes for 6.8. Core & protocols ---------------- - Analyze and reorganize core networking structs (socks, netdev, netns, mibs) to optimize cacheline consumption and set up build time warnings to safeguard against future header changes. This improves TCP performances with many concurrent connections up to 40%. - Add page-pool netlink-based introspection, exposing the memory usage and recycling stats. This helps indentify bad PP users and possible leaks. - Refine TCP/DCCP source port selection to no longer favor even source port at connect() time when IP_LOCAL_PORT_RANGE is set. This lowers the time taken by connect() for hosts having many active connections to the same destination. - Refactor the TCP bind conflict code, shrinking related socket structs. - Refactor TCP SYN-Cookie handling, as a preparation step to allow arbitrary SYN-Cookie processing via eBPF. - Tune optmem_max for 0-copy usage, increasing the default value to 128KB and namespecifying it. - Allow coalescing for cloned skbs coming from page pools, improving RX performances with some common configurations. - Reduce extension header parsing overhead at GRO time. - Add bridge MDB bulk deletion support, allowing user-space to request the deletion of matching entries. - Reorder nftables struct members, to keep data accessed by the datapath first. - Introduce TC block ports tracking and use. This allows supporting multicast-like behavior at the TC layer. - Remove UAPI support for retired TC qdiscs (dsmark, CBQ and ATM) and classifiers (RSVP and tcindex). - More data-race annotations. - Extend the diag interface to dump TCP bound-only sockets. - Conditional notification of events for TC qdisc class and actions. - Support for WPAN dynamic associations with nearby devices, to form a sub-network using a specific PAN ID. - Implement SMCv2.1 virtual ISM device support. - Add support for Batman-avd mulicast packet type. BPF --- - Tons of verifier improvements: - BPF register bounds logic and range support along with a large test suite - log improvements - complete precision tracking support for register spills - track aligned STACK_ZERO cases as imprecise spilled registers. It improves the verifier "instructions processed" metric from single digit to 50-60% for some programs - support for user's global BPF subprogram arguments with few commonly requested annotations for a better developer experience - support tracking of BPF_JNE which helps cases when the compiler transforms (unsigned) "a > 0" into "if a == 0 goto xxx" and the like - several fixes - Add initial TX metadata implementation for AF_XDP with support in mlx5 and stmmac drivers. Two types of offloads are supported right now, that is, TX timestamp and TX checksum offload. - Fix kCFI bugs in BPF all forms of indirect calls from BPF into kernel and from kernel into BPF work with CFI enabled. This allows BPF to work with CONFIG_FINEIBT=y. - Change BPF verifier logic to validate global subprograms lazily instead of unconditionally before the main program, so they can be guarded using BPF CO-RE techniques. - Support uid/gid options when mounting bpffs. - Add a new kfunc which acquires the associated cgroup of a task within a specific cgroup v1 hierarchy where the latter is identified by its id. - Extend verifier to allow bpf_refcount_acquire() of a map value field obtained via direct load which is a use-case needed in sched_ext. - Add BPF link_info support for uprobe multi link along with bpftool integration for the latter. - Support for VLAN tag in XDP hints. - Remove deprecated bpfilter kernel leftovers given the project is developed in user-space (https://github.com/facebook/bpfilter). Misc ---- - Support for parellel TC self-tests execution. - Increase MPTCP self-tests coverage. - Updated the bridge documentation, including several so-far undocumented features. - Convert all the net self-tests to run in unique netns, to avoid random failures due to conflict and allow concurrent runs. - Add TCP-AO self-tests. - Add kunit tests for both cfg80211 and mac80211. - Autogenerate Netlink families documentation from YAML spec. - Add yml-gen support for fixed headers and recursive nests, the tool can now generate user-space code for all genetlink families for which we have specs. - A bunch of additional module descriptions fixes. - Catch incorrect freeing of pages belonging to a page pool. Driver API ---------- - Rust abstractions for network PHY drivers; do not cover yet the full C API, but already allow implementing functional PHY drivers in rust. - Introduce queue and NAPI support in the netdev Netlink interface, allowing complete access to the device <> NAPIs <> queues relationship. - Introduce notifications filtering for devlink to allow control application scale to thousands of instances. - Improve PHY validation, requesting rate matching information for each ethtool link mode supported by both the PHY and host. - Add support for ethtool symmetric-xor RSS hash. - ACPI based Wifi band RFI (WBRF) mitigation feature for the AMD platform. - Expose pin fractional frequency offset value over new DPLL generic netlink attribute. - Convert older drivers to platform remove callback returning void. - Add support for PHY package MMD read/write. New hardware / drivers ---------------------- - Ethernet: - Octeon CN10K devices - Broadcom 5760X P7 - Qualcomm SM8550 SoC - Texas Instrument DP83TG720S PHY - Bluetooth: - IMC Networks Bluetooth radio Removed ------- - WiFi: - libertas 16-bit PCMCIA support - Atmel at76c50x drivers - HostAP ISA/PCMCIA style 802.11b driver - zd1201 802.11b USB dongles - Orinoco ISA/PCMCIA 802.11b driver - Aviator/Raytheon driver - Planet WL3501 driver - RNDIS USB 802.11b driver Drivers ------- - Ethernet high-speed NICs: - Intel (100G, ice, idpf): - allow one by one port representors creation and removal - add temperature and clock information reporting - add get/set for ethtool's header split ringparam - add again FW logging - adds support switchdev hardware packet mirroring - iavf: implement symmetric-xor RSS hash - igc: add support for concurrent physical and free-running timers - i40e: increase the allowable descriptors - nVidia/Mellanox: - Preparation for Socket-Direct multi-dev netdev. That will allow in future releases combining multiple PFs devices attached to different NUMA nodes under the same netdev - Broadcom (bnxt): - TX completion handling improvements - add basic ntuple filter support - reduce MSIX vectors usage for MQPRIO offload - add VXLAN support, USO offload and TX coalesce completion for P7 - Marvell Octeon EP: - xmit-more support - add PF-VF mailbox support and use it for FW notifications for VFs - Wangxun (ngbe/txgbe): - implement ethtool functions to operate pause param, ring param, coalesce channel number and msglevel - Netronome/Corigine (nfp): - add flow-steering support - support UDP segmentation offload - Ethernet NICs embedded, slower, virtual: - Xilinx AXI: remove duplicate DMA code adopting the dma engine driver - stmmac: add support for HW-accelerated VLAN stripping - TI AM654x sw: add mqprio, frame preemption & coalescing - gve: add support for non-4k page sizes. - virtio-net: support dynamic coalescing moderation - nVidia/Mellanox Ethernet datacenter switches: - allow firmware upgrade without a reboot - more flexible support for bridge flooding via the compressed FID flooding mode - Ethernet embedded switches: - Microchip: - fine-tune flow control and speed configurations in KSZ8xxx - KSZ88X3: enable setting rmii reference - Renesas: - add jumbo frames support - Marvell: - 88E6xxx: add "eth-mac" and "rmon" stats support - Ethernet PHYs: - aquantia: add firmware load support - at803x: refactor the driver to simplify adding support for more chip variants - NXP C45 TJA11xx: Add MACsec offload support - Wifi: - MediaTek (mt76): - NVMEM EEPROM improvements - mt7996 Extremely High Throughput (EHT) improvements - mt7996 Wireless Ethernet Dispatcher (WED) support - mt7996 36-bit DMA support - Qualcomm (ath12k): - support for a single MSI vector - WCN7850: support AP mode - Intel (iwlwifi): - new debugfs file fw_dbg_clear - allow concurrent P2P operation on DFS channels - Bluetooth: - QCA2066: support HFP offload - ISO: more broadcast-related improvements - NXP: better recovery in case receiver/transmitter get out of sync Signed-off-by: Paolo Abeni <pabeni@redhat.com> -----BEGIN PGP SIGNATURE----- iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmWdamsSHHBhYmVuaUBy ZWRoYXQuY29tAAoJECkkeY3MjxOkGC4P/2xjLzdw22ckSssuE9ORbGko9SNjnqHk PQh1E+26BHiCg5KB8VvzMsL78E79MRNXEattSW+1g7dhCvln3oi+Vd0WkdRkgt35 98Iv18zLbbwFAJeyKvmLAPAkQkMLtVj19QILBBRrugF+egEZgVSE3JBcTAiKv2ZQ HzkabA171Ri6LpCcEEtY5XuaKvimGnGzF8YMFf8rX0wtqd2p5kbY9aMe47WAGxvU Vf9548XvH+A5yVH2/4/gujtUOpA/RHuhuCMb+oo0cZ+VCC1x9MGzoXzj6r87OTkf k2W1whNzcGoin92f+9Lk1JYMuiGKBH4QVaDdNXJnYFSJWPTE7RvRsPzYTSD4/GzK yEZbzSJXpy/2vDQm16NoAxl7evRs8Sorzkw4LQRviZHI/5SAkK2ZQiCK5CO8QSYy C1LELcV5kn6Foe24xWnrWLjAGug9oJnYoGPMU5gvPmFJMvUMXqm5rmbBgUWL5Rxw q1M6gVzabCyWUy6z2G2vaqW2ZntNVvCkdsLtIX0XZkcTzNoP0MA+TuhyGz4wbiuo PeyQp/mbGnDgCYggqKIA0YWrTVxkhFrKN520cbO8qXBQytV9oFbM/0/+C0/r/5WX pL1JVzLrh6l5ME7EIQfha8UOF9j8q4ueSwb40P3AR2NaZiDABM0zfUZ6+sx+91WF ucqPEcZB5cRE =1bW6 -----END PGP SIGNATURE----- Merge tag 'net-next-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Paolo Abeni: "The most interesting thing is probably the networking structs reorganization and a significant amount of changes is around self-tests. Core & protocols: - Analyze and reorganize core networking structs (socks, netdev, netns, mibs) to optimize cacheline consumption and set up build time warnings to safeguard against future header changes This improves TCP performances with many concurrent connections up to 40% - Add page-pool netlink-based introspection, exposing the memory usage and recycling stats. This helps indentify bad PP users and possible leaks - Refine TCP/DCCP source port selection to no longer favor even source port at connect() time when IP_LOCAL_PORT_RANGE is set. This lowers the time taken by connect() for hosts having many active connections to the same destination - Refactor the TCP bind conflict code, shrinking related socket structs - Refactor TCP SYN-Cookie handling, as a preparation step to allow arbitrary SYN-Cookie processing via eBPF - Tune optmem_max for 0-copy usage, increasing the default value to 128KB and namespecifying it - Allow coalescing for cloned skbs coming from page pools, improving RX performances with some common configurations - Reduce extension header parsing overhead at GRO time - Add bridge MDB bulk deletion support, allowing user-space to request the deletion of matching entries - Reorder nftables struct members, to keep data accessed by the datapath first - Introduce TC block ports tracking and use. This allows supporting multicast-like behavior at the TC layer - Remove UAPI support for retired TC qdiscs (dsmark, CBQ and ATM) and classifiers (RSVP and tcindex) - More data-race annotations - Extend the diag interface to dump TCP bound-only sockets - Conditional notification of events for TC qdisc class and actions - Support for WPAN dynamic associations with nearby devices, to form a sub-network using a specific PAN ID - Implement SMCv2.1 virtual ISM device support - Add support for Batman-avd mulicast packet type BPF: - Tons of verifier improvements: - BPF register bounds logic and range support along with a large test suite - log improvements - complete precision tracking support for register spills - track aligned STACK_ZERO cases as imprecise spilled registers. This improves the verifier "instructions processed" metric from single digit to 50-60% for some programs - support for user's global BPF subprogram arguments with few commonly requested annotations for a better developer experience - support tracking of BPF_JNE which helps cases when the compiler transforms (unsigned) "a > 0" into "if a == 0 goto xxx" and the like - several fixes - Add initial TX metadata implementation for AF_XDP with support in mlx5 and stmmac drivers. Two types of offloads are supported right now, that is, TX timestamp and TX checksum offload - Fix kCFI bugs in BPF all forms of indirect calls from BPF into kernel and from kernel into BPF work with CFI enabled. This allows BPF to work with CONFIG_FINEIBT=y - Change BPF verifier logic to validate global subprograms lazily instead of unconditionally before the main program, so they can be guarded using BPF CO-RE techniques - Support uid/gid options when mounting bpffs - Add a new kfunc which acquires the associated cgroup of a task within a specific cgroup v1 hierarchy where the latter is identified by its id - Extend verifier to allow bpf_refcount_acquire() of a map value field obtained via direct load which is a use-case needed in sched_ext - Add BPF link_info support for uprobe multi link along with bpftool integration for the latter - Support for VLAN tag in XDP hints - Remove deprecated bpfilter kernel leftovers given the project is developed in user-space (https://github.com/facebook/bpfilter) Misc: - Support for parellel TC self-tests execution - Increase MPTCP self-tests coverage - Updated the bridge documentation, including several so-far undocumented features - Convert all the net self-tests to run in unique netns, to avoid random failures due to conflict and allow concurrent runs - Add TCP-AO self-tests - Add kunit tests for both cfg80211 and mac80211 - Autogenerate Netlink families documentation from YAML spec - Add yml-gen support for fixed headers and recursive nests, the tool can now generate user-space code for all genetlink families for which we have specs - A bunch of additional module descriptions fixes - Catch incorrect freeing of pages belonging to a page pool Driver API: - Rust abstractions for network PHY drivers; do not cover yet the full C API, but already allow implementing functional PHY drivers in rust - Introduce queue and NAPI support in the netdev Netlink interface, allowing complete access to the device <> NAPIs <> queues relationship - Introduce notifications filtering for devlink to allow control application scale to thousands of instances - Improve PHY validation, requesting rate matching information for each ethtool link mode supported by both the PHY and host - Add support for ethtool symmetric-xor RSS hash - ACPI based Wifi band RFI (WBRF) mitigation feature for the AMD platform - Expose pin fractional frequency offset value over new DPLL generic netlink attribute - Convert older drivers to platform remove callback returning void - Add support for PHY package MMD read/write New hardware / drivers: - Ethernet: - Octeon CN10K devices - Broadcom 5760X P7 - Qualcomm SM8550 SoC - Texas Instrument DP83TG720S PHY - Bluetooth: - IMC Networks Bluetooth radio Removed: - WiFi: - libertas 16-bit PCMCIA support - Atmel at76c50x drivers - HostAP ISA/PCMCIA style 802.11b driver - zd1201 802.11b USB dongles - Orinoco ISA/PCMCIA 802.11b driver - Aviator/Raytheon driver - Planet WL3501 driver - RNDIS USB 802.11b driver Driver updates: - Ethernet high-speed NICs: - Intel (100G, ice, idpf): - allow one by one port representors creation and removal - add temperature and clock information reporting - add get/set for ethtool's header split ringparam - add again FW logging - adds support switchdev hardware packet mirroring - iavf: implement symmetric-xor RSS hash - igc: add support for concurrent physical and free-running timers - i40e: increase the allowable descriptors - nVidia/Mellanox: - Preparation for Socket-Direct multi-dev netdev. That will allow in future releases combining multiple PFs devices attached to different NUMA nodes under the same netdev - Broadcom (bnxt): - TX completion handling improvements - add basic ntuple filter support - reduce MSIX vectors usage for MQPRIO offload - add VXLAN support, USO offload and TX coalesce completion for P7 - Marvell Octeon EP: - xmit-more support - add PF-VF mailbox support and use it for FW notifications for VFs - Wangxun (ngbe/txgbe): - implement ethtool functions to operate pause param, ring param, coalesce channel number and msglevel - Netronome/Corigine (nfp): - add flow-steering support - support UDP segmentation offload - Ethernet NICs embedded, slower, virtual: - Xilinx AXI: remove duplicate DMA code adopting the dma engine driver - stmmac: add support for HW-accelerated VLAN stripping - TI AM654x sw: add mqprio, frame preemption & coalescing - gve: add support for non-4k page sizes. - virtio-net: support dynamic coalescing moderation - nVidia/Mellanox Ethernet datacenter switches: - allow firmware upgrade without a reboot - more flexible support for bridge flooding via the compressed FID flooding mode - Ethernet embedded switches: - Microchip: - fine-tune flow control and speed configurations in KSZ8xxx - KSZ88X3: enable setting rmii reference - Renesas: - add jumbo frames support - Marvell: - 88E6xxx: add "eth-mac" and "rmon" stats support - Ethernet PHYs: - aquantia: add firmware load support - at803x: refactor the driver to simplify adding support for more chip variants - NXP C45 TJA11xx: Add MACsec offload support - Wifi: - MediaTek (mt76): - NVMEM EEPROM improvements - mt7996 Extremely High Throughput (EHT) improvements - mt7996 Wireless Ethernet Dispatcher (WED) support - mt7996 36-bit DMA support - Qualcomm (ath12k): - support for a single MSI vector - WCN7850: support AP mode - Intel (iwlwifi): - new debugfs file fw_dbg_clear - allow concurrent P2P operation on DFS channels - Bluetooth: - QCA2066: support HFP offload - ISO: more broadcast-related improvements - NXP: better recovery in case receiver/transmitter get out of sync" * tag 'net-next-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1714 commits) lan78xx: remove redundant statement in lan78xx_get_eee lan743x: remove redundant statement in lan743x_ethtool_get_eee bnxt_en: Fix RCU locking for ntuple filters in bnxt_rx_flow_steer() bnxt_en: Fix RCU locking for ntuple filters in bnxt_srxclsrldel() bnxt_en: Remove unneeded variable in bnxt_hwrm_clear_vnic_filter() tcp: Revert no longer abort SYN_SENT when receiving some ICMP Revert "mlx5 updates 2023-12-20" Revert "net: stmmac: Enable Per DMA Channel interrupt" ipvlan: Remove usage of the deprecated ida_simple_xx() API ipvlan: Fix a typo in a comment net/sched: Remove ipt action tests net: stmmac: Use interrupt mode INTM=1 for per channel irq net: stmmac: Add support for TX/RX channel interrupt net: stmmac: Make MSI interrupt routine generic dt-bindings: net: snps,dwmac: per channel irq net: phy: at803x: make read_status more generic net: phy: at803x: add support for cdt cross short test for qca808x net: phy: at803x: refactor qca808x cable test get status function net: phy: at803x: generalize cdt fault length function net: ethernet: cortina: Drop TSO support ...	2024-01-11 10:07:29 -08:00

1 2 3 4 5 ...

43707 Commits