From e3dddcfd3dd8b483c9ccaa06733688bb63bb7c9d Mon Sep 17 00:00:00 2001 From: Chen Ridong Date: Tue, 8 Oct 2024 11:24:57 +0000 Subject: [PATCH 1/3] workqueue: doc: Add a note saturating the system_wq is not permitted If something is expected to generate large number of concurrent works, it should utilize its own dedicated workqueue rather than system wq. Because this may saturate system_wq and potentially block other's works. eg, cgroup release work. Let's document this as a note. Signed-off-by: Chen Ridong Signed-off-by: Tejun Heo --- Documentation/core-api/workqueue.rst | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/Documentation/core-api/workqueue.rst b/Documentation/core-api/workqueue.rst index 16f861c9791e..2b813f80ce39 100644 --- a/Documentation/core-api/workqueue.rst +++ b/Documentation/core-api/workqueue.rst @@ -357,6 +357,11 @@ Guidelines difference in execution characteristics between using a dedicated wq and a system wq. + Note: If something may generate more than @max_active outstanding + work items (do stress test your producers), it may saturate a system + wq and potentially lead to deadlock. It should utilize its own + dedicated workqueue rather than the system wq. + * Unless work items are expected to consume a huge amount of CPU cycles, using a bound wq is usually beneficial due to the increased level of locality in wq operations and work item execution. From 581434654e01ec79dd02c21448ac84e2ce2d1a64 Mon Sep 17 00:00:00 2001 From: Chen Ridong Date: Tue, 8 Oct 2024 11:24:58 +0000 Subject: [PATCH 2/3] workqueue: Adjust WQ_MAX_ACTIVE from 512 to 2048 WQ_MAX_ACTIVE is currently set to 512, which was established approximately 15 yeas ago. However, with the significant increase in machine sizes and capabilities, the previous limit of 256 concurrent tasks is no longer sufficient. Therefore, we propose to increase WQ_MAX_ACTIVE to 2048. and WQ_DFL_ACTIVE is 1024 now. Signed-off-by: Chen Ridong Signed-off-by: Tejun Heo --- Documentation/core-api/workqueue.rst | 4 ++-- include/linux/workqueue.h | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/Documentation/core-api/workqueue.rst b/Documentation/core-api/workqueue.rst index 2b813f80ce39..e295835fc116 100644 --- a/Documentation/core-api/workqueue.rst +++ b/Documentation/core-api/workqueue.rst @@ -245,8 +245,8 @@ CPU which can be assigned to the work items of a wq. For example, with at the same time per CPU. This is always a per-CPU attribute, even for unbound workqueues. -The maximum limit for ``@max_active`` is 512 and the default value used -when 0 is specified is 256. These values are chosen sufficiently high +The maximum limit for ``@max_active`` is 2048 and the default value used +when 0 is specified is 1024. These values are chosen sufficiently high such that they are not the limiting factor while providing protection in runaway cases. diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h index 59c2695e12e7..b0dc957c3e56 100644 --- a/include/linux/workqueue.h +++ b/include/linux/workqueue.h @@ -412,7 +412,7 @@ enum wq_flags { }; enum wq_consts { - WQ_MAX_ACTIVE = 512, /* I like 512, better ideas? */ + WQ_MAX_ACTIVE = 2048, /* I like 2048, better ideas? */ WQ_UNBOUND_MAX_ACTIVE = WQ_MAX_ACTIVE, WQ_DFL_ACTIVE = WQ_MAX_ACTIVE / 2, From 85f0d8e39affb7b88401b1e0542230a7af985b96 Mon Sep 17 00:00:00 2001 From: Wangyang Guo Date: Fri, 15 Nov 2024 13:49:36 +0800 Subject: [PATCH 3/3] workqueue: Reduce expensive locks for unbound workqueue For unbound workqueue, pwqs usually map to just a few pools. Most of the time, pwqs will be linked sequentially to wq->pwqs list by cpu index. Usually, consecutive CPUs have the same workqueue attribute (e.g. belong to the same NUMA node). This makes pwqs with the same pool cluster together in the pwq list. Only do lock/unlock if the pool has changed in flush_workqueue_prep_pwqs(). This reduces the number of expensive lock operations. The performance data shows this change boosts FIO by 65x in some cases when multiple concurrent threads write to xfs mount points with fsync. FIO Benchmark Details - FIO version: v3.35 - FIO Options: ioengine=libaio,iodepth=64,norandommap=1,rw=write, size=128M,bs=4k,fsync=1 - FIO Job Configs: 64 jobs in total writing to 4 mount points (ramdisks formatted as xfs file system). - Kernel Codebase: v6.12-rc5 - Test Platform: Xeon 8380 (2 sockets) Reviewed-by: Tim Chen Signed-off-by: Wangyang Guo Reviewed-by: Lai Jiangshan Signed-off-by: Tejun Heo --- kernel/workqueue.c | 22 ++++++++++++++++++---- 1 file changed, 18 insertions(+), 4 deletions(-) diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 9949ffad8df0..8b07576814a5 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -3833,16 +3833,28 @@ static bool flush_workqueue_prep_pwqs(struct workqueue_struct *wq, { bool wait = false; struct pool_workqueue *pwq; + struct worker_pool *current_pool = NULL; if (flush_color >= 0) { WARN_ON_ONCE(atomic_read(&wq->nr_pwqs_to_flush)); atomic_set(&wq->nr_pwqs_to_flush, 1); } + /* + * For unbound workqueue, pwqs will map to only a few pools. + * Most of the time, pwqs within the same pool will be linked + * sequentially to wq->pwqs by cpu index. So in the majority + * of pwq iters, the pool is the same, only doing lock/unlock + * if the pool has changed. This can largely reduce expensive + * lock operations. + */ for_each_pwq(pwq, wq) { - struct worker_pool *pool = pwq->pool; - - raw_spin_lock_irq(&pool->lock); + if (current_pool != pwq->pool) { + if (likely(current_pool)) + raw_spin_unlock_irq(¤t_pool->lock); + current_pool = pwq->pool; + raw_spin_lock_irq(¤t_pool->lock); + } if (flush_color >= 0) { WARN_ON_ONCE(pwq->flush_color != -1); @@ -3859,9 +3871,11 @@ static bool flush_workqueue_prep_pwqs(struct workqueue_struct *wq, pwq->work_color = work_color; } - raw_spin_unlock_irq(&pool->lock); } + if (current_pool) + raw_spin_unlock_irq(¤t_pool->lock); + if (flush_color >= 0 && atomic_dec_and_test(&wq->nr_pwqs_to_flush)) complete(&wq->first_flusher->done);