Add three more affinity scopes - WQ_AFFN_CPU, SMT and CACHE - and make CACHE
the default. The code changes to actually add the additional scopes are
trivial.
Also add module parameter "workqueue.default_affinity_scope" to override the
default scope and "affinity_scope" sysfs file to configure it per workqueue.
wq_dump.py and documentations are updated accordingly.
This enables significant flexibility in configuring how unbound workqueues
behave. If affinity scope is set to "cpu", it'll behave close to a per-cpu
workqueue. On the other hand, "system" removes all locality boundaries.
Many modern machines have multiple L3 caches often while being mostly
uniform in terms of memory access. Thus, workqueue's previous behavior of
spreading work items in each NUMA node had negative performance implications
from unncessarily crossing L3 boundaries between issue and execution.
However, picking a finer grained affinity scope also has a downside in that
an issuer in one group can't utilize CPUs in other groups.
While dependent on the specifics of workload, there's usually a noticeable
penalty in crossing L3 boundaries, so let's default to CACHE. This issue
will be further addressed and documented with examples in future patches.
Signed-off-by: Tejun Heo <tj@kernel.org>
While wq_pod_type[] can now group CPUs in any aribitrary way, WQ_AFFN_NUM
init is hard coded into workqueue_init_topology(). This patch modularizes
the init path by introducing init_pod_type() which takes a callback to
determine whether two CPUs should share a pod as an argument.
init_pod_type() first scans the CPU combinations testing for sharing to
assign consecutive pod IDs and initialize pod_type->cpu_pod[]. Once
->cpu_pod[] is determined, ->pod_cpus[] and ->pod_node[] are initialized
accordingly. WQ_AFFN_NUMA is now initialized by calling init_pod_type() with
cpus_share_numa() which tests whether the CPU belongs to the same NUMA node.
This patch may change the pod ID assigned to each NUMA node but that
shouldn't cause any behavior changes as the NUMA node to use for allocations
are tracked separately in pod_type->pod_node[]. This makes adding new
affinty types pretty easy.
Signed-off-by: Tejun Heo <tj@kernel.org>
Lack of visibility has always been a pain point for workqueues. While the
recently added wq_monitor.py improved the situation, it's still difficult to
understand what worker pools are active in the system, how workqueues map to
them and why. The lack of visibility into how workqueues are configured is
going to become more noticeable as workqueue improves locality awareness and
provides more mechanisms to customize locality related behaviors.
Now that the basic framework for more flexible locality support is in place,
this is a good time to improve the situation. This patch adds
tools/workqueues/wq_dump.py which prints out the topology configuration,
worker pools and how workqueues are mapped to pools. Read the command's help
message for more details.
Signed-off-by: Tejun Heo <tj@kernel.org>
While renamed to pod, the code still assumes that the pods are defined by
NUMA boundaries. Let's generalize it:
* workqueue_attrs->affn_scope is added. Each enum represents the type of
boundaries that define the pods. There are currently two scopes -
WQ_AFFN_NUMA and WQ_AFFN_SYSTEM. The former is the same behavior as before
- one pod per NUMA node. The latter defines one global pod across the
whole system.
* struct wq_pod_type is added which describes how pods are configured for
each affnity scope. For each pod, it lists the member CPUs and the
preferred NUMA node for memory allocations. The reverse mapping from CPU
to pod is also available.
* wq_pod_enabled is dropped. Pod is now always enabled. The previously
disabled behavior is now implemented through WQ_AFFN_SYSTEM.
* get_unbound_pool() wants to determine the NUMA node to allocate memory
from for the new pool. The variables are renamed from node to pod but the
logic still assumes they're one and the same. Clearly distinguish them -
walk the WQ_AFFN_NUMA pods to find the matching pod and then use the pod's
NUMA node.
* wq_calc_pod_cpumask() was taking @pod but assumed that it was the NUMA
node. Take @cpu instead and determine the cpumask to use from the pod_type
matching @attrs.
* apply_wqattrs_prepare() is update to return ERR_PTR() on error instead of
NULL so that it can indicate -EINVAL on invalid affinity scopes.
This patch allows CPUs to be grouped into pods however desired per type.
While this patch causes some internal behavior changes, nothing material
should change for workqueue users.
v2: Trigger WARN_ON_ONCE() in wqattrs_pod_type() if affn_scope is
WQ_AFFN_NR_TYPES which indicates that the function is called with a
worker_pool's attrs instead of a workqueue's.
Signed-off-by: Tejun Heo <tj@kernel.org>
workqueue_attrs can be used for both workqueues and worker_pools. However,
some fields, currently only ->ordered, only apply to workqueues and should
be cleared to the default / invalid values.
Currently, an unbound workqueue explicitly clears attrs->ordered in
get_unbound_pool() after copying the source workqueue attrs, while per-cpu
workqueues rely on the fact that zeroing on allocation gives us the desired
default value for pool->attrs->ordered.
This is fragile. Let's add wqattrs_clear_for_pool() which clears
attrs->ordered and is called from both init_worker_pool() and
get_unbound_pool(). This will ease adding more workqueue-only attrs fields.
In get_unbound_pool(), pool->node initialization is moved upwards for
readability. This shouldn't cause any behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
For an unbound pool, multiple cpumasks are involved.
U: The user-specified cpumask (may be filtered with cpu_possible_mask).
A: The actual cpumask filtered by wq_unbound_cpumask. If the filtering
leaves no CPU, wq_unbound_cpumask is used.
P: Per-pod subsets of #A.
wq->attrs stores #U, wq->dfl_pwq->pool->attrs->cpumask #A, and
wq->cpu_pwq[CPU]->pool->attrs->cpumask #P.
wq_update_pod() is called to update per-pod pwq's during CPU hotplug. To
calculate the new #P for each workqueue, it needs to call
wq_calc_pod_cpumask() with @attrs that contains #A. Currently,
wq_update_pod() achieves this by calling wq_calc_pod_cpumask() with
wq->dfl_pwq->pool->attrs.
This is rather fragile because we're calling wq_calc_pod_cpumask() with
@attrs of a worker_pool rather than the workqueue's actual attrs when what
we want to calculate is the workqueue's cpumask on the pod. While this works
fine currently, future changes will add fields which are used differently
between workqueues and worker_pools and this subtlety will bite us.
This patch factors out #U -> #A calculation from apply_wqattrs_prepare()
into wqattrs_actualize_cpumask and updates wq_update_pod() to copy
wq->unbound_attrs and use the new helper to obtain #A freshly instead of
abusing wq->dfl_pwq->pool_attrs.
This shouldn't cause any behavior changes in the current code.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: K Prateek Nayak <kprateek.nayak@amd.com>
Reference: http://lkml.kernel.org/r/30625cdd-4d61-594b-8db9-6816b017dde3@amd.com
During boot, to initialize unbound CPU pods, wq_pod_init() was called from
workqueue_init(). This is early enough for NUMA nodes to be set up but
before SMP is brought up and CPU topology information is populated.
Workqueue is in the process of improving CPU locality for unbound workqueues
and will need access to topology information during pod init. This adds a
new init function workqueue_init_topology() which is called after CPU
topology information is available and replaces wq_pod_init().
As unbound CPU pods are now initialized after workqueues are activated, we
need to revisit the workqueues to apply the pod configuration. Workqueues
which are created before workqueue_init_topology() are set up so that they
always use the default worker pool. After pods are set up in
workqueue_init_topology(), wq_update_pod() is called on all existing
workqueues to update the pool associations accordingly.
Note that wq_update_pod_attrs_buf allocation is moved to
workqueue_init_early(). This isn't necessary right now but enables further
generalization of pod handling in the future.
This patch changes the initialization sequence but the end result should be
the same.
Signed-off-by: Tejun Heo <tj@kernel.org>
wq_pod_init() is called from workqueue_init() and responsible for
initializing unbound CPU pods according to NUMA node. Workqueue is in the
process of improving affinity awareness and wants to use other topology
information to initialize unbound CPU pods; however, unlike NUMA nodes,
other topology information isn't yet available in workqueue_init().
The next patch will introduce a later stage init function for workqueue
which will be responsible for initializing unbound CPU pods. Relocate
wq_pod_init() below workqueue_init() where the new init function is going to
be located so that the diff can show the content differences.
Just a relocation. No functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Workqueue is in the process of improving CPU affinity awareness. It will
become more flexible and won't be tied to NUMA node boundaries. This patch
renames all NUMA related names in workqueue.c to use "pod" instead.
While "pod" isn't a very common term, it short and captures the grouping of
CPUs well enough. These names are only going to be used within workqueue
implementation proper, so the specific naming doesn't matter that much.
* wq_numa_possible_cpumask -> wq_pod_cpus
* wq_numa_enabled -> wq_pod_enabled
* wq_update_unbound_numa_attrs_buf -> wq_update_pod_attrs_buf
* workqueue_select_cpu_near -> select_numa_node_cpu
This rename is different from others. The function is only used by
queue_work_node() and specifically tries to find a CPU in the specified
NUMA node. As workqueue affinity will become more flexible and untied from
NUMA, this function's name should specifically describe that it's for
NUMA.
* wq_calc_node_cpumask -> wq_calc_pod_cpumask
* wq_update_unbound_numa -> wq_update_pod
* wq_numa_init -> wq_pod_init
* node -> pod in local variables
Only renames. No functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
With the recent removal of NUMA related module param and sysfs knob,
workqueue_attrs->no_numa is now only used to implement ordered workqueues.
Let's rename the field so that it's less confusing especially with the
planned CPU affinity awareness improvements.
Just a rename. No functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
A pwq (pool_workqueue) represents an association between a workqueue and a
worker_pool. When a work item is queued, the workqueue selects the pwq to
use, which in turn determines the pool, and queues the work item to the pool
through the pwq. pwq is also what implements the maximum concurrency limit -
@max_active.
As a per-cpu workqueue should be assocaited with a different worker_pool on
each CPU, it always had per-cpu pwq's that are accessed through wq->cpu_pwq.
However, unbound workqueues were sharing a pwq within each NUMA node by
default. The sharing has several downsides:
* Because @max_active is per-pwq, the meaning of @max_active changes
depending on the machine configuration and whether workqueue NUMA locality
support is enabled.
* Makes per-cpu and unbound code deviate.
* Gets in the way of making workqueue CPU locality awareness more flexible.
This patch makes unbound workqueues use per-cpu pwq's the same way per-cpu
workqueues do by making the following changes:
* wq->numa_pwq_tbl[] is removed and unbound workqueues now use wq->cpu_pwq
just like per-cpu workqueues. wq->cpu_pwq is now RCU protected for unbound
workqueues.
* numa_pwq_tbl_install() is renamed to install_unbound_pwq() and installs
the specified pwq to the target CPU's wq->cpu_pwq.
* apply_wqattrs_prepare() now always allocates a separate pwq for each CPU
unless the workqueue is ordered. If ordered, all CPUs use wq->dfl_pwq.
This makes the return value of wq_calc_node_cpumask() unnecessary. It now
returns void.
* @max_active now means the same thing for both per-cpu and unbound
workqueues. WQ_UNBOUND_MAX_ACTIVE now equals WQ_MAX_ACTIVE and
documentation is updated accordingly. WQ_UNBOUND_MAX_ACTIVE is no longer
used in workqueue implementation and will be removed later.
* All unbound pwq operations which used to be per-numa-node are now per-cpu.
For most unbound workqueue users, this shouldn't cause noticeable changes.
Work item issue and completion will be a small bit faster, flush_workqueue()
would become a bit more expensive, and the total concurrency limit would
likely become higher. All @max_active==1 use cases are currently being
audited for conversion into alloc_ordered_workqueue() and they shouldn't be
affected once the audit and conversion is complete.
One area where the behavior change may be more noticeable is
workqueue_congested() as the reported congestion state is now per CPU
instead of NUMA node. There are only two users of this interface -
drivers/infiniband/hw/hfi1 and net/smc. Maintainers of both subsystems are
cc'd. Inputs on the behavior change would be very much appreciated.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Karsten Graul <kgraul@linux.ibm.com>
Cc: Wenjia Zhang <wenjia@linux.ibm.com>
Cc: Jan Karcher <jaka@linux.ibm.com>
When a CPU went online or offline, wq_update_unbound_numa() was called only
on the CPU which was going up or down. This works fine because all CPUs on
the same NUMA node share the same pool_workqueue slot - one CPU updating it
updates it for everyone in the node.
However, future changes will make each CPU use a separate pool_workqueue
even when they're sharing the same worker_pool, which requires updating
pool_workqueue's for all CPUs which may be sharing the same pool_workqueue
on hotplug.
To accommodate the planned changes, this patch updates
workqueue_on/offline_cpu() so that they call wq_update_unbound_numa() for
all CPUs sharing the same NUMA node as the CPU going up or down. In the
current code, the second+ calls would be noops and there shouldn't be any
behavior changes.
* As wq_update_unbound_numa() is now called on multiple CPUs per each
hotplug event, @cpu is renamed to @hotplug_cpu and another @cpu argument
is added. The former indicates the CPU being hot[un]plugged and the latter
the CPU whose pool_workqueue is being updated.
* In wq_update_unbound_numa(), cpu_off is renamed to off_cpu for consistency
with the new @hotplug_cpu.
Signed-off-by: Tejun Heo <tj@kernel.org>
Currently, all per-cpu pwq's (pool_workqueue's) are allocated directly
through a per-cpu allocation and thus, unlike unbound workqueues, not
reference counted. This difference in lifetime management between the two
types is a bit confusing.
Unbound workqueues are currently accessed through wq->numa_pwq_tbl[] which
isn't suitiable for the planned CPU locality related improvements. The plan
is to unify pwq handling across per-cpu and unbound workqueues so that
they're always accessed through wq->cpu_pwq.
In preparation, this patch makes per-cpu pwq's to be allocated, reference
counted and released the same way as unbound pwq's. wq->cpu_pwq now holds
pointers to pwq's instead of containing them directly.
pwq_unbound_release_workfn() is renamed to pwq_release_workfn() as it's now
also used for per-cpu work items.
Signed-off-by: Tejun Heo <tj@kernel.org>
pool_workqueue release path is currently bounced to system_wq; however, this
is a bit tricky because this bouncing occurs while holding a pool lock and
thus has risk of causing a A-A deadlock. This is currently addressed by the
fact that only unbound workqueues use this bouncing path and system_wq is a
per-cpu workqueue.
While this works, it's brittle and requires a work-around like setting the
lockdep subclass for the lock of unbound pools. Besides, future changes will
use the bouncing path for per-cpu workqueues too making the current approach
unusable.
Let's just use a dedicated kthread_worker to untangle the dependency. This
is just one more kthread for all workqueues and makes the pwq release logic
simpler and more robust.
Signed-off-by: Tejun Heo <tj@kernel.org>
Unbound workqueue CPU affinity is going to receive an overhaul and the NUMA
specific knobs won't make sense anymore. Remove them. Also, the pool_ids
knob was used for debugging and not really meaningful given that there is no
visibility into the pools associated with those IDs. Remove it too. A future
patch will improve overall visibility.
Signed-off-by: Tejun Heo <tj@kernel.org>
Collect first_idle_worker(), worker_enter/leave_idle(),
find_worker_executing_work(), move_linked_works() and wake_up_worker() into
one place. These functions will later be used to implement higher level
worker management logic.
No functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
wq->cpu_pwqs is a percpu variable carraying one pointer to a pool_workqueue.
The field name being plural is unusual and confusing. Rename it to singular.
This patch doesn't cause any functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
insert_work() always tried to wake up a worker; however, the only time it
needs to try to wake up a worker is when a new active work item is queued.
When a work item goes on the inactive list or queueing a flush work item,
there's no reason to try to wake up a worker.
This patch moves the worker wakeup logic out of insert_work() and places it
in the active new work item queueing path in __queue_work().
While at it:
* __queue_work() is dereferencing pwq->pool repeatedly. Add local variable
pool.
* Every caller of insert_work() calls debug_work_activate(). Consolidate the
invocations into insert_work().
* In __queue_work() pool->watchdog_ts update is relocated slightly. This is
to better accommodate future changes.
This makes wakeups more precise and will help the planned change to assign
work items to workers before waking them up. No behavior changes intended.
v2: WARN_ON_ONCE(pool != last_pool) added in __queue_work() to clarify as
suggested by Lai.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
* Drop the trivial optimization in worker_thread() where it bypasses calling
process_scheduled_works() if the first work item isn't linked. This is a
mostly pointless micro optimization and gets in the way of improving the
work processing path.
* Consolidate pool->watchdog_ts updates in the two callers into
process_scheduled_works().
Signed-off-by: Tejun Heo <tj@kernel.org>
worker->flags used to be accessed from scheduler hooks without grabbing
pool->lock for concurrency management. This is no longer true since
6d25be5782 ("sched/core, workqueues: Distangle worker accounting from rq
lock"). Also, it's unclear why worker_pool->flags was using the "X" rule.
All relevant users are accessing it under the pool lock.
Let's drop the special "X" rule and use the "L" rule for these flag fields
instead. While at it, replace the CONTEXT comment with
lockdep_assert_held().
This allows worker_set/clr_flags() to be used from context which isn't the
worker itself. This will be used later to implement assinging work items to
workers before waking them up so that workqueue can have better control over
which worker executes which work item on which CPU.
The only actual changes are sanity checks. There shouldn't be any visible
behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Unbound workqueue execution locality improvement patchset is about to
applied which will cause merge conflicts with changes in for-6.5-fixes.
Let's avoid future merge conflict by pulling in for-6.5-fixes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Use LIST_HEAD() to initialize cull_list instead of open-coding it.
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
wq_cpu_intensive_thresh_us is used to detect CPU-hogging per-cpu work items.
Once detected, they're excluded from concurrency management to prevent them
from blocking other per-cpu work items. If CONFIG_WQ_CPU_INTENSIVE_REPORT is
enabled, repeat offenders are also reported so that the code can be updated.
The default threshold is 10ms which is long enough to do fair bit of work on
modern CPUs while short enough to be usually not noticeable. This
unfortunately leads to a lot of, arguable spurious, detections on very slow
CPUs. Using the same threshold across CPUs whose performance levels may be
apart by multiple levels of magnitude doesn't make whole lot of sense.
This patch scales up wq_cpu_intensive_thresh_us upto 1 second when BogoMIPS
is below 4000. This is obviously very inaccurate but it doesn't have to be
accurate to be useful. The mechanism is still useful when the threshold is
fully scaled up and the benefits of reports are usually shared with everyone
regardless of who's reporting, so as long as there are sufficient number of
fast machines reporting, we don't lose much.
Some (or is it all?) ARM CPUs systemtically report significantly lower
BogoMIPS. While this doesn't break anything, given how widespread ARM CPUs
are, it's at least a missed opportunity and it probably would be a good idea
to teach workqueue about it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
There exists no parameter called "cpu_intensive_threshold_us".
The actual parameter name is "cpu_intensive_thresh_us".
Fixes: 6363845005 ("workqueue: Report work funcs that trigger automatic CPU_INTENSIVE mechanism")
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Motivation of doing this is to better improve boot times for devices when
we want to prevent our workqueue works from running on some specific CPUs,
e,g, some CPUs are busy with interrupts.
Signed-off-by: tiozhang <tiozhang@didiglobal.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Based on commit c4f135d643 ("workqueue: Wrap flush_workqueue() using
a macro"), all in-tree users stopped flushing system-wide workqueues.
Therefore, start emitting runtime message so that all out-of-tree users
will understand that they need to update their code.
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Tejun Heo <tj@kernel.org>
return values in crypto/asymmetric_keys/public_key.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEn51F/lCuNhUwmDeSxycdCkmxi6cFAmSpW4oACgkQxycdCkmx
i6fmnA/9FdX6jk25C/ebCCsjlnZk3IP99LCv8ZA3Zuas6xT6C7e5FzmAx9qBAxcE
U4CaOBp0FPf36ebrEPkhWY2xBGFxgU18K/DSFqt6Rtki8ihoBMC/zUfJCh/B4QMM
h8uJ7LKRYmDXoWdE2v6rNRM8etRd8KDLj+x3GIq0ryvoA4tICUXA87MBnssfv2+L
5phIfxtfRaZSbozdXxjXD+LYpDzKZSed5t+8+63v7W3Qu6LoJiapi/ZmpCnik8kg
OxlLuUUuu0dLMZxeDRDfSjYCXYwdnkwluBBJtnOFHYYxOb1Vfbgi5iuL3d29HjPD
CjUsIwxGYvzuCEDdvNE5QhfKidxjKAbLiIALzpwz4GIzXDVV7Hgljev8hD09gGP9
T/FscVNgUmZtgaORCnTQRM3ZYUfVTEgP51SnRZ/Pg/KleTojzkMd5mEmUVBN0dfw
ytLdAvr7FiEgu1a/BG4EacRrZx+9Gf/K9mqpAq2BGbeb6OrkX0dHSeEq6ftgK5zB
ACVm1hERTu2y7rTwypIFtuZSjrL9G5nj91c9wAmVt2lOysssp/JLQly1wQHAII9/
xsnSLO/VYt4uIlvGBPs6AxRcs+LbOcmOXK540BD1aPswtXRT42h2cH6LfCld5Nzo
bT/5F2U4jCZTN+mNcfC+HUtnzjo2UeBu3jgndt6zs3a83UlqoIo=
=oojR
-----END PGP SIGNATURE-----
Merge tag 'v6.5-p2' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto fixes from Herbert Xu:
"Fix a couple of regressions in af_alg and incorrect return values in
crypto/asymmetric_keys/public_key"
* tag 'v6.5-p2' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
crypto: algif_hash - Fix race between MORE and non-MORE sends
KEYS: asymmetric: Fix error codes
crypto: af_alg - Fix merging of written data into spliced pages
We just sorted the entries and fields last release, so just out of a
perverse sense of curiosity, I decided to see if we can keep things
ordered for even just one release.
The answer is "No. No we cannot".
I suggest that all kernel developers will need weekly training sessions,
involving a lot of Big Bird and Sesame Street. And at the yearly
maintainer summit, we will all sing the alphabet song together.
I doubt I will keep doing this. At some point "perverse sense of
curiosity" turns into just a cold dark place filled with sadness and
despair.
Repeats: 80e62bc848 ("MAINTAINERS: re-sort all entries and fields")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- swiotlb area sizing fixes (Petr Tesarik)
-----BEGIN PGP SIGNATURE-----
iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmSq57sLHGhjaEBsc3Qu
ZGUACgkQD55TZVIEUYOh0xAAkwklIxxzXvNlgjvy2hdgPWImPS8tGPDSIsqA9TDD
WDZq89nz/ndnchdPObDvyJXmfBgqa0qCHqopBVPqMKv5a1pKZhrRXYlbajGQQwji
MIqefTLZ/VGw7bDpEivOt+yadwQ3KxuxWWs7/JPKLLReSJ22H8P+jkrK7P7kBL5X
YaMtZG9d86fvFHnQHKdAOlF1iCvnoZDHPcvaZbI6m5mMSZ+HIYqK5pP1MTUEAbIU
MX4ZSI7/mL0q67+kZuM/NCSeq1pH0Cd0D2DGm+8k/y87G81GS6E5Wgg+xO7vYiXf
5ChuwlAO9K9KhH7NIRkKhkad/Ii89ENXSyv5gmPRoKYK5FXajnHSlJTUrZojV6XC
Pbsd9ATVzV0rY61EPyh6G1a+Ttp/pwMp+W0I2fi032GVAePQ/vhB9x9O+2O/3QiC
v80nUSatkciZncWqkguhp3NONsRmLKep3CCQnEAA/gLs27B0avdQeslnqbOOUQKd
Si+djIoel8ONjQ+mW8eFCsVYMH1xFSo0aGsgGe0y2cyBE3DN1AW9eRnOXWT4C1PR
UyDlx8ACs87ojec+YRQFYk2/PbsU7CQiH1pteXvBHcbhiVUAvrtXtg6ANQ+7066P
IIduAZmlHcHb1BhyrSQbAtRllVLIp/l9IAkCSY9SvL0tjh/B5CaRBD5m2Taow5I/
KUI=
=4Lfc
-----END PGP SIGNATURE-----
Merge tag 'dma-mapping-6.5-2023-07-09' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping fixes from Christoph Hellwig:
- swiotlb area sizing fixes (Petr Tesarik)
* tag 'dma-mapping-6.5-2023-07-09' of git://git.infradead.org/users/hch/dma-mapping:
swiotlb: reduce the number of areas to match actual memory pool size
swiotlb: always set the number of areas before allocating the pool
boot reordering work
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmSqal4ACgkQEsHwGGHe
VUr+pg/+JyqfzyymWYAaPUfwaFH7V8425p8thrZL+OSnDDoAZt5UnPpLB4lYZKWW
u2SlphNSLhuclZ7Wly1zkkPO1J8O88FRCFFBxONtnrQ4WqH2P7f2E6cHgzD4dQRF
RX/pNuLQ1TNYiOHvNvJ3xJvVdAGrcXFBbqupfSig+dQMBKIyuzGu/Jn7Cm0Q+HJK
j9WJWiGNJ+f8WOEbHiTdI89OFcPmUMe2nhtK/I/QIUoCBIiyp3jQ2RilZwY2V7Wu
U5kSQChqp7N+e275TLlOCFGvNW2htCZ5GPc2/nCOkfmnTDTwjVGX8jQr+EqC1pj1
WcueoTjBMw2Drs4/V9ItkGXYqmUE4CK03nGp6uZ2hA5Qo8mSAdzr59A3+I7BbHur
ulbm1i6ZZ0ip9Co080E0JS0F1CIL7ROIQ6HDQz4BUGQ1BbmIhNBmdj7yBJ20nTrr
L7EmwgDsOF2NhKpg5USGrPxJWBvc9ma72CAlHAiPVUgzFIR6Z5DN9TM8aWgZZPDt
RULC1/L/SI2FQmrMnCYhjO7Om0qJFk422cWCVjOA3D/lRo3toFEJ/XopxxXz9FZs
guAIJuFLjDun13hxS9PCGvRCkg2cdVsCykkg1ydAbg2ux99rPDAmmnwYPG7pvxiP
2W0gq43dbQAZlYjRx3gV5sHpUtPCsF+1Lz5jXkldRZJNXD1v1Fk=
=RZFV
-----END PGP SIGNATURE-----
Merge tag 'x86_urgent_for_v6.5_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fpu fix from Borislav Petkov:
- Do FPU AP initialization on Xen PV too which got missed by the recent
boot reordering work
* tag 'x86_urgent_for_v6.5_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/xen: Fix secondary processors' FPU initialization
On shutdown or kexec, the kernel tries to park the non-boot CPUs with an
INIT IPI. But the same code path is also used by the crash utility. If the
CPU which panics is not the boot CPU then it sends an INIT IPI to the boot
CPU which resets the machine. Prevent this by validating that the CPU which
runs the stop mechanism is the boot CPU. If not, leave the other CPUs in
HLT.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmSqoEcTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoYNSEACwo5zgibek27qeMvJGfNztm0qRa4mw
wN0qV31yaNcEfhqL8bMU8n3wvEA+pZBqhaU5fyalY+yxc29jI/j9eda5zR+Fi9e5
kVyFT2M0rVSDLFraoQeD+T/tSSK2MJtswF12ytY5mHzHMCb6Uy9fNCpUiQlB+i81
AcnlKQk9ifZXFdMJPj5E+E6l776T8NZPoYEdFgJloxaYOGTdFJDWDlryx4LD7Urz
Fx/ec8Ug/FYSPl2XzXHugvHjNefxKoomcZ3v3CSZonBcav7Gz6F06HAR5vVRWSHx
4Dlh6zdy+60YKBmkvpb+RJIBMo8aXclwT+tntaoJvGHZ+PNASO6JVz9PvmoNgfWK
Oy2n1K687qIOY6d+yxUZgbZpwXX5bG6kc0xbicUNigGagrYTfd83G5RAfwxNkqsY
23Qw4Ue8uxve4M8iM/FfxKIShuDBiLCIDDIrWDjEkvIAnr1pd+NPUv7kqOTI7Kz/
srNgcOwalypzuS93lgaN1yjRv1mmaPXhdhjy0DwGbC54bKgNzfq+7z75Ibn0dSFF
JUPFVjztB+ymnM6PJ1dR77SvPi+xOi60nw7L+Qu9US4yKkW0NeGiIWVsggNorbU6
UPFSE5gxwFD0w1EZ9W+IDeOZUNhjJUINZsn8txm+tb+oEqTIGRPHPOo0C1dBmLW9
AmDIeHljj0iWIw==
=DOCF
-----END PGP SIGNATURE-----
Merge tag 'x86-core-2023-07-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fix from Thomas Gleixner:
"A single fix for the mechanism to park CPUs with an INIT IPI.
On shutdown or kexec, the kernel tries to park the non-boot CPUs with
an INIT IPI. But the same code path is also used by the crash utility.
If the CPU which panics is not the boot CPU then it sends an INIT IPI
to the boot CPU which resets the machine.
Prevent this by validating that the CPU which runs the stop mechanism
is the boot CPU. If not, leave the other CPUs in HLT"
* tag 'x86-core-2023-07-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/smp: Don't send INIT to boot CPU
* Fix an uninitialized variable warning.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZKjUjwAKCRBKO3ySh0YR
pn92AQC4gY9GOyKcc/aiAd/t1u8gGxnFtcN06xh4TdVArMM4/AD/UtEKx9LYuaSF
pyhw5SfzxI555HfXkA8ci/D+BxguVQs=
=/vX1
-----END PGP SIGNATURE-----
Merge tag 'xfs-6.5-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fix from Darrick Wong:
"Nothing exciting here, just getting rid of a gcc warning that I got
tired of seeing when I turn on gcov"
* tag 'xfs-6.5-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: fix uninit warning in xfs_growfs_data
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmSqNkIACgkQiiy9cAdy
T1GXsAwAhYUyjlXZLDsmO+9PjKhM9WRM1IO5myy3P396R0Tzq741f8LM7Lx08qc+
D1701gsnhIrvprem1HjtW6DZzCVnLdpBIYUEnwUr8eDqMpk1VFKug3xSVhIRMih3
Y30dHTgQ0aCLrrh5XHOWhBHJbpq7Wdlh3q0oi8I36Of8e6tGFNo2wI4ud7no4aIj
N222dWOs56FXtVAmgEAuc7U2A40ztMOp7FXrbzhK4FwD5kO+pFkqJcLjG6Bk10ph
Tyg3Wh2TnX+MviOY0xUaN0X50dSoSJPkSUYGkccrIcfVPwEoH7l6j0LNgAVyhG7K
f5EUbM7Td51a1Znj9wX6U9N0UfO/IOZRDFZ7ACckLBBBEzfKYCgYY5dWJ6aVxZHb
bB336f1ObvDiocEabS1SMa//sXUjpOy3Tg8etLCYJpqjWYE8nO7lERoBWGWXkUqy
xO86pGQjYLzkw16R11tzbplv+1HxoGwIuQnOubivv2prn++NZ4Zr2ohBeDlyJc1/
WwF42UfM
=F8D0
-----END PGP SIGNATURE-----
Merge tag '6.5-rc-smb3-client-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6
Pull more smb client updates from Steve French:
- fix potential use after free in unmount
- minor cleanup
- add worker to cleanup stale directory leases
* tag '6.5-rc-smb3-client-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6:
cifs: Add a laundromat thread for cached directories
smb: client: remove redundant pointer 'server'
cifs: fix session state transition to avoid use-after-free issue
Lockdep is certainly right to complain about
(&vma->vm_lock->lock){++++}-{3:3}, at: vma_start_write+0x2d/0x3f
but task is already holding lock:
(&mapping->i_mmap_rwsem){+.+.}-{3:3}, at: mmap_region+0x4dc/0x6db
Invert those to the usual ordering.
Fixes: 33313a747e ("mm: lock newly mapped VMA which can be modified after it becomes visible")
Cc: stable@vger.kernel.org
Signed-off-by: Hugh Dickins <hughd@google.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZKmgXAAKCRDdBJ7gKXxA
joqDAP0V520Jy0cyJrRMvaQRFMqtVeDOdTpAue7ZOQHSi/LZnAD9EEAxDpYF/V4x
PO27ixXQ4Glm2iYgH7bDX7J73WiA3wg=
=JsYW
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2023-07-08-10-43' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull hotfixes from Andrew Morton:
"16 hotfixes. Six are cc:stable and the remainder address post-6.4
issues"
The merge undoes the disabling of the CONFIG_PER_VMA_LOCK feature, since
it was all hopefully fixed in mainline.
* tag 'mm-hotfixes-stable-2023-07-08-10-43' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
lib: dhry: fix sleeping allocations inside non-preemptable section
kasan, slub: fix HW_TAGS zeroing with slub_debug
kasan: fix type cast in memory_is_poisoned_n
mailmap: add entries for Heiko Stuebner
mailmap: update manpage link
bootmem: remove the vmemmap pages from kmemleak in free_bootmem_page
MAINTAINERS: add linux-next info
mailmap: add Markus Schneider-Pargmann
writeback: account the number of pages written back
mm: call arch_swap_restore() from do_swap_page()
squashfs: fix cache race with migration
mm/hugetlb.c: fix a bug within a BUG(): inconsistent pte comparison
docs: update ocfs2-devel mailing list address
MAINTAINERS: update ocfs2-devel mailing list address
mm: disable CONFIG_PER_VMA_LOCK until its fixed
fork: lock VMAs of the parent process when forking
When forking a child process, the parent write-protects anonymous pages
and COW-shares them with the child being forked using copy_present_pte().
We must not take any concurrent page faults on the source vma's as they
are being processed, as we expect both the vma and the pte's behind it
to be stable. For example, the anon_vma_fork() expects the parents
vma->anon_vma to not change during the vma copy.
A concurrent page fault on a page newly marked read-only by the page
copy might trigger wp_page_copy() and a anon_vma_prepare(vma) on the
source vma, defeating the anon_vma_clone() that wasn't done because the
parent vma originally didn't have an anon_vma, but we now might end up
copying a pte entry for a page that has one.
Before the per-vma lock based changes, the mmap_lock guaranteed
exclusion with concurrent page faults. But now we need to do a
vma_start_write() to make sure no concurrent faults happen on this vma
while it is being processed.
This fix can potentially regress some fork-heavy workloads. Kernel
build time did not show noticeable regression on a 56-core machine while
a stress test mapping 10000 VMAs and forking 5000 times in a tight loop
shows ~5% regression. If such fork time regression is unacceptable,
disabling CONFIG_PER_VMA_LOCK should restore its performance. Further
optimizations are possible if this regression proves to be problematic.
Suggested-by: David Hildenbrand <david@redhat.com>
Reported-by: Jiri Slaby <jirislaby@kernel.org>
Closes: https://lore.kernel.org/all/dbdef34c-3a07-5951-e1ae-e9c6e3cdf51b@kernel.org/
Reported-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Closes: https://lore.kernel.org/all/b198d649-f4bf-b971-31d0-e8433ec2a34c@applied-asynchrony.com/
Reported-by: Jacob Young <jacobly.alt@gmail.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217624
Fixes: 0bff0aaea0 ("x86/mm: try VMA lock-based page fault handling first")
Cc: stable@vger.kernel.org
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mmap_region adds a newly created VMA into VMA tree and might modify it
afterwards before dropping the mmap_lock. This poses a problem for page
faults handled under per-VMA locks because they don't take the mmap_lock
and can stumble on this VMA while it's still being modified. Currently
this does not pose a problem since post-addition modifications are done
only for file-backed VMAs, which are not handled under per-VMA lock.
However, once support for handling file-backed page faults with per-VMA
locks is added, this will become a race.
Fix this by write-locking the VMA before inserting it into the VMA tree.
Other places where a new VMA is added into VMA tree do not modify it
after the insertion, so do not need the same locking.
Cc: stable@vger.kernel.org
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With recent changes necessitating mmap_lock to be held for write while
expanding a stack, per-VMA locks should follow the same rules and be
write-locked to prevent page faults into the VMA being expanded. Add
the necessary locking.
Cc: stable@vger.kernel.org
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A few late arriving patches that missed the initial pull request.
It's mostly bug fixes (the dt-bindings is a fix for the initial pull).
Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com>
-----BEGIN PGP SIGNATURE-----
iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCZKmvwSYcamFtZXMuYm90
dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishResAQCPDbBh
omMRBE+W+Vx2TgOJGjo/F+T1D2JjBhLIGpNVggEApJtgrQutAToiCU/qIP9GOTl7
evetzh5boMMuyD2s7ak=
=pi4v
-----END PGP SIGNATURE-----
Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull more SCSI updates from James Bottomley:
"A few late arriving patches that missed the initial pull request. It's
mostly bug fixes (the dt-bindings is a fix for the initial pull)"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: ufs: core: Remove unused function declaration
scsi: target: docs: Remove tcm_mod_builder.py
scsi: target: iblock: Quiet bool conversion warning with pr_preempt use
scsi: dt-bindings: ufs: qcom: Fix ICE phandle
scsi: core: Simplify scsi_cdl_check_cmd()
scsi: isci: Fix comment typo
scsi: smartpqi: Replace one-element arrays with flexible-array members
scsi: target: tcmu: Replace strlcpy() with strscpy()
scsi: ncr53c8xx: Replace strlcpy() with strscpy()
scsi: lpfc: Fix lpfc_name struct packing
* xiic patch should have been in part 1 but slipped through
* mpc patch fixes a build regression from part 1
* nomadik is a fix which needed a rebase after part 1
-----BEGIN PGP SIGNATURE-----
iQJDBAABCgAtFiEEOZGx6rniZ1Gk92RdFA3kzBSgKbYFAmSoblYPHHdzYUBrZXJu
ZWwub3JnAAoJEBQN5MwUoCm2iG0P/RG3yGeV0PzQ2XbKPwIMAQURQ7NiqVVrAnUP
kkduOkPJ9QU+9Qjb/4SRCTxWFVFbqruyOZWB+5hmwHtv4MS3H1ulKaH9bat5bQsF
F28o5pjMVQEhjl+Y1jAtnSOlhprbTds6B1NdgbMFopnQ7ExlB1/RAt0rvZdgsfkt
WeKRQMlne0mpQ7mybp6qsN8ldYBSDu8AmZ0S/RSnCwRwl0AxYYxk5roQAfg0tbaU
N1zWUerNO7lAVoJnFsdeo0CMxK7t3QGHPSZGayargEF0KJHmxfzU14oZ6l28j6wq
mSazjyK5wTeqqa4mU835RO9Ko6zE42u5nM+ok8Ui6a4xWjvafNrjnZkIClMg2eC9
7ESZoz+Rtd3kCu3LiWXqPhs9cplHrRtA4M8E4gbTbsdMB8kFRaaiocjHfPksO8SG
QtFXHGV5AgtoBkCC3womZWvY17XOZf40S6NUFafZ8kZFg5WGCWP3rwUZhz/+frSM
QiIdxM+cER6WC6ADnZHdQyFl1fjLc4EdbwoN54E7DRBqB4s0wy+1vU+/qf9JdIuR
zIN0I2OFxnwFEiTfAlTX1RmYy7L/3dtP2Gk8og+W9CxHSwJsURhqIfz/MjNuoh7U
c11jS9QfETBj2GDjRoRxE0zj4Q2rhjdTwFhQfuYp0eI6UsDfYHHB7ql2FwO1+EW1
VkmkX38T
=xwp6
-----END PGP SIGNATURE-----
Merge tag 'i2c-for-6.5-rc1-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull more i2c updates from Wolfram Sang:
- xiic patch should have been in the original pull but slipped through
- mpc patch fixes a build regression
- nomadik cleanup
* tag 'i2c-for-6.5-rc1-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: mpc: Drop unused variable
i2c: nomadik: Remove a useless call in the remove function
i2c: xiic: Don't try to handle more interrupt events after error
The debugfs_create_dir function returns ERR_PTR in case of error, and the
only correct way to check if an error occurred is 'IS_ERR' inline function.
This patch will replace the null-comparison with IS_ERR.
Signed-off-by: Anup Sharma <anupnewsmail@gmail.com>
Suggested-by: Ivan Orlov <ivan.orlov0322@gmail.com>
Signed-off-by: Jon Mason <jdmason@kudzu.us>
Build:
- Allow to generate vmlinux.h from BTF using `make GEN_VMLINUX_H=1`
and skip if the vmlinux has no BTF.
- Replace deprecated clang -target xxx option by --target=xxx.
perf record:
- Print event attributes with well known type and config symbols in the
debug output like below:
# perf record -e cycles,cpu-clock -C0 -vv true
<SNIP>
------------------------------------------------------------
perf_event_attr:
type 0 (PERF_TYPE_HARDWARE)
size 136
config 0 (PERF_COUNT_HW_CPU_CYCLES)
{ sample_period, sample_freq } 4000
sample_type IP|TID|TIME|CPU|PERIOD|IDENTIFIER
read_format ID
disabled 1
inherit 1
freq 1
sample_id_all 1
exclude_guest 1
------------------------------------------------------------
sys_perf_event_open: pid -1 cpu 0 group_fd -1 flags 0x8 = 5
------------------------------------------------------------
perf_event_attr:
type 1 (PERF_TYPE_SOFTWARE)
size 136
config 0 (PERF_COUNT_SW_CPU_CLOCK)
{ sample_period, sample_freq } 4000
sample_type IP|TID|TIME|CPU|PERIOD|IDENTIFIER
read_format ID
disabled 1
inherit 1
freq 1
sample_id_all 1
exclude_guest 1
- Update AMD IBS event error message since it now support per-process
profiling but no priviledge filters.
$ sudo perf record -e ibs_op//k -C 0
Error:
AMD IBS doesn't support privilege filtering. Try again without
the privilege modifiers (like 'k') at the end.
perf lock contention:
- Support CSV style output using -x option
$ sudo perf lock con -ab -x, sleep 1
# output: contended, total wait, max wait, avg wait, type, caller
19, 194232, 21415, 10222, spinlock, process_one_work+0x1f0
15, 162748, 23843, 10849, rwsem:R, do_user_addr_fault+0x40e
4, 86740, 23415, 21685, rwlock:R, ep_poll_callback+0x2d
1, 84281, 84281, 84281, mutex, iwl_mvm_async_handlers_wk+0x135
8, 67608, 27404, 8451, spinlock, __queue_work+0x174
3, 58616, 31125, 19538, rwsem:W, do_mprotect_pkey+0xff
3, 52953, 21172, 17651, rwlock:W, do_epoll_wait+0x248
2, 30324, 19704, 15162, rwsem:R, do_madvise+0x3ad
1, 24619, 24619, 24619, spinlock, rcu_core+0xd4
- Add --output option to save the data to a file not to be interfered
by other debug messages.
Test:
- Fix event parsing test on ARM where there's no raw PMU nor supports
PERF_PMU_CAP_EXTENDED_HW_TYPE.
- Update the lock contention test case for CSV output.
- Fix a segfault in the daemon command test.
Vendor events (JSON):
- Add has_event() to check if the given event is available on system
at runtime. On Intel machines, some transaction events may not be
present when TSC extensions are disabled.
- Update Intel event metrics.
Misc:
- Sort symbols by name using an external array of pointers instead of
a rbtree node in the symbol. This will save 16-bytes or 24-bytes
per symbol whether the sorting is actually requested or not.
- Fix unwinding DWARF callstacks using libdw when --symfs option is
used.
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSo2x5BnqMqsoHtzsmMstVUGiXMgwUCZKb4mwAKCRCMstVUGiXM
g1QqAPwKZow/DhAzyN7KvzdNd+SojRGpUMl6RkVphY/9ntDqPAD+L3V5aXLTiC1L
8kUzdpRX5VMjqdR9U7TycUOi4QU40QA=
=dEF1
-----END PGP SIGNATURE-----
Merge tag 'perf-tools-for-v6.5-2-2023-07-06' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next
Pull more perf tools updates from Namhyung Kim:
"These are remaining changes and fixes for this cycle.
Build:
- Allow generating vmlinux.h from BTF using `make GEN_VMLINUX_H=1`
and skip if the vmlinux has no BTF.
- Replace deprecated clang -target xxx option by --target=xxx.
perf record:
- Print event attributes with well known type and config symbols in
the debug output like below:
# perf record -e cycles,cpu-clock -C0 -vv true
<SNIP>
------------------------------------------------------------
perf_event_attr:
type 0 (PERF_TYPE_HARDWARE)
size 136
config 0 (PERF_COUNT_HW_CPU_CYCLES)
{ sample_period, sample_freq } 4000
sample_type IP|TID|TIME|CPU|PERIOD|IDENTIFIER
read_format ID
disabled 1
inherit 1
freq 1
sample_id_all 1
exclude_guest 1
------------------------------------------------------------
sys_perf_event_open: pid -1 cpu 0 group_fd -1 flags 0x8 = 5
------------------------------------------------------------
perf_event_attr:
type 1 (PERF_TYPE_SOFTWARE)
size 136
config 0 (PERF_COUNT_SW_CPU_CLOCK)
{ sample_period, sample_freq } 4000
sample_type IP|TID|TIME|CPU|PERIOD|IDENTIFIER
read_format ID
disabled 1
inherit 1
freq 1
sample_id_all 1
exclude_guest 1
- Update AMD IBS event error message since it now support per-process
profiling but no priviledge filters.
$ sudo perf record -e ibs_op//k -C 0
Error:
AMD IBS doesn't support privilege filtering. Try again without
the privilege modifiers (like 'k') at the end.
perf lock contention:
- Support CSV style output using -x option
$ sudo perf lock con -ab -x, sleep 1
# output: contended, total wait, max wait, avg wait, type, caller
19, 194232, 21415, 10222, spinlock, process_one_work+0x1f0
15, 162748, 23843, 10849, rwsem:R, do_user_addr_fault+0x40e
4, 86740, 23415, 21685, rwlock:R, ep_poll_callback+0x2d
1, 84281, 84281, 84281, mutex, iwl_mvm_async_handlers_wk+0x135
8, 67608, 27404, 8451, spinlock, __queue_work+0x174
3, 58616, 31125, 19538, rwsem:W, do_mprotect_pkey+0xff
3, 52953, 21172, 17651, rwlock:W, do_epoll_wait+0x248
2, 30324, 19704, 15162, rwsem:R, do_madvise+0x3ad
1, 24619, 24619, 24619, spinlock, rcu_core+0xd4
- Add --output option to save the data to a file not to be interfered
by other debug messages.
Test:
- Fix event parsing test on ARM where there's no raw PMU nor supports
PERF_PMU_CAP_EXTENDED_HW_TYPE.
- Update the lock contention test case for CSV output.
- Fix a segfault in the daemon command test.
Vendor events (JSON):
- Add has_event() to check if the given event is available on system
at runtime. On Intel machines, some transaction events may not be
present when TSC extensions are disabled.
- Update Intel event metrics.
Misc:
- Sort symbols by name using an external array of pointers instead of
a rbtree node in the symbol. This will save 16-bytes or 24-bytes
per symbol whether the sorting is actually requested or not.
- Fix unwinding DWARF callstacks using libdw when --symfs option is
used"
* tag 'perf-tools-for-v6.5-2-2023-07-06' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next: (38 commits)
perf test: Fix event parsing test when PERF_PMU_CAP_EXTENDED_HW_TYPE isn't supported.
perf test: Fix event parsing test on Arm
perf evsel amd: Fix IBS error message
perf: unwind: Fix symfs with libdw
perf symbol: Fix uninitialized return value in symbols__find_by_name()
perf test: Test perf lock contention CSV output
perf lock contention: Add --output option
perf lock contention: Add -x option for CSV style output
perf lock: Remove stale comments
perf vendor events intel: Update tigerlake to 1.13
perf vendor events intel: Update skylakex to 1.31
perf vendor events intel: Update skylake to 57
perf vendor events intel: Update sapphirerapids to 1.14
perf vendor events intel: Update icelakex to 1.21
perf vendor events intel: Update icelake to 1.19
perf vendor events intel: Update cascadelakex to 1.19
perf vendor events intel: Update meteorlake to 1.03
perf vendor events intel: Add rocketlake events/metrics
perf vendor metrics intel: Make transaction metrics conditional
perf jevents: Support for has_event function
...
Fixes for different bitmap pieces:
- lib/test_bitmap: increment failure counter properly
The tests that don't use expect_eq() macro to determine that a test is
failured must increment failed_tests explicitly.
- lib/bitmap: drop optimization of bitmap_{from,to}_arr64
bitmap_{from,to}_arr64() optimization is overly optimistic on 32-bit LE
architectures when it's wired to bitmap_copy_clear_tail().
- nodemask: Drop duplicate check in for_each_node_mask()
As the return value type of first_node() became unsigned, the node >= 0
became unnecessary.
- cpumask: fix function description kernel-doc notation
- MAINTAINERS: Add bits.h to the BITMAP API record
- MAINTAINERS: Add bitfield.h to the BITMAP API record
Add linux/bits.h and linux/bitfield.h for visibility
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEEi8GdvG6xMhdgpu/4sUSA/TofvsgFAmShrd0ACgkQsUSA/Tof
vsg5RAv/YyedOccCkLtd4D+l+jF/a1qd64u6HfONkYzdoSeKIBiAlKukpHOjJjky
ii5vSwEMzW34MA6wBylmY4078OXEwEu9SUsF/+5jGvqJ1ABCZkwFoMuAavRHkHhc
xE/es6XvpSbZ4U4GHnyaUbQCuEkpW21U644QdSWqjVPDHmo32EgtNG7hTaZfCRNh
VNa8H+tbgi5rQUMBeyxppJjyynGKwut6pX4OcOi2lq1onz8zzHx8H4AJC2v8Yg0r
2ZRS1FfAuk2Pvp7JFQOExcxf0x/Ph3zp8dqIgfo2pSMAi/0ZjnTkAPxCYIGsWG5u
TbfO3rHviHtKAN0q5/6xp939G69flW5xSica+IvtuXSIS/JbllwDN4mrexYxppWH
Euj/ixJe79Dn+jtOgG17p4cp70yAcM1IQfyWH25Q8Jej+C/gN8WB0UXD4K7jTXrn
9rGID28WbuclAWzJDmFQTpZutG1GHjM7rytit+if08gz065vUhTllhJC+vCHQojY
clS4E2DW
=nZF4
-----END PGP SIGNATURE-----
Merge tag 'bitmap-6.5-rc1' of https://github.com/norov/linux
Pull bitmap updates from Yury Norov:
"Fixes for different bitmap pieces:
- lib/test_bitmap: increment failure counter properly
The tests that don't use expect_eq() macro to determine that a test
is failured must increment failed_tests explicitly.
- lib/bitmap: drop optimization of bitmap_{from,to}_arr64
bitmap_{from,to}_arr64() optimization is overly optimistic
on 32-bit LE architectures when it's wired to
bitmap_copy_clear_tail().
- nodemask: Drop duplicate check in for_each_node_mask()
As the return value type of first_node() became unsigned, the node
>= 0 became unnecessary.
- cpumask: fix function description kernel-doc notation
- MAINTAINERS: Add bits.h and bitfield.h to the BITMAP API record
Add linux/bits.h and linux/bitfield.h for visibility"
* tag 'bitmap-6.5-rc1' of https://github.com/norov/linux:
MAINTAINERS: Add bitfield.h to the BITMAP API record
MAINTAINERS: Add bits.h to the BITMAP API record
cpumask: fix function description kernel-doc notation
nodemask: Drop duplicate check in for_each_node_mask()
lib/bitmap: drop optimization of bitmap_{from,to}_arr64
lib/test_bitmap: increment failure counter properly
The Smatch static checker reports the following warnings:
lib/dhry_run.c:38 dhry_benchmark() warn: sleeping in atomic context
lib/dhry_run.c:43 dhry_benchmark() warn: sleeping in atomic context
Indeed, dhry() does sleeping allocations inside the non-preemptable
section delimited by get_cpu()/put_cpu().
Fix this by using atomic allocations instead.
Add error handling, as atomic these allocations may fail.
Link: https://lkml.kernel.org/r/bac6d517818a7cd8efe217c1ad649fffab9cc371.1688568764.git.geert+renesas@glider.be
Fixes: 13684e966d ("lib: dhry: fix unstable smp_processor_id(_) usage")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/0469eb3a-02eb-4b41-b189-de20b931fa56@moroto.mountain
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>