kmap_atomic() disables preemption and pagefaults for historical reasons.
The conversion to kmap_local(), which only disables migration, cannot be
done wholesale because quite some call sites need to be updated to
accommodate with the changed semantics.
On PREEMPT_RT enabled kernels the kmap_atomic() semantics are problematic
due to the implicit disabling of preemption which makes it impossible to
acquire 'sleeping' spinlocks within the kmap atomic sections.
PREEMPT_RT replaces the preempt_disable() with a migrate_disable() for
more than a decade. It could be argued that this is a justification to do
this unconditionally, but PREEMPT_RT covers only a limited number of
architectures and it disables some functionality which limits the coverage
further.
Limit the replacement to PREEMPT_RT for now.
Link: https://lkml.kernel.org/r/20210810091116.pocdmaatdcogvdso@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, the "auto-movable" online policy does not allow for hotplugged
KERNEL (ZONE_NORMAL) memory to increase the amount of MOVABLE memory we
can have, primarily, because there is no coordiantion across memory
devices and we don't want to create zone-imbalances accidentially when
unplugging memory.
However, within a single memory device it's different. Let's allow for
KERNEL memory within a dynamic memory group to allow for more MOVABLE
within the same memory group. The only thing we have to take care of is
that the managing driver avoids zone imbalances by unplugging MOVABLE
memory first, otherwise there can be corner cases where unplug of memory
could result in (accidential) zone imbalances.
virtio-mem is the only user of dynamic memory groups and recently added
support for prioritizing unplug of ZONE_MOVABLE over ZONE_NORMAL, so we
don't need a new toggle to enable it for dynamic memory groups.
We limit this handling to dynamic memory groups, because:
* We want to keep the runtime overhead for collecting stats when
onlining a single memory block small. We tend to have only a handful of
dynamic memory groups, but we can have quite some static memory groups
(e.g., 256 DIMMs).
* It doesn't make too much sense for static memory groups, as we try
onlining all applicable memory blocks either completely to ZONE_MOVABLE
or not. In ordinary operation, we won't have a mixture of zones within
a static memory group.
When adding memory to a dynamic memory group, we'll first online memory to
ZONE_MOVABLE as long as early KERNEL memory allows for it. Then, we'll
online the next unit(s) to ZONE_NORMAL, until we can online the next
unit(s) to ZONE_MOVABLE.
For a simple virtio-mem device with a MOVABLE:KERNEL ratio of 3:1, it will
result in a layout like:
[M][M][M][M][M][M][M][M][N][M][M][M][N][M][M][M]...
^ movable memory due to early kernel memory
^ allows for more movable memory ...
^-----^ ... here
^ allows for more movable memory ...
^-----^ ... here
While the created layout is sub-optimal when it comes to contiguous zones,
it gives us the maximum flexibility when dynamically growing/shrinking a
device; we can grow small VMs really big in small steps, and still shrink
reliably to e.g., 1/4 of the maximum VM size in this example, removing
full memory blocks along with meta data more reliably.
Mark dynamic memory groups in the xarray such that we can efficiently
iterate over them when collecting stats. In usual setups, we have one
virtio-mem device per NUMA node, and usually only a small number of NUMA
nodes.
Note: for now, there seems to be no compelling reason to make this
behavior configurable.
Link: https://lkml.kernel.org/r/20210806124715.17090-10-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hui Zhu <teawater@gmail.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Marek Kedzierski <mkedzier@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In our "auto-movable" memory onlining policy, we want to make decisions
across memory blocks of a single memory device. Examples of memory
devices include ACPI memory devices (in the simplest case a single DIMM)
and virtio-mem. For now, we don't have a connection between a single
memory block device and the real memory device. Each memory device
consists of 1..X memory block devices.
Let's logically group memory blocks belonging to the same memory device in
"memory groups". Memory groups can span multiple physical ranges and a
memory group itself does not contain any information regarding physical
ranges, only properties (e.g., "max_pages") necessary for improved memory
onlining.
Introduce two memory group types:
1) Static memory group: E.g., a single ACPI memory device, consisting
of 1..X memory resources. A memory group consists of 1..Y memory
blocks. The whole group is added/removed in one go. If any part
cannot get offlined, the whole group cannot be removed.
2) Dynamic memory group: E.g., a single virtio-mem device. Memory is
dynamically added/removed in a fixed granularity, called a "unit",
consisting of 1..X memory blocks. A unit is added/removed in one go.
If any part of a unit cannot get offlined, the whole unit cannot be
removed.
In case of 1) we usually want either all memory managed by ZONE_MOVABLE or
none. In case of 2) we usually want to have as many units as possible
managed by ZONE_MOVABLE. We want a single unit to be of the same type.
For now, memory groups are an internal concept that is not exposed to user
space; we might want to change that in the future, though.
add_memory() users can specify a mgid instead of a nid when passing the
MHP_NID_IS_MGID flag.
Link: https://lkml.kernel.org/r/20210806124715.17090-4-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hui Zhu <teawater@gmail.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Marek Kedzierski <mkedzier@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When onlining without specifying a zone (using "online" instead of
"online_kernel" or "online_movable"), we currently select a zone such that
existing zones are kept contiguous. This online policy made sense in the
past, where contiguous zones where required.
We'd like to implement smarter policies, however:
* User space has little insight. As one example, it has no idea which
memory blocks logically belong together (e.g., to a DIMM or to a
virtio-mem device).
* Drivers that add memory in separate memory blocks, especially
virtio-mem, want memory to get onlined right from the kernel when
adding.
So we really want to have onlining to differing zones managed in the
kernel, configured by user space.
We see more and more cases where we might eventually hotplug a lot of
memory in the future (e.g., eventually grow a 2 GiB VM to 64 GiB),
however:
* Resizing happens dynamically, in smaller steps in both directions
(e.g., 2 GiB -> 8 GiB -> 4 GiB -> 16 GiB ...)
* We still want as much flexibility as possible, especially,
hotunplugging as much memory as possible later.
We can really only use "online_movable" if we know that the amount of
memory we are going to hotplug upfront, and we know that it won't result
in a zone imbalance. So in our example, a 2 GiB VM that could grow to 64
GiB could currently not use "online_movable", and instead, "online_kernel"
would have to be used, resulting in worse (no) memory hotunplug
reliability.
Let's add a new "auto-movable" online policy that considers the current
zone ratios (global, per-node) to determine, whether we a memory block can
be onlined to ZONE_MOVABLE:
MOVABLE : KERNEL
However, internally we'll only consider the following ratio for now:
MOVABLE : KERNEL_EARLY
For now, we don't allow for hotplugged KERNEL memory to allow for more
MOVABLE memory, because there is no coordination across memory devices.
In follow-up patches, we will allow for more KERNEL memory within a memory
device to allow for more MOVABLE memory within the same memory device --
which only makes sense for special memory device types.
We base our calculation on "present pages", see the code comments for
details. Hotplugged memory will get online to ZONE_MOVABLE if the
configured ratio allows for it. Depending on the setup, this can result
in fragmented zones, which can make compaction slower and dynamic
allocation of gigantic pages when not using CMA less reliable (... which
is already pretty unreliable).
The old policy will be the default and called "contig-zones". In
follow-up patches, our new policy will use additional information, such as
memory groups, to make even smarter decisions across memory blocks.
Configuration:
* memory_hotplug.online_policy is used to switch between both polices
and defaults to "contig-zones".
* memory_hotplug.auto_movable_ratio defines the maximum ratio is in
percent and defaults to "301" -- allowing e.g., most 8 GiB machines to
grow to 32 GiB and have all hotplugged memory in ZONE_MOVABLE. The
additional percent accounts for a handful of lost present pages (e.g.,
firmware allocations). User space is expected to adjust this ratio when
enabling the new "auto-movable" policy, though.
* memory_hotplug.auto_movable_numa_aware considers numa node stats in
addition to global stats, and defaults to "true".
Note: just like the old policy, the new policy won't take things like
unmovable huge pages or memory ballooning that doesn't support balloon
compaction into account. User space has to configure onlining
accordingly.
Link: https://lkml.kernel.org/r/20210806124715.17090-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hui Zhu <teawater@gmail.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Marek Kedzierski <mkedzier@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm/memory_hotplug: "auto-movable" online policy and memory groups", v3.
I. Goal
The goal of this series is improving in-kernel auto-online support. It
tackles the fundamental problems that:
1) We can create zone imbalances when onlining all memory blindly to
ZONE_MOVABLE, in the worst case crashing the system. We have to know
upfront how much memory we are going to hotplug such that we can
safely enable auto-onlining of all hotplugged memory to ZONE_MOVABLE
via "online_movable". This is far from practical and only applicable in
limited setups -- like inside VMs under the RHV/oVirt hypervisor which
will never hotplug more than 3 times the boot memory (and the
limitation is only in place due to the Linux limitation).
2) We see more setups that implement dynamic VM resizing, hot(un)plugging
memory to resize VM memory. In these setups, we might hotplug a lot of
memory, but it might happen in various small steps in both directions
(e.g., 2 GiB -> 8 GiB -> 4 GiB -> 16 GiB ...). virtio-mem is the
primary driver of this upstream right now, performing such dynamic
resizing NUMA-aware via multiple virtio-mem devices.
Onlining all hotplugged memory to ZONE_NORMAL means we basically have
no hotunplug guarantees. Onlining all to ZONE_MOVABLE means we can
easily run into zone imbalances when growing a VM. We want a mixture,
and we want as much memory as reasonable/configured in ZONE_MOVABLE.
Details regarding zone imbalances can be found at [1].
3) Memory devices consist of 1..X memory block devices, however, the
kernel doesn't really track the relationship. Consequently, also user
space has no idea. We want to make per-device decisions.
As one example, for memory hotunplug it doesn't make sense to use a
mixture of zones within a single DIMM: we want all MOVABLE if
possible, otherwise all !MOVABLE, because any !MOVABLE part will easily
block the whole DIMM from getting hotunplugged.
As another example, virtio-mem operates on individual units that span
1..X memory blocks. Similar to a DIMM, we want a unit to either be all
MOVABLE or !MOVABLE. A "unit" can be thought of like a DIMM, however,
all units of a virtio-mem device logically belong together and are
managed (added/removed) by a single driver. We want as much memory of
a virtio-mem device to be MOVABLE as possible.
4) We want memory onlining to be done right from the kernel while adding
memory, not triggered by user space via udev rules; for example, this
is reqired for fast memory hotplug for drivers that add individual
memory blocks, like virito-mem. We want a way to configure a policy in
the kernel and avoid implementing advanced policies in user space.
The auto-onlining support we have in the kernel is not sufficient. All we
have is a) online everything MOVABLE (online_movable) b) online everything
!MOVABLE (online_kernel) c) keep zones contiguous (online). This series
allows configuring c) to mean instead "online movable if possible
according to the coniguration, driven by a maximum MOVABLE:KERNEL ratio"
-- a new onlining policy.
II. Approach
This series does 3 things:
1) Introduces the "auto-movable" online policy that initially operates on
individual memory blocks only. It uses a maximum MOVABLE:KERNEL ratio
to make a decision whether a memory block will be onlined to
ZONE_MOVABLE or not. However, in the basic form, hotplugged KERNEL
memory does not allow for more MOVABLE memory (details in the
patches). CMA memory is treated like MOVABLE memory.
2) Introduces static (e.g., DIMM) and dynamic (e.g., virtio-mem) memory
groups and uses group information to make decisions in the
"auto-movable" online policy across memory blocks of a single memory
device (modeled as memory group). More details can be found in patch
#3 or in the DIMM example below.
3) Maximizes ZONE_MOVABLE memory within dynamic memory groups, by
allowing ZONE_NORMAL memory within a dynamic memory group to allow for
more ZONE_MOVABLE memory within the same memory group. The target use
case is dynamic VM resizing using virtio-mem. See the virtio-mem
example below.
I remember that the basic idea of using a ratio to implement a policy in
the kernel was once mentioned by Vitaly Kuznetsov, but I might be wrong (I
lost the pointer to that discussion).
For me, the main use case is using it along with virtio-mem (and DIMMs /
ppc64 dlpar where necessary) for dynamic resizing of VMs, increasing the
amount of memory we can hotunplug reliably again if we might eventually
hotplug a lot of memory to a VM.
III. Target Usage
The target usage will be:
1) Linux boots with "mhp_default_online_type=offline"
2) User space (e.g., systemd unit) configures memory onlining (according
to a config file and system properties), for example:
* Setting memory_hotplug.online_policy=auto-movable
* Setting memory_hotplug.auto_movable_ratio=301
* Setting memory_hotplug.auto_movable_numa_aware=true
3) User space enabled auto onlining via "echo online >
/sys/devices/system/memory/auto_online_blocks"
4) User space triggers manual onlining of all already-offline memory
blocks (go over offline memory blocks and set them to "online")
IV. Example
For DIMMs, hotplugging 4 GiB DIMMs to a 4 GiB VM with a configured ratio of
301% results in the following layout:
Memory block 0-15: DMA32 (early)
Memory block 32-47: Normal (early)
Memory block 48-79: Movable (DIMM 0)
Memory block 80-111: Movable (DIMM 1)
Memory block 112-143: Movable (DIMM 2)
Memory block 144-275: Normal (DIMM 3)
Memory block 176-207: Normal (DIMM 4)
... all Normal
(-> hotplugged Normal memory does not allow for more Movable memory)
For virtio-mem, using a simple, single virtio-mem device with a 4 GiB VM
will result in the following layout:
Memory block 0-15: DMA32 (early)
Memory block 32-47: Normal (early)
Memory block 48-143: Movable (virtio-mem, first 12 GiB)
Memory block 144: Normal (virtio-mem, next 128 MiB)
Memory block 145-147: Movable (virtio-mem, next 384 MiB)
Memory block 148: Normal (virtio-mem, next 128 MiB)
Memory block 149-151: Movable (virtio-mem, next 384 MiB)
... Normal/Movable mixture as above
(-> hotplugged Normal memory allows for more Movable memory within
the same device)
Which gives us maximum flexibility when dynamically growing/shrinking a
VM in smaller steps.
V. Doc Update
I'll update the memory-hotplug.rst documentation, once the overhaul [1] is
usptream. Until then, details can be found in patch #2.
VI. Future Work
1) Use memory groups for ppc64 dlpar
2) Being able to specify a portion of (early) kernel memory that will be
excluded from the ratio. Like "128 MiB globally/per node" are excluded.
This might be helpful when starting VMs with extremely small memory
footprint (e.g., 128 MiB) and hotplugging memory later -- not wanting
the first hotplugged units getting onlined to ZONE_MOVABLE. One
alternative would be a trigger to not consider ZONE_DMA memory
in the ratio. We'll have to see if this is really rrequired.
3) Indicate to user space that MOVABLE might be a bad idea -- especially
relevant when memory ballooning without support for balloon compaction
is active.
This patch (of 9):
For implementing a new memory onlining policy, which determines when to
online memory blocks to ZONE_MOVABLE semi-automatically, we need the
number of present early (boot) pages -- present pages excluding hotplugged
pages. Let's track these pages per zone.
Pass a page instead of the zone to adjust_present_page_count(), similar as
adjust_managed_page_count() and derive the zone from the page.
It's worth noting that a memory block to be offlined/onlined is either
completely "early" or "not early". add_memory() and friends can only add
complete memory blocks and we only online/offline complete (individual)
memory blocks.
Link: https://lkml.kernel.org/r/20210806124715.17090-1-david@redhat.com
Link: https://lkml.kernel.org/r/20210806124715.17090-2-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Marek Kedzierski <mkedzier@redhat.com>
Cc: Hui Zhu <teawater@gmail.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <lenb@kernel.org>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE".
After recent updates to freeing unused parts of the memory map, no
architecture can have holes in the memory map within a pageblock. This
makes pfn_valid_within() check and CONFIG_HOLES_IN_ZONE configuration
option redundant.
The first patch removes them both in a mechanical way and the second patch
simplifies memory_hotplug::test_pages_in_a_zone() that had
pfn_valid_within() surrounded by more logic than simple if.
This patch (of 2):
After recent changes in freeing of the unused parts of the memory map and
rework of pfn_valid() in arm and arm64 there are no architectures that can
have holes in the memory map within a pageblock and so nothing can enable
CONFIG_HOLES_IN_ZONE which guards non trivial implementation of
pfn_valid_within().
With that, pfn_valid_within() is always hardwired to 1 and can be
completely removed.
Remove calls to pfn_valid_within() and CONFIG_HOLES_IN_ZONE.
Link: https://lkml.kernel.org/r/20210713080035.7464-1-rppt@kernel.org
Link: https://lkml.kernel.org/r/20210713080035.7464-2-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
timespec64_ns() prevents multiplication overflows by comparing the seconds
value of the timespec to KTIME_SEC_MAX. If the value is greater or equal it
returns KTIME_MAX.
But that check casts the signed seconds value to unsigned which makes the
comparision true for all negative values and therefore return wrongly
KTIME_MAX.
Negative second values are perfectly valid and required in some places,
e.g. ptp_clock_adjtime().
Remove the cast and add a check for the negative boundary which is required
to prevent undefined behaviour due to multiplication underflow.
Fixes: cb47755725 ("time: Prevent undefined behaviour in timespec64_to_ns()")'
Signed-off-by: Lukas Hannen <lukas.hannen@opensource.tttech-industrial.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/AM6PR01MB541637BD6F336B8FFB72AF80EEC69@AM6PR01MB5416.eurprd01.prod.exchangelabs.com
* acpi-pm:
ACPI: PM: s2idle: Run both AMD and Microsoft methods if both are supported
* acpi-docs:
Documentation: ACPI: Align the SSDT overlays file with the code
* pm-cpufreq:
Revert "cpufreq: intel_pstate: Process HWP Guaranteed change notification"
cpufreq: mediatek-hw: Add support for CPUFREQ HW
cpufreq: Add of_perf_domain_get_sharing_cpumask
dt-bindings: cpufreq: add bindings for MediaTek cpufreq HW
cpufreq: Remove ready() callback
cpufreq: sh: Remove sh_cpufreq_cpu_ready()
cpufreq: acpi: Remove acpi_cpufreq_cpu_ready()
cpufreq: qcom-hw: Set dvfs_possible_from_any_cpu cpufreq driver flag
cpufreq: blocklist more Qualcomm platforms in cpufreq-dt-platdev
cpufreq: qcom-cpufreq-hw: Add dcvs interrupt support
cpufreq: scmi: Use .register_em() to register with energy model
cpufreq: vexpress: Use .register_em() to register with energy model
cpufreq: scpi: Use .register_em() to register with energy model
cpufreq: qcom-cpufreq-hw: Use .register_em() to register with energy model
cpufreq: omap: Use .register_em() to register with energy model
cpufreq: mediatek: Use .register_em() to register with energy model
cpufreq: imx6q: Use .register_em() to register with energy model
cpufreq: dt: Use .register_em() to register with energy model
cpufreq: Add callback to register with energy model
cpufreq: vexpress: Set CPUFREQ_IS_COOLING_DEV flag
WARNING: CPU: 0 PID: 10392 at fs/io_uring.c:1151 req_ref_put_and_test
fs/io_uring.c:1151 [inline]
WARNING: CPU: 0 PID: 10392 at fs/io_uring.c:1151 req_ref_put_and_test
fs/io_uring.c:1146 [inline]
WARNING: CPU: 0 PID: 10392 at fs/io_uring.c:1151
io_req_complete_post+0xf5b/0x1190 fs/io_uring.c:1794
Modules linked in:
Call Trace:
tctx_task_work+0x1e5/0x570 fs/io_uring.c:2158
task_work_run+0xe0/0x1a0 kernel/task_work.c:164
tracehook_notify_signal include/linux/tracehook.h:212 [inline]
handle_signal_work kernel/entry/common.c:146 [inline]
exit_to_user_mode_loop kernel/entry/common.c:172 [inline]
exit_to_user_mode_prepare+0x232/0x2a0 kernel/entry/common.c:209
__syscall_exit_to_user_mode_work kernel/entry/common.c:291 [inline]
syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:302
do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86
entry_SYSCALL_64_after_hwframe+0x44/0xae
When io_wqe_enqueue() -> io_wqe_create_worker() fails, we can't just
call io_run_cancel() to clean up the request, it's already enqueued via
io_wqe_insert_work() and will be executed either by some other worker
during cancellation (e.g. in io_wq_put_and_exit()).
Reported-by: Hao Sun <sunhao.th@gmail.com>
Fixes: 3146cba99a ("io-wq: make worker creation resilient against signals")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/93b9de0fcf657affab0acfd675d4abcd273ee863.1631092071.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The address of ftrace_graph_caller is already virtual.
Using __va() to translate the address is wrong.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Many comments above functions start with a kernel doc indicator, but
the comments are not using kernel doc style. Get rid of the warnings
by simply removing the indicator.
E.g.:
drivers/s390/crypto/zcrypt_msgtype6.c:111: warning:
This comment starts with '/**', but isn't a kernel-doc comment.
Reviewed-by: Harald Freudenberger <freude@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
A couple of function names don't match what the kernel doc comments
indicate.
Acked-by: Benjamin Block <bblock@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Add __nonstring annotation, since the missing string termination for
id member of sclp_trace_entry is intended. This way we get rid of this
warning:
drivers/s390/char/sclp.c:84:9: warning: ‘strncpy’ output truncated before terminating nul copying 4 bytes from a string of the same length [-Wstringop-truncation]
84 | strncpy(e.id, id, sizeof(e.id));
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Reported-by: kernel test robot <lkp@intel.com>
Reviewed-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
As noted in the "Deprecated Interfaces, Language Features, Attributes, and
Conventions" documentation [1], size calculations (especially
multiplication) should not be performed in memory allocator (or similar)
function arguments due to the risk of them overflowing. This could lead to
values wrapping around and a smaller allocation being made than the caller
was expecting. Using those allocations could lead to linear overflows of
heap memory and other misbehaviors.
In this case this is not actually dynamic sizes: both sides of the
multiplication are constant values. However it is best to refactor this
anyway, just to keep the open-coded math idiom out of code.
So, use the purpose specific kcalloc() function instead of the argument
size * count in the kzalloc() function.
Also, remove the unnecessary initialization of the sqp_tbl variable since
it is set a few lines later.
[1] https://www.kernel.org/doc/html/v5.14/process/deprecated.html#open-coded-arithmetic-in-allocator-arguments
Link: https://lore.kernel.org/r/20210905081812.17113-1-len.baker@gmx.com
Signed-off-by: Len Baker <len.baker@gmx.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Acked-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
>> drivers/infiniband/hw/qib/qib_sysfs.c:411:1: warning: performing pointer subtraction with a null pointer has undefined behavior
+[-Wnull-pointer-subtraction]
QIB_DIAGC_ATTR(rc_resends);
^~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/infiniband/hw/qib/qib_sysfs.c:408:51: note: expanded from macro 'QIB_DIAGC_ATTR'
.counter = &((struct qib_ibport *)0)->rvp.n_##N - (u64 *)0, \
Use offsetof and accomplish the type check using static_assert.
Fixes: 4a7aaf88c8 ("RDMA/qib: Use attributes for the port sysfs")
Link: https://lore.kernel.org/r/0-v1-43ae3c759177+65-qib_type_jgg@nvidia.com
Reported-by: kernel test robot <lkp@intel.com>
Acked-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The XLT chunk alignment depends on ent_size not sizeof(ent_size) aka
sizeof(size_t). The incoming ent_size is either 8 or 16, so the
miscalculation when 16 is required is only an over-alignment and
functional harmless.
Fixes: 8010d74b99 ("RDMA/mlx5: Split the WR setup out of mlx5_ib_update_xlt()")
Link: https://lore.kernel.org/r/20210908081849.7948-2-schnelle@linux.ibm.com
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
In commit 8010d74b99 ("RDMA/mlx5: Split the WR setup out of
mlx5_ib_update_xlt()") the allocation logic was split out of
mlx5_ib_update_xlt() and the logic was changed to enable better OOM
handling. Sadly this change introduced a miscalculation of the number of
entries that were actually allocated when under memory pressure where it
can actually become 0 which on s390 lets dma_map_single() fail.
It can also lead to corruption of the free pages list when the wrong
number of entries is used in the calculation of sg->length which is used
as argument for free_pages().
Fix this by using the allocation size instead of misusing get_order(size).
Cc: stable@vger.kernel.org
Fixes: 8010d74b99 ("RDMA/mlx5: Split the WR setup out of mlx5_ib_update_xlt()")
Link: https://lore.kernel.org/r/20210908081849.7948-1-schnelle@linux.ibm.com
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
The cross-product of the kernel's supported toolchains, architectures,
and configuration options is large. So large, that it's generally
accepted to be infeasible to enumerate and build+test them all
(many compile-testers rely on randomly generated configs).
Without the possibility to enumerate all possible combinations of
toolchains, architectures, and configuration options, it is inevitable
that compiler warnings in this space exist.
With -Werror, this means that an innumerable set of kernels are now
broken, yet had been perfectly usable before (confused compilers, code
with warnings unused, or luck).
Distributors will necessarily pick a point in the toolchain X arch X
config space, and if unlucky, will have a broken build. Granted, those
will likely disable CONFIG_WERROR and move on.
The kernel's default configuration is unlikely to be suitable for all
users, but it's inappropriate to force many users to set CONFIG_WERROR=n.
This also holds for CI systems which are focused on runtime testing,
where the odd warning in some subsystem will disrupt testing of the rest
of the kernel. Many of those runtime-focused CI systems run tests or
fuzz the kernel using runtime debugging tools. Runtime testing of
different subsystems can proceed in parallel, and potentially uncover
serious bugs; halting runtime testing of the entire kernel because of
the odd warning (now error) in a subsystem or driver is simply
inappropriate.
Therefore, runtime-focused CI systems will likely choose CONFIG_WERROR=n
as well.
The appropriate usecase for -Werror is therefore compile-test focused
builds (often done by developers or CI systems).
Reflect this in the Kconfig option by making the default value of WERROR
match COMPILE_TEST.
Signed-off-by: Marco Elver <elver@google.com>
Acked-by: Guenter Roeck <linux@roeck-us.net>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Reviwed-by: Mark Brown <broonie@kernel.org>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are two bugs in this code. First, if the kzalloc() fails it leads
to a NULL dereference of "ep" on the next line. Second, if the
alloc_event_probe() function returns an error then it leads to an
error pointer dereference in the caller.
Link: https://lkml.kernel.org/r/20210824115150.GI31143@kili
Fixes: 7491e2c442 ("tracing: Add a probe that attaches to trace events")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>