mirror of
https://github.com/torvalds/linux.git
synced 2024-11-24 21:21:41 +00:00
c91c6062d6
Commit 79365026f8
("crash: add a new kexec flag for hotplug support")
generalizes the crash hotplug support to allow architectures to update
multiple kexec segments on CPU/Memory hotplug and not just elfcorehdr.
Therefore, update the relevant kernel documentation to reflect the same.
No functional change.
Link: https://lkml.kernel.org/r/20240812041651.703156-1-sourabhjain@linux.ibm.com
Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com>
Reviewed-by: Petr Tesarik <ptesarik@suse.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Hari Bathini <hbathini@linux.ibm.com>
Cc: Petr Tesarik <petr@tesarici.cz>
Cc: Sourabh Jain <sourabhjain@linux.ibm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
697 lines
28 KiB
ReStructuredText
697 lines
28 KiB
ReStructuredText
==================
|
|
Memory Hot(Un)Plug
|
|
==================
|
|
|
|
This document describes generic Linux support for memory hot(un)plug with
|
|
a focus on System RAM, including ZONE_MOVABLE support.
|
|
|
|
.. contents:: :local:
|
|
|
|
Introduction
|
|
============
|
|
|
|
Memory hot(un)plug allows for increasing and decreasing the size of physical
|
|
memory available to a machine at runtime. In the simplest case, it consists of
|
|
physically plugging or unplugging a DIMM at runtime, coordinated with the
|
|
operating system.
|
|
|
|
Memory hot(un)plug is used for various purposes:
|
|
|
|
- The physical memory available to a machine can be adjusted at runtime, up- or
|
|
downgrading the memory capacity. This dynamic memory resizing, sometimes
|
|
referred to as "capacity on demand", is frequently used with virtual machines
|
|
and logical partitions.
|
|
|
|
- Replacing hardware, such as DIMMs or whole NUMA nodes, without downtime. One
|
|
example is replacing failing memory modules.
|
|
|
|
- Reducing energy consumption either by physically unplugging memory modules or
|
|
by logically unplugging (parts of) memory modules from Linux.
|
|
|
|
Further, the basic memory hot(un)plug infrastructure in Linux is nowadays also
|
|
used to expose persistent memory, other performance-differentiated memory and
|
|
reserved memory regions as ordinary system RAM to Linux.
|
|
|
|
Linux only supports memory hot(un)plug on selected 64 bit architectures, such as
|
|
x86_64, arm64, ppc64 and s390x.
|
|
|
|
Memory Hot(Un)Plug Granularity
|
|
------------------------------
|
|
|
|
Memory hot(un)plug in Linux uses the SPARSEMEM memory model, which divides the
|
|
physical memory address space into chunks of the same size: memory sections. The
|
|
size of a memory section is architecture dependent. For example, x86_64 uses
|
|
128 MiB and ppc64 uses 16 MiB.
|
|
|
|
Memory sections are combined into chunks referred to as "memory blocks". The
|
|
size of a memory block is architecture dependent and corresponds to the smallest
|
|
granularity that can be hot(un)plugged. The default size of a memory block is
|
|
the same as memory section size, unless an architecture specifies otherwise.
|
|
|
|
All memory blocks have the same size.
|
|
|
|
Phases of Memory Hotplug
|
|
------------------------
|
|
|
|
Memory hotplug consists of two phases:
|
|
|
|
(1) Adding the memory to Linux
|
|
(2) Onlining memory blocks
|
|
|
|
In the first phase, metadata, such as the memory map ("memmap") and page tables
|
|
for the direct mapping, is allocated and initialized, and memory blocks are
|
|
created; the latter also creates sysfs files for managing newly created memory
|
|
blocks.
|
|
|
|
In the second phase, added memory is exposed to the page allocator. After this
|
|
phase, the memory is visible in memory statistics, such as free and total
|
|
memory, of the system.
|
|
|
|
Phases of Memory Hotunplug
|
|
--------------------------
|
|
|
|
Memory hotunplug consists of two phases:
|
|
|
|
(1) Offlining memory blocks
|
|
(2) Removing the memory from Linux
|
|
|
|
In the first phase, memory is "hidden" from the page allocator again, for
|
|
example, by migrating busy memory to other memory locations and removing all
|
|
relevant free pages from the page allocator After this phase, the memory is no
|
|
longer visible in memory statistics of the system.
|
|
|
|
In the second phase, the memory blocks are removed and metadata is freed.
|
|
|
|
Memory Hotplug Notifications
|
|
============================
|
|
|
|
There are various ways how Linux is notified about memory hotplug events such
|
|
that it can start adding hotplugged memory. This description is limited to
|
|
systems that support ACPI; mechanisms specific to other firmware interfaces or
|
|
virtual machines are not described.
|
|
|
|
ACPI Notifications
|
|
------------------
|
|
|
|
Platforms that support ACPI, such as x86_64, can support memory hotplug
|
|
notifications via ACPI.
|
|
|
|
In general, a firmware supporting memory hotplug defines a memory class object
|
|
HID "PNP0C80". When notified about hotplug of a new memory device, the ACPI
|
|
driver will hotplug the memory to Linux.
|
|
|
|
If the firmware supports hotplug of NUMA nodes, it defines an object _HID
|
|
"ACPI0004", "PNP0A05", or "PNP0A06". When notified about an hotplug event, all
|
|
assigned memory devices are added to Linux by the ACPI driver.
|
|
|
|
Similarly, Linux can be notified about requests to hotunplug a memory device or
|
|
a NUMA node via ACPI. The ACPI driver will try offlining all relevant memory
|
|
blocks, and, if successful, hotunplug the memory from Linux.
|
|
|
|
Manual Probing
|
|
--------------
|
|
|
|
On some architectures, the firmware may not be able to notify the operating
|
|
system about a memory hotplug event. Instead, the memory has to be manually
|
|
probed from user space.
|
|
|
|
The probe interface is located at::
|
|
|
|
/sys/devices/system/memory/probe
|
|
|
|
Only complete memory blocks can be probed. Individual memory blocks are probed
|
|
by providing the physical start address of the memory block::
|
|
|
|
% echo addr > /sys/devices/system/memory/probe
|
|
|
|
Which results in a memory block for the range [addr, addr + memory_block_size)
|
|
being created.
|
|
|
|
.. note::
|
|
|
|
Using the probe interface is discouraged as it is easy to crash the kernel,
|
|
because Linux cannot validate user input; this interface might be removed in
|
|
the future.
|
|
|
|
Onlining and Offlining Memory Blocks
|
|
====================================
|
|
|
|
After a memory block has been created, Linux has to be instructed to actually
|
|
make use of that memory: the memory block has to be "online".
|
|
|
|
Before a memory block can be removed, Linux has to stop using any memory part of
|
|
the memory block: the memory block has to be "offlined".
|
|
|
|
The Linux kernel can be configured to automatically online added memory blocks
|
|
and drivers automatically trigger offlining of memory blocks when trying
|
|
hotunplug of memory. Memory blocks can only be removed once offlining succeeded
|
|
and drivers may trigger offlining of memory blocks when attempting hotunplug of
|
|
memory.
|
|
|
|
Onlining Memory Blocks Manually
|
|
-------------------------------
|
|
|
|
If auto-onlining of memory blocks isn't enabled, user-space has to manually
|
|
trigger onlining of memory blocks. Often, udev rules are used to automate this
|
|
task in user space.
|
|
|
|
Onlining of a memory block can be triggered via::
|
|
|
|
% echo online > /sys/devices/system/memory/memoryXXX/state
|
|
|
|
Or alternatively::
|
|
|
|
% echo 1 > /sys/devices/system/memory/memoryXXX/online
|
|
|
|
The kernel will select the target zone automatically, depending on the
|
|
configured ``online_policy``.
|
|
|
|
One can explicitly request to associate an offline memory block with
|
|
ZONE_MOVABLE by::
|
|
|
|
% echo online_movable > /sys/devices/system/memory/memoryXXX/state
|
|
|
|
Or one can explicitly request a kernel zone (usually ZONE_NORMAL) by::
|
|
|
|
% echo online_kernel > /sys/devices/system/memory/memoryXXX/state
|
|
|
|
In any case, if onlining succeeds, the state of the memory block is changed to
|
|
be "online". If it fails, the state of the memory block will remain unchanged
|
|
and the above commands will fail.
|
|
|
|
Onlining Memory Blocks Automatically
|
|
------------------------------------
|
|
|
|
The kernel can be configured to try auto-onlining of newly added memory blocks.
|
|
If this feature is disabled, the memory blocks will stay offline until
|
|
explicitly onlined from user space.
|
|
|
|
The configured auto-online behavior can be observed via::
|
|
|
|
% cat /sys/devices/system/memory/auto_online_blocks
|
|
|
|
Auto-onlining can be enabled by writing ``online``, ``online_kernel`` or
|
|
``online_movable`` to that file, like::
|
|
|
|
% echo online > /sys/devices/system/memory/auto_online_blocks
|
|
|
|
Similarly to manual onlining, with ``online`` the kernel will select the
|
|
target zone automatically, depending on the configured ``online_policy``.
|
|
|
|
Modifying the auto-online behavior will only affect all subsequently added
|
|
memory blocks only.
|
|
|
|
.. note::
|
|
|
|
In corner cases, auto-onlining can fail. The kernel won't retry. Note that
|
|
auto-onlining is not expected to fail in default configurations.
|
|
|
|
.. note::
|
|
|
|
DLPAR on ppc64 ignores the ``offline`` setting and will still online added
|
|
memory blocks; if onlining fails, memory blocks are removed again.
|
|
|
|
Offlining Memory Blocks
|
|
-----------------------
|
|
|
|
In the current implementation, Linux's memory offlining will try migrating all
|
|
movable pages off the affected memory block. As most kernel allocations, such as
|
|
page tables, are unmovable, page migration can fail and, therefore, inhibit
|
|
memory offlining from succeeding.
|
|
|
|
Having the memory provided by memory block managed by ZONE_MOVABLE significantly
|
|
increases memory offlining reliability; still, memory offlining can fail in
|
|
some corner cases.
|
|
|
|
Further, memory offlining might retry for a long time (or even forever), until
|
|
aborted by the user.
|
|
|
|
Offlining of a memory block can be triggered via::
|
|
|
|
% echo offline > /sys/devices/system/memory/memoryXXX/state
|
|
|
|
Or alternatively::
|
|
|
|
% echo 0 > /sys/devices/system/memory/memoryXXX/online
|
|
|
|
If offlining succeeds, the state of the memory block is changed to be "offline".
|
|
If it fails, the state of the memory block will remain unchanged and the above
|
|
commands will fail, for example, via::
|
|
|
|
bash: echo: write error: Device or resource busy
|
|
|
|
or via::
|
|
|
|
bash: echo: write error: Invalid argument
|
|
|
|
Observing the State of Memory Blocks
|
|
------------------------------------
|
|
|
|
The state (online/offline/going-offline) of a memory block can be observed
|
|
either via::
|
|
|
|
% cat /sys/devices/system/memory/memoryXXX/state
|
|
|
|
Or alternatively (1/0) via::
|
|
|
|
% cat /sys/devices/system/memory/memoryXXX/online
|
|
|
|
For an online memory block, the managing zone can be observed via::
|
|
|
|
% cat /sys/devices/system/memory/memoryXXX/valid_zones
|
|
|
|
Configuring Memory Hot(Un)Plug
|
|
==============================
|
|
|
|
There are various ways how system administrators can configure memory
|
|
hot(un)plug and interact with memory blocks, especially, to online them.
|
|
|
|
Memory Hot(Un)Plug Configuration via Sysfs
|
|
------------------------------------------
|
|
|
|
Some memory hot(un)plug properties can be configured or inspected via sysfs in::
|
|
|
|
/sys/devices/system/memory/
|
|
|
|
The following files are currently defined:
|
|
|
|
====================== =========================================================
|
|
``auto_online_blocks`` read-write: set or get the default state of new memory
|
|
blocks; configure auto-onlining.
|
|
|
|
The default value depends on the
|
|
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE kernel configuration
|
|
option.
|
|
|
|
See the ``state`` property of memory blocks for details.
|
|
``block_size_bytes`` read-only: the size in bytes of a memory block.
|
|
``probe`` write-only: add (probe) selected memory blocks manually
|
|
from user space by supplying the physical start address.
|
|
|
|
Availability depends on the CONFIG_ARCH_MEMORY_PROBE
|
|
kernel configuration option.
|
|
``uevent`` read-write: generic udev file for device subsystems.
|
|
``crash_hotplug`` read-only: when changes to the system memory map
|
|
occur due to hot un/plug of memory, this file contains
|
|
'1' if the kernel updates the kdump capture kernel memory
|
|
map itself (via elfcorehdr and other relevant kexec
|
|
segments), or '0' if userspace must update the kdump
|
|
capture kernel memory map.
|
|
|
|
Availability depends on the CONFIG_MEMORY_HOTPLUG kernel
|
|
configuration option.
|
|
====================== =========================================================
|
|
|
|
.. note::
|
|
|
|
When the CONFIG_MEMORY_FAILURE kernel configuration option is enabled, two
|
|
additional files ``hard_offline_page`` and ``soft_offline_page`` are available
|
|
to trigger hwpoisoning of pages, for example, for testing purposes. Note that
|
|
this functionality is not really related to memory hot(un)plug or actual
|
|
offlining of memory blocks.
|
|
|
|
Memory Block Configuration via Sysfs
|
|
------------------------------------
|
|
|
|
Each memory block is represented as a memory block device that can be
|
|
onlined or offlined. All memory blocks have their device information located in
|
|
sysfs. Each present memory block is listed under
|
|
``/sys/devices/system/memory`` as::
|
|
|
|
/sys/devices/system/memory/memoryXXX
|
|
|
|
where XXX is the memory block id; the number of digits is variable.
|
|
|
|
A present memory block indicates that some memory in the range is present;
|
|
however, a memory block might span memory holes. A memory block spanning memory
|
|
holes cannot be offlined.
|
|
|
|
For example, assume 1 GiB memory block size. A device for a memory starting at
|
|
0x100000000 is ``/sys/devices/system/memory/memory4``::
|
|
|
|
(0x100000000 / 1Gib = 4)
|
|
|
|
This device covers address range [0x100000000 ... 0x140000000)
|
|
|
|
The following files are currently defined:
|
|
|
|
=================== ============================================================
|
|
``online`` read-write: simplified interface to trigger onlining /
|
|
offlining and to observe the state of a memory block.
|
|
When onlining, the zone is selected automatically.
|
|
``phys_device`` read-only: legacy interface only ever used on s390x to
|
|
expose the covered storage increment.
|
|
``phys_index`` read-only: the memory block id (XXX).
|
|
``removable`` read-only: legacy interface that indicated whether a memory
|
|
block was likely to be offlineable or not. Nowadays, the
|
|
kernel return ``1`` if and only if it supports memory
|
|
offlining.
|
|
``state`` read-write: advanced interface to trigger onlining /
|
|
offlining and to observe the state of a memory block.
|
|
|
|
When writing, ``online``, ``offline``, ``online_kernel`` and
|
|
``online_movable`` are supported.
|
|
|
|
``online_movable`` specifies onlining to ZONE_MOVABLE.
|
|
``online_kernel`` specifies onlining to the default kernel
|
|
zone for the memory block, such as ZONE_NORMAL.
|
|
``online`` let's the kernel select the zone automatically.
|
|
|
|
When reading, ``online``, ``offline`` and ``going-offline``
|
|
may be returned.
|
|
``uevent`` read-write: generic uevent file for devices.
|
|
``valid_zones`` read-only: when a block is online, shows the zone it
|
|
belongs to; when a block is offline, shows what zone will
|
|
manage it when the block will be onlined.
|
|
|
|
For online memory blocks, ``DMA``, ``DMA32``, ``Normal``,
|
|
``Movable`` and ``none`` may be returned. ``none`` indicates
|
|
that memory provided by a memory block is managed by
|
|
multiple zones or spans multiple nodes; such memory blocks
|
|
cannot be offlined. ``Movable`` indicates ZONE_MOVABLE.
|
|
Other values indicate a kernel zone.
|
|
|
|
For offline memory blocks, the first column shows the
|
|
zone the kernel would select when onlining the memory block
|
|
right now without further specifying a zone.
|
|
|
|
Availability depends on the CONFIG_MEMORY_HOTREMOVE
|
|
kernel configuration option.
|
|
=================== ============================================================
|
|
|
|
.. note::
|
|
|
|
If the CONFIG_NUMA kernel configuration option is enabled, the memoryXXX/
|
|
directories can also be accessed via symbolic links located in the
|
|
``/sys/devices/system/node/node*`` directories.
|
|
|
|
For example::
|
|
|
|
/sys/devices/system/node/node0/memory9 -> ../../memory/memory9
|
|
|
|
A backlink will also be created::
|
|
|
|
/sys/devices/system/memory/memory9/node0 -> ../../node/node0
|
|
|
|
Command Line Parameters
|
|
-----------------------
|
|
|
|
Some command line parameters affect memory hot(un)plug handling. The following
|
|
command line parameters are relevant:
|
|
|
|
======================== =======================================================
|
|
``memhp_default_state`` configure auto-onlining by essentially setting
|
|
``/sys/devices/system/memory/auto_online_blocks``.
|
|
``movable_node`` configure automatic zone selection in the kernel when
|
|
using the ``contig-zones`` online policy. When
|
|
set, the kernel will default to ZONE_MOVABLE when
|
|
onlining a memory block, unless other zones can be kept
|
|
contiguous.
|
|
======================== =======================================================
|
|
|
|
See Documentation/admin-guide/kernel-parameters.txt for a more generic
|
|
description of these command line parameters.
|
|
|
|
Module Parameters
|
|
------------------
|
|
|
|
Instead of additional command line parameters or sysfs files, the
|
|
``memory_hotplug`` subsystem now provides a dedicated namespace for module
|
|
parameters. Module parameters can be set via the command line by predicating
|
|
them with ``memory_hotplug.`` such as::
|
|
|
|
memory_hotplug.memmap_on_memory=1
|
|
|
|
and they can be observed (and some even modified at runtime) via::
|
|
|
|
/sys/module/memory_hotplug/parameters/
|
|
|
|
The following module parameters are currently defined:
|
|
|
|
================================ ===============================================
|
|
``memmap_on_memory`` read-write: Allocate memory for the memmap from
|
|
the added memory block itself. Even if enabled,
|
|
actual support depends on various other system
|
|
properties and should only be regarded as a
|
|
hint whether the behavior would be desired.
|
|
|
|
While allocating the memmap from the memory
|
|
block itself makes memory hotplug less likely
|
|
to fail and keeps the memmap on the same NUMA
|
|
node in any case, it can fragment physical
|
|
memory in a way that huge pages in bigger
|
|
granularity cannot be formed on hotplugged
|
|
memory.
|
|
|
|
With value "force" it could result in memory
|
|
wastage due to memmap size limitations. For
|
|
example, if the memmap for a memory block
|
|
requires 1 MiB, but the pageblock size is 2
|
|
MiB, 1 MiB of hotplugged memory will be wasted.
|
|
Note that there are still cases where the
|
|
feature cannot be enforced: for example, if the
|
|
memmap is smaller than a single page, or if the
|
|
architecture does not support the forced mode
|
|
in all configurations.
|
|
|
|
``online_policy`` read-write: Set the basic policy used for
|
|
automatic zone selection when onlining memory
|
|
blocks without specifying a target zone.
|
|
``contig-zones`` has been the kernel default
|
|
before this parameter was added. After an
|
|
online policy was configured and memory was
|
|
online, the policy should not be changed
|
|
anymore.
|
|
|
|
When set to ``contig-zones``, the kernel will
|
|
try keeping zones contiguous. If a memory block
|
|
intersects multiple zones or no zone, the
|
|
behavior depends on the ``movable_node`` kernel
|
|
command line parameter: default to ZONE_MOVABLE
|
|
if set, default to the applicable kernel zone
|
|
(usually ZONE_NORMAL) if not set.
|
|
|
|
When set to ``auto-movable``, the kernel will
|
|
try onlining memory blocks to ZONE_MOVABLE if
|
|
possible according to the configuration and
|
|
memory device details. With this policy, one
|
|
can avoid zone imbalances when eventually
|
|
hotplugging a lot of memory later and still
|
|
wanting to be able to hotunplug as much as
|
|
possible reliably, very desirable in
|
|
virtualized environments. This policy ignores
|
|
the ``movable_node`` kernel command line
|
|
parameter and isn't really applicable in
|
|
environments that require it (e.g., bare metal
|
|
with hotunpluggable nodes) where hotplugged
|
|
memory might be exposed via the
|
|
firmware-provided memory map early during boot
|
|
to the system instead of getting detected,
|
|
added and onlined later during boot (such as
|
|
done by virtio-mem or by some hypervisors
|
|
implementing emulated DIMMs). As one example, a
|
|
hotplugged DIMM will be onlined either
|
|
completely to ZONE_MOVABLE or completely to
|
|
ZONE_NORMAL, not a mixture.
|
|
As another example, as many memory blocks
|
|
belonging to a virtio-mem device will be
|
|
onlined to ZONE_MOVABLE as possible,
|
|
special-casing units of memory blocks that can
|
|
only get hotunplugged together. *This policy
|
|
does not protect from setups that are
|
|
problematic with ZONE_MOVABLE and does not
|
|
change the zone of memory blocks dynamically
|
|
after they were onlined.*
|
|
``auto_movable_ratio`` read-write: Set the maximum MOVABLE:KERNEL
|
|
memory ratio in % for the ``auto-movable``
|
|
online policy. Whether the ratio applies only
|
|
for the system across all NUMA nodes or also
|
|
per NUMA nodes depends on the
|
|
``auto_movable_numa_aware`` configuration.
|
|
|
|
All accounting is based on present memory pages
|
|
in the zones combined with accounting per
|
|
memory device. Memory dedicated to the CMA
|
|
allocator is accounted as MOVABLE, although
|
|
residing on one of the kernel zones. The
|
|
possible ratio depends on the actual workload.
|
|
The kernel default is "301" %, for example,
|
|
allowing for hotplugging 24 GiB to a 8 GiB VM
|
|
and automatically onlining all hotplugged
|
|
memory to ZONE_MOVABLE in many setups. The
|
|
additional 1% deals with some pages being not
|
|
present, for example, because of some firmware
|
|
allocations.
|
|
|
|
Note that ZONE_NORMAL memory provided by one
|
|
memory device does not allow for more
|
|
ZONE_MOVABLE memory for a different memory
|
|
device. As one example, onlining memory of a
|
|
hotplugged DIMM to ZONE_NORMAL will not allow
|
|
for another hotplugged DIMM to get onlined to
|
|
ZONE_MOVABLE automatically. In contrast, memory
|
|
hotplugged by a virtio-mem device that got
|
|
onlined to ZONE_NORMAL will allow for more
|
|
ZONE_MOVABLE memory within *the same*
|
|
virtio-mem device.
|
|
``auto_movable_numa_aware`` read-write: Configure whether the
|
|
``auto_movable_ratio`` in the ``auto-movable``
|
|
online policy also applies per NUMA
|
|
node in addition to the whole system across all
|
|
NUMA nodes. The kernel default is "Y".
|
|
|
|
Disabling NUMA awareness can be helpful when
|
|
dealing with NUMA nodes that should be
|
|
completely hotunpluggable, onlining the memory
|
|
completely to ZONE_MOVABLE automatically if
|
|
possible.
|
|
|
|
Parameter availability depends on CONFIG_NUMA.
|
|
================================ ===============================================
|
|
|
|
ZONE_MOVABLE
|
|
============
|
|
|
|
ZONE_MOVABLE is an important mechanism for more reliable memory offlining.
|
|
Further, having system RAM managed by ZONE_MOVABLE instead of one of the
|
|
kernel zones can increase the number of possible transparent huge pages and
|
|
dynamically allocated huge pages.
|
|
|
|
Most kernel allocations are unmovable. Important examples include the memory
|
|
map (usually 1/64ths of memory), page tables, and kmalloc(). Such allocations
|
|
can only be served from the kernel zones.
|
|
|
|
Most user space pages, such as anonymous memory, and page cache pages are
|
|
movable. Such allocations can be served from ZONE_MOVABLE and the kernel zones.
|
|
|
|
Only movable allocations are served from ZONE_MOVABLE, resulting in unmovable
|
|
allocations being limited to the kernel zones. Without ZONE_MOVABLE, there is
|
|
absolutely no guarantee whether a memory block can be offlined successfully.
|
|
|
|
Zone Imbalances
|
|
---------------
|
|
|
|
Having too much system RAM managed by ZONE_MOVABLE is called a zone imbalance,
|
|
which can harm the system or degrade performance. As one example, the kernel
|
|
might crash because it runs out of free memory for unmovable allocations,
|
|
although there is still plenty of free memory left in ZONE_MOVABLE.
|
|
|
|
Usually, MOVABLE:KERNEL ratios of up to 3:1 or even 4:1 are fine. Ratios of 63:1
|
|
are definitely impossible due to the overhead for the memory map.
|
|
|
|
Actual safe zone ratios depend on the workload. Extreme cases, like excessive
|
|
long-term pinning of pages, might not be able to deal with ZONE_MOVABLE at all.
|
|
|
|
.. note::
|
|
|
|
CMA memory part of a kernel zone essentially behaves like memory in
|
|
ZONE_MOVABLE and similar considerations apply, especially when combining
|
|
CMA with ZONE_MOVABLE.
|
|
|
|
ZONE_MOVABLE Sizing Considerations
|
|
----------------------------------
|
|
|
|
We usually expect that a large portion of available system RAM will actually
|
|
be consumed by user space, either directly or indirectly via the page cache. In
|
|
the normal case, ZONE_MOVABLE can be used when allocating such pages just fine.
|
|
|
|
With that in mind, it makes sense that we can have a big portion of system RAM
|
|
managed by ZONE_MOVABLE. However, there are some things to consider when using
|
|
ZONE_MOVABLE, especially when fine-tuning zone ratios:
|
|
|
|
- Having a lot of offline memory blocks. Even offline memory blocks consume
|
|
memory for metadata and page tables in the direct map; having a lot of offline
|
|
memory blocks is not a typical case, though.
|
|
|
|
- Memory ballooning without balloon compaction is incompatible with
|
|
ZONE_MOVABLE. Only some implementations, such as virtio-balloon and
|
|
pseries CMM, fully support balloon compaction.
|
|
|
|
Further, the CONFIG_BALLOON_COMPACTION kernel configuration option might be
|
|
disabled. In that case, balloon inflation will only perform unmovable
|
|
allocations and silently create a zone imbalance, usually triggered by
|
|
inflation requests from the hypervisor.
|
|
|
|
- Gigantic pages are unmovable, resulting in user space consuming a
|
|
lot of unmovable memory.
|
|
|
|
- Huge pages are unmovable when an architectures does not support huge
|
|
page migration, resulting in a similar issue as with gigantic pages.
|
|
|
|
- Page tables are unmovable. Excessive swapping, mapping extremely large
|
|
files or ZONE_DEVICE memory can be problematic, although only really relevant
|
|
in corner cases. When we manage a lot of user space memory that has been
|
|
swapped out or is served from a file/persistent memory/... we still need a lot
|
|
of page tables to manage that memory once user space accessed that memory.
|
|
|
|
- In certain DAX configurations the memory map for the device memory will be
|
|
allocated from the kernel zones.
|
|
|
|
- KASAN can have a significant memory overhead, for example, consuming 1/8th of
|
|
the total system memory size as (unmovable) tracking metadata.
|
|
|
|
- Long-term pinning of pages. Techniques that rely on long-term pinnings
|
|
(especially, RDMA and vfio/mdev) are fundamentally problematic with
|
|
ZONE_MOVABLE, and therefore, memory offlining. Pinned pages cannot reside
|
|
on ZONE_MOVABLE as that would turn these pages unmovable. Therefore, they
|
|
have to be migrated off that zone while pinning. Pinning a page can fail
|
|
even if there is plenty of free memory in ZONE_MOVABLE.
|
|
|
|
In addition, using ZONE_MOVABLE might make page pinning more expensive,
|
|
because of the page migration overhead.
|
|
|
|
By default, all the memory configured at boot time is managed by the kernel
|
|
zones and ZONE_MOVABLE is not used.
|
|
|
|
To enable ZONE_MOVABLE to include the memory present at boot and to control the
|
|
ratio between movable and kernel zones there are two command line options:
|
|
``kernelcore=`` and ``movablecore=``. See
|
|
Documentation/admin-guide/kernel-parameters.rst for their description.
|
|
|
|
Memory Offlining and ZONE_MOVABLE
|
|
---------------------------------
|
|
|
|
Even with ZONE_MOVABLE, there are some corner cases where offlining a memory
|
|
block might fail:
|
|
|
|
- Memory blocks with memory holes; this applies to memory blocks present during
|
|
boot and can apply to memory blocks hotplugged via the XEN balloon and the
|
|
Hyper-V balloon.
|
|
|
|
- Mixed NUMA nodes and mixed zones within a single memory block prevent memory
|
|
offlining; this applies to memory blocks present during boot only.
|
|
|
|
- Special memory blocks prevented by the system from getting offlined. Examples
|
|
include any memory available during boot on arm64 or memory blocks spanning
|
|
the crashkernel area on s390x; this usually applies to memory blocks present
|
|
during boot only.
|
|
|
|
- Memory blocks overlapping with CMA areas cannot be offlined, this applies to
|
|
memory blocks present during boot only.
|
|
|
|
- Concurrent activity that operates on the same physical memory area, such as
|
|
allocating gigantic pages, can result in temporary offlining failures.
|
|
|
|
- Out of memory when dissolving huge pages, especially when HugeTLB Vmemmap
|
|
Optimization (HVO) is enabled.
|
|
|
|
Offlining code may be able to migrate huge page contents, but may not be able
|
|
to dissolve the source huge page because it fails allocating (unmovable) pages
|
|
for the vmemmap, because the system might not have free memory in the kernel
|
|
zones left.
|
|
|
|
Users that depend on memory offlining to succeed for movable zones should
|
|
carefully consider whether the memory savings gained from this feature are
|
|
worth the risk of possibly not being able to offline memory in certain
|
|
situations.
|
|
|
|
Further, when running into out of memory situations while migrating pages, or
|
|
when still encountering permanently unmovable pages within ZONE_MOVABLE
|
|
(-> BUG), memory offlining will keep retrying until it eventually succeeds.
|
|
|
|
When offlining is triggered from user space, the offlining context can be
|
|
terminated by sending a signal. A timeout based offlining can easily be
|
|
implemented via::
|
|
|
|
% timeout $TIMEOUT offline_block | failure_handling
|