linux/drivers/base
David Hildenbrand 395f6081ba drivers/base/memory: determine and store zone for single-zone memory blocks
test_pages_in_a_zone() is just another nasty PFN walker that can easily
stumble over ZONE_DEVICE memory ranges falling into the same memory block
as ordinary system RAM: the memmap of parts of these ranges might possibly
be uninitialized.  In fact, we observed (on an older kernel) with UBSAN:

  UBSAN: Undefined behaviour in ./include/linux/mm.h:1133:50
  index 7 is out of range for type 'zone [5]'
  CPU: 121 PID: 35603 Comm: read_all Kdump: loaded Tainted: [...]
  Hardware name: Dell Inc. PowerEdge R7425/08V001, BIOS 1.12.2 11/15/2019
  Call Trace:
   dump_stack+0x9a/0xf0
   ubsan_epilogue+0x9/0x7a
   __ubsan_handle_out_of_bounds+0x13a/0x181
   test_pages_in_a_zone+0x3c4/0x500
   show_valid_zones+0x1fa/0x380
   dev_attr_show+0x43/0xb0
   sysfs_kf_seq_show+0x1c5/0x440
   seq_read+0x49d/0x1190
   vfs_read+0xff/0x300
   ksys_read+0xb8/0x170
   do_syscall_64+0xa5/0x4b0
   entry_SYSCALL_64_after_hwframe+0x6a/0xdf
  RIP: 0033:0x7f01f4439b52

We seem to stumble over a memmap that contains a garbage zone id.  While
we could try inserting pfn_to_online_page() calls, it will just make
memory offlining slower, because we use test_pages_in_a_zone() to make
sure we're offlining pages that all belong to the same zone.

Let's just get rid of this PFN walker and determine the single zone of a
memory block -- if any -- for early memory blocks during boot.  For memory
onlining, we know the single zone already.  Let's avoid any additional
memmap scanning and just rely on the zone information available during
boot.

For memory hot(un)plug, we only really care about memory blocks that:
* span a single zone (and, thereby, a single node)
* are completely System RAM (IOW, no holes, no ZONE_DEVICE)
If one of these conditions is not met, we reject memory offlining.
Hotplugged memory blocks (starting out offline), always meet both
conditions.

There are three scenarios to handle:

(1) Memory hot(un)plug

A memory block with zone == NULL cannot be offlined, corresponding to
our previous test_pages_in_a_zone() check.

After successful memory onlining/offlining, we simply set the zone
accordingly.
* Memory onlining: set the zone we just used for onlining
* Memory offlining: set zone = NULL

So a hotplugged memory block starts with zone = NULL. Once memory
onlining is done, we set the proper zone.

(2) Boot memory with !CONFIG_NUMA

We know that there is just a single pgdat, so we simply scan all zones
of that pgdat for an intersection with our memory block PFN range when
adding the memory block. If more than one zone intersects (e.g., DMA and
DMA32 on x86 for the first memory block) we set zone = NULL and
consequently mimic what test_pages_in_a_zone() used to do.

(3) Boot memory with CONFIG_NUMA

At the point in time we create the memory block devices during boot, we
don't know yet which nodes *actually* span a memory block. While we could
scan all zones of all nodes for intersections, overlapping nodes complicate
the situation and scanning all nodes is possibly expensive. But that
problem has already been solved by the code that sets the node of a memory
block and creates the link in the sysfs --
do_register_memory_block_under_node().

So, we hook into the code that sets the node id for a memory block. If
we already have a different node id set for the memory block, we know
that multiple nodes *actually* have PFNs falling into our memory block:
we set zone = NULL and consequently mimic what test_pages_in_a_zone() used
to do. If there is no node id set, we do the same as (2) for the given
node.

Note that the call order in driver_init() is:
-> memory_dev_init(): create memory block devices
-> node_dev_init(): link memory block devices to the node and set the
		    node id

So in summary, we detect if there is a single zone responsible for this
memory block and we consequently store the zone in that case in the
memory block, updating it during memory onlining/offlining.

Link: https://lkml.kernel.org/r/20220210184359.235565-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reported-by: Rafael Parra <rparrazo@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rafael Parra <rparrazo@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-22 15:57:10 -07:00
..
firmware_loader firmware_loader: move firmware sysctl to its own files 2022-01-22 08:33:35 +02:00
power PM: s2idle: ACPI: Fix wakeup interrupts handling 2022-02-07 21:02:31 +01:00
regmap regmap-irq: Update interrupt clear register for proper reset 2022-02-17 15:33:15 +00:00
test driver core: Simplify async probe test code by using ktime_ms_delta() 2021-12-29 10:57:22 +01:00
arch_numa.c mm: percpu: add generic pcpu_populate_pte() function 2022-01-20 08:52:52 +02:00
arch_topology.c arch_topology: Remove unused topology_set_thermal_pressure() and related 2021-11-23 15:10:26 +05:30
attribute_container.c driver core: attribute_container: fix W=1 warnings 2021-05-14 13:37:10 +02:00
auxiliary.c Documentation/auxiliary_bus: Move the text into the code 2021-12-03 16:41:50 +01:00
base.h software nodes: Split software_node_notify() 2021-07-16 19:17:05 +02:00
bus.c kobject: remove kset from struct kset_uevent_ops callbacks 2021-12-28 11:26:18 +01:00
cacheinfo.c cacheinfo: clear cache_leaves(cpu) in free_cache_attributes() 2021-07-21 17:29:40 +02:00
class.c drivers: base: fix some kernel-doc markups 2020-11-09 18:56:49 +01:00
component.c component: do not leave master devres group open after bind 2021-10-21 13:01:56 +02:00
container.c
core.c Rework of the MSI interrupt infrastructure: 2022-01-13 09:05:29 -08:00
cpu.c driver: base: Prefer unsigned int to bare use of unsigned 2021-07-21 17:30:09 +02:00
dd.c driver core: Free DMA range map when device is released 2022-02-22 08:40:02 +01:00
devcoredump.c devcoredump: remove contact information 2021-06-04 15:05:44 +02:00
devres.c devres: Enable trace events 2021-06-15 17:14:36 +02:00
devtmpfs.c devtmpfs regression fix: reconfigure on each mount 2022-01-17 09:40:29 +02:00
driver.c drivers: base: Convert to printk alias functions 2020-07-10 14:16:44 +02:00
firmware.c
hypervisor.c
init.c drivers/base/node: consolidate node device subsystem initialization in node_dev_init() 2022-03-22 15:57:10 -07:00
isa.c bus: Make remove callback return void 2021-07-21 11:53:42 +02:00
Kconfig devtmpfs: mount with noexec and nosuid 2021-12-30 13:54:42 +01:00
Makefile mm/memory_hotplug: remove CONFIG_MEMORY_HOTPLUG_SPARSE 2021-11-06 13:30:42 -07:00
map.c driver: base: Prefer unsigned int to bare use of unsigned 2021-07-21 17:30:09 +02:00
memory.c drivers/base/memory: determine and store zone for single-zone memory blocks 2022-03-22 15:57:10 -07:00
module.c
node.c drivers/base/memory: determine and store zone for single-zone memory blocks 2022-03-22 15:57:10 -07:00
pinctrl.c
platform-msi.c platform-msi: Simplify platform device MSI code 2021-12-16 22:22:19 +01:00
platform.c driver core: platform: document registration-failure requirement 2021-12-22 13:59:33 +01:00
property.c Char/Misc and other driver changes for 5.17-rc1 2022-01-14 16:02:28 +01:00
soc.c soc: fix comment for freeing soc_dev_attr 2020-12-09 19:46:31 +01:00
swnode.c software node: fix wrong node passed to find nargs_prop 2021-12-22 18:26:18 +01:00
syscore.c syscore: Use pm_pr_dbg() for syscore_{suspend,resume}() 2020-09-08 13:32:06 +02:00
topology.c topology/sysfs: rework book and drawer topology ifdefery 2021-12-03 15:58:27 +01:00
trace.c devres: Enable trace events 2021-06-15 17:14:36 +02:00
trace.h devres: Enable trace events 2021-06-15 17:14:36 +02:00
transport_class.c scsi: drivers: base: Propagate errors through the transport component 2020-01-15 22:55:37 -05:00