mainlining shenanigans
Go to file
Dave Hansen 79c28a4167 mm/numa: automatically generate node migration order
Patch series "Migrate Pages in lieu of discard", v11.

We're starting to see systems with more and more kinds of memory such as
Intel's implementation of persistent memory.

Let's say you have a system with some DRAM and some persistent memory.
Today, once DRAM fills up, reclaim will start and some of the DRAM
contents will be thrown out.  Allocations will, at some point, start
falling over to the slower persistent memory.

That has two nasty properties.  First, the newer allocations can end up in
the slower persistent memory.  Second, reclaimed data in DRAM are just
discarded even if there are gobs of space in persistent memory that could
be used.

This patchset implements a solution to these problems.  At the end of the
reclaim process in shrink_page_list() just before the last page refcount
is dropped, the page is migrated to persistent memory instead of being
dropped.

While I've talked about a DRAM/PMEM pairing, this approach would function
in any environment where memory tiers exist.

This is not perfect.  It "strands" pages in slower memory and never brings
them back to fast DRAM.  Huang Ying has follow-on work which repurposes
NUMA balancing to promote hot pages back to DRAM.

This is also all based on an upstream mechanism that allows persistent
memory to be onlined and used as if it were volatile:

	http://lkml.kernel.org/r/20190124231441.37A4A305@viggo.jf.intel.com

With that, the DRAM and PMEM in each socket will be represented as 2
separate NUMA nodes, with the CPUs sit in the DRAM node.  So the
general inter-NUMA demotion mechanism introduced in the patchset can
migrate the cold DRAM pages to the PMEM node.

We have tested the patchset with the postgresql and pgbench.  On a
2-socket server machine with DRAM and PMEM, the kernel with the patchset
can improve the score of pgbench up to 22.1% compared with that of the
DRAM only + disk case.  This comes from the reduced disk read throughput
(which reduces up to 70.8%).

== Open Issues ==

 * Memory policies and cpusets that, for instance, restrict allocations
   to DRAM can be demoted to PMEM whenever they opt in to this
   new mechanism.  A cgroup-level API to opt-in or opt-out of
   these migrations will likely be required as a follow-on.
 * Could be more aggressive about where anon LRU scanning occurs
   since it no longer necessarily involves I/O.  get_scan_count()
   for instance says: "If we have no swap space, do not bother
   scanning anon pages"

This patch (of 9):

Prepare for the kernel to auto-migrate pages to other memory nodes with a
node migration table.  This allows creating single migration target for
each NUMA node to enable the kernel to do NUMA page migrations instead of
simply discarding colder pages.  A node with no target is a "terminal
node", so reclaim acts normally there.  The migration target does not
fundamentally _need_ to be a single node, but this implementation starts
there to limit complexity.

When memory fills up on a node, memory contents can be automatically
migrated to another node.  The biggest problems are knowing when to
migrate and to where the migration should be targeted.

The most straightforward way to generate the "to where" list would be to
follow the page allocator fallback lists.  Those lists already tell us if
memory is full where to look next.  It would also be logical to move
memory in that order.

But, the allocator fallback lists have a fatal flaw: most nodes appear in
all the lists.  This would potentially lead to migration cycles (A->B,
B->A, A->B, ...).

Instead of using the allocator fallback lists directly, keep a separate
node migration ordering.  But, reuse the same data used to generate page
allocator fallback in the first place: find_next_best_node().

This means that the firmware data used to populate node distances
essentially dictates the ordering for now.  It should also be
architecture-neutral since all NUMA architectures have a working
find_next_best_node().

RCU is used to allow lock-less read of node_demotion[] and prevent
demotion cycles been observed.  If multiple reads of node_demotion[] are
performed, a single rcu_read_lock() must be held over all reads to ensure
no cycles are observed.  Details are as follows.

=== What does RCU provide? ===

Imagine a simple loop which walks down the demotion path looking
for the last node:

        terminal_node = start_node;
        while (node_demotion[terminal_node] != NUMA_NO_NODE) {
                terminal_node = node_demotion[terminal_node];
        }

The initial values are:

        node_demotion[0] = 1;
        node_demotion[1] = NUMA_NO_NODE;

and are updated to:

        node_demotion[0] = NUMA_NO_NODE;
        node_demotion[1] = 0;

What guarantees that the cycle is not observed:

        node_demotion[0] = 1;
        node_demotion[1] = 0;

and would loop forever?

With RCU, a rcu_read_lock/unlock() can be placed around the loop.  Since
the write side does a synchronize_rcu(), the loop that observed the old
contents is known to be complete before the synchronize_rcu() has
completed.

RCU, combined with disable_all_migrate_targets(), ensures that the old
migration state is not visible by the time __set_migration_target_nodes()
is called.

=== What does READ_ONCE() provide? ===

READ_ONCE() forbids the compiler from merging or reordering successive
reads of node_demotion[].  This ensures that any updates are *eventually*
observed.

Consider the above loop again.  The compiler could theoretically read the
entirety of node_demotion[] into local storage (registers) and never go
back to memory, and *permanently* observe bad values for node_demotion[].

Note: RCU does not provide any universal compiler-ordering
guarantees:

	https://lore.kernel.org/lkml/20150921204327.GH4029@linux.vnet.ibm.com/

This code is unused for now.  It will be called later in the
series.

Link: https://lkml.kernel.org/r/20210721063926.3024591-1-ying.huang@intel.com
Link: https://lkml.kernel.org/r/20210715055145.195411-1-ying.huang@intel.com
Link: https://lkml.kernel.org/r/20210715055145.195411-2-ying.huang@intel.com
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Wei Xu <weixugc@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-03 09:58:16 -07:00
arch microblaze: simplify pte_alloc_one_kernel() 2021-09-03 09:58:15 -07:00
block mm: remove flush_kernel_dcache_page 2021-09-03 09:58:13 -07:00
certs Kbuild updates for v5.13 (2nd) 2021-05-08 10:00:11 -07:00
crypto crypto: drbg - select SHA512 2021-07-16 15:49:31 +08:00
Documentation doc: hwpoison: correct the support for hugepage 2021-09-03 09:58:15 -07:00
drivers mm: sparse: pass section_nr to find_memory_block 2021-09-03 09:58:14 -07:00
fs userfaultfd: prevent concurrent API initialization 2021-09-03 09:58:16 -07:00
include userfaultfd: change mmap_changing to atomic 2021-09-03 09:58:16 -07:00
init init: Suppress wrong warning for bootconfig cmdline parameter 2021-08-12 13:35:57 -04:00
ipc memcg: enable accounting of ipc resources 2021-09-03 09:58:12 -07:00
kernel memcg: enable accounting for posix_timers_cache slab 2021-09-03 09:58:13 -07:00
lib kasan: test: avoid corrupting memory in kasan_rcu_uaf 2021-09-03 09:58:15 -07:00
LICENSES LICENSES/dual/CC-BY-4.0: Git rid of "smart quotes" 2021-07-15 06:31:24 -06:00
mm mm/numa: automatically generate node migration order 2021-09-03 09:58:16 -07:00
net This is a one-liner fix for a serious bug that can cause the server to 2021-08-26 13:26:40 -07:00
samples Networking fixes for 5.14-rc2, including fixes from bpf and netfilter. 2021-07-14 09:24:32 -07:00
scripts Kbuild fixes for v5.14 (2nd) 2021-08-07 10:03:02 -07:00
security mm/pagemap: add mmap_assert_locked() annotations to find_vma*() 2021-09-03 09:58:13 -07:00
sound another sound-fixes for 5.14-rc7 2021-08-20 12:31:10 -07:00
tools selftests/vm/userfaultfd: wake after copy failure 2021-09-03 09:58:16 -07:00
usr .gitignore: prefix local generated files with a slash 2021-05-02 00:43:35 +09:00
virt KVM: Do not leak memory for duplicate debugfs directories 2021-08-04 06:02:03 -04:00
.clang-format clang-format: Update with the latest for_each macro list 2021-05-12 23:32:39 +02:00
.cocciconfig
.get_maintainer.ignore Opt out of scripts/get_maintainer.pl 2019-05-16 10:53:40 -07:00
.gitattributes .gitattributes: use 'dts' diff driver for dts files 2019-12-04 19:44:11 -08:00
.gitignore .gitignore: ignore only top-level modules.builtin 2021-05-02 00:43:35 +09:00
.mailmap m68k updates for v5.14 2021-06-28 14:01:03 -07:00
COPYING COPYING: state that all contributions really are covered by this file 2020-02-10 13:32:20 -08:00
CREDITS MAINTAINERS: move Murali Karicheri to credits 2021-04-29 15:47:30 -07:00
Kbuild kbuild: rename hostprogs-y/always to hostprogs/always-y 2020-02-04 01:53:07 +09:00
Kconfig kbuild: ensure full rebuild when the compiler is updated 2020-05-12 13:28:33 +09:00
MAINTAINERS MAINTAINERS: exfat: update my email address 2021-08-25 12:25:12 -07:00
Makefile Linux 5.14 2021-08-29 15:04:50 -07:00
README Drop all 00-INDEX files from Documentation/ 2018-09-09 15:08:58 -06:00

Linux kernel
============

There are several guides for kernel developers and users. These guides can
be rendered in a number of formats, like HTML and PDF. Please read
Documentation/admin-guide/README.rst first.

In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``.  The formatted documentation can also be read online at:

    https://www.kernel.org/doc/html/latest/

There are various text files in the Documentation/ subdirectory,
several of them using the Restructured Text markup notation.

Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.