linux/mm
Hugh Dickins 4146d2d673 ksm: make !merge_across_nodes migration safe
The new KSM NUMA merge_across_nodes knob introduces a problem, when it's
set to non-default 0: if a KSM page is migrated to a different NUMA node,
how do we migrate its stable node to the right tree?  And what if that
collides with an existing stable node?

ksm_migrate_page() can do no more than it's already doing, updating
stable_node->kpfn: the stable tree itself cannot be manipulated without
holding ksm_thread_mutex.  So accept that a stable tree may temporarily
indicate a page belonging to the wrong NUMA node, leave updating until the
next pass of ksmd, just be careful not to merge other pages on to a
misplaced page.  Note nid of holding tree in stable_node, and recognize
that it will not always match nid of kpfn.

A misplaced KSM page is discovered, either when ksm_do_scan() next comes
around to one of its rmap_items (we now have to go to cmp_and_merge_page
even on pages in a stable tree), or when stable_tree_search() arrives at a
matching node for another page, and this node page is found misplaced.

In each case, move the misplaced stable_node to a list of migrate_nodes
(and use the address of migrate_nodes as magic by which to identify them):
we don't need them in a tree.  If stable_tree_search() finds no match for
a page, but it's currently exiled to this list, then slot its stable_node
right there into the tree, bringing all of its mappings with it; otherwise
they get migrated one by one to the original page of the colliding node.
stable_tree_search() is now modelled more like stable_tree_insert(), in
order to handle these insertions of migrated nodes.

remove_node_from_stable_tree(), remove_all_stable_nodes() and
ksm_check_stable_tree() have to handle the migrate_nodes list as well as
the stable tree itself.  Less obviously, we do need to prune the list of
stale entries from time to time (scan_get_next_rmap_item() does it once
each full scan): whereas stale nodes in the stable tree get naturally
pruned as searches try to brush past them, these migrate_nodes may get
forgotten and accumulate.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Petr Holasek <pholasek@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Izik Eidus <izik.eidus@ravellosystems.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:19 -08:00
..
backing-dev.c bdi: allow block devices to say that they require stable page writes 2013-02-21 17:22:19 -08:00
balloon_compaction.c mm: introduce a common interface for balloon pages mobility 2012-12-11 17:22:26 -08:00
bootmem.c mm: Add alloc_bootmem_low_pages_nopanic() 2013-01-29 19:32:59 -08:00
bounce.c block: optionally snapshot page contents to provide stable pages during write 2013-02-21 17:22:20 -08:00
cleancache.c ->encode_fh() API change 2012-05-29 23:28:33 -04:00
compaction.c mm: remove MIGRATE_ISOLATE check in hotpath 2013-02-23 17:50:15 -08:00
debug-pagealloc.c mm, x86: Remove debug_pagealloc_enabled 2011-12-06 09:24:07 +01:00
dmapool.c dmapool: make DMAPOOL_DEBUG detect corruption of free marker 2012-12-11 17:22:24 -08:00
fadvise.c switch simple cases of fget_light to fdget 2012-09-26 22:20:08 -04:00
failslab.c switch debugfs to umode_t 2012-01-03 22:54:56 -05:00
filemap_xip.c mm: move all mmu notifier invocations to be done outside the PT lock 2012-10-09 16:22:58 +09:00
filemap.c mm: only enforce stable page writes if the backing device requires it 2013-02-21 17:22:19 -08:00
fremap.c mm: introduce VM_POPULATE flag to better deal with racy userspace programs 2013-02-23 17:50:11 -08:00
frontswap.c frontswap: support exclusive gets if tmem backend is capable 2012-09-21 10:38:12 -04:00
highmem.c Some nice cleanups, and even a patch my wife did as a "live" demo for 2012-12-20 08:37:05 -08:00
huge_memory.c mm: rename page struct field helpers 2013-02-23 17:50:18 -08:00
hugetlb_cgroup.c mm/hugetlb: create hugetlb cgroup file in hugetlb_init 2012-12-18 15:02:15 -08:00
hugetlb.c mm/hugetlb.c: convert to pr_foo() 2013-02-23 17:50:09 -08:00
hwpoison-inject.c memcg: rename config variables 2012-07-31 18:42:43 -07:00
init-mm.c atomic: use <linux/atomic.h> 2011-07-26 16:49:47 -07:00
internal.h mm: directly use __mlock_vma_pages_range() in find_extend_vma() 2013-02-23 17:50:11 -08:00
interval_tree.c mm: add CONFIG_DEBUG_VM_RB build option 2012-10-09 16:22:42 +09:00
Kconfig memory-hotplug: implement register_page_bootmem_info_section of sparse-vmemmap 2013-02-23 17:50:12 -08:00
Kconfig.debug mm: more intensive memory corruption debugging 2012-01-10 16:30:42 -08:00
kmemcheck.c
kmemleak-test.c
kmemleak.c mm/kmemleak.c: remove obsolete simple_strtoul 2012-12-18 15:02:15 -08:00
ksm.c ksm: make !merge_across_nodes migration safe 2013-02-23 17:50:19 -08:00
maccess.c mm: Map most files to use export.h instead of module.h 2011-10-31 09:20:12 -04:00
madvise.c mm: make madvise(MADV_WILLNEED) support swap file prefetch 2013-02-23 17:50:10 -08:00
Makefile mm: introduce a common interface for balloon pages mobility 2012-12-11 17:22:26 -08:00
memblock.c mm/memblock.c: use CONFIG_HAVE_MEMBLOCK_NODE_MAP to protect movablecore_map in memblock_overlaps_region(). 2013-02-23 17:50:14 -08:00
memcontrol.c memcg: avoid dangling reference count in creation failure. 2013-02-23 17:50:18 -08:00
memory_hotplug.c mm: increase totalram_pages when free pages allocated by bootmem allocator 2013-02-23 17:50:15 -08:00
memory-failure.c mm/memory-failure.c: fix wrong num_poisoned_pages in handling memory error on thp 2013-02-23 17:50:15 -08:00
memory.c ksm: remove old stable nodes more thoroughly 2013-02-23 17:50:19 -08:00
mempolicy.c mm: rename page struct field helpers 2013-02-23 17:50:18 -08:00
mempool.c mempool: add @gfp_mask to mempool_create_node() 2012-06-25 11:53:47 +02:00
migrate.c ksm: make KSM page migration possible 2013-02-23 17:50:19 -08:00
mincore.c swap: make each swap partition have one address_space 2013-02-23 17:50:17 -08:00
mlock.c mm: introduce VM_POPULATE flag to better deal with racy userspace programs 2013-02-23 17:50:11 -08:00
mm_init.c mm: init: report on last-nid information stored in page->flags 2013-02-23 17:50:18 -08:00
mmap.c mm/rmap: rename anon_vma_unlock() => anon_vma_unlock_write() 2013-02-23 17:50:17 -08:00
mmu_context.c mm, counters: remove task argument to sync_mm_rss() and __sync_task_rss_stat() 2012-03-21 17:54:59 -07:00
mmu_notifier.c mm/mmu_notifier: allocate mmu_notifier in advance 2012-10-25 14:37:53 -07:00
mmzone.c mm: rename page struct field helpers 2013-02-23 17:50:18 -08:00
mprotect.c mm/mprotect.c: coding-style cleanups 2012-12-18 15:02:15 -08:00
mremap.c mm/rmap: rename anon_vma_unlock() => anon_vma_unlock_write() 2013-02-23 17:50:17 -08:00
msync.c
nobootmem.c mm: Add alloc_bootmem_low_pages_nopanic() 2013-01-29 19:32:59 -08:00
nommu.c swap: add per-partition lock for swapfile 2013-02-23 17:50:17 -08:00
oom_kill.c memcg, oom: provide more precise dump info while memcg oom happening 2013-02-23 17:50:08 -08:00
page_alloc.c mm: rename page struct field helpers 2013-02-23 17:50:18 -08:00
page_cgroup.c memcontrol: use N_MEMORY instead N_HIGH_MEMORY 2012-12-12 17:38:32 -08:00
page_io.c mm: add support for direct_IO to highmem pages 2012-07-31 18:42:47 -07:00
page_isolation.c mm: fix zone_watermark_ok_safe() accounting of isolated pages 2013-01-04 16:11:46 -08:00
page-writeback.c page-writeback.c: subtract min_free_kbytes from dirtyable memory 2013-02-23 17:50:17 -08:00
pagewalk.c thp: change split_huge_page_pmd() interface 2012-12-12 17:38:31 -08:00
percpu-km.c
percpu-vm.c mm: fix kernel-doc warnings 2012-06-20 14:39:36 -07:00
percpu.c mm, percpu: Make sure percpu_alloc early parameter has an argument 2012-12-02 06:23:04 -08:00
pgtable-generic.c mm: Only flush the TLB when clearing an accessible pte 2012-12-11 14:28:34 +00:00
process_vm_access.c aio/vfs: cleanup of rw_copy_check_uvector() and compat_rw_copy_check_uvector() 2012-05-31 17:49:32 -07:00
quicklist.c mm: delete various needless include <linux/module.h> 2011-10-31 09:20:11 -04:00
readahead.c switch simple cases of fget_light to fdget 2012-09-26 22:20:08 -04:00
rmap.c mm/rmap: rename anon_vma_unlock() => anon_vma_unlock_write() 2013-02-23 17:50:17 -08:00
shmem.c mempolicy: remove arg from mpol_parse_str, mpol_to_str 2013-01-02 09:27:10 -08:00
slab_common.c slab: propagate tunable values 2012-12-18 15:02:14 -08:00
slab.c memcg: add comments clarifying aspects of cache attribute propagation 2012-12-18 15:02:15 -08:00
slab.h slab: propagate tunable values 2012-12-18 15:02:14 -08:00
slob.c mm: rename page struct field helpers 2013-02-23 17:50:18 -08:00
slub.c mm: rename page struct field helpers 2013-02-23 17:50:18 -08:00
sparse-vmemmap.c mm: delete various needless include <linux/module.h> 2011-10-31 09:20:11 -04:00
sparse.c memory-failure: use num_poisoned_pages instead of mce_bad_pages 2013-02-23 17:50:15 -08:00
swap_state.c swap: add per-partition lock for swapfile 2013-02-23 17:50:17 -08:00
swap.c swap: make each swap partition have one address_space 2013-02-23 17:50:17 -08:00
swapfile.c swap: add per-partition lock for swapfile 2013-02-23 17:50:17 -08:00
truncate.c mm: drop vmtruncate 2012-12-20 18:46:29 -05:00
util.c swap: make each swap partition have one address_space 2013-02-23 17:50:17 -08:00
vmalloc.c mm: use IS_ENABLED(CONFIG_NUMA) instead of NUMA_BUILD 2012-12-11 17:22:22 -08:00
vmscan.c swap: add per-partition lock for swapfile 2013-02-23 17:50:17 -08:00
vmstat.c mm: don't wait on congested zones in balance_pgdat() 2013-02-23 17:50:15 -08:00