linux/mm
Huang Ying a3aea839e4 mm, THP, swap: support to clear swap cache flag for THP swapped out
Patch series "mm, THP, swap: Delay splitting THP after swapped out", v3.

This is the second step of THP (Transparent Huge Page) swap
optimization.  In the first step, the splitting huge page is delayed
from almost the first step of swapping out to after allocating the swap
space for the THP and adding the THP into the swap cache.  In the second
step, the splitting is delayed further to after the swapping out
finished.  The plan is to delay splitting THP step by step, finally
avoid splitting THP for the THP swapping out and swap out/in the THP as
a whole.

In the patchset, more operations for the anonymous THP reclaiming, such
as TLB flushing, writing the THP to the swap device, removing the THP
from the swap cache are batched.  So that the performance of anonymous
THP swapping out are improved.

During the development, the following scenarios/code paths have been
checked,

 - swap out/in
 - swap off
 - write protect page fault
 - madvise_free
 - process exit
 - split huge page

With the patchset, the swap out throughput improves 42% (from about
5.81GB/s to about 8.25GB/s) in the vm-scalability swap-w-seq test case
with 16 processes.  At the same time, the IPI (reflect TLB flushing)
reduced about 78.9%.  The test is done on a Xeon E5 v3 system.  The swap
device used is a RAM simulated PMEM (persistent memory) device.  To test
the sequential swapping out, the test case creates 8 processes, which
sequentially allocate and write to the anonymous pages until the RAM and
part of the swap device is used up.

Below is the part of the cover letter for the first step patchset of THP
swap optimization which applies to all steps.

=========================

Recently, the performance of the storage devices improved so fast that
we cannot saturate the disk bandwidth with single logical CPU when do
page swap out even on a high-end server machine.  Because the
performance of the storage device improved faster than that of single
logical CPU.  And it seems that the trend will not change in the near
future.  On the other hand, the THP becomes more and more popular
because of increased memory size.  So it becomes necessary to optimize
THP swap performance.

The advantages of the THP swap support include:

 - Batch the swap operations for the THP to reduce TLB flushing and lock
   acquiring/releasing, including allocating/freeing the swap space,
   adding/deleting to/from the swap cache, and writing/reading the swap
   space, etc. This will help improve the performance of the THP swap.

 - The THP swap space read/write will be 2M sequential IO. It is
   particularly helpful for the swap read, which are usually 4k random
   IO. This will improve the performance of the THP swap too.

 - It will help the memory fragmentation, especially when the THP is
   heavily used by the applications. The 2M continuous pages will be
   free up after THP swapping out.

 - It will improve the THP utilization on the system with the swap
   turned on. Because the speed for khugepaged to collapse the normal
   pages into the THP is quite slow. After the THP is split during the
   swapping out, it will take quite long time for the normal pages to
   collapse back into the THP after being swapped in. The high THP
   utilization helps the efficiency of the page based memory management
   too.

There are some concerns regarding THP swap in, mainly because possible
enlarged read/write IO size (for swap in/out) may put more overhead on
the storage device.  To deal with that, the THP swap in should be turned
on only when necessary.

For example, it can be selected via "always/never/madvise" logic, to be
turned on globally, turned off globally, or turned on only for VMA with
MADV_HUGEPAGE, etc.

This patch (of 12):

Previously, swapcache_free_cluster() is used only in the error path of
shrink_page_list() to free the swap cluster just allocated if the THP
(Transparent Huge Page) is failed to be split.  In this patch, it is
enhanced to clear the swap cache flag (SWAP_HAS_CACHE) for the swap
cluster that holds the contents of THP swapped out.

This will be used in delaying splitting THP after swapping out support.
Because there is no THP swapping in as a whole support yet, after
clearing the swap cache flag, the swap cluster backing the THP swapped
out will be split.  So that the swap slots in the swap cluster can be
swapped in as normal pages later.

Link: http://lkml.kernel.org/r/20170724051840.2309-2-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Ross Zwisler <ross.zwisler@intel.com> [for brd.c, zram_drv.c, pmem.c]
Cc: Vishal L Verma <vishal.l.verma@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:27 -07:00
..
kasan Merge branch 'linus' into locking/core, to pick up fixes 2017-08-10 12:20:53 +02:00
backing-dev.c bdi: Drop 'parent' argument from bdi_register[_va]() 2017-04-20 12:09:55 -06:00
balloon_compaction.c mm/balloon_compaction.c: don't zero ballooned pages 2017-08-10 15:54:07 -07:00
bootmem.c mm/bootmem.c: cosmetic improvement of code readability 2017-02-22 16:41:29 -08:00
cleancache.c fs: switch ->s_uuid to uuid_t 2017-06-05 16:59:12 +02:00
cma_debug.c mm/cma_debug.c: fix stack corruption due to sprintf usage 2017-08-18 15:32:02 -07:00
cma.c cma: fix calculation of aligned offset 2017-07-10 16:32:32 -07:00
cma.h cma: Store a name in the cma structure 2017-04-18 20:41:12 +02:00
compaction.c mm, compaction: skip over holes in __reset_isolation_suitable 2017-07-06 16:24:32 -07:00
debug_page_ref.c
debug.c mm: make tlb_flush_pending global 2017-08-10 15:54:07 -07:00
dmapool.c lib/vsprintf.c: remove %Z support 2017-02-27 18:43:47 -08:00
early_ioremap.c x86/mm: Add support to access boot related data in the clear 2017-07-18 11:38:02 +02:00
fadvise.c mm: fadvise: avoid expensive remote LRU cache draining after FADV_DONTNEED 2016-12-20 09:48:46 -08:00
failslab.c
filemap.c mm: use find_get_pages_range() in filemap_range_has_page() 2017-09-06 17:27:27 -07:00
frame_vector.c treewide: use kv[mz]alloc* rather than opencoded variants 2017-05-08 17:15:13 -07:00
frontswap.c
gup.c mm/gup: make __gup_device_* require THP 2017-09-06 17:27:26 -07:00
highmem.c
huge_memory.c mm/huge_memory.c: constify attribute_group structures 2017-09-06 17:27:27 -07:00
hugetlb_cgroup.c
hugetlb.c mm/hugetlb.c: constify attribute_group structures 2017-09-06 17:27:27 -07:00
hwpoison-inject.c mm: hwpoison: call shake_page() unconditionally 2017-05-03 15:52:12 -07:00
init-mm.c
internal.h mm, memory_hotplug: drop zone from build_all_zonelists 2017-09-06 17:27:25 -07:00
interval_tree.c
Kconfig mm/kasan: add support for memory hotplug 2017-07-10 16:32:33 -07:00
Kconfig.debug mm: enable page poisoning early at boot 2017-05-03 15:52:10 -07:00
khugepaged.c mm: make PR_SET_THP_DISABLE immediately active 2017-07-10 16:32:31 -07:00
kmemcheck.c mm: Rename SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU 2017-04-18 11:42:36 -07:00
kmemleak-test.c
kmemleak.c mm: kmemleak: treat vm_struct as alternative reference to vmalloc'ed objects 2017-07-06 16:24:34 -07:00
ksm.c mm/ksm.c: constify attribute_group structures 2017-09-06 17:27:27 -07:00
list_lru.c mm/list_lru.c: fix list_lru_count_node() to be race free 2017-07-10 16:32:33 -07:00
maccess.c
madvise.c mm, madvise: ensure poisoned pages are removed from per-cpu lists 2017-08-31 16:33:15 -07:00
Makefile percpu: expose statistics about percpu memory via debugfs 2017-06-20 15:31:38 -04:00
memblock.c mm/memblock.c: reversed logic in memblock_discard() 2017-08-25 16:12:46 -07:00
memcontrol.c mm: memcontrol: use int for event/state parameter in several functions 2017-09-06 17:27:27 -07:00
memory_hotplug.c mm, memory_hotplug: get rid of zonelists_mutex 2017-09-06 17:27:26 -07:00
memory-failure.c x86/mm, mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages 2017-08-17 10:30:49 +02:00
memory.c mm: always flush VMA ranges affected by zap_page_range 2017-09-06 17:27:26 -07:00
mempolicy.c mm/mempolicy: fix use after free when calling get_mempolicy 2017-08-18 15:32:02 -07:00
mempool.c sched/wait: Rename wait_queue_t => wait_queue_entry_t 2017-06-20 12:18:27 +02:00
memtest.c
migrate.c Sanitize 'move_pages()' permission checks 2017-08-20 13:26:27 -07:00
mincore.c mm: remove shmem_mapping() shmem_zero_setup() duplicates 2017-02-24 17:46:56 -08:00
mlock.c mlock: fix mlock count can not decrease in race condition 2017-06-02 15:07:38 -07:00
mm_init.c
mmap.c mm: fix overflow check in expand_upwards() 2017-07-14 15:05:12 -07:00
mmu_context.c sched/headers: Prepare to move the task_lock()/unlock() APIs to <linux/sched/task.h> 2017-03-02 08:42:38 +01:00
mmu_notifier.c mm/mmu_notifier: kill invalidate_page 2017-08-31 16:13:00 -07:00
mmzone.c mm/mmzone.c: swap likely to unlikely as code logic is different for next_zones_zonelist() 2017-02-22 16:41:29 -08:00
mprotect.c mm: migrate: prevent racy access to tlb_flush_pending 2017-08-10 15:54:07 -07:00
mremap.c mm/mremap: fail map duplication attempts for private mappings 2017-09-06 17:27:26 -07:00
msync.c
nobootmem.c mm: discard memblock data later 2017-08-18 15:32:01 -07:00
nommu.c mm, vmalloc: use __GFP_HIGHMEM implicitly 2017-05-08 17:15:13 -07:00
oom_kill.c mm/oom_kill.c: add tracepoints for oom reaper-related events 2017-07-10 16:32:32 -07:00
page_alloc.c mm, memory_hotplug: get rid of zonelists_mutex 2017-09-06 17:27:26 -07:00
page_counter.c
page_ext.c mm, page_ext: periodically reschedule during page_ext_init() 2017-09-06 17:27:26 -07:00
page_idle.c mm/page_idle.c: constify attribute_group structures 2017-09-06 17:27:27 -07:00
page_io.c mm/page_io.c: fix oops during block io poll in swapin path 2017-08-02 17:16:11 -07:00
page_isolation.c mm: unify new_node_page and alloc_migrate_target 2017-07-10 16:32:31 -07:00
page_owner.c mm, page_owner: don't grab zone->lock for init_pages_in_zone() 2017-09-06 17:27:26 -07:00
page_poison.c mm: enable page poisoning early at boot 2017-05-03 15:52:10 -07:00
page_vma_mapped.c mm/hugetlb: add size parameter to huge_pte_offset() 2017-07-06 16:24:34 -07:00
page-writeback.c mm: memcontrol: fix NULL pointer crash in test_clear_page_writeback() 2017-08-18 15:32:01 -07:00
pagewalk.c mm/hugetlb: add size parameter to huge_pte_offset() 2017-07-06 16:24:34 -07:00
percpu-internal.h percpu: fix early calls for spinlock in pcpu_stats 2017-06-21 13:53:52 -04:00
percpu-km.c percpu: fix static checker warnings in pcpu_destroy_chunk 2017-06-29 11:23:38 -04:00
percpu-stats.c percpu: expose statistics about percpu memory via debugfs 2017-06-20 15:31:38 -04:00
percpu-vm.c percpu: fix static checker warnings in pcpu_destroy_chunk 2017-06-29 11:23:38 -04:00
percpu.c percpu: resolve err may not be initialized in pcpu_alloc 2017-06-21 12:00:45 -04:00
pgtable-generic.c mm: convert generic code to 5-level paging 2017-03-09 11:48:47 -08:00
process_vm_access.c sched/headers: Prepare for new header dependencies before moving code to <linux/sched/mm.h> 2017-03-02 08:42:28 +01:00
quicklist.c
readahead.c mm: don't cap request size based on read-ahead setting 2016-12-12 18:55:08 -08:00
rmap.c mm/rmap: update to new mmu_notifier semantic v2 2017-08-31 16:12:59 -07:00
rodata_test.c mm: remove rodata_test_data export, add pr_fmt 2017-05-03 15:52:09 -07:00
shmem.c mm, shmem: fix handling /sys/kernel/mm/transparent_hugepage/shmem_enabled 2017-08-25 16:12:46 -07:00
slab_common.c mm: allow slab_nomerge to be set at build time 2017-07-06 16:24:31 -07:00
slab.c mm: memcontrol: account slab stats per lruvec 2017-07-06 16:24:35 -07:00
slab.h locking/lockdep: Rework FS_RECLAIM annotation 2017-08-10 12:29:03 +02:00
slob.c locking/lockdep: Rework FS_RECLAIM annotation 2017-08-10 12:29:03 +02:00
slub.c mm/slub.c: constify attribute_group structures 2017-09-06 17:27:27 -07:00
sparse-vmemmap.c mm, sparse, page_ext: drop ugly N_HIGH_MEMORY branches for allocations 2017-09-06 17:27:26 -07:00
sparse.c mm, sparse, page_ext: drop ugly N_HIGH_MEMORY branches for allocations 2017-09-06 17:27:26 -07:00
swap_cgroup.c mm, THP, swap: delay splitting THP during swap out 2017-07-06 16:24:31 -07:00
swap_slots.c mm/swap_slots.c: don't disable preemption while taking the per-CPU cache 2017-07-10 16:32:32 -07:00
swap_state.c swap: add block io poll in swapin path 2017-07-10 16:32:30 -07:00
swap.c mm: remove nr_pages argument from pagevec_lookup{,_range}() 2017-09-06 17:27:27 -07:00
swapfile.c mm, THP, swap: support to clear swap cache flag for THP swapped out 2017-09-06 17:27:27 -07:00
truncate.c mm/truncate.c: fix THP handling in invalidate_mapping_pages() 2017-07-10 16:32:32 -07:00
usercopy.c mm/usercopy: Drop extra is_vmalloc_or_module() check 2017-04-05 12:30:18 -07:00
userfaultfd.c mm: convert generic code to 5-level paging 2017-03-09 11:48:47 -08:00
util.c mm: fix global NR_SLAB_.*CLAIMABLE counter reads 2017-08-10 15:54:06 -07:00
vmacache.c sched/headers: Prepare to move 'init_task' and 'init_thread_union' from <linux/sched.h> to <linux/sched/task.h> 2017-03-02 08:42:38 +01:00
vmalloc.c mm/vmalloc.c: don't unconditonally use __GFP_HIGHMEM 2017-08-18 15:32:02 -07:00
vmpressure.c mm, vmpressure: pass-through notification support 2017-07-10 16:32:31 -07:00
vmscan.c mm, vmscan: do not loop on too_many_isolated for ever 2017-09-06 17:27:26 -07:00
vmstat.c mm: avoid taking zone lock in pagetypeinfo_showmixed() 2017-07-10 16:32:32 -07:00
workingset.c mm: memcontrol: per-lruvec stats infrastructure 2017-07-06 16:24:35 -07:00
z3fold.c z3fold: fix page locking in z3fold_alloc() 2017-04-13 18:24:20 -07:00
zbud.c
zpool.c
zsmalloc.c zsmalloc: zs_page_migrate: skip unnecessary loops but not return -EBUSY if zspage is not inuse 2017-09-06 17:27:26 -07:00
zswap.c mm/zswap.c: delete an error message for a failed memory allocation in zswap_dstmem_prepare() 2017-07-06 16:24:35 -07:00